Ahstract-Aside from the benefits it brings, 3D-IC technology inevitably exacerbates the difficulty of power delivery with volumetrically increasing power consumption. Recent work managed to "recycle" current within the 3D stack by linking the different layers' supply/ground nets into a series connection. This charge-recycled (also known as voltage-stacked, or V-S) scheme provides a scalable so lution for 3D-IC's power delivery because it supports an arbitrary number of layers with a constant off-chip current demand. Although prior work has studied the circuit im plementation of a V-S power delivery network (PDN) and its current-reduction benefits, a whole-system evaluation of V-S PDNs' transient voltage noise and a noise comparison between the V-S PDN and the traditional PDN are miss ing. In this paper, we build a system-level model to exam ine voltage-stacked 3D-ICs' transient noise and explore the impact of different PDN design parameters and workload behaviors. Our results show that compared with the tra ditional PDN scheme, V-S provides stronger isolation for cross-layer noise interference, which in turn grants higher performance benefits for run-time noise mitigation tech niques, such as dynamic margin adaptation. We observe that, compared with traditional PDNs, V-S PDNs provide up to 60% lower transient noise in the worst-case scenario. Furthermore, we show that V-S PDNs significantly reduce the packaging cost, because their noise is almost insensitive to the package impedance (e.g., a 300% impedance increase only raises worst-case noise by less than 0.3% Vdd).
Introduction
Three-dimensional integrated circuits (3D-IC) make it possible to continue the historical trend of increasing de vice integration while maintaining high bandwidth, low la tency and small form factor. Since the number of device layers in a 3D-IC stack is expected to grow, power density will inevitably increase. Unfortunately, the severity of the two major power-delivery-related reliability issues-supply voltage noise and electromigration-, or EM-induced power grid wearout-are directly related to the on-chip power density. Consequently, power delivery quality will become a limiting factor in the road towards many-layer 3D-ICs.
In response to the power delivery challenge caused by excessive current consumption, various research proposals [1] [2] [3] [4] explored the idea of using a charge-recycled power delivery structure to support 3D-IC. Charge-recycling, or voltage-stacking (V-S), refers to power delivery that ar ranges multiple circuit blocks electrically in series. By connecting one block's ground net directly to the next one's power supply net, V-S power delivery network (PDN) "recycles" current between blocks. Blocks utilizing V-S PDN will share the same current, while their Vdd values are added. V-S provides a scalable solution for 3D-ICs' power delivery because by recycling current between lay ers, adding more layers to a 3D stack only requires higher off-chip supply voltage, while the current density within the PDN remains constant. This breaks the fundamental mis match between 3D-IC's volumetric power dissipation and surface-limited (i.e., Controlled Collapse Chip Connection, or C4 array-based) power delivery.
With reduced current density in C4 bumps, through silicon-vias (TSV), and on-chip wires, V-S significantly improves 3D-IC's robustness against EM-induced PDN wearout [5] . However, V-S PDNs are not guaranteed to have lower supply voltage noise compared with the tra ditional power delivery scheme, where all layers' power supply and ground nets are connected with TSVs respec tively: when the power consumption in the various lay ers are not perfectly matched, the volt ages at the inter mediate nodes in the V-S stack deviate from the nomi nal value. Therefore, explicit voltage regulation is required in V-S PDNs to compensate for the current-consumption mismatch between layers, and regulate voltages at the in ternal nodes. Based on circuit-level implementations and tests, prior research proposals have demonstrated the fea sibility of using these explicit regulators in V-S PDNs [2, 4] . However, the trade-off in voltage noise between V-S PDNs and traditional PDNs is not clear. To understand voltage noise in V-S PDNs under different workload conditions, to explore the impact of various PDN design parameters, and ultimately, to prove whether or when V-S PDNs have better noise quality than traditional PDNs, system-level modeling and analysis are required.
In this paper, we first design and validate a compact RC model for the voltage regulators in V-S PDNs. We then extend an open-source, system-level PDN model, VoltSpot version 1.0 [6] , and integrate it with our regulator model, producing the first platform to enable whole-system, tran sient simulation for many-layer 3D-ICs' V-S PDN. This new version of VoltSpot has been released as version 2.0. Using an example low-power, ARM-based manycore 3D processor, we then compare the supply noise between voltage-stacked and traditional PDNs, and explore the im pact of (a) cross-layer noise, (b) on-chip decoupling capaci tance, and (c) package impedance. We observe that: 1. V S provides stronger cross-layer noise isolation, increasing the effectiveness of run-time noise mitigation, and there fore system efficiency; 2. Under an area constraint for in tegrated capacitors, V-S provides up 60% lower worst-case noise amplitude; 3. V-S PDNs are less sensitive to package impedance. Consequently, we conclude that V-S achieves lower noise and lower cost compared with traditional 3D PDNs.
Il. Background and Related Work
A. Voltage Noise and Timing Margin in 3D-le Supply voltage noise, which includes IR drop, Ldl/dt, and LC resonance, refers to voltage fluctuation in the power-delivery network. Since transistor delay is directly proportional to source-to-drain potential differences [7] , it is a common design practice to assign a timing margin to critical paths to avoid noise-induced timing errors. Besides a design-time allocation that guards against the worst-case scenario, the timing margin can also be dynamically ad justed to improve system efficiency. For example, Lefurgy et al. [8] proposed a technique that detects available tim ing margin at run-time with critical path monitors. Using digital phase-lock-loops, their scheme can rapidly change clock frequency to save energy during average-case execu tion (i.e., reduce margin) while guaranteeing functionality in the worst case (i.e., increase margin).
As more layers of active device layers are stacked to gether, the aggregate current demand increases, and the amplitude of the voltage noise grows proportionally with the layer count if the PDN impedance is kept constant [9] . To maintain a traditional PDN's robustness against volt age noise, 3D-IC designers will have to keep increasing tim ing margin (which degrades system performance with lower clock frequency), and/or reducing PDN impedance (which increases PDN cost with extra area overhead for on-chip decoupling capacitance or higher packaging complexity). Unfortunately, neither of these two approaches are scalable to many-layer 3D-ICs.
B. Voltage Regulation in V-S PDN
Although V-S significantly reduces 3D-IC's off-chip cur rent demand [5] , it introduces extra voltage noise caused by the workload imbalance between device layers. This is because, when layers are connected in series, the ratio of their effective resistances (which are inversely propor tional to their power consumptions) directly affects the voltage levels of the intermediate nodes. Consequently, layers with higher power will experience greater voltage drops. To regulate this noise, prior work proposed using explicit regulators with V-S PDN [4, 10] . Considering the rapid improvement of capacitive technology, we focus on switching-capacitor (SC) converters in this paper, due to their regulation efficiency [11] . Fig. 1 shows the detailed circuit structure of the V-S SC converter we adopt from the literature [4] . Each converter cell consists of two fly-capacitors (Cl and C2) and eight switches . By periodically interchanging the posi tions of the fly-caps (i.e., phase CLK1 and CLK2 in Fig. 1 ), the SC converter can either "source" or "sink" the charge difference between the stacked loads to regulate the voltage at its output. For a 2-layer system, this fixed 2:1 push-pull converter acts merely as a charge equalizer to assist the natural 2:1 voltage down-conversion of the stacked loads. For many-layer systems, we arrange the SC converters into a multi-output ladder structure to generate higher voltages. Similar to [5] , we assume a fixed switching frequency for all converters to reduce design complexity.
C. System-level Supply Voltage Noise Modeling
In the past, researchers constructed system-level models to examine the supply voltage noise in both 2D ( [6, 12] ) and 3D ( [5, 9] ) chips. While prior work has demonstrated that stacking more layers of active silicon using the tradi tional PDN structure will monotonically increase on-chip noise [9] , it is still not clear whether, or in which scenarios the V-S scheme provides better power delivery quality (in terms of transient noise) for 3D-ICs. To answer this ques tion, we build a whole-system evaluation platform for V-S PDNs by designing a compact RC model for SC converters and integrating it with a full-chip power grid model.
The topic of SC converter modeling has been discussed in the past. However, prior work either focused on the traditional 2D-IC case without voltage stacking ( [13] ), or only studied the static noise (i.e., IR drop) of SC converters ( [5] ). To the best of our knowledge, ours is the first work to model transient voltage noise in SC-converter-supported V-S PDNs and compare V-S PDNs with traditional PDNs.
Ill. V-S PDN Modeling Methodologies
The power delivery networks of contemporary proces sors are usually large systems that contain up to several billion nodes, even in the context of 2D-IC. 3D integration and voltage stacking further increase the PDN's complexity with more device layers and new components such as TSVs and voltage regulators. For this reason, circuit-level simu lations will be extremely computational-intensive and inca pable of supporting whole-system design-space exploration studies. To enable a system-level study of V-S PDN's volt age noise, we design and validate a compact RC model for the SC converters and integrate it with a pre-RTL PDN model. This section discusses our modeling methodology and the validation results. Note that although we exchange the positions (i.e., electric charge) of the fly-caps at each clock edge, the resis tance of each top and bottom branch is kept unmodified. This is because each time we "flip" the position of the fly caps, we also change the set of switches to conduct the current ( Fig. 1) . Fortunately, the switches are designed in a symmetric way such that both the top and bottom RC branch in the two different clock phases have the same equivalent resistance [4] . Therefore, we can collapse the eight switches into two resistors (Rt A common design technique to smooth the output ripple is to divide the single-cell converters into multiple sub-cells and interleave their switching clocks [4] . To model this structure, we simply instantiate a pair of top/bottom RC branches for each sub-cell, scale the capacitance values ac cording to the number of total sub-cells, and shift the phase of each sub-cell's control clock. Fig. 2a illustrates an exam ple model for a two-way interleaved SC converter. Similar to [4] , we assume that all the sub-cells have identical struc ture, and therefore, the same RC values.
B. Validation
We implement a 4-way interleaved, 2:1 push-pull SC con verter in a commercial 28nm CMOS technology to validate our modeling methodology. It has an optimum switching frequency of 50MHz and a total capacitance of 8nF. Each SC converter can source/sink up to 100mA current to/from the load at a nominal voltage of IV. Using the Cadence ADE environment and the Spectre simulator, we simulate this converter in a two-layer, voltage-stacked system (i.e., Vhead = 2V, Vfoot = �V) and compare results against the output of our RC model. workload conditions. Since the SC converter's output volt age is directly related to its output current, we attach an ideal current source directly to the Vout port and sweep the test cases from maximum sourcing (positive 100mA) to maximum sinking (negative 100mA). Under a constant workload, the output voltage shows a periodic rippling be havior caused by the converters' switching activities. Val idation results show that with IV Vdd, our model's maxi mum DC error 0.75%.
We also use a time-varying load current to validate our model. Fig. 3b shows the output voltage trace over 300 ns. The load current is sampled from Parsec 2.0 benchmark raytrace [14] ; it induces an average current of 66.3mA in an ARM Cortex A9 core. Over the entire simulated time window, the output voltage trace of our model matches well with circuit simulation in term of DC component, AC amplitude, and slew rate. Overall, our model can capture the SC converter's transient output voltage with less than 72m V error at all times. To study the interaction between the SC converters and the on-chip PDN grid, and to evaluate V-S PDNs' overall noise quality, we combine our SC converter model with an existing PDN model, VoltSpot [6] . VoltSpot uses a dis tributed RLC network to model the entire on-chip PDN metal stack, and a lumped RLC loop to model the chip package. Section 4-B will discuss the parameters we use and the modifications we made to VoltSpot in detail. Fig. 2b shows the structure of the whole-system model we build for many-layer V-S PDNs. For each SC converter, we connect its three ports (i.e., Vhead, Vout, and Vjoot) to three consecutive layers in the voltage-stacked power grids. We note that ideally, Vout = (Vhead + Vjoot)j2, which in dicates that any change in either Vhead or Vjoot will also affect the regulator's output voltage. Our model directly captures this inter-layer voltage dependency.
IV. Simulation Setup

A. Many-core 3D Processor Modeling
To study supply voltage noise in realistic 3D-IC design scenarios, we model an example many-core, many-layer 3D IC based on a 40nm ARM Cortex A9 IP [15] . Using the architecture-level power and area model McPAT [16] , we observe that when running at 1GHz with IV supply volt age, each core has a peak power density of 172mWjmm2 (475 mW over 2.76 mm2). Due to the power-efficient na ture of these ARM processors, we can build our example many-layer 3D-IC without relying on aggressive, volumet ric cooling solutions. With the help of pre-RTL flooplan tool ArchFP [17] and thermal model HotSpot [IS], we eval uate the 3D stacks' maximum temperature and find that with a conventional air-cooling solution, we can stack up to eight layers of 16-core processors without violating the typical upper limit of 100 QC.
Although many-layer, especially many-logic-layer 3D ICs, pose various fabrication challenges [3] , the possibility of manufacturing 3D stacks economically has been exem plified by existing commercial products (e.g., the Micron hybrid memory cube with 4-S layers [19] ). To study the voltage noise in both short-term and long-term future 3D ICs, and to evaluate how 3D scaling affects PDN design tradeoff, we build a series of example 3D systems with 2 to S layers stacked together. With 16 ARM cores per layer, the peak power consumption of these 3D processors ranges from 30.4W to 60.SW.
B. PDN Modeling
Besides integrating our SC converter model with VoltSpot, we also modify this 2D PDN model to support transient simulations for 3D-IC. Our major extension is an explicit resistor-inductor model for the TSVs. We adopt TSV parameters from prior work [20] . Similar to prior work [21] , we ignore TSV capacitance in this paper, be cause it is usually orders of magnitude smaller than the on-chip and package decoupling capacitance. Other mod eling parameters (Table I ) are adopted from prior work [6] . By default, the VoltSpot version 1.0 utilizes ideal cur rent sources to model the load (i.e., switching transis tors). In order to model the voltage-stacked PDN orga nization, we replace the current sources with time-varying resistors. This is a necessary modification, because V S PDN connects multiple layers of load in series, and using a resistive load model eliminates potential current source cutsets (if it exists, the solution is not unique) in the modeling circuit. The load resistance is calculated as R = V dd2 j Power. This modification increases the model's computational complexity with more frequent LU decomposition operations. This is because, unlike the orig inal VoltSpot where the modeling circuit is time-invariant (only the current excitation changes), our model changes the load resistors over time to match the power consump tion. To explore a broader design space within an afford able simulation time (e.g., 1 hour to simulate 1k cycles), we adopt the modeling methodology from Huang et al. [21] and only simulate a "slice" of the entire 3D stack. Since each layer of our example 3D processor is a homogeneous 16-core ARM chip, we utilize the symmetry and simulate a reduced system of 2 cores per layer.
C. Workload Modeling
U sing an integrated tool flow that combines McP AT with performance simulator Gem5 [22] , we simulate the Parsec 2.0 benchmark suite [14] and extract dynamic power con sumption traces to build realistic test cases for our noise study. Due to the limitation of PDN simulation's speed, we simulate 2k-cycle-long samples of power traces instead of whole-applications. To construct representative multi layer workload behaviors, we first randomly collect a large number (i.e., 1000) of power samples from each benchmark, then profile each sample's average power consumption and maximum noise amplitude when running alone (on a 2D IC). Section 5 gives more details about the workload com binations we use in our study.
V. Results
A. Cross-layer Noise Interference
To study whether or how different layers' voltage vari ations affect each other in traditional and V-S PDN, we pick one noisy workload and three less noisy ones from our sample pool and assign them to our 4-layer example 3D processor. The first row in Table II shows each workload sample's maximum noise amplitude when running alone on a single-layer chip. Fig. 4 shows each layer's maximum voltage drop (%Vdd) over time.
In the traditional PDN (Fig. 4a) , voltage noise in all lay ers is clearly highly correlated, a consequence of the layers' high-density, parallel interconnection. Supply voltage fluc tuations in one layer affect the entire 3D stack through the vertical connections (i.e., TSVs). Conversely, the V-S PDN connects layers in series and regulates voltage levels with SC converters. Consequently, it breaks the inter-layer noise correlation (Fig. 4b) . Table II shows each layer's max imum noise amplitude over the entire simulated time win dow. Compared with a 2D PDN, the traditional 3D PDN significantly reduces Task3's noise, because the decoupling capacitors (decap) on adjacent layers help to stabilize local voltage variation. However, other layers' voltage noise is also affected by Task3. In contrast, the V-S PDN isolates Task3's noise so that other layers have lower noise.
With dynamic margin adaptation (see Sec. 2-A, also ref erence [8] ), each layer can adjust its timing margin accord ing to its own maximum noise amplitude. Consequently, less noisy layers can run faster. Given the approximately linear relationship between noise amplitude and transistor delay, we assume that x% Vdd noise also requires an x% decrease in clock frequency. The last column in Table II shows the arithmetic mean of all four layers' maximum noise amplitude. This cross-layer mean metric shows the whole-stack's average slowdown when we use per-layer mar gin adaptation. By isolating the cross-layer noise interac tion, V-S PDN can improve system performance with less slowdown. Since margin adaptation only slightly changes clock frequency (e.g., a few percent), we ignore its impact on processors' power consumption in this study.
B. Allocating On-Chip Capacitance:
A Tradeoff
Study
The on-chip integrated capacitors can serve as either ex plicit decap for both traditional and V-S PDNs, or as fly caps for V-S PDNs' SC converters. Because of their high area overhead, the total amount of on-chip capacitance is usually limited. It is therefore important to understand the tradeoff between the allocation of explicit decap and SC converters in the V-S PDN before we compare the overall area overhead and voltage noise quality between the two schemes.
B.1 Workload selection
In order to understand 3D-ICs' voltage noise level un der a wide range of workload conditions, we construct dif ferent scenarios to stress both traditional and V-S PDNs. Starting from our sample pool, we first sort all workloads by average power consumption and then select the top, medium, and bottom one-percentile samples as candidate groups, categorized as high (H), medium (M), and low (L). Using these candidate groups, we build the follow ing three classes of multi-layer workloads. The first class (AlLH, AlLM, and AlLL) assigns different samples from the same group to different layers in the 3D-IC. The sec ond class (H/M and H/L) selects samples from any two candidate groups and assigns them to the 3D stack in an interleaved fashion. This pattern is particularly stressful for V-S PDNs, because it forces all layers' SC converters to provide the same large amount of current, and the SC converters' output voltage drop is directly proportional to the load. In fact, the interleaved high-low (H/L) combi nation is the worst-case scenario for V-S PDNs. The last group (HJkstp, M_lkstp, and L_lkstp) constructs a "lock step" execution pattern by replicating the same workload to the entire stack. With all layers' power consumption changing simultaneously, this group will excite the largest LdI/dt and LC resonance voltage noise in the PDN. As an estimation for the worst-case scenario, we select the work loads with the highest single-layer noise within each H, M, and L candidate group.
B.2 Tradeoff study
Using our example 4-layer 3D processor, we simulated both V-S PDNs and traditional PDNs with different on chip capacitance allocations. Fig. 5a shows the cross-layer mean noise amplitude For both PDN schemes, we sweep the percentage of die area allocated for explicit decap (x axis within each data group). For V-S PDNs, we assign different number of SC converters to each core (lines with different markers). We note that all SC converters have the same amount of capacitance and switching frequency. Us ing an advanced, high-density technology (e.g., trench ca pacitors [23] ), each SC converter occupies 0.082mm2, which is 3% of an ARM core. Therefore, the V-S PDN's on-chip capacitance area equals decap_area+ num ber _SC _percore * 3%. According to Fig. 5a , the V-S PDNs' overall noise is not as sensitive to the amount of explicit decap as tra ditional PDNs', especially in the lock-step scenarios, where the traditional PDN suffers from LC resonance. This is because the SC converters not only help to smooth local LdI/ dt noise with the built-in fly-capacitors, they also iso late the on-chip PDN from the package RLC loop, so that the package LC resonance is greatly suppressed. Conse quently, designers can significantly reduce the amount of explicit decap in V-S PDNs. If we compare two PDN de signs with the same amount of on-chip area allocated for overall capacitance (i.e., a V-S PDN with 4 per-core con verters and 3% decap allocation, and a traditional PDN with 15% decap allocation), we observe that under their respective cross-layer means, the V-S PDN's noise is sig nificantly lower than the traditional PDN's. This means that if per-layer runtime margin adaptation is used, the performance loss will be significantly lower for V-S. Fig. 5b shows the maximum noise amplitude observed in any layer for all test cases. The observation that the V-S PDN's cross-layer mean noise (Fig. 5a) is significantly lower than its global maximum noise (Fig. 5b) further proves the superior cross-layer noise isolation of V-S. This suggests that if a static worst-case noise margin is used, the V-S PDN will be worse. V-S PDN performance is only better when we utilize the per-layer dynamic margin adaptation.
C. Impact of 3D Scaling
To explore the effect of 3D scaling (i.e., stacking more layers) on both the V-S and traditional PDN's noise, we simulate our example 3D processors with two to eight lay ers, using the eight workload combinations. To make fair comparisons, we pick the design points described in Sec.5-B that allocate a 15% on-chip area for capacitors in both PDN schemes. Fig. 6 plots all test cases' maximum noise amplitude (both whole-stack max and cross-layer mean) across all workload conditions. In general, stacking more layers to gether increases voltage noise in both types of PDNs. If a constant noise margin is applied to all layers at design time, this margin has to accommodate the worst-case whole stack maximum noise. Consequently, the V-S structure requires smaller margin in 3D-ICs with 2 layers or more than 6 layers. With a per-layer dynamic margin adapta tion technique enabled, the whole-stack's average margin will be no larger than the worst-case cross-layer mean value. As a result, V-S PDNs always require smaller timing mar gin, regardless of layer count. In the 8-layer 3D-IC, V-S PDN's noise is 60% lower than traditional PDNs'.
One interesting observation is that a 2-layer V-S PDN's whole-stack maximum noise is significantly lower than the maximum noise of V-S PDNs with more layers. This is because in a 2-layer V-S PDN, the output voltage variations of the SC converters only affect one supply net (either foot bounce or head-droop) of any layer while the other net is directly connected to the off-chip voltage source via C4 pads. As silicon layers are added, foot-bounce and head droop can be added to the same layer, which significantly increases noise.
D. Impact of Package Impedance
Chip package impedance has a significant impact on the supply-voltage noise [24] . Although package designs with lower impedance can provide more current with lower noise, they usually have higher cost due to their increased com plexity (e.g., more layers of power planes to reduce the Table I ) and reduce the package capacitance by half. We note that this scaling factor does not change the package RLC loop's resonance frequency. Fig. 7 illustrates how package impedance affects both V-S and traditional PDNs' noise in a 4-layer 3D proces sor. Compared with the traditional PDNs, the maximum noise in V-S PDNs is much less sensitive to the package quality. For example, a 300% impedance increase only raises the V-S PDN's worst-case noise by 0.23% Vdd. Since the V-S PDN reduces off-chip current significantly, package impedance contributes much less noise overall. By relax ing the constraint on package impedance, the V-S PDN is expected to reduce the cost of 3D-IC packaging.
VI. Conclusions
In this paper, we build a whole-system PDN model to: 1. Examine voltage-stacked 3D-ICs' transient noise under different workload conditions; 2. Compare voltage noise between V-S PDN and traditional PDN in the context of 3D scaling; 3. Explore the impact of various PDN design parameters. Our simulation results show that, compared with a traditional PDN, the V-S PDN provides stronger isolation for the cross-layer noise interference, but suffers higher noise in the particular case of highly imbalanced workloads. This is mitigated if dynamic, per-layer margin adaptation is used to respond to severe noise. If so, V-S PDN can better reduce timing margin and improve system performance. Without incurring extra on-chip area over head for the integrated capacitors, the V-S PDN's cross layer-mean noise amplitude under the worst-case scenario is up to 60% lower than the traditional PDN. Furthermore, we observe that the V-S PDN allows lower packaging cost for 3D-ICs. Overall, we demonstrate that the V-S PDN provides a low-noise, low-cost, and scalable solution to the challenges of 3D-ICs' power delivery.
