ABSTRACT
INTRODUCTION
Two dominant types of noise are present in a power delivery network (PDN): peak noise and resonance noise [1] . Peak noise usually occurs when the instantaneous switching current load becomes maximum [2] for a short duration with its energy spectrum lying in the high-frequency range [1] . Abundant research has been done to minimize peak noise for PDN design (e.g., [3] [4] [5] [6] ).
Resonance noise is a result of the distributed RLC characteristics of a PDN, which includes parasitic inductance of interconnect and decoupling capacitance. The PDN forms a resonant tank that produces impedance peaks at multiple resonant frequencies. The dominant resonance frequency (fres) usually occurs at low-tomiddle frequency range (MHz to 100MHz) [7, 8] . Though highperformance chips' working frequencies are much higher than this resonance frequency in general, chip runtime loading frequency is not. When current loads exhibit a periodical rate close to fres caused by a looping sequence of instruction execution, the impedance would increase significantly at this resonance frequency, causing persistent undershoots and overshoots that exceed the droop tolerance of the PDN. Resonance noise compromises chip performance, hold-time margins, and gate oxide integrity [7, 9] . Despite the importance of resonance noise for reliable PDN design, resonance noise suppression has not gained enough attention in the EDA community.
Traditional static solutions, such as adding more passive capacitors or more supply pins, are not effective to suppress the resonance noise. Hence, dynamic run-time solutions are being studied recently in the literature. For example, the authors of [9] proposed to dynamically switch on-chip decoupling capacitors to suppress resonance noise for microprocessors' PDN designs, while the authors of [7] provided an on-die resonance-suppression circuit that uses band-limited active damping to reduce resonance noise. But all these approaches are retroactive, i.e., they only remedy the noise problem when the noise problem has occurred, which is often too late as wrong values might have been latched already. Hence a better approach should proactively suppress the resonance noise when such issues are predicted to happen soon.
The major contribution of this paper is as follows. We model chip dynamic current loads as a high dimension generalized Markov process, and develop a novel stochastic method to predict the future current load based on the knowledge of existing current profile. A proactive PDN design approach is proposed to suppress resonance noise by leveraging a frequency actuator consisting of on-chip programmable PLLs and dynamic power supply current sensors [10, 11] . We develop an efficient controlling algorithm to judiciously select the run-time clock frequency so that the resonance noise is contained below the tolerance bound with minimum impact on chip performance. Compared with baseline design without frequency actuator, experimental results show that our frequency actuator design alone reduces maximum noise by 16% and average noise by 30%, while our proactive frequency actuator with current prediction reduces maximum noise by 77% and average noise by 85%. In terms of system level performance, compared with the baseline model, our frequency actuator alone can reduce the system latency overhead by up to 35% , and with current prediction it can reduce the system latency overhead by up to 93%.
The remainder of the paper is organized as follows. We motivate the study of this work in section 2, and present the problem formulation and overall design methodology in section 3. We develop the stochastic current prediction algorithm in section 4, and propose the optimum frequency selection in section 5. The experimental results are presented in section 6 and concluding remarks are given in section 7.
MOTIVATION
Most existing work on PDN designs models the load of a port as a single current spike I0(t) with a short duration time τ as shown in Fig. 1 , which is typically modeled as a triangular waveform within [0, τ ]. Hence the peak noise resulting from I0(t) is essentially a high-frequency noise with its frequency in the range of 1/τ . For 65 nm designs, the duration τ is on the order of 100 ps, which produces peak noise at the frequency range on the order of 10 GHz. But in reality, the current load I(t) possesses a much lower frequency component because of the periodic nature of functional execution. Without loss of generality, we assume I(t) has I0(t) as its only current component with a period of T >> τ as shown in Fig. 1 ; while the real case can be treated as a superposition of this simple scenario with different combinations of I0(t) and T . By performing AC analysis on the circuit, we obtain the voltage response at the port of interest as
978-1-
where H(jω) and I(jω) are the impedance and current load of the PDN at the port of interest, respectively, with ω0 = 2π/T . Though the PDN model used in our work is a meshed RLC network, we illustrate the concept resonance noise through a simple circuit model of the PDN as shown in Fig. 1 . The dominant resonance frequency fres of this system is approximately given by
where Cc is the lumped on-chip intrinsic capacitance, C d is the decoupling capacitance, and Lp is the lumped package inductance. For some typical extracted values of an industrial design, Fig. 2 illustrates its impedance frequency response with the dominant resonance frequency around 50 MHz. As the resonance frequency fres for a typical PDN design is on the order of 100 MHz (equivalent to 20 cycles for a 2GHz CPU), it is far from the frequency range of peak noise, but rather closer to the periodic function execution frequency. When the low frequency components of I(t) are close to ωres, i.e.,
the voltage fluctuation Vn(jkω0) would increase significantly to cause chip malfunction. The voltage drop measured at this resonance frequency is called resonance noise. Since the impedance H(jω) at the port of interest is constant for a given PDN topology, the magnitude of Vn(jω) is propotional to the magnitude of current load I(jω). Therefore, we also call the current load measured at this resonance frequency as resonance noise whenever there is no ambuguity.
It is generally believed that inserting decoupling capacitance can minimize noise of a PDN. This is valid only for high-frequency peak noise reduction, but not for the suppression of resonance noise. Resonance noise greatly depends on the run-time operation, as it affects the low-frequency components of I(t). An effective way to minimize the low-frequency resonance noise would be to change clock period T directly so that the relation (2) does not hold.
PROBLEM FORMULATION
In order to control the low frequency component of the current load to avoid the resonance frequency, we need to dynamically adjust work load period T . There are two possible ways to apply the adjustment: the first one is to adjust the PLL to change clock frequency; the other one is to adjust the power supply voltage level [12] . Either approach can effectively change the duration of work load, thus achieving different frequency response. As an illustration, we choose to adjust clock frequency directly by employing a programmable PLL design similar to [13, 14] in this paper.
We assume the PLL allows a range of clock frequencies from fmin to fmax, where the chip is signed-off for fmax. In other words, the chip will work for any frequency below fmax. The actual chip performance will vary for different frequencies depending on the application. However, note that frequency adjustments are temporary, so the impact on performance is also limited, as will be verified by our experiments.
We employ on-chip current sensors to monitor the dynamic current load I(t) for each clock domain of interest in the design. Based on the history I(t) data, a control unit determines an optimal clock frequency to be generated by the programmable PLL. This procedure continues such that the low-frequency components of the current load I(t) will not be close to the resonance frequency fres. In other words, we keep the resonance noise below a userspecified tolerant bound. Because it takes time for the PLL to track the adjustment, it is important to select an optimal clock frequency so that the impact on performance degredation is minimum.
The control unit can be implemented as a frequency actuator in hardware, and it consists of two major parts. The first is a current load prediction module or predictor, which predicts the incoming current load profile and its impact on frequency response based on the history data of current loads. The second module is an optimizer, which determines an optimal clock frequency to be generated by the programmable PLL.
To reflect reality, the following design constraints are considered: (1) a finite m discrete number of clock frequencies for the programmable PLL to choose from; (2) non-instantaneous PLL tracking time, i.e., it takes certain number of clock cycles for the programmable PLL to transit from an existing clock frequency to the next one; (3) transition overhead, i.e., the PLL has to stay within each chosen clock frequency for at least a certain number of clock cycles before it can transit to the next frequency to reduce the overhead cost for frequency switching.
In the interest of space, we focus on the CAD aspects of the proposed methodology, i.e., how we predict the current profile based upon the historical current sensor data, and how this profile can be utilized to select the optimal frequency to suppress resonance noise for a PDN design. Detailed discussion on how to design power supply dynamic current sensors with minimum area and power overhead is beyond the scope of this paper. Interested readers please refer to [10, 11] .
STOCHASTIC CURRENT PREDICTION

Current Prediction Modeling
To select an optimal frequency, we need to know how the frequency response would change for the incoming work load variations. To do so, we first need to predict how the current loads would change for the next few clock cycles.
For a given clock domain of interest, there are n number of current sensors monitoring the current load. We represent the current waveform within one clock cycle as a triangular waveform, and each current sensor records either a peak or average current value for this waveform. Such a monitored value for current sensor j at cycle k is denoted as i j k ; in other words, there is a correspondence between a monitored current value and the triangular current waveform that it represents. We record all the currents for the same cycle as a vector I k , i.e.,
where n is the total number of sensors. Under different input vectors and working conditions, I k would be different for different cycles. Moreover, for I k that are close in cycles, they are highly correlated; while for I k that are far apart in cycles, they are less (or even not) correlated. The correlation distance D is the number of cycles such that all I k are uncorrelated when they are at least D cycles apart.
Based on these observations, we model I k as a generalized Markov stochastic process over different clock cycles. A generalized Markov process is a stochastic process whose value at time k depends not only on its value at time k − 1, but also on its values at time k − 2, . . ., k − Q. These past states collectively are called the history of length Q of the process [15] .
We propose to use a linear filter as the predictor to predict the current load
where Ψi are n × n coefficient matrices to be determined, while I k−i are historical current vectors. Apparently the choice of Q, hence Q number of Ψi, helps to balance the trade-off between our model prediction accuracy and computation efficiency. In this work, we set Q same as the correlation distance D, as any current vectors that are D cycles apart have no correlation. Moreover, instead of using all Q historical current vectors, we sample M number of them for prediction, i.e.,
where L is the sampling separation such that M · L = Q. In another words, we reduce the number of unknown coefficient matrices Ψi from Q to M . Our goal is to determine the set of M coefficient matrices Ψi such that (5) is a good predictor for I k for any randomly selected current vectors in Q consecutive clock cycles. Mathematically, this problem can be stated as follows. 
We propose to solve (6) in two approaches with each providing different trade-offs between prediction accuracy and computation complexity (hardware area cost). Obviously, the prediction accuracy depends on both M and L, and we will report this in the experimental section.
LMS Adaptive Filter
The first approach is based on the framework of a least-mean-square (LMS) adaptive filter as illustrated in Fig. 3 , where matrices Ψi are time-varying matrices and are dynamically adjusted at every clock cycle during runtime. We denote Ψ i,k as the value of Ψi at clock cycle k, and δΨ i,k as the adjustment for Ψ i,k . Then we havê
where μ is the step size determined by experiments and e k = I k −Î k is the prediction error. In general, the LMS adaptive filter approach is accurate as a predictor, because it can automatically adjust itself to follow large changes in statistical behavior of the sequence of current vectors, and is thus suitable for systems with diverse operations. But hardware implementation cost of this type is relatively high; and it cannot always guarantee the convergence of the coefficients Ψ k i in all situations [16] . Therefore, we propose a second approach called predetermined linear filter as an alternative solution.
Predetermined Linear Filter
The idea of a predetermined linear filter is based upon the off-line simulation and uses the simulation results as training data to find an optimal set of constant matrices Ψi to the problem of (6). Similar to the vectorless P/G analysis in [17] , we assume that each clock domain under study is partitioned into blocks such that different blocks are relatively independent. For each block, there are multiple ports connected to the power network, and each port is modeled as a time-varying current load for the power network. We apply extensive simulation to each block independently to get the current signatures for all ports, which are then aggregated to obtain the current signature for the n current senors for this clock domain. After extensive simulation, we have current vectors at many different clock cycles. We then need to determine a set of Ψi to (6) based on these simulation data. To do so, we present the following theorem with the detailed proof omitted because of limited space.
Theorem 1. If we define a set of matrices ri,j = E(IiI
where S ∈ R n×M ·L , R ∈ R M ·L×M ·L , and ei ∈ R M ·L×n , and are given by
with I being an n × n identity matrix at the i th block matrix of ei.
Without going into too much details, we note that R is closely related to two types of correlation for current vectors. Specifically, (1) logic-induced correlation, i.e., current loads at different location are correlated and cannot reach the maximum at the same time due to the inherent logic dependency for a given design; and (2) temporal correlation, i.e., for current loads at the same port, they cannot attain the maximum value at all time, and depending on the functionality being performed, the current variations for different clock cycles are correlated. The element at m th row and n th column of the block matrix ri,j actually reflects the logic-induced correlation between location m and n and the temporal correlation between clock cycle i and j. Therefore, the matrix R actually characterizes both the logic-induced correlation over all n locations and the temporal correlation over all clock cycles.
Once we obtain Ψi according to (10) , at any clock cycle k, we can predict future L cycles' current vectorsÎ k ,Î k+1 , . . .,Î k+L−1 by using the M · L history current loads
As there exists correspondence between a triangular waveform model and a recorded (or predicted) current value, we can reconstruct future K cycles' current waveform u(t) for all K ≤ L as
where T is the clock period, uΔ(t) is a triangular waveform whose starting time is at the beginning of each clock cycle k with a unit peak current value; and ui(t) =Î k+i uΔ(t) is the triangular waveform with the predicted peak current value ofÎ k+i .
OPTIMUM FREQUENCY SELECTION
The Fourier transformation of current load (15) can be written as
where Hi(jω) (i > 0) is the Fourier transformation of ui(t), and H0(jω) is the Fourier transformation of u(t) for t ≤ 0. According to the discussion as shown in section 2, our goal is to minimize the resonance noise, i.e., the magnitude of the frequency domain response of current load H(jω), at ω = ω0, i.e.,
where Tmin is the clock period corresponding to the maximum clock frequency. The reason for us to add a weighted penalty function λ(T − Tmin) to the objective function is to consider the impact of performance loss resulting from changing clock frequency. The positive number λ reflects aggressiveness of our frequency actuator design. It is clear that (17) is an unconstrained nonlinear optimization problem and any general optimization techniques such as Newton's method can be applied to solve it efficiently.
In practice, by knowing the fact that only a finite number of discrete clock frequencies are available for any digital-based programmable PLL design, we develop a more efficient way of solving the problem. We denote the finite set of available programmable frequencies as {1/T1, . . . , 1/Tq, . . . , 1/Tm}, then we can easily find the optimal frequency by evaluating (17) over different Tq and select the optimal one that minimizes the objective function, i.e., (17) can be rewritten as
where HΔ(jω) is the Fourier transformation of the unit triangular waveform uΔ(t). This is exactly the optimization problem we need to solve at clock cycle k. To further improve efficiency in evaluating (18), we can pre-calculate and store
in a look-up table, as Ai,q is a floating number for 1 ≤ i ≤ L and 1 ≤ q ≤ m.
EXPERIMENTAL RESULTS
Current Prediction Verification
We first verify the accuracy and efficiency of our prediction algorithm with current data measured on a mobile chip from industrial design. We apply both the predetermined linear filter and the LMS adaptive filter designs to our frequency actuator, and the prediction results based on simulation are illustrated in Fig. 4 (a) and (b) , respectively. Both methods use 32 points (M = 32) in history with spacing L = 400, and predict the currents in the incoming 400 clock cycles. From the figure we can quantitatively see that the adaptive filter can provide a better prediction result (closer to the actual current) than the predetermined linear filter. Experimental results show that adaptive filter has an average prediction error of 1.51%, whereas that of the predetermined linear filter is 13.4%. On the other hand, we observe that the maximum prediction error for adaptive filter can be as large as 311% in time period 5650 − 5700ns, indicating the failure of convergence, whereas the predetermined linear filter has an error of 11.6% in those clock cycles. Fortunately, such error does not affect the proposed resonance reduction as explained below. Fig. 5 illustrates the predicted current spectrum from predetermined filter (a) and adaptive filter (b) compared with the actual current spectrum. From the figure we can see that both methods are accurate when the responses are sharp. The main prediction error only happens at frequencies where the frequency domain response is small, and thus does not affect our selection of correct clock frequency. Fig. 6 (a) shows the relationship between the average prediction error and the number of history data points M for fixed spacing L = 400. We can see that the prediction accuracy improves with the increase of M . For the region of M < 30, increasing M can result in a big decrease in the average error, while for the region of M > 30, changing M has little impact on the error. Fig. 6 (b) shows the relationship between the average prediction error and the spacing 4C-1 L for fixed number of history data points M = 32. From the figure, we see that there is an optimal L that corresponds to the smallest error for both methods, and such an optimal value is roughly the same (L = 400) for both methods. Another interesting observation is that the adaptive filter is less sensitive to parameter changes than the predetermined filter. This is expected as the adaptive property enables it to adjust itself with the change of parameters.
Resonance Noise Reduction
We first study how the number of current sensors affects noise reduction on the same mobile chip. As shown in Fig. 7 , the noise reduction is almost the same when the number of current sensors is greater than 5% of the total number of system ports, which translates to 10 − 100 current sensors for a leading chip. This suggests that there is no need to place many sensors for the measurement. Next we conduct experiments for the same mobile chip and one additional high performance micro-processor to illustrate the impact on resonance noise reduction. We assume that the current profile obtained from measurement scales with the clock cycle. For these two designs, the tracking time for PLL is 75 clock cycles. The choice of clock frequencies ranges from 1.5 GHz to 0.8 GHz with an interval of 0.1 GHz. The retroactive model incrementally reduces the clock frequency by 0.1 GHz until the noise is below the tolerance bound. Then it tries to incrementally increase the clock frequency with 0.1 GHz step until the maximum frequency is reached or when noise violation occurs. The proactive model select optimal frequency based on predicted currents. We apply simulation with the current profile and the distributed PDN to get the maximum and average voltage droop. The comparison results are shown in Table  1 . Compared with the baseline model without frequency actuator, the retroactive approach can only reduce the max noise by up to 14% and reduce the mean noise by up to 33%. On the other hand, our proactive approach with predetermined linear filter can reduce the max noise by up to 61% and the mean noise by up to 67%, while the proactive approach with the LMS adaptive filter can reduce the max noise by up to 79% and the mean noise by up to 87%.
Performance and Area Impact
For both the microprocessor and mobile examples, we simulate the system latency for one time resonance noise violation such that one time reboot is required in the baseline case. From the design, it takes 1μs to do a full save and 1μs to do a restore of the whole architecture state. The ideal latencies for the retroactive and proactive approaches are, respectively, 8M and 20M cycles. switches to avoid resonance noise and to increase clock frequency when the resonance is gone, and time loss due to slowing down the clock. In Table 2 , we report normalized latency overhead with respect to the ideal latencies for the baseline, retroactive and proactive cases. For proactive case, we have tested both the predetermined linear filter and the LMS adaptive filter. From the table, we see that compared with the baseline model, the retroactive method can reduce the system latency overhead by up to 35%, the proactive model with the predetermined linear filter can reduce that by up to 74%, and the proactive model with the LMS adaptive filter can reduce that by up to 93%. This further illustrates the importance of the proactive frequency actuator for high performance systems.
We also compare the gate count for the LMS adaptive filter and the predetermined linear filter based designs obtained from Cadence Encounter RTL Compiler, and the results are reported in Table 3 . From the table, we see that the predetermined linear filter based actuator can cause the gate count to be increased only by 0.02% for the microprocessor design, while the design of the LMS adaptive filter causes the gate count to increase by 0.4%.
CONCLUSIONS
Because of the distributed RLC characteristics of a power delivery network (PDN), runtime resonance noise at the low-to-middle frequency range may significantly affect the reliability of a PDN and chip performance. In contrast to existing retroactive solution that only remedies the noise problem when the noise problem has occurred already, we have proposed a novel design approach to proactively suppress resonance noise. We have developed an efficient stochastic current load prediction method based on a generalized Markov process modeling. We have presented a frequency actuator that utilizes both on-chip dynamic current sensors and a programmable PLL for frequency adjustment. A novel optimal frequency selection algorithm has also been developed. Compared with baseline design without frequency actuator, experimental results show that our frequency actuator design alone reduces maximum noise by 16% and average noise by 30%, while our proactive frequency actuator with current prediction reduces maximum noise by 77% and average noise by 85%. In terms of system level performance, compared with the baseline model, our frequency actuator alone can reduce the system latency overhead by up to 35% , and with current prediction it can reduce that by up to 93%.
