In a synchronous clock distribution network with zero latencies, digital circuits switch simultaneously on the clock edge, therefore they generate substrate noise due to the sharp peaks on the supply current. We present a novel methodology optimizing the clock tree for less substrate generation by using statistical single cycle supply current profiles computed for every clock region taking the timing constraints into account. Our methodology is novel as it uses an error-driven compressed data set during the optimization over a number of clock regions specified for a significant reduction in substrate noise. It also produces a quality analysis of the computed latencies as a function of the clock skew. The experimental results show >x2 reduction of substrate noise generation from the circuits having four clock regions of which the latencies are optimized.
INTRODUCTION
There is a trend towards single-chip integration of more complex mixed-signal systems, higher speeds and lower supply voltages where the signal-integrity analysis becomes a challenging task. In mixed-signal ICs, substrate noise degrades the performance of the analog circuits due to noisy digital circuits in the same substrate.
A few publications concentrate on reducing the noise generation at the source. There are publications on low-noise logic cell design, such as low-voltage logic [1] , current-mode logic [2] , and CMOS gates with guard wiring and decoupling [3] . Speed degradation and lower noise margins are the drawbacks for low-voltage logic. Static power consumption increase is a major drawback of currentmode logic, not tolerable in large digital systems. The gates with decoupling and guard wiring have a drawback of increase in area and additional supply rails. Up till now, no good methodologies exist to reduce the substrate noise at its source without drawbacks such as increase in area and power increase, speed degradation.
Decreasing the peak and the slope of the supply current will reduce the substrate noise since a large part of the noise is generated due to the ringing of the damped LC-tank, which is formed by the on-chip capacitance and the package inductance with series resistance on the supply. Flattening the supply current profile requires an estimation of current waveforms, which is a complex task since:
• At a given time, one or more gates can switch simultaneously, depending on the inputs and the state of the circuit. • For a gate, the supply current differs due to the gate, the load, the supply, the input transition time and the input state.
•
The waveform width depends on the penetration depth of the switching into the combinatorial logic.
Designing regular blocks, where each block has the same delay and the same supply waveform pattern, can solve the problem. One must introduce a lot of logic redundancies to achieve this goal, which is not acceptable. Another possibility is to introduce different latencies in the different regions in a clock tree to make the supply current flatter. To optimize the latencies, the use of the total transient data of the supply current is not acceptable due to the complexity of an exhaustive search for optimum clock latencies. It is therefore necessary to find a representative current waveform of all clock cycles to reduce the number of points. There exist a number of techniques to find a representative supply current waveform. A pattern-independent algorithm, iMAX [4] , generates a maximum instantaneous peak current for one clock Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. cycle. The drawback of lacking logical constraints in iMAX has been solved by adding signal correlations [5] . Simulation based approaches have also been proposed to estimate either the maximum peak or RMS value of the supply current [6] . However, those techniques require not only an extensive simulation time due to the large number of input patterns but also the produced vectors are still large to be used for an exhaustive optimization. Besides this, none of the techniques described above gives an error interval on the estimation. The algorithm described in [7] compresses the supply currents of M current sources into C compression sets, with a higher compression than a single-cycle for a single compression set, assuring also a user-specified error bound. However, this algorithm can result in a high number of compression sets, which is not acceptable for our optimization, as the number of compression sets is highly dependent on the temporal locality and periodicity of the currents across different clock cycles.
In this paper we present a novel methodology for substrate noise reduction, which is based on an error-driven optimization of the clock tree latencies using supply current profiles taking timing constraints and the clock skew into account. There are techniques using clock latencies in order to reduce the peak current [8] and the ground bounce [9] . However, they suffer from a large number of the constraints, given as the total number of the flip-flops. Importantly they do not give a value for the number of the clock regions, which is set by the relation between the major resonance frequency of the circuit and the rise/fall time of the supply current.
First we describe the substrate noise simulation methodology that will be used to evaluate the results of the latency optimization. Next, the steps of the latency optimization are described together with their computational complexity. Then we define figure-ofmerits for the clock skew sensitivity of the computed latencies. Finally we present our experimental results and draw conclusions.
OVERVIEW OF SUBSTRATE NOISE SIMULATION METHODOLOGY
For large designs, it is not feasible to use a transistor-level simulation of the substrate noise generation with detailed substrate models. We have proposed a methodology called SWAN [10] to simulate the substrate noise generation from large digital circuits. The accuracy of SWAN has been verified with measurements [11] . In this paper, we will use SWAN to simulate the supply current and substrate noise. An overview of the methodology is shown in Figure 1 . For every gate a substrate macro model is characterized once using transistor level model of each gate. It macro model contains two substrate noise injection mechanisms, the bulk and the supply current, together with the coupling impedances between VDD, VSS and the substrate [10] . A chip-level substrate model is extracted (see Figure 1 ) by using the macro models and the switching data generated by an event-driven simulator. Finally, we simulate the substrate noise using the chip-level substrate model.
CLOCK TREE LATENCY OPTIMIZATION METHODOLOGY
The coupling to the substrate from the supply current is due to the ringing of the supply voltage caused by the Ldi/dt noise and the resistive coupling via the ground contact resistance. Therefore, decreasing both the amplitude and the time derivative of the supply current will reduce the substrate noise generation. The RMS value of substrate noise is proportional to the integral of its spectrum, resulting from the multiplication of the supply current spectrum and the supply current transfer function to the substrate. Since most of the noise power is due to the major resonance frequency in the transfer function, reducing the spectrum power under this resonance will also reduce the substrate noise (see Figure 2 ). 
Supply current Isupply(s)

Supply current transfer function H(s)
Substrate
Figure 2. Effect of supply current on the substrate noise
For synchronous CMOS circuits, the total supply current in the time domain can be approximated by a triangular waveform as shown in Figure 2 where I p , t r , t f and E are the peak current, rise time, fall time and total charge respectively. Fourier transformation of the supply current in Figure 2 gives:
(1)
As the fastest oscillating harmonic in equation (1) determines the first local minimum on the oscillating term, we can define the corner frequency (f corner ) in the supply current spectrum as the minimum of 1/t r and 1/t f . One can adjust the corner frequency, by modifying I p , t r , and t f , in order to eliminate the major resonance frequency (f res ), which is set by the chip-level substrate model (see Figure 1 ). An optimum value for the rise/fall time is computed by:
The reduction becomes more by choosing the corner frequency as the notch points in (1) as:
Without looking at the timing implications, the required minimum number of the clock regions, M, is found by the ratio of the actual rise/fall time over the optimum rise/fall time. The actual rise/fall time is computed after the triangular approximation of the total supply current of the circuit. Normally, the timing implications and the multiple peaks on the supply current will avoid having an optimum rise/fall time by using M or more clock regions. This requires an optimization of the latencies on these clock regions.
The clock latency optimization (see Figure 3 ) consists of 3 main parts: (1) assignment of every instance into M clock regions, (2) folding of supply current transients, (3) optimization of latencies. The next three sections will describe these steps in more detail.
Gate-Level VHDL Simulation
Extended VHDL library to monitor switching activities 
Gate-level netlist
Derivation of weighted transition probability for every instance
Substrate macro library
Assignment of cells into M clock regions
Generation of supply current transients for M clock regions
Generation of a single cycle supply current profile for each clock region
Optimization of latencies for each clock region
Set-up and hold time constraints between clock regions
Gate-Level VHDL Simulation
Gate-level netlist with new clock tree
Clock region assignment
It is important to balance the instances over the clock regions for a significant reduction in the substrate noise generation. First, we divide the instances in sets FF i , the set of all instances (u ij ) that have a data dependency on the driving flip-flop (ff i ) in FF i . The set u ij also contains the driving flip-flop (ff i ). Note that in general the intersection of two sets may not be empty. Instance assignment is based on transitive fan-out analysis from each flip-flop (ff i ). We define the weighted transition probability, a figure indicating switching statistics for all instances assigned to clock region-m (CR m ) weighted with individual contribution of the instances during a single switch, as follows:
where n i,switch is the number of switching activities of the instance (u ij ), A ij,RMS is the average RMS value of the supply current for u ij over all switching occurrences of u ij . N is the total number of the flip-flops. F(i) is the total number of the instances in set FF i . A ij,RMS factor reflects the average contribution of u ij on the RMS value of the total supply current waveform during a single switching of u ij . An instance u ij is strictly an element of CR m , if all sets (FF i ) where u ij belongs are in CR m . To derive A ij,RMS and n ij,switch values for each instance in the network, a gate-level simulation of the initial netlist using SWAN is performed. The sets of FF i are joined into appropriate clock regions such that the weighted transition probabilities are balanced by 1/M. In a condition where u ij ∈{FF i1 , FF i2 }, FF i1 and FF i2 are assigned to different clock regions, e.g. FF i1 → CR m1 and FF i2 → CR m2 , then u ij is assigned to CR m1 if weighted transitional probability of u ij is caused mostly by CR m1 or vice versa. It is vital to reduce the shared set of the cells as much as possible to reduce the possible glitches, which cause an increase in power, integrity problems and the error term in the supply current compression described in Section 3.2.
Folding of supply current transients
After the assignment of the instances to each clock region, the individual supply transient for each clock region is generated from the previously explained transient simulation of the supply current used during clock region assignment. This results in M.(T clock /∆t).n cycle data points, where M is the number of clock regions, T clock is the clock period (integer multiple of ∆t), ∆t is the unit time step of the simulation and n cycle is the total number of clock cycles. We discretize a single clock cycle into K time intervals, where each time interval can be chosen as ∆t resulting in K=T clock /∆t. We define I(k,m,n) as the actual value of the supply current at time interval-k, clock cycle-n and clock region-m. For each clock region, the union of I(k,m,n) points is compressed into a set of supply current profiles of each having a single clock cycle representation. Every clock region contains at least one current profile defined over every time interval. The compression can create more than one profile in a clock region depending on the user error bound. The set of supply current profiles is defined as:
where P(m) is the number of elements in the set of supply current profiles, IP m , defined for clock region-m. With tight assumption on temporal locality and periodicity of currents across different clock cycles, for the supply current profile one can choose a single supply current cycle with maximum peak-to-peak value, that is I p (k,m,p=1)= I(k,m,n o ) where n o is the clock cycle number with maximum peak-to-peak value on the supply current. This choice of n o leads to wrong results when the temporal locality assumption is violated. Better representation is to compute statistical properties such as mean, standard deviation and probability density function at each profile point I p (k,m,p) using all the points in the actual waveform. I p (k,m,p) contains a set of statistical functions given as: Figure 4 shows the folding procedure.
The error interval on the supply current folding is derived for every time interval by using the derived probabilities from the normalized histogram h(k,m,p). For a confidence percentage, P c (such as 98%), one has to find a minimum value of ε from:
As n cycle →∞, a normal distribution is likely to occur when the circuit contains a large number of independent processes and has an input vector having normal distribution. In this case, the error interval on I p (k,m,p) is given as µ(k,m,p)±2.32σ(k,m,p) with 98% confidence. Using error interval analysis on I p (k,m,p), the set of supply current profiles, IP, are extracted within a given error interval (ε user ). For each I(k,m,n), the confidence percentage, P c , on ε user is tested using (7) by substituting ε= ε user . If the inequality in (7) is not satisfied for all elements of IP m set, an additional current profile, I p (k,m,p), is generated in IP m using (6) . P(m) is incremented by 1 else I p (k,m,p) in (7) is computed for the current profile-p in IP m where P c is maximum. The trade-off in this error driven profile extraction is the decrease of the compression ratio. T clock has to be chosen always larger than the maximum width of the supply current to satisfy timing constraints. For this reason, a filtering is performed on the supply current transients to increase the computational efficiency of the folding algorithm.
Clock latency optimization
Clock latency optimization is based on an exhaustive search of all latencies for a minimum on the cost function, which will be described later. An exhaustive search is necessary as the problem is NP-complete. The latencies have to be constrained with timing constraints defined in Figure 5 . The clock region-i communicating with clock region-j has to satisfy the following constraints:
Setup time constraint: 
where ∆t clk,max/min (i,j) is the maximum/minimum allowed latency between Clk i and Clk j . δ is the clock uncertainty due to the unexpected skew coming from the clock interconnect respectively. The constraint for each clock latency becomes:
The optimization procedure is to find the best M latency bundle (l 1 ,l 2 ,..,l m ,…,l M ) that gives minimum value on the cost function computation using the total supply current shifted with the latencies where l m is defined as the latency value of clock regionm. One can freely set one of the latencies to zero such as l 1 =0 such that one of the clock regions is aligned to the edge of the clock.
At each latency value, the cost function is evaluated as the product of the peak value and the slope of the total supply current. This comes as a direct result from equation (1), which states the direct proportionality of the supply current spectrum to the multiplication of the peak, 2E/(t r +t f ), and the inverse of the rise/fall time, 1/t r +1/t f , proportional to the slope of the supply current. The optimization tries to minimize this factor in order to reduce the spectral energy of the supply current, therefore the RMS value of the substrate noise. The optimization is performed on the constraint space formed by the latencies (l 1 , l 2 ,…,l M ) as follows:
COMPUTATIONAL COMPLEXITY
During the folding of the supply current transients of M clock regions, each having K.n cycle , data points; the mean, the standard deviation and the probability histogram are computed recursively. Due to this recursion, the computational complexity is bound to O(K.M.n cycle ) whenever there is no error control. With error control, the complexity becomes O(K.M.P MAX .n cycle ) where P MAX is the maximum number of the supply profiles stored in IP m . P MAX increases by a decrease of the user error interval. The compression factor with and without error control becomes P MAX /n cycle and 1/n cycle respectively. The complexity of the clock latency optimization is O(K.(M+1).P MAX .a M-1 ) where a M-1 <<K M-1 is the dimension of the search space of all M-1 latencies under set-up and hold time constraints while we set one of the latencies to zero. Overall complexity from the folding and latency optimization is O(K.M.P MAX .n cycle ) + O(K.(M+1).P MAX .a M-1 ). If the number of compression sets in each clock region is bound to 1, computational complexity becomes O(n)+O(a M-1 n/n cycle ) where n is the number of the data points (n=K.M.n cycle ). The computational complexity approaches O(n), first order dependency on the number of the data, with increasing number of clock cycles (n cycle ).
CLOCK SKEW SENSITIVITY
Due to clock routing, the load balancing or other random effects within different clock regions there will be an uncertainty or skew at each clock region. As the high slew rate of the supply current used during the optimization, the optimum point can have a high sensitivity to clock skew. To analyze the quality of the results to skew, we construct a skew radius around the optimum point. In addition, we exhaustively search the space around the optimum for a given radius (δ). Within this radius, we introduce the following skew figure on the quality of the results:
where l opt and r are the optimum latency bundle and the skew effect on the latencies respectively. f cost (0) is the value of the cost function (see equation (10)) before the optimization. The square root of the ratio is necessary to scale the cost function, which is given as I peak .(di/dt) RMS , appropriately in order to resemble the reduction factor of the substrate noise generation. SB MAX,RMS (δ) is an indicator showing the maximum and RMS value of the reduction factor due to the clock skews, which is bound by δ.
EXPERIMENTAL RESULTS
The methodology is illustrated in a 4-bit Pseudo-Random-Noise-Generator (PRBS) implemented in a 0.35-µm CMOS process on an EPI-type substrate at 3.3V supply. Later, the results from ITC'99 benchmark circuits [12] and a test chip [13] are presented. Figure 6 depicts the test-circuit and its division into four different clock regions. The supply current transfer function to the substrate has a resonance frequency of 2.3GHz. The 3dB bandwidth of the resonance stretches from 1.3GHz to 3.2GHz. The supply current has a corner frequency at 270MHz. So choosing 4 as the number of clock regions is appropriate as the initial corner frequency is already well below the resonance frequency. The design has a clock period of 4ns and a supply line parasitics of 5nH+0.5Ω. A single supply current profile has been constructed for each clock region using the actual supply current data from a total transient simulation of 105 clock cycles using SPICE. Choosing 105 clock cycles, considering the intrinsic periodicity of the 4-bit PRBS, results in an unbiased estimate. Figure 7 shows the transients (a) and the profiles (b) of the supply current with/without latencies. Figure 8 shows the simulated substrate noise transients (a) and the corresponding spectra (b) with/without latencies. The design with optimized latencies achieves factors of 2.10 and 1.75 reduction in the peak-to-peak and the RMS value of substrate noise respectively. The design with optimized latencies has a reduction of 8.9dB, 20dB, 18dB, 14dB, and 13dB at fundamental, 2 nd , 3 rd , 4 th and 5 th harmonics of the clock respectively. This is due to the reduction of the spectral power of the supply current as depicted in Figure 8 . In this example, the spectral power of the supply current has been reduced by 6dB. Table 1 shows the clock skew sensitivity figures, explained in section 5, of the latency optimization results. The reduction becomes more significant when the initial corner frequency is above the major resonance frequency and is shifted below the major resonance frequency. Figure 10 shows the transients and the corresponding spectra of a circuit, having a resonance at 950MHz, as a result of changing the amplitude and slope of its supply current. The supply current has been shaped from (50mA, 200ps) to (10mA, 1000ps), where we assume no timing constraints and equal rise and fall times. A significant reduction by a factor of 5.4 and 7.9 is achieved for the peak-topeak and RMS value of the substrate noise respectively. We have also tested the introduction of the latencies on the substrate noise reduction figures of ITC99 benchmark circuits [12] . It is not anymore feasible to simulate those circuits using SPICE due to the increased complexity of the circuits relative to PRBS circuit illustrated above. Therefore, we use SWAN [10] to simulate supply current and substrate noise voltage transients using random test vectors as input. The circuits have been implemented in a 0.35-µm CMOS process on an EPI-type substrate using 1nH+0.1Ω package parasitics and 3.3V supply. The latencies of each design have been computed for the 4 clock regions. Table 2 shows the initial values of the substrate noise generation and the supply current together with the reduction percentages as a result of the introduced latencies. It is concluded that the reduction is around 37% in average for the RMS values of the substrate noise. We have also designed and measured a mixed-signal chip (see Figure 11 ), fabricated in a 0.35µm CMOS process on an EPI-type substrate, in order to compare several low-noise digital designs [13] . A comparison of two realizations, a reference circuit (REF) and a low-noise design with optimized clock latencies (LN1) in four clock regions, of a 5Kgate synchronous CMOS circuit, shows more than a factor of 2 reduction in the substrate noise generation.
Conclusions
Shaping the supply current is shown to be very effective as the coupling from the ringing of the supply into the substrate is dominant. In this paper, we have presented a methodology to optimize the clock tree latencies to reduce the substrate noise generation by using an error-driven compression of the supply current profiles. Before the optimization the number of the clock regions is computed based on the elimination the major resonance frequency set by on-chip circuit capacitance and the supply parasitics. Using compressed supply current transients, the computational complexity is reduced from O((n/M) M ) to O(n) where M, and n are the number of clock regions and the total number of transient data respectively. Experimental results show a factor of 2 reduction in the generated substrate noise on the designs with four clock regions. The efficiency of the methodology has been verified with measurements on a fabricated mixed-signal chip. The supply current shaping by the use of clock latencies is shown to be very effective if timing constraints allow shaping. Figure 11 . Microphotograph of the test chip
