Abstract-The complexity in timing optimization of highperformance microprocessors has been increasing with the number of channel-connected transistors in various paths of dynamic CMOS circuits and the rising magnitude of process variations in nanometer CMOS process. In this paper, a process variation aware transistor sizing algorithm for dynamic CMOS circuits while considering the Load Balance of Multiple Paths (LBMP) is proposed. The proposed iterative optimization algorithm is a deterministic approach and is illustrated first by a 2-b weighted binary-tothermometric converter (WBTC) and of which the critical path was optimized from an initial delay of 355 ps to an optimal delay of 157 ps, which accounts for a 55.77% delay improvement. A 4-b unity weight binary-to-thermometric converter (UWBTC) was also designed and of which the critical path was optimized from an initial delay of 152 ps to an optimal delay of 103 ps, which accounts for a 32.23% delay improvement. Finally, a 64-b parallel binary adder was partitioned to a mixed dynamic-static CMOS style and the critical path and the power delay product were optimized to 632 ps and 84.17 pJ respectively.
I. INTRODUCTION
The performance of microprocessors has been driven traditionally by dynamic CMOS technology and micro architectural improvements [1] , and can be enhanced at the circuit level through design and physical organization. At the circuit level, dynamic logic style has been predominantly used in microprocessors, and the use of custom dynamic circuits in microprocessors has increased timing performance significantly over static CMOS circuits [1] [2] . One of the challenges in timing optimization of dynamic CMOS circuits is transistor sizing due to charge sharing, noise-immunity, process variations and leakage, etc.
Research has demonstrated that process variations have caused about 30% variation in chip frequency, along with 20X variation in chip leakage [17] . Integrated circuits have always been vulnerable to inherent die-todie (inter-die) and within-die (intra-die) parameter variations during the fabrication process [12] . With the continued scaling of CMOS technology towards the 45 nanometer (nm) transistor channel length, the magnitude of relevant sources of environmental and semiconductor process variations have been increasing rapidly. This increased magnitude of process variations could lower the performance of a circuit by one generation [12] , and might even result in design failure [13] . The magnitude of intra-die channel length variations has been estimated to increase from 35% of total variation in 130 nm, to 60% in 70 nm CMOS process. And, variations in wire width, height, and thickness are also estimated to increase from 25% to 35% at the 70nm CMOS process [13] .
Transistor sizing and optimization affects delay and power of dynamic CMOS logic. However, designs optimized for power by transistor sizing are more susceptible to frequency impact due to within-die variations as they sharpen path delay distributions making a large number of paths and transistors critical [20] . This further highlights the importance of considering process variations while optimizing delay and power.
II. PREVIOUS WORK
Many literatures exist on automating transistor sizing [3] [4] [5] [6] [7] [8] [9] . Most of the proposed methods focus on static CMOS circuits and technologies using multiple threshold voltages. TILOS [4] presented an algorithm used for iteratively sizing transistors by a factor in the critical path. This algorithm does not guarantee a convergence of timing optimization and is not a deterministic approach. MINFLOTRANSIT [5] is an algorithm proposed for transistor sizing based on iterative relaxation method but requires generation of directed acyclic graphs iteratively for timing optimization.
Methods to limit the effect of process variations in CMOS process were proposed in [12] [13] [14] [15] [16] [17] [18] [19] . These methods deal with statistical variations and are not optimal for designs with large number of parameter variations [21] . A technique called Adaptive Body Biasing (ABB) was presented in [17, 19] to compensate for variation tolerance. The ABB technique is implemented postsilicon where each die receives a unique bias voltage thus reducing the variance of frequency variation. But, this method does not address the intra-die variations issue as each block in the design requires a unique bias voltage. Another limitation of this method is the increased leakage power due to reduction of threshold voltage. Using keepers to compensate for process variations was proposed in [18] . This method works for designs with large number of parallel stacks similar to NOR gates, but is not optimal for designs without parallel stacks as it requires additional hardware to program the keeper transistors.
Selecting multiple corners to simulate a design accounts for systematic variations but not random variations. Monte Carlo method considers both systematic and random variations [23] . As variations in D L and D W are random and predicted to be major contributors towards total variations [13] , Monte Carlo simulation results are promising when delay is the constraint. Although there are misconceptions that Monte Carlo method is slow, it is ideal when the number of sources of variations is significantly high [21] . The advantage of using Monte Carlo method is its theoretical accuracy. This method is also commonly used as a golden reference. Monte Carlo method can be used to clearly explain the behavior of a circuit. It can be easily extended to incorporate crosstalk and IR drop effects in simulation [21] .
Research has shown intra-die variations primarily impact the mean delay, and inter-die variations primarily impact the variance of delay [12] . So, design tools aimed towards optimization of timing and yield should consider both inter-die and intra-die variations. In addition to timing optimization by reducing delay, performance has to be improved by reducing the delay uncertainty and sensitivity due to process variations as depicted in (1) and (2) , where T Max is the worst-case delay and T Min is the best-case delay, is the mean delay, is the standard deviation from Monte Carlo simulations.
III. LOAD BALANCE OF MULTIPLE PATHS (LBMP)
The delay of dynamic CMOS circuit is highly dependent on the number and size of transistors in the critical path. Increasing size of transistors in a path will increase the discharging current and reduce the output pull-down path delay. However, increasing the size of transistors to reduce one path delay may also increase the capacitive load of channel-connected transistors on other paths and substantially increase delays of respective paths. This level of complexity increases along with the number of paths in the design. In this paper, a 2-b Weighted Binary-to-Thermometric Converter (WBTC) as shown in Fig. 1 is used as a first benchmark to explain the path delay optimization complexity while considering process variations. Conventionally, worst-case path is identified based on the mean from delay distribution which accounts only for intra-die variations. As inter-die variations are equally important, standard deviation needs to be considered as well. Consider two paths (path-1 and path-2) with different delay distribution as shown in Fig. 2 . Path-2 has a high mean delay and path-1 has a high standard deviation. While considering only the mean ) ( delay, path-2 would be chosen as the critical path for timing optimization. Optimizing the design by increasing size of transistors on path-2 may reduce the mean delay ) ( , but may not reduce the standard deviation ) ( . However, by considering the worst scenario, , path-1 would be the critical path to be optimized. As both inter-die and intra-die variations are to be considered during optimization, the proposed timing optimization algorithm ranks the critical path delays based on the sum of mean delay and standard deviation, . The LBMP algorithm proposed for transistor sizing of dynamic CMOS circuits while considering process variations is depicted in Fig. 3 . As shown in Fig. 1 , discharge time of transistors near Gnd is longer compared to the transistors near outputs, as transistors near Gnd are usually driven by many paths. Therefore, path delay is optimized by increasing the size of transistors near Gnd the most and the size of transistors near outputs the least.
As increasing the size of transistor that appears in the most number of paths reduces delays of most paths, the number of paths a transistor is present in is computed and denoted as "repeats". The initial step in LBMP algorithm is to size adjacent transistors on every path with a fixed size ratio, e.g., 1.1, for optimization convergence. Thereafter, a weight is assigned to each transistor with the one near Gnd having the highest value and the one near the output having the least value. Once the repeats and the weights of all transistors are computed, Monte Carlo simulations while considering process variation are performed to obtain delay profiles of each path. The transistors on the top 20% critical paths are grouped to set-x, and their sizes are increased and calculated by (3) .
As delay of the critical path is dependent on the capacitive load of channel-connected transistors, reducing this capacitive load reduces the overall delay. The 1 st order connection transistors in the set-x are identified and grouped to set-y. Then, transistors in set-y that are not in set-x of the current iteration are grouped to set-z. For each transistor in set-z, it is checked if the transistor is present in set-x of previous iteration. If so, its size is decreased and calculated by (4) and (5). If not, its size is decreased and calculated by (6 (6) Once new transistor sizes are determined, Monte Carlo simulations are performed to identify the new top 20% critical paths. If the new worst-case path delay is higher than the delay in the previous iteration, sizes of transistors in set-z of the new worst-case path are changed to the average of old and new sizes. Iterations are repeated until the solution converges to an optimum. The 34 timing paths in 2-b WBTC are presented in Table I . The transistor repeat and weight profiles are shown in Table II The critical path order profile over a few iterations is shown in Table III . With minimum size transistors, the worst-case path is path-1. After the first iteration of LBMP algorithm, its delay is reduced from 355 ps to 244 ps. However, path-17 of which the transistor (T 20 , T 21 ) sizes were reduced came into the set of new critical paths. Repeated iterations of the LBMP algorithm reduced the worst-case path delay and solution finally converged to an optimum of 157 ps while accounting for a 55.77% delay improvement. Table IV shows the 2-b WBTC delay convergence profile over 10 iterations. The first column represents the iteration number, the second column represents the worstcase critical path number based on the delay of + , the third column represents the minimum delay of the worstcase path due to process variations, the fourth column represents the maximum delay of the worst-case path due to process variations, and the fifth column represents the delay of the worst-case path.
Efficiency of the LBMP algorithm is illustrated through reduction in delay sensitivity as shown in (2) . Table V Table  V shows that although delay sensitivity has reduced in majority of the paths, it has also slightly increased for some paths (4, 5, 13 14, 18, 28 and 31). The ranks of these paths based on their delays are shown in Table VI . The increase in delay sensitivity of these paths is very much acceptable as most of the paths except path-31 do not fall in the set of critical paths.
A comparison of applying the LBMP algorithm to a 2-b WBTC with and without consideration of process variation during the timing optimization is shown in the Table VII. The 2-b WBTC designed without considering process variations has the delay of 161.37 ps [24] , while occupying an area of 2.054 m 2 . By accounting for process variations, the delay was reduced from 161.37 ps to 144 ps, and area occupied reduced from 2.054 m 2 to 1.695 m 2 . This accounts for 10.8% of delay improvement and 17.4% of area improvement.
V. LBMP FOR 4-B UNITY WEIGHT BINARY-TO-THERMOMETRIC CONVERTER
Another circuit used to validate the LBMP algorithm is the 4-b Unity Weight BTC (UWBTC) used in digital-toanalog-converters as shown in Fig. 4 . The UWBTC takes a 4-b binary input and generates a thermometric output and of which the number of '1' equals to its binary input. 1 1 1 1) . Along with the increase in the number of transistors in this 4-b UWBTC, the number of timing paths has also increased to 83. With minimum size transistors, the worst-case delay of the 4-b UWBTC was 152 ps.
After the first iteration of the LBMP algorithm, the worst-case delay reduced from 152 ps to 114 ps. Repeated iterations of the algorithm has reduced its delay to 103 ps, which accounts for 32.23% delay improvement. Table VIII shows the delay convergence profile of the 4-b UWBTC demonstrating that LBMP algorithm works effectively for complex designs with large number of timing paths. 
VI. LBMP FOR A MIXED DYNAMIC-STATIC ADDER
An optimal balance of delay and power can be achieved by partitioning the design to a mixed dynamicstatic circuit style [3] . A 64-b adder architecture used as a test case for timing optimization is shown in Fig. 5 , and is divided into two blocks operating in parallel for performance in timing [24] . [11] and a Final Sum (FS) block. The FS block is comprised of a Thermometric-to-Abacus Converter with add-1 logic (Fig. 7) , a Thermometric-to-Abacus Converter with add-0 logic, two Abacus-to-Binary Converters (Fig. 8) , and multiplexers. The 64-b adder is partitioned to a mixed dynamicstatic circuit style and designed in four combinations as shown in Table IX . The 64-b adder designed with CCT, CG and BTC using dynamic CMOS and FS using static CMOS has the least delay of 632 ps and power of 133.19 mW. Changing the BTC to static CMOS, the power reduced from 133.19 to 125.34 mw, which accounts for a 5.8% of power improvement. However, delay increased from 632 to 1462.33 ps, which accounts for a 131.38% increase. Furthermore, changing the CG to static CMOS, the power reduced from 133.19 to 125.02 mw, which accounts for a 6.3% of power improvement. However, delay increased from 632 to 1646.5 ps, which accounts for a 160.52% increase. Keeping CCT and BTC in dynamic CMOS and CG and FS in static CMOS, the power is 133.45 mw. However, delay increased from 632 to 862.4 ps, which accounts for a 36.45% increase.
A comparison of applying the LBMP algorithm to the CCT blocks and BTC of the 64-b adder with and without consideration of process variation in the timing optimization is shown in Table X . When the CCT block and BTC are optimized without considering process variations, the worst-case delay of 64-b adder in case-1 was 686 ps. Considering process variations in LBMP resulted in further reduction of delay from 686 ps to 632 ps, and power delay product from 91.6 pJ to 84.17 pJ, which accounts for an 8% improvement in both delay and power delay product. Similarly, accounting for process variations resulted in the worst-case delay of 64-b adder in case-4 to reduce from 890.56 ps to 862.4 ps, and power delay product reduced from 118.98 pJ to 115.08 pJ, which accounts for a 3.16% improvement in delay and a 3.36% improvement in power delay product.
VII. CONCLUSION
In this paper, it is shown that the importance and complexity in timing optimization of dynamic CMOS circuits increases as the number of timing paths and the number and magnitude of process variation increases. A solution addressing these issues is presented through a process variation aware transistor sizing algorithm for dynamic CMOS circuits while considering the load balance of multiple paths in a design.
A 2-b weighted binary-to-thermometric converter was first analyzed, and the worst-case delay was reduced from 355 ps to 157 ps while accounting for 55.77% delay improvement. In addition to reducing the worst-case path delay, it was shown that the proposed LBMP algorithm also reduces the sensitivity and uncertainty due to process variations. A 4-b unity weight binary-to-thermometric converter used in digital-to-analog converters was also analyzed, and the worst-case path delay was reduced through the LBMP algorithm from 152 ps to 103 ps, while accounting for 32.23% delay improvement. Furthermore, through implementation on a 64-b parallel binary adder and partitioning the design to a mixed dynamic-static CMOS logic, the critical path delay was optimized to 632 ps and the power delay product was optimized to 84.17 pJ. 
