ABSTRACT The recently proposed dual mode logic (DML) gates family enables a very high level of energydelay optimization flexibility at the gate level. In this paper, this flexibility is utilized to improve energy efficiency and performance of combinatorial circuits by manipulating their critical and noncritical paths. An approach that locates the design's critical paths and operates these paths in the boosted performance mode is proposed. The noncritical paths are operated in the low energy DML mode, which does not affect the performance of the design, but allows significant energy consumption reduction. The proposed approach is analyzed on a 128 bit carry skip adder. Simulations, carried out in a standard 40 nm digital CMOS process with V DD = 400 mV, show that the proposed approach allows performance improvement of X2 along with reduction of energy consumption of X2.5, as compared with a standard CMOS implementation. At V DD = 1.1 V, improvements of 1.3X and 1.5X in performance and energy are achieved, respectively.
I. INTRODUCTION
The DML logic gates family was proposed in order to provide a very high level of energy-delay (E-D) optimization flexibility [1] - [3] . DML allows an on-the-fly change between two operational modes at the gate level: static mode and dynamic mode. In the static mode, DML gates consume very low energy, with some performance degradation, as compared to standard CMOS gates. Alternatively, dynamic DML gates operation obtains very high performance at the expense of increased energy dissipation. A DML basic gate is based on a static logic family gate, e.g., a conventional CMOS gate, and an additional transistor. While DML gates have very simple and intuitive structure, they require an unconventional sizing scheme to achieve the desired behavior [1] , [3] .
Performance of most digital circuits and systems is determined by the delay of critical paths (CP). Even though standard synthesis tools attempt to design logic blocks without CP [4] - [6] (i.e. equalized path delay), the slack from the targeted Clk (Clock) frequency always exists and should be repaired by designers. Many methods have been proposed to address these slacks . These methods include adaptive voltage scaling with a CP emulator circuit [7] , multi oxide thickness driven threshold-voltages, multi-channel lengths for energy reduction in the non-CPs and performance boost in the CPs [8] , [9] . Meijer et al. and Liu et al. applied a body bias on a non-CP to improve energy consumption and increase performance of the CPs, respectively [10] , [11] . While the aforementioned methods solve the critical path slack problem, in most cases they also result in a significant increase of energy consumption. In addition to these gate level approaches higher-level approaches were presented such as multi-mode logics, parameterised logic [24] . In this paper, we issue both the gate and higher archeitectural levels.
This paper proposes to meet the delay requirements of CPs along with lowering the over-all energy consumption of the design by utilizing the powerful modularity of DML. We propose and analyze a new approach, which locates the design's CPs and utilizes the on-the-fly modularity of DML to operate these paths in the boosted (dynamic) performance mode. The non-critical paths are operated in the low energy static DML mode, which does not affect the performance of the design. Since in most cases the majority of gates in the design are not on the CPs, the increase in energy consumption of the critical paths will be negligible in comparison to the general circuit consumption. Moreover, DML static gates dissipate less power than their CMOS counterparts, resulting in reduced power dissipation of the whole design.
The proposed approaches have been analyzed on a 128bit Carry Skip Adder (CSA) benchmark. Simulations, carried out in a standard 40nm CMOS process with V DD = 400 mV, show that the proposed approaches allow performance improvement of X2 along with reduction of energy consumption of X2.5, as compared to a standard CMOS implementation. At V DD = 1.1V, improvements of 1.3X and 1.5X in performance and energy were achieved, respectively.
The rest of this paper is constructed as follows: Section II discusses basic properties of the DML family. The proposed CP-DML approach is discussed in Section III. Section IV presents the test-bench, chosen for the evaluation of the proposed approach. Simulation results of the benchmark circuits are presented in Section V. Section VI concludes the paper.
II. DML BASICS A. DML OVERVIEW
A basic DML gate architecture is composed of an un-clocked static gate, e.g. CMOS, and an additional transistor M1, whose gate is connected to a global clock signal [1] , [2] . In this paper we focus on DML gates where the static gate implementation is based on conventional CMOS. A DML gate implementation can be one of two: ''Type A'' and ''Type B'', as shown in Figure 1 (a-b) and Figure 1(c-d) , accordingly. In the static DML mode of operation (Static mode), the M1 transistor is cut-off by applying the high Clk signal for ''Type A'' and low Clk_bar for ''Type B'' topology. Therefore, the gates of both topologies operate similarly to the static logic gate, CMOS in this case. For a dynamic operation of the gate (Dynamic mode), the Clk is enabled for toggling, providing two separate phases: pre-charge and evaluation. During the pre-charge phase, the output is charged to V DD in ''Type A'' gates and discharged to GND in ''Type B'' gates. During evaluation, the output is evaluated according to the values at the gate inputs, in a similar fashion to NORA/np-CMOS implementations [12] , [13] . It was shown that DML gates have presented a very robust operation in both static and dynamic modes under process variations (PVT) and at low supply voltages [1] - [3] . Dynamic mode robustness is mainly achieved by the intrinsic active restorer (pull-up in ''Type A'' and pull-down in ''Type B''). This restorer also allows sustaining glitches, charge leakage and charge sharing. Unique sizing of the DML gate transistors is the key factor for achieving low energy consumption in the static DML mode (in which the topology of the gate is identical to the static gate). This sizing is also responsible for reduction of all capacitances of the gate. In a similar way, the unique transistor sizing enables evaluation through a low resistive network achieving fast operation in the dynamic mode. An intuitive visualization of the tradeoff inherently related to DML is shown in Figure 1 (e). Energy efficiency is achieved in the static DML mode at the expense of slower operation (Low Energy and Low Performance, left scales). However, the dynamic mode is characterized by high performance, albeit with increased energy consumption (High Energy and High Performance, right scales). These tradeoffs allow a very high level of flexibility at the system level, as will be shown [3] . These are optimized for dynamic operation. Figure 1(h) shows the conventional sizing of a standard CMOS gate where, W MIN is a minimal transistor width, β is the PUN to PDN inherent up-sizing factor and f is the gate's general up-sizing factor [3] , [14] , [15] . As can be seen, the in\out capacitances of DML gates are significantly reduced, as compared to CMOS gates, due to the utilization of minimal width transistors in the pull-up of ''Type A'' or pull-down of ''Type B'' networks. The size of the pre-charge transistor is kept equal S ·W MIN in order to maintain a fast precharge period, despite the output load upsized gate, where S is the evaluation network upsizing factor. For more details, the reader is referred to [3] . Type A'' and the headed ''Type B'' DML gates, respectively. The use of these topologies is explained in details in [1] . It allows successful pre-charge for a cascaded topology of standard Static gates\Synchronous devices to a DML logic. Many aspects of DML gates sizing, as well as preferred set of gates for ''Type A'' and ''Type B'' topologies, have been analyzed and discussed. Optimization for network up-sizing parameters for load driving was conducted using the Logical Effort (LE) method [3] . The DML key achievement is that while presenting very high performance in the dynamic mode by the proposed sizing, the same topology also enables VOLUME 1, 2013 improved energy efficiency in static mode, as compared to a conventional CMOS.
B. STATIC DML AS A SEMI-ENERGY-OPTIMAL CMOS
Design space of a CMOS gate is mainly influenced by V TH , transistor width, V DD , channel length, oxide thickness and body voltage. The influence of those parameters on E-D plain-optimization is being explored. For the CMOS family, the symmetry of the gate (i.e. equal rise and fall times) is highly important. This is due to the fact that in a combinational system there is always some uncertainty regarding the transition type. As a result, the pull-up network (PUN) of CMOS gates, which is constructed by low mobility PMOS devices, is sized up by the β parameter [14] . When optimizing a CMOS gate's energy at the expense of its performance, the transistor's width is the main parameter used for reducing the energy consumption. This is due to several facts: (1) Switching energy is proportionate to the load and quadratic dependent on V DD . Under energy optimization, the symmetry of the gates' performance does not constitute a constraint so the transistor's width can be reduced, as well as β. This significantly lowers the load capacitances. (2) With circuit's V DD lowering [26] , [27] and technology scaling, leakage energy has become one of the key factors for static power dissipation. The leakage energy is caused by the numerous leakage currents of a device. The main leakage currents are the sub-threshold and gate leakage currents [16] , [27] . These currents are linearly dependent on the transistor's width And under energy optimization they are considerably reduced.
CMOS based DML operated in static mode with transistor sizes optimized for the dynamic mode is de facto a semienergy-optimal CMOS structure with an additional negligible output capacitance for the Clk transistors (transistors M1 and M2). Static DML is still highly robust due to its complementary nature [1] , [2] and withstands aggressive voltage scaling. This methodology can also be referred to as a stand-alone technique for reducing the energy consumption of digital circuits. The E-D tradeoff space under this approach is very wide and in this paper the discussion is limited only to transistors sizing, as shown in Figure 1 
III. CP-DML APPROACHES FOR ENERGY EFFICIENCY AND HIGH PERFORMANCE
This section elaborates the proposed design approaches for energy efficient and high performance design of combinatorial systems. Sub-Section A presents an approach which utilizes DML gates in the dynamic mode on the CPs in order to improve their delays. Sub-Section B elaborates various aspects of energy reduction of all non-CP portions of the design.
Theoretically, a general DML design can be controlled (input signal-driven control or external signal-driven control) to operate each gate in one of two modes: Static and Dynamic. This means that a general design can be operated in 2 (Gates Number) different options, each one leading to a Switching between these two modes leads to the distinct tradeoff, meaning that the design is optimized either to achieve maximum performance or minimum energy consumption.
A. SOLVING CPs TIMING VIOLATIONS
As discussed in Section I, the CPs of a design are automatically identified using standard design flow tools. By replacing only these paths with DML gates and applying the dynamic mode on these paths, their delay can be reduced. The rest of the design is implemented using standard CMOS static logic. Of course, special design constraints should be enforced in all the intersections between a static path and a dynamic one. In some of these cases, a footer\header should be applied [1] , [2] , [17] . Figure 2 (d) presents a design in which the CPs were located and only those paths were given the option to toggle between dynamic and static mode, according to the system requirements. If the system design can withstand slower operation, the CP logic will operate in static mode. If the system is required to meet the defined Clk period for all cycles, the CPs will operate in the dynamic mode. Such application can be a smart phone that operates with two frequencies: slow one for power save/ hibernating mode and a fast one for video streaming. To emphasize, low complexity systems will normally bear only one frequency for operation and therefore the CPs will constantly operate in the dynamic mode. Normally, the amount of gates on the CP is small as compared to the total amount of gates in the design. Therefore, in most cases, the inherent dynamic-operation energy of these CPs will lead to a non-significant increase in total energy consumption of the design.
B. SOLVING THE CPs TIMING VIOLATION WHILE REDUCING THE TOTAL ENERGY CONSUMPTION
As described in the previous Sub-Section, the CPs are mapped and operated in the dynamic DML mode. In Sub-Section A, the rest of the circuit was assumed to keep a standard CMOS logic gates topology. Therefore, the design was proposed to solve the CPs' timing constraints at the expense of a small degradation in energy consumption, as compared to a complete CMOS design. In this Sub-Section, all portions of the design, which are not a part of the CPs, will be mapped to static mode DML gates (similar to semi-energy optimized CMOS gates, described in section II). In most designs, these non-CPs are not time constrained and therefore the asymmetry behavior of their transitions and consequently their performance degradation will withstand the Clk period. The use of the static DML mode for the mass majority of gates in the design will lead to a significant reduction in the total dynamic and static energy consumption. Figure 3 visualizes this approach.
IV. MODULAR BENCHMARK
This section, presents the chosen benchmarks. As depicted in Section III we will discuss three designs: 1) A CPs accelerator, as described in Sub-Section III(A), which has 2 operation modes: -''DML Carry Path-Dynamic''-The DML CPs are activated in the dynamic mode. -''DML Carry Path-Static''-The DML CPs are activated in the static mode. In both of these modes the rest of the non-CPs portions of the system are designed with standard CMOS. 2) A CPs accelerator with low energy consuming nonCPs, as described in Sub-Section III(B), which has 2 operation modes: -''DML Carry Path-Dynamic. With low energy nonCPs-Static'' -The DML CPs are activated in the dynamic mode, while the rest of the system operates in the DML static mode. -''DML Carry Path-Static. With low energy nonCPs-Static'' -The DML CPs are activated in the DML static mode, similar to the rest of the system. 3) CMOS equivalent design. A Carry Skip Adder (CSA, also called carry bypass adder), was chosen as a benchmark to demonstrate and evaluate the proposed concept. The CP of the CSA increases as a function of the number of inputs, making it possible to examine the E-D trends as a function of the CPs lengths. It is important to note that the proposed methods can apply over any combinatorial circuits and a CSA was chosen only due to its modularity and simplicity.
A. CMOS CSA DESIGN
A conventional CSA is composed of a set of Ripple Carry Adder (RCA) blocks. They essentially utilize the carry prop- agation in order to skip the carry from one RCA to the next RCA block. It is possible to predict the propagation of the carry by a simple XOR gate [18] . Such prediction mechanism can substantially reduce the delay [19] . The CP in CSA occurs when the carry ripples at the first block, and then skips the rest of the blocks and then ripples again at the last block. This is the longest possible route in the CSA. Lehman et al. have researched CSAs with non-uniform sized distributed RCA blocks [20] . Majerski has presented a multi-level of carry-skip propagation architecture [21] . Guyot et al. and Oklobdzija et al. proposed algorithms for choosing optimized block sizes [22] , [23] . In this paper, a simple CMOS CSA design with a fixed size of 4-bits blocks was designed, as shown in Figure 4 . Clearly, the methods presented in this paper can be generalized to any CSA block size constant or variable and for multi or single level carry path. A general single-bit Full Adder (FA) equations are:
(1)
Where, ⊕ is the conventional XOR symbol. For an RCA, C out will be an input to the next FA. For the CP, the carry would propagate through all FAs. Due to the fact that C out is on the CP for each RCA, the mirror circuit for computing C out is used [19] , as shown in Figure 5 . This circuit calculates the inverted value C out and when serially chained, it reduces the circuitry on the CP (i.e. eliminates one inverter for each FA). Furthermore, the use of the mirror adders creates the need for inverting inputs for all odd FAs and inverting outputs for all even FAs [18] , as shown in Figure 4 . All the logical gates presented in the figure are constructed with standard CMOS. A standard sizing optimization, for the RCA of mirror FAs using Logical Effort [15] , yields the sizing factor Fi (as shown in Figure 4 for all the carry path gates). For all i's which are a multiple of 4, F i = 1 and for all the rest F i = 3.5. Figure 5 shows the DML implementation of the CSA's CP. The CP flows through the first NOR (assuming that the carry in of the whole design is 0) and through all the MUXs of the design. The gate level implementation of the CP can be constructed with various topologies of DML: DML NOR gates are most efficiently implemented in the ''Type A'' topologies and NAND gates in ''Type B'', as discussed in [1] - [2] . The Boolean logic does not allow an efficient implementation of a MUX with a NOR following a NAND or vice-versa, which is the preferred topology for DML logic design. Therefore, in the chosen topology, the CP is composed only of NANDs (where one of them is implemented using efficient ''Type B'' and the other one has a less optimal ''Type A'' structure). The last inverter in each RCA block is a headed ''Type B'' inverter, which maintains correct Pre-Charge phase for the CP. The sizes of the transistors in terms of minimal transistor width are shown in Figure 5 . In the design, implemented in such way, only 8% of transistors will (optionally) operate dynamically, while the remaining 92% of the transistors are kept at the low energy static mode. This modular design keeps the same complexity and the same dynamic-to-static-gatesratio, as a function of the input vector's length, N [bits].
B. DML CRITICAL PATH DESIGN

V. SIMULATION RESULTS
The modular benchmarks circuits, described in the previous section were simulated in a standard 40nm CMOS process, using the Spectre Cadence simulator [25] . Implementations of these methods on the benchmark CSAs were mainly examined over the E-D plain and as a function of the operating voltage and the CP's length. Note, the naming convention for the different designs and operating modes is elaborated in the preface of Section III. All energy and delay measurements are per-operation.
A. THE E-D PLAIN AS f (V DD )
Each design was carefully measured as a function of the supply voltage. Now-days, even standard manufacturers realize the potential held in the near\sub-threshold operation. Standard cell libraries, designed for 700-800 mV, are widespread. For special low power applications, the libraries are normally designed for 200-500 mV. In order to examine the proposed concept both for low voltage [26] and strong inversion operations, measurements are performed with supply voltages varying from 0.4V to 1.1V.
The E-D curves for all designs of a 128 bit CSA are plotted in Figure 6 (a). The curves' order from top to bottom is: (1) CMOS, (2) CMOS design with a CP in Dynamic DML mode, (3) CMOS design with a CP in static DML mode, (4) low energy non-CP design with a CP in Dynamic DML mode, and (5) Low energy non-CP design with a CP in static DML mode. The last two curves are presented in the gray enhanced region, at the bottom of the graph. This region represents the low energy area of the E-D plain, achieved by implementing all non-CPs with the low energy DML static mode (which, as described in Section II, could be also referred as ''low energy CMOS''). The two areas of interest are circled at the edges of Figure 6 (a) and are enlarged in Figures 6(b) and 6(c) . Figure 6(b) shows the tradeoff area for a 400 mV operating voltage for all designs. Figure 6 (c) presents that same tradeoff for 1.1V. These two extremities clearly show that these designs are highly flexible in energy consumption and performance, for the whole range of voltages. The conclusion from analyzing the DML enhanced CP plots (second and third curves) compared to the CMOS plot (first curve) for the 0.4V supply (Figure 6(b) ) is that the DML enhanced CP achieves X2 in performance. This achievement, however, comes at the expense of a 16% increase in energy consumption. If the system is such that two operational frequencies are allowed, when a low-power operation is required, the static mode (with a low frequency) could be applied yielding X2.5 energy improvement at the expense of performance degradation of X1.3. This ability to change operating conditions on the E-D plain on-the-fly is a feature that can be easily utilized to improve the system flexibility and E-D efficiency.
For the 1.1 V supply ( Figure 6 (c)), it is shown that boosting the performance of the CP by 20% increases energy consumption by only 3%. Again, if the system is such that two operational frequencies are allowed, when a low-power operation is required, the static mode could be applied yielding X1.5 energy improvement at the expense of performance degradation of X1.4. These results reveal that a low-voltage operation magnifies the differences between the different modes. There are a few reasons for this trend. First, the performance advantage of DML circuits in the dynamic mode over standard CMOS intensifies with the supply voltage lowering [1] - [2] , [26] , [27] . The second, less dominant factor is the reduced sensitivity of DML circuits to increased leakage currents at low supply voltages [1] - [2] .
By examining the DML performance optimized CP with low energy non-CP plots (two lowermost curves), it is clear that the total energy is reduced by X2-X3 (gray region) for all voltage regions -which is substantial. In addition, the improvement in CP performance of X1.3 and X2.1 are achieved for the 1.1V and 400 mV supplies, accordingly. The results for the CP are quite similar to the results achieved for operating without the low energy non-CP gates. This is due to the fact that the CPs themselves have not changed. To conclude, the flexibility of the DML design led to a significant improvement in both energy and performance.
B. THE E-D PLAIN AS f(N)
This Sub-Section examines the efficiency of the proposed concept as a function of the CP's length, which is closely related to the size of the design. The CSA's size/length depends on the number of inputs, N . Figure 8 shows the E/D trends for all designs as a function of N . Each plot starts with the minimal CP related to N = 4 and goes up to the longest examined CP of N = 128. The point where N = 128 appears both in Figure 6 and Figure 7 . The key point of this analysis is to show the scalability of the method for various design sizes and not only for a very long CP. The E-D trends of different designs with N = 128 were discussed in Sub-VOLUME 1, 2013 Section A and therefore will not be discussed here again. Figure 7 (a) and Figure 7(b) show that as N increases (or log 2 (N ) increases), the scalability of the energy and performance improvement is almost constant both for 400mV and 1.1V. There is another interesting point regarding the 128 bit design with V DD = 1.1V, presented in Figure 7 (b): the Low Energy design (DML static mode for non-CPs) with CPs operated in the dynamic mode, consumes slightly more energy than the standard CMOS non-CP design with DML dynamic CP, but achieves more than X2 improvement in performance.
As can be seen from Figure 7 (a), all designs (N = 4..128) with performance improved CP show a significant improvement in performance at 400mV, as compared to the CMOS counterparts. However, for the 1.1V supply (Figure 7(b) ), this efficiency can be observed only from N = 32. This behavior naturally depends on the specific gates topology of the chain, as mentioned in Section IV B. The CSA specific design represents an average case where some of the DML gates on the CP are very fast in comparison to CMOS, such as ''Type B'' NAND, and others hold very small improvements, such as ''Type A'' NAND. For this reason, we can expect that for other benchmarks, the improvement in E/D will occur for an N > N MIN .
C. STIMULI INPUT VECTOR COMPLEXITY
The measurements presented in the previous two SubSections simulated input stimulus that activated the CP of each circuit. These stimuli trigger the worst delays which are possible for these designs. Each circuit requires different input for activating its CP. The worst case of energy consumption depends on the input vector. Worst case is reached when the input vector switches as many gates as possible for each RCA chain (static portions of the design). In the previous two Sub-Sections, for the case of 128 bit CSA, input vectors were chosen to switch 40 outputs, regardless the CP switching. This approach is quite pessimistic, since the average number of switching outputs is lower than 40. Let's assume equal probabilities for logic ''1'' and logic ''0'' for each input. The probability for a carry in a FA is q = 0.5. The probability for a carry to propagate through K successive bits is:
Alternatively, the probability of a carry being either killed or generated through K successive bits is 1-q k . Therefore, the probability for propagating more than 4 bits is 6.25%, which is quite low. For example, let's consider the 128 bit design composed of 4 bit RCAs (i.e. 32 segments): rippling of 2 bits inside each 4 bit RCA (in addition to the switching of the whole CP) is, in terms of probability, a quite reasonable or even a harsh case. Nevertheless, input vectors which are more energy consuming (for the static parts of the design) were simulated (60 and 80 switched outputs). The anticipated results showed that as the input stimulus complexity rises, the additional energy required for the dynamic operated CP becomes more and more negligible in comparison to the total energy of the designs. These results are, of course, reassuring for all worst\typical\best case input vectors, energy-wise.
VI. CONCLUSION
CP timing violation and energy minimization are important issues in all digital circuits. The invaluable possibilities, inherent to design with DML gates, leverage the flexibility of the design to meet CP timing along with reducing the total energy consumed by the circuit, as shown in this paper. Until now, meeting the CP timing was closely related to a rise in the consumed energy by conventional methods. In this work this paradigm is contradicted -both timing and low energy consumption requirements are met. We showed that the performance of the 40nm CSA benchmark circuit was improved by X2, while also achieving reduction of energy consumption of X2.5. Since the CSA circuit is not optimal for DML implementations, it is expected that these improvements will be even more significant for other designs.
