Abstract
Introduction
Flip-flops and latches (collectively referred to as timing elements in this paper) are critical components in modern synchronous VLSI designs. Timing element (TE) design has a large impact on both system cycle time and system energy consumption and consequently there has been significant interest in the development of fast and energy-efficient T E circuits [2, 10, 11, 12, 14, 15, 16, 17, 181 . The evaluation methodology presented in previous work often employs a very limited set of data patterns and has usually assumed that the clock switches every cycle [lo, 11, 12 , 14, 15, 17, 181. In real VLSI designs, however, there is a wide variation in clock and data activity across different T E instances.
In this paper, we show that there can be significant energy savings if each T E instance is selected from a heterogeneous library of designs, each tuned to different operating regimes. For example, low-power microprocessors make extensive use of clock gating [6, 71, resulting in many TEs whose energy consumption is dominated by input data transitions rather than clocking, and for which we should select devices with low energy on data transitions. Other TEs, in contrast, have negligible data input activity but are clocked every cycle, hence for these we should select T E designs with low clock transition energy. Previous work has also focused on the delay or energy-delay product of TEs, but real designs often include many TEs that are not on the critical path. This timing slack can be exploited by using slower, lower energy TEs.
We use detailed energy analysis to compare a number of T E designs in this paper, including designs that exploit particular combinations of signal activity and timing slack. To demonstrate the potential savings from activity-sensitive T E selection, we instrument a pipelined MIPS microprocessor datapath design to gather statistics on T E activity, and simulate five SPECint95 benchmarks for a total of 2.7 billion CPU cycles. We then show that selecting appropriate TEs can reduce total T E energy without increasing cycle time.
3522-869WO1 S10.00Q 2001 EEE
Designing with a heterogeneous mix of flip-flop and latch structures may have the disadvantage of complicating timing verification. However, advanced designs with clock gating already perform verification for each local clock independently [l] , and in this case the added complexity is minimal. Additionally, many of the alternative T E structures are used on non-critical timing paths for which verification is usually relatively straightforward.
In this work, we select flip-flop and latch structures based on activation patterns and timing slack. When selecting T E structures for a real design, more factors would come into play, including: input drive and output load, presence of differential inputs, desirability of complementary outputs, robustness to clock skew and process variations, and the ability to provide time-borrowing. These factors will tend to limit the set of designs from which TEs are selected.
Other related work has explored the use of timing slack to reduce energy in non-critical gates: traditional transistor sizing uses smaller transistors, cluster voltage scaling [19] uses a lower supply voltage, multiple threshold voltages can be used to reduce leakage current [4, 51, or series transistors can be added to reduce leakage currents in a single threshold process [9] . These techniques are also applicable to T E design, but to our knowledge this paper is the first work that systematically exploits signal activity to reduce energy by changing the T E structure.
The paper is organized as follows. Section 2 presents a range of T E designs targeted for particular operating regimes. Section 3 describes our methodology for characterizing the energy profile of a given T E design and presents detailed simulation results for the set of candidate TE designs. Section 4 shows how the relative energy ranking of the T E designs varies widely depending on signal activity and on allowable slack. Sections 5 and 6 present results from applying activity-sensitive T E selection to a MIPS processor datapath, and Section 7 concludes.
Latch and flip-flop designs
Figures 1 and 2 present schematics for the latch and flip-flop designs we evaluated. To allow arbitrarily low clock frequencies and to allow clocks that can be gated in either phase, we restricted our designs to include only fully static structures. We used only single-rail input and output signals, and where TEs had complementary outputs we loaded only the selected output. Although not covered in this paper, we expect that our technique will Figure 1 . High-enabled latch designs. Transistor sizes are shown for a low-power design (in parentheses: (n)) and a high-speed design (in brackets: [n]). A transistor labeled with size n means that its W/L ratio is n times that of a minimum-sized transistor. For gates, the sizes of all transistors are shown. also accommodate dynamic and/or complementary TEs. To ensure design robustness, we required that circuits have input buffers to isolate input sources from any actively driven feedback nodes (e.g., PTLA Figure l(b) ). We assume that both true and inverted clock signals are generated by clock buffers and so do not insert local clock inverters (although some pulsed latch designs require local inverters to generate pulses). Also, we do not penalize inverting TEs (e.g. PPCLA) because in general it is not obviously preferable to have either true or complement output. For each T E design, we developed both a low-power version and a high-speed version by sizing the transistors accordingly. Figure l(a) , PPCLA, is a transparent latch based on the PowerPC 603 design, which is known to be reasonably fast and energy-efficient [17] . Figure l ( b ) , PTLA, is a passtransistor latch, which we chose because of its low clock load. Figure l(c) , SSALA, is a latch based on a fully static differential sense amp, which we chose for its low clock load. Figure l(d) , SSA2LA, is a minor variant of SSALA, which has greater clock load but has lower data transition energy while clock is gated. Figure l(e) , CPNLA, is a PI'CLA preceded by a clocked pseudo-NMOS input buffer. The pseudo-NMOS input buffer reduces the input loading of this latch and so reduces input data transition energy when the latch is closed. When the latch is transparent, the p-transistor in the clocked inverter acts as the pseudo-NMOS load and so dissipates considerable static power when the data input is high. Figure 2 (a), PPCFF, is a flip-flop design using master-slave PowerPC-style latch stages, which is known to have low energy and delay [17]. Figure 2 (b), SSAFF, is a masterslave flip-flop using static sense-amp latch stages which we include for its low clock load. We also measured the performance of various pulsed latch structures, which all employ an edge-triggered pulse generator to provide a short transparency window. Compared to flip-flops with master-slave latch designs, pulsed latches have the advantages of requiring only one latch stage per clock cycle and of allowing time-borrowing across cycle boundaries. The major disadvantages of pulsed latch structures are the increased susceptibility to timing hazards and the energy dissipation of the local clock pulse generators. The clock pulse generators can be shared among a few latch cells to reduce energy, although care must be taken that the pulse shape does not degrade due to wire delay, signal coupling and noise. We measured designs both with individual pulse generators and with pulse generators shared among four latch bits, in which case we divide the energy used by the pulse generator among the four latch instances. 
Delay and energy characterization
Our test-bench setup is similar to [17] as shown in Figure 3 . In order to have realistic input signals, the data input was driven with a minimum-sized inverter which was itself driven by a loaded minimum-sized inverter. The clock inputs were designed to simulate a local clock buffer, and the clock drivers were sized to give equal clock rise and fall times for each T E design. The T E outputs were loaded with a 7.2fF capacitance, simulating a fanout of four minimum-sized inverters (F04-min). Other studies [12, 15 , 171 use strong input drivers and much larger output loads (200fF). However, we have extracted capacitance values for a processor datapath (described below) including transistor gate and drain capacitances and wire substrate and coupling capacitances; and we found that over 40% of TEs have output loads less than the F04-min load, over 60% have loads less than twice this amount, and none have loads greater than 60fF. For brevity, we here consider only one size of output load but in general T E characterization should consider a variety of loads;
we are investigating T E load sensitivity in ongoing work.
The T E designs were implemented in a 0.25 pm TSMC CMOS technology. Layouts were extracted using the SPACE 2D extractor [20] which extracts layout parasitics including capacitance to substrate, fringe capacitance, crossover coupling capacitance, and capacitance between parallel wires. All tests were run under nominal conditions of Vdd=2.5V and T=25'C. Figure 4 shows the delays for both versions of each timing element (low-power and highspeed). For latches, delay is defined as the D-Q propagation delay. For flip-flops, we used the methodology proposed by [17] in which delay is defined as the minimum D-Q delay (in general the C-Q delay changes depending on when D arrives in relation to C, and there is some optimal arrival time that minimizes the total D-Q delay). These delays were obtained using HSpice.
We rely on accurate energy models to characterize candidate flip-flop and latch designs. Traditionally, the power consumption of flip-flop and latch designs has been measured using an un-gated clock and a small number of input activation patterns [lo, 11, 12, 14, 15, 17, 181 . Instead, we adopt a more accurate methodology based on [21] in which all possible states of the T E are enumerated and the energy consumption of each state transition is measured.
Canonical state transition diagrams for latch and flip-flop designs are shown in Figure 5 .
In general, the state transition diagram for a given flip-flop or latch design may be more intricate than these canonical examples because the design may have internal nodes which are not uniquely determined by the values of C, D, and Q. In this case, the design has two or more distinct states for a given CDQ combination; its internal nodes have different values depending on the sequence of transitions taken to obtain those C, D, and Q values To characterize the.TE designs, we simulated each transition using HSpice, and measured the energy consumption. The output energy of the shaded inverters in Figure 3 was included (as in [17] ), but the energy dissipated on the output load capacitance was not (the purpose of this capacitor is only to simulate reasonable output signal slopes). The resulting energy numbers for our T E designs are shown in Table 1 and Table 2 . When flip-flops or latches have two states corresponding to some CDQ combination, both energy numbers are shown for transitions leaving these states. We note that these differences are usually small, and for the remainder of this paper we use the average value for each transition to simplify the analysis.
. Since the CPNLA design has static current dissipation when C and D are both high, we must make some assumptions in order to characterize its energy usage. We assume that the clock is gated low, so that the clock input never remains high for more than half a clock period, and we assume that the clock cycle time is a pessimistic 32 F04 delays. Th.us, in Table 2 , whenever there is a transition into a state where C and D are both high, we include in the energy value the static current energy consumed during half a clock period. If D goes low during this time, the static current path will be broken, but we always assume worst case timing so that the static current lasts for the full half cycle.
Energy analysis
In order to more easily analyze the energy numbers in Tables 1 and 2 , we constructed several example waveforms shown in Figure 6 . These tests are designed to exemplify the different operating regimes for flip-flops and latches. For example, Tests 1 and 2 emphasize clock activity, while Tests 3 and 4 emphasize data activity. Tests 5, 6, and 7 exhibit high clock, input data, and output data activity. Test 8 has both clock and input data activity, but no output activity.
For each test, we used Tables 1 and 2 to calculate energy. The resulting energy consumption is shown in Table 3 . We can see that the optimal flip-flop or latch for each regime varies considerably; some designs perform extremely well in certain regimes, but extremely poorly in others. For example, in Test 2 the low power SSAFF design uses 8 times less energy than the HLFF structure, but in Test 3 it uses 7 times more energy. Another good example of a T E specialized for an operating regime is CPNLA; this latch design is by far the best choice for Test 3, but by far the worst choice in all other cases.
In these results we also see the flaw in the methodology of many flip-flop and latch analyses which test only a limited set of data activations with clock always un-gated [lo, 11, 12, 14, 15, 17, 181. These studies typically look at Tests 5, 6, and 7; however, we see that the optimal T E choice may be very different if we take Tests l-4,into consideration. Also, in these studies, the TEs are typically optimized for energy-delay product. Our results show that if we size a design for high-speed and low-power separately, the energy usage can differ substantially. When the T E is not on a critical path the low-power design should be used, and when timing is critical the high-speed design should be used. If TEs are only optimized for energy-delay product, the result will be a slower circuit that burns more power.
Another important observation is that CCPPCFF never uses less energy than SSAFF, even when data is inactive. This is because both designs have two transistor gate loads on the clock. Additionally SSAFF is significantly faster and less complex than CCPPCFF, so we conclude that it is always a better choice. The analyses in [14, 16, 181 which advocate an individually gated clock are unfair in that they only compare their designs with flip-flops that have eight transistor gate loads on the clock.
Processor design and simulation
To evaluate the effectiveness of designing with diverse flip-flop and latch structures;, we tested our idea on a processor datapath. Our processor design is a classic 32-bit MIPS RIISC five-stage pipeline (R3000 compatible), including caches and system coprocessor registers.
We are implementing this design as part of a low-power processor project. To date, we In order to characterize the behavior of the flip-flops and latches in the CPU datapath, we simulated the design using a fast cycle-accurate simulator. We augmented the framework previously presented in [13] to count the relevant T E state transitions. This simulation framework tracks the input and output values of all blocks in the designs (flip-flops, adders, muxes, etc.), and is cycle-accurate for both the high and low regions of the clock period. However, it does not accurately track the timing of signals and it does not model glitches.
If modeled accurately, glitching activity would have the effect of increasing the input data activity for TEs, and could possibly affect the optimal design choice. In low-power datapath designs, however, glitching activity is usually kept to a minimum. As a test set, we chose five programs from the SPECint95 benchmarks: perl(test, primes), ijpeg(test), m88ksim(test), go (20, 9) , and lzw'. In total, the benchmarks executed 1.71 billion instructions in 2.69 billion cycles (CPI = 1.57). 'For each TE, we counted the number of relevant state transitions, subject to the constraints of a cycle-accurate simulator mentioned above. Negative-edge-triggered flip-flops and low-enabled latches were implemented as their positive/high counterparts, but with inverted clock signals.
Processor energy results
A simplified view of the data collected by the simulations is shown in Figure 8 . Here, the T E state transition counts have been compressed into clock and input data activity. It is readily apparent that the various TEs have substantially different activation patterns. Also, we notice that data activity tends to be very low, while clock activation is generally much greater.
Next we show the total energy used by all TEs in the datapath if a single design is used universally. As a point of reference, the energy for the total datapath other than the flipflops and latches (and not including caches or control logic) was about 0.21 J for these tests. Figures 9 and 10 show the T E energy plotted against the delay of each TE (from Figure   4 ). As long as at least one T E is on a critical path (as is the case for the CPU design), this delay has a direct impact on the maximum clock frequency of the circuit. Also plotted (for HLFF, SSASPL, and PPCLA) is the energy usage when a fast design is used for all TEs with critical timing, and the low-power version of this same design is used for non-critical TEs. This shows the improvement that would be obtained by traditional transistor sizing on non-critical timing paths.
We also show optimal points obtained using activity-sensitive selection of TE designs.
One option (Lowest-Energy) is to always choose the optimal T E design to minimize the energy consumption for a particular T E in the datapath. This results in minimal energy, but the delay impact is set by the slowest T E on a critical path. The other option we show .(for HLFF, SSASPL, and PPCLA) is High-Speed-Lowest-Energy (HSLE) in which a fast design is used for any timing-critical TE, and the design which results in lowest energy is used otherwise. In this study, we choose a design universally for each multi-bit TE; we found that choosing the optimal design for every bit in every T E only improved results by less than one percent. This is because the clock activity for all bits in a TE is identical, and the data activity tends to be similar. Table 5 shows the energy breakdown in more detail. For each T E instance, we show the energy for the fastest T E (HLFF-hs, PPCLA-hs), along with that for the lowest energy TE. We also include SSASPL-hs as a high-speed flip-flop option since it is only slightly slower 'This is an optimized version of the SPECint95 compress benchmark than HLFF-hs (214ps vs. 204ps) but uses much less energy. The totals given show the energy for a fast design with homogeneous TEs, the saving achieved by transistor sizing, and the saving using HSLE activity-sensitive selection. For flip-flops, HSLE selection reduces energy by 69% compared to a fast homogeneous design using HLFF-hs, and 52% compared to a design with transistor sizing. If we start with SSASPL-hs as the base case, the saving is 43% compared to a homogeneous design, and 25% compared to a design with transistor sizing. For latches, the opportunity to save energy is reduced because they are simpler structures, and the fastest latch (PPCLA) is also quite energy efficient for the activation patterns in the datapath. Nevertheless, the energy saving with HSLE selection is 8.3% compared to a homogeneous design using PPCLA-hs, and 6.1% compared to a design using transistor sizing.
Overall, the saving we get for flip-flops and latches using HSLE activity-sensitive selection is 63% compared to a homogeneous design with HLFF-hs and PPCLA-hs, and 46% compared to a design with transistor sizing. If SSASPL-hs is used as the base case flip-flop, the HSLE saving is 35% compared to a homogeneous design, and 19% compared to a design with transistor sizing. Table 5 shows that several different T E structures are used when the processor design is optimized for both energy and speed; this validates our hypothesis that a heterogeneous mix of T E structures can result in a 1o;ver energy design without degrading performance.
Summary
Traditionally, designers have chosen flip-flop and latch structures to use uniformly throughout a circuit. Because of this, many studies have compared TE designs based on a limited set of activation patterns in order to determine the best universal design. The proposition of this paper is that no flip-flop or latch design is universally optimal. Designs vary significantly in parameters such as delay, clock switching energy, and input data switching energy. Two important observations allow us to use this variance to enable circuit designs with more optimal energy usage and performance. First, the activation patterns for various TEs in a given circuit may differ considerably. Second, most TEs do not lie on critical paths, and thus have ample timing slack. Based on these observations, we propose an alternative methodology in which the designs for various flip-flops and latches are chosen from among a range of alternatives based on the local operating conditions and delay requirements. We present a variety of T E structures with separate transistor sizings for high-speed and low-power, and provide complete energy and delay characterizations. We examine several operating regimes based on clock and data activity, and find that indeed there is considerable variation in the optimal T E design for different regimes.
We apply our technique to a MIPS RISC processor design which we simulate for 2.7 billion cycles of program execution to determine flip-flop and latch activation patterns. We determine that, compared to a high-performance design wi!h homogeneous flip-flop and latch structures, a processor designed with activity-sensitive selection of T E structures results in a total T E energy reduction of 63% with no loss in performance. Compared to a design which uses transistor sizing alone to reduce energy, activity-sensitive selection results in a total T E energy reduction of 46%. Figure 9 . The total energy used by all flip-flops in the processor datapath while executing the entire benchmark test set is shown for each candidate design assuming that it is used universally. This energy is plotted against the delay of the flip-flop design, which has a direct impact on maximum clock frequency. A -hs suffix refers to a flip-flop design sized for high speed, while a -2p suffix refers to a design sized for low power. Lowest-Energy shows the results of using activity-sensitive selection to minimize energy for each flip-flop instance. HLFF-Sizing uses HLFF-hs for all timing-critical flip-flops, and HLFF-lp otherwise. HLFF-HSLE uses HLFF-hs for all timing-critical flip-flops, and activity-sensitive selection to pick the lowest energy design otherwise. SSASPL-Sizing and SSASPL-HSLE are analogous to the corresponding HLFF markers. Table 5 . A breakdown of the total energy used by TEs in the processor datapath while executing the entire benchmark test set. Shown are energy numbers (in mJ) for the fastest TE designs (HLFFhs, PPCLA-hs) and the designs which use the lowest energy in each instance. SSASPL-hs is also included as a high-speed flip-flop option. The total energy is shown as well as the total energy obtained using transistor sizing and the total energy using HSLE activity-sensitive selection. The bold values indicate which energy numbers are chosen with HSLE selection, based on which TEs have critical timing requirements.
PPCFF-hr PHLFF~HSLE

CCWCFF-hr
0
OCCPPCFF-t
W
