In this paper we describe an area efficient power minimization scheme "Control Generated Clocking I' that saves significant amounts of power in datapath registers and clock drivers of sequential circuits. Power savings are achieved by making simple changes to the state machines controlling the datapath. These changes enable the control signals from the state machines themselves to be used as clocks for the datapath registers. Use of these control generated clocks makes the static timing analysis of des {gns implementing this scheme simpler when compared to techniques such as clock gating. This scheme preserves the cycle boundaries on which registers load data, thereby allowing reuse of fivnctional test cases developed for the original circuit. In this paper we also describe timing requirements of a design in which this scheme has been implemented, cost-benefit aspects of this scheme and an algorithm for the automatic synthesis of control generated clocks. Results from application of this technique on a complex design are then discussed.
Introduction
High levels of integration, high operational frequencies and the proliferation of battery operated applications has rendered power dissipation a key parameter in the design of present day VLSI circuits, since it affects reliability, performance and cost of the circuit [l] [S]. In high frequency CMOS digital circuits, the dynamic power resulting from charging and discharging of parasitic capacitances dominates over the power dissipated due to 1eak.age currents [1] [8]. The focus of most minimization techiniques is therefore dynamic power. In a sequential circuit, the sources of dynamic power dissipation and commonly used techniques for power minimization are listed in Table 1 .
Clock gating [1] [6][8] is an effective technique for power minimization in circuits that are idle for long periods of time. Clock gating in its most general form has some practical difficulties, viz.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. 1.
2.
Possibility of glitches on the gated clock signal when the gating signal arrives later than the clock. Difficulties in using static timing analysis effectively on the design when the gating signal is data dependent, which is usually the case.
In thus paper, we describe a robust new technique called Control Generated Clocking (CGC) that can be applied to minimize dynamic power in a synchronous sequential circuit. This scheme achieves higher power savings compared to clock gating while overcoming the drawbacks of clock gating listed above. This scheme minimizes the power dissipation in the datapath registers and in the clock drivers, without affecting the cycle boundaries on which registers load data.
The organization of the rest of the paper is as follows. Section 2 of the paper provides an overview of the proposed scheme. Following this, we describe the static timing analysis of a design with CGC in section 3. Section 4 describes an algorithm for automatic synthesis of CGC and we discuss the 1:os.t-benefit aspects of CGC in section 5. We present results obtained after implementing CGC in a RISC processor in section 6 and finally conclude in section 7.
Control Generated Clocking (CGC)
Consider a general edge-triggered synchronous sequential circuit shown in figure l(a), in which multiple state machines are shown, sequencing operations in the datapath. Control signals from the state machines enable writes to registers in the datapath. The master clock MCLK synchronizes all activity in the datapath registers as well as in the controlling state machines. Note that when writes to certain datapath registers are infrequent, a lot of powi:r is unnecessarily dissipated in the drivers of the clock MCLK and in flip-flops of the datapath registers.
Figure l(b) shows the timing diagram for a write to a single flipflop in any datapath register in such circuits. The figure shows the control signal WRITE, which is synchronous to the clock MCLK. The flip-flop loads the data on the next rising edge of clock MCLK. Observe that for a single write operation, this also happens to be the falling edge of the control signal. Our scheme essentially uses this control signal as the clock for the flip-flop. The flip-flop must now respond to the falling edge of the WRITE signal. This simple arrangement does not work when there are back-to-back writes to the flip-flop. In such a scenario, the WRITE signal changes only at the end of the last write to the flip-flop. In the CGC scheme described below, this problem is alleviated using RZ pulse generator circuits. In CGC, the general system shown in figure I(a) is transformed to a system shown in figure 2(a). This scheme introduces certain changes to the original circuit listed below:
. Figure 2(b) illustrates the timing relationships between signals in an implementation of this scheme. The figure indicates two backto-back writes taking place to a flip-flop. Note that the WRITE signal has only one falling edge in this case. If the WRITE signal were directly fed to the clock input of the flip-flop, only one write would take place. The RZ pulse generator toggles the control clock once every cycle, allowing back to back writes to take place. The RZ pulse generation circuit is not necessary for those flip-flops that never get updated in a back-to-back fashion.
Note from the timing diagram that data gets loaded into the flipflops at the same cycle boundaries as in the original circuit. This is desirable, because test cases written for the verification of the original circuit can now be reused without any changes.
Timing Analysis
In this section, we describe the timing constraints to successfully implement CGC in a circuit. In a circuit with CGC, in addition to the original clock domain of MCLK, new clock domains of the control clocks CCLKs are introduced. The constraints to be satisfied for ensuring reliable transfers between registers in different clock domains are based on the source and the target clock domains. In CGC, the control clocks are derived from flipflop outputs, and will have deterministic delays with respect to the master clock. To derive the constraints for reliable transfers between any two flip-flops in a design with control generated clocking, we assume a model shown in figure 3 . For simplicity we assume that RZ pulse generators generate all control clocks. We assume that the maximum possible value of the clock skew on the master clock in the design is known and its value is denoted by +Tsk,, . Table 2 lists all the timing paths in a design with CGC and the corresponding allowable maximum and minimum delay. In the table.
= Tclk -Tclk-q -Tsu -Tskew and a2 = Tskew + Thohi -Tclk-q where Trlk-q denotes the output delay of the flip-flop at the source relative to its clock, T,, denotes the setup time of the flip-flop in the target, Thold its hold time, and T, the propagation delay through the RZ pulse generator.
For transfers between an MCLK domain and a CCLK domain, the: delay introduced by the RZ pulse generator manifests itself as additional clock skew. For transfers between two different CCLK domains, this method can expose false timing paths in the original design. In table 2, n indicates the minimum path length bet.ween set of states that write to the source and the set of states that write to the target, in the controller state transition diagram.
In sum, timing analysis of a circuit with control generated clocking is simpler than that of a circuit implementing clock gating. This is due to the fact that this scheme introduces deterministic skews on the control clocks. 
Automation
This section describes a procedure to automate the insertion of CGC in a circuit. It takes as input, a description of state machines controlling the registers in the datapath along with a set of target registers and outputs a re-coded state machine suitable for CGC. Thiis procedure also marks those registers that require shadow flip-flops and can be updated in a back-to-back manner. Each sta1.e in the state machine description has an attribute code that represents its encoding. The set of registers in the datapath that are to be targeted by this scheme is specified in the set T. Each register rk E T has an attribute CS, called the control set, which is the set of all states which cause a write to rk . The control set of each register is constructed from the input state machine description. Every target register has two other boolean attributes namely, shadow and bb which respectively indicate whether the register needs a shadow flip-flop for its control signal and whether it can get updated in a back to back fashion. The attribute bb is used to insert an RZ pulse generator for the control signal. The algorithm for inserting CGC is described below.
//Main body of the algorithm
FSiM f;
set--of-registers T;
In .this section we derive analytical expressions for the power sav8:d in a circuit with CGC, and the overheads associated with it. With CGC, power is saved in the registers and in the clock drivers. In subsection 5.1, we derive an expression for register power saved. Following this, in subsection 5.2, an expression for the power saved in the clock drivers is derived. In subsection 5.3, we estimate the overheads in CGC.
Power saved in registers
Power dissipated in the target logic without CGC is first computed. We assume that each target flip-flop in the datapath consumes the same energy Eclk per clock, and the same energy 
Ccclk (i)fwrife-i i=l
Power saving in the clock drivers is calculated by substituting for P and Popl in Eqn (l), to yield
Clock Power Saving = -where kl = NregCoverhead is a parameter that indicates the change in master clock net capacitance after implementing CGC, and k 2 = is a parameter that indicates the ratio of the loads due to the untargeted logic to the targeted logic.
Note that k , and k , are constants for a given design, but are not independent of one another. Unlike the previous expression for power saving in registers, the expression for power saving in clock power can take on negative values. When n has its maximum possible value of 1 , the savings in clock power has a value -1 . This means that the circuit will consume more clock power after optimization. By keeping the overhead capacitances low, kl can be minimized sufficiently to keep the additional power minimal.
N p i p -fopscflip -flop
NJip-fops Cflip-flop k l + k 2
Overheads in CGC
The overheads in CGC are in re-coding the state machines, the RZ pulse generators, and in the shadow flip-flops. As an estimate for the overhead, we use the number of additional flip-flops that have to be added to the design for CGC. The overhead depends on the nature of the state machines in the design before implementing this scheme. If the initial design had a Moore state machine with Ns,,,, states as the controller, and n out of these states write to at least one register, then after re-coding the state machines, the number of flip-flops needed to implement the state machines is given by 
Nff-design
As indicated by this expression, the overhead increases roughly linearly with the number of states that write to at least one register. This expression also indicates that if this scheme targets a large number of registers in the circuit, Nplp-pops in the denominator would be large and the overheads can be kept low. This also has the positive effect of increasing the power savings. In addition to the area overhead derived above, this scheme incurs a small timing penalty on account of skew introduced on each control clock, as described in section 3.
Results
In this section we discuss results obtained after implementing CGC in an experimental processor. The processor has an instruction set based on the DLX described in figure 5 , it is evident that with aggressive application of CGC, in DLXA-DLXE, significant amounts of power (upto 73.:!5% in this case) can be saved. One may note that the control power is higher in circuits in which CGC has been applied aggressively. This is due to the fact that the control block dissipates the clock power of datapath blocks with CGC. The higher control power is also due to the increased gate count of the control block. It is also observed that the clock power decreases when CGC is aggressively applied. This is because the datapath blocks derive their clocks from the control block, lowering the capacitance on the clock line. In DLXF, we. observe that the Decode block power is nearly the same as in DLX. This is due to the fact that Design Compiler depends on the HDL description to identify idle conditions. The tool sometimes loses out on opportunities for minimization depending on the HDL coding stylc used. It is observed that Control block consumes very little power in DLXF. This is due to fact that Design Compiler can gate: clocks in the control blocks as well. Intemally, the tool does not distinguish between datapath and control. The tool uses the information from the HDL for the generation of the gated clock. Finzllly we observe that aggressive application of CCiC can yield about 16% higher power savings than clock gating.
From figure 6 it is evident that CGC achieves a reduction in the gate count of blocks in which it is used. This is because because CGC eliminates the multiplexers used by datapath registers to recirculate data when idle. It is also observed that the gate count of the control block increases due to the overheads associated with CGC. implementation of clock gating) reveals that DLXF requires a higher gate count.
Conclusions
In this paper, we describe an area efficient power minimization scheme "Control generated clocking" (CGC) for minimizing dynamic power dissipation of registers and clock drivers in a synchronous sequential circuit. CGC minimizes power by utilizing control signals generated by FSMs as clocks for the registers. CGC is a robust method in which power minimization is achieved without the possibility of introducing glitches on the clocks. Timing analysis of a circuit with CGC is simpler than in circuits with gated clocks. CGC preserves the cycle boundaries on which registers load data thereby allowing reuse of test cases developed for the original design without any modifications. CGC is a structured method and can be easily incorporated into a synthesis tool for automation. The utility of CGC has been demonstrated with a complex example.
