This paper proposes a novel method to estimate and to reduce redundant power of synchronous circuits at RT level design. Because much redundant power is caused by redundant clockings which activate registers unnecessarily, we detect these clockings. They are detected from the difference of the numbers of incoming and outgoing data of a register. And then we introduce gated-clock scheme to reduce the power consumption of the circuits using our estimation results. Experimental results demonstrate the accuracy of our method and the effect on power reduction.
Introduction
It is very important for low power design to analyze the power consumption of a circuit. Several power estimation techniques for CMOS digital circuits have been proposed [2- 91. These techniques estimate the power consumption of the whole circuits. However, they cannot distinguish redundant behaviors from essential ones in circuits. Therefore, an LSI designer cannot obtain the information about which part of the circuit behaves redundantly and how much power he can reduce, although such information is very useful for low power design.
In this paper, we propose a novel method which identifies and reduces redundant clockings. Since these clockings activate registers unnecessarily, they are critical issue for low power design. A clock signal charges and discharges large wire load capacitance and internal capacitance of register cells at high frequency. Furthermore, the output of a register causes switchings in the consequent circuits. If the power consumed by redundant clockings is estimated, an LSI designer can know how much power each register wastes and apply low power techniques such as gated-clock scheme[ 1 13 or multi-phase clock scheme.
Hereafter, when a register A feeds data to a-register B, Permission to make digitalhard copy of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication and its date appear, and notice is given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. C)1997 ACM 0-89791-903-3/97/08..$3. 50 we refer to register A as "source register" of register B, and register B as "destination register" of register A.
In our method, redundant clockings are detected from the difference of the number of data transferred from source registers and that of data transferred to destination registers. We regard the number of times when a condition of data-transfer becomes true as the number of transferred data. The conditions of data-transfer among registers are extracted by analyzing RT level HDL descriptions statically. Then, we dynamically count how many times the conditions are satisfied during RT level simulation.
This paper is organized as follows. In section 2, we describe redundant behaviors of a register and the basic ideas to detect redundant clockings. In section 3, we present an algorithm to estimate the power consumed by redundant clocking. In section 4, we discuss how to reduce the power consumption of the circuit using estimation results. We have experimental results to show effectiveness in section 5, and we have conclusions in section 6.
BasicIdeas
In this section, we show the basic ideas to detect redundant clockings. The redundant clockings occur when a register stores excessive data from source registers or feeds excessive data to destination registers. In order to detect the redundant clockings, we focus on the difference of the numbers of incoming and outgoing data of a register, or the balance of the numbers of incoming data and clockings for a register.
We define three types of redundant behaviors of a register which are caused by redundant clockings.
Unused data latching: if a register stores excessive data, some data are not transferred to any destination registers. We call this behavior as unused data latching.
Unchanged data latching: if source registers of a register feed data which is not updated, the register stores the same data that has already stored in itself. We call this behavior of the register as unchanged data latching.
Redundant data holding: if a register does not store data incoming from source registers in a certain clock cycle, the register stores data from itself. We call this behavior as redundant data holding. First, let us consider a register which has a single source register and a single destination register. Fig. 1 shows an example circuit. Registers A, B , and X are driven by signal clock. We assume that the numbers of data-transfers during 10 clock cycles are as follows: We focus on register X. The number of data incoming from A to X , eight, is larger than the number of data outgoing from X to B , six. We identify two unused data latchings of X . The number of clockings for X , ten, is larger than the number of data incoming to X. We also identify two redundant data holdings of X . Consider register A. The number of data incoming to A, five, is smaller than the number of data outgoing from A, eight. We identify three unchanged data latchings of X by these two numbers for A.
Next, let us consider a more complex case where there are multiple source registers and/or multiple destination registers. In that case, we treat the multiple source registers or the multiple destination registers as a pseudo-register.
An example of complex circuit is shown in Fig. 2 . For register X , we introduce two pseudo-registers F and G. We make assumptions as follows:
The number of data incoming from G to X is the sum of the numbers of data from A, B , and C.
If A, B , and/or C stores data in a certain clock cycle, G is updated. On these assumptions, we treat the case of multiple source registers and/or multiple destination registers in the same way as the case of a single source register and/or a single destination register.
132

Algorithm
In this section, we describe our algorithm to count the numbers of data-transfers among registers in detail. At first, we define data-transfer conditions which become true when data-transfers f r o d t o the register occur. They are extracted by analyzing RT level HDL descriptions, statically. Then, we count the numbers of times when these conditions become true. These numbers are counted in RT level simulation, dynamically.
Extraction of the data-transfer conditions is shown in section 3.1. Estimation methods of the number of redundant clockings and redundant power are described in section 3.2 and 3.3, respectively. The outline of our algorithm is described in section 3.4.
Extraction of the Data Transfer Conditions
We define datu transfer graph (DTG) to capture the relationship of data-transfer among registers on data-path. A data transfer graph is a directed graph as shown in Fig. 3 . A node wi represents a register i in the circuit. A directed edge (vi, vj) exists if and only if data is transferred from a register i to a register j through only combinational circuits. We treat a primary input as a register which feeds data in every clock cycle, and a primary output as a register which stores data in every clock cycle. "when signal cond equals to ' 1 ' at the rise-edge of signal ck, a value of i is assigned to j." Then, the data-transfer condition C R T (~~, vj) is represented by (1).
Each edge (vi, wj) has a data-transfer condition CRT(Q, wj). C R T ( V~,
C~~( w ; , v j )
We represent A as logical product (AND) and V as logical sum (OR) throughout the paper. 
C L A T ( U~)
is a condition of data-transfer between a register i and one or more source registers of register i. Let registers 1,2, . . . , m be source registers of register i , then
C L A T ( V~) is represented by (2).
is a condition of data-transfer between a register i and one or more destination registers of register i.
Let registers 1,2, . . . , n be the destination registers of i , then 
C U S E D ( V~) is represented by (3).
Estimation of the Number of Redundant Clockings
In this section, we describe how to estimate the number of redundant clockings.
Let ACK (vi) be the number of clockings for a register i, 
A H O L D ( W~)
is estimated using (5) .
Auu(vi) and Auc(vi) are estimated using (6) and (7), respectively.
Auu(vi) ? A L A T ( V~) -A u s E D (~~) (6)
Equality of (6) is satisfied when the destination register of register i always receives updated data. If some data incoming to register i are transferred to the destination register more than one time, the right side of (6) underestimates Equality of (7) is satisfied when all data incoming to the source register of register i are transferred to register i . If some data incoming to the source register are not transferred to register i , the right side of (7) underestimates A u c ( w~) .
A U U ( v i ) .
Estimation of the Power Consumed by Redundant Clocking
We estimate the redundant power which is caused by the The power consumption of CMOS circuits is denoted as redundant clocking.
follows:
where P is power consumption, V is supply voltage, CL is load capacitance, Q: is switching rate, and f is clock frequency[ 11.
Let P H O L D ( U~) , Puu(vi), and P~c ( v i )
be power consumed by redundant behaviors redundant data holding, unused data latching, and unchanged data latching of a register i , respectively. Then, they are estimated as follows:
In equations (10)-(13), C K ( v ; ) is the clock driving register i. P U U C K ( V Z ) and PuuFuNc(uc) are the power consumed by the clock net and the power consumed by the consequent combinational circuit of register a when unused datu latching occurs, respectively. L c K (~, ) is the load capacitance which is charged and discharged when the clock sig-
is the load capacitance which is charged and discharged when the output of register i changes. In (12), we assume that the output of a register changes from ' 1' to ' 0 n/2 times during n clock cycles.
Outline of Our Method
We show the outline of our algorithm. 
1.
.
3.
4.
Power Reduction
Gated-clock scheme is one of solutions for low power design. Although it sometimes causes clock skew problem in timing design phase, it is still used widely for low power design of synchronous circuits due to its effectiveness. We adopt this scheme to eliminate redundant clocking.
Since modification of the clocking scheme for all registers wasting power requires long redesign time and much effort, we select registers for which clocking should be modified in the following way: (b) Example description of gated-clock scheme.
2.
3.
4.
i) Calculate the sum of AHOLD, Auu, and Auc for each register in every clock cycle.
ii> Record t for each register if and only if the sum in clock cycle t is larger than the old sum in clock cycle t -1, since a redundant clocking for the register is detected in clock cycle t.
Group registers which behave redundant behaviors similarly as follows:
foreach i (registers which do not belong to any groups)
Let register i belong to a new group Gi. foreachj (registers which do not belong to any groups)
Count the number of clock cycles in which both registers i and j behave redundantly.
if (counted number > given threshold)
Let registerj belong to group Ci.
Calculate the total redundant power for each group.
Select groups whose total redundant powers are more than a given threshold power. They are targets of modification of the clocking scheme for power reduction.
We introduce a single gated-clock for each selected group. An enabling condition for a gated-clock is derived from conditions of data-transfer for each register in the group. An example HDL descriptions are shown in Fig. 6 . The condition of data-transfer for register j is cond = ' 1 . We assume that a 2-input-AND gate is used for gating clock ck. Then the enabling signal for gated clock signal gck is cond.
Because gated-clock scheme requires additional circuits for enabling clock signal, it may cause overheads of area, delay and power. In practical design, trade-off between overheads and power reduction by the optimization should be considered. When the power reduction for a register is small, the clocking scheme for the register should not be modified. Estimation results depend on given test patterns for RT level simulation. Consider a register which wastes power in simulation with a given test pattern. The functionality of the circuit is not changed by introducing gated clock scheme for the register, because we derive an enabling signal for a gatedclock from conditions of data-transfer. However, to reduce effort to redesign, test patterns simulating the actual behaviors of the circuit should be used for RT level simulation.
61-70 51-60
Experimental Results
We have developed a power analysis system for RT level circuits, and applied it to two example circuits. Experimental results demonstrate that our method can precisely estimate the power which can be reduced, and that information about redundant clocking is useful for low power design. We use a commercial CAD tool to estimate the total power consumption of the whole circuit. Load capacitances L c K (~~) and L F U N C (~, ) described in section 3.3 are also calculated by the tool.
The result of the power estimation is shown in Table 2 . Using our method (in section 3), the redundant power consumed by redundant clockings is estimated as 31.2mW, which is about 40% of the total power. Table 3 shows the distribution of registers which behave redundantly. The first column shows percentages of redundant clockings. The other columns show the numbers of registers which behave redundant data holding, unused data latching, and unchanged data latching, respectively. It shows that 25 registers are identified as redundant data holding registers during over 90% of the whole clock cycles. It is also shown that 5 registers store unchanged data during over 90% of the whole clock cycles. 
Power Reduction of Circuit A
Using our method (in section 4), 13 groups are selected as targets of modification when the threshold of total redundant power of a group is set to 0.3mW. The number of registers and the total redundant power in each group are shown in Table 4 . The estimated total redundant power of all selected registers is 26.9mW. We appended 13 gated-clocks into HDL descriptions to drive these registers, manually. Then we reduced 28.9mW which is 37% of the total power as shown in Table 5 . This result shows that our method can estimate redundant power accurately and reduce the power consumption efficiently.
The number of gates of the modified circuits is 2,387, which is smaller than that of the original circuit. One of the reasons for the reduction in gate count is that individual control circuits for data-transfer for all flip-flops in a group are replaced with a single enabling circuit for a gated-clock.
Example Circuit B
Circuit B is another part of the video signal processor described in section 5. l. Its dimensions are shown in Table 6 . This circuit operates in two modes, recording and playback. The power consumption of the whole circuit and the redundant power, which estimated using our method, are shown in Table 7 . Table 7 shows that about 25% of the total power is consumed by redundant clockings in both modes. In this circuit, the redundant clockings are detected at 117 of 180 registers. Gated-clock scheme described in section 4 is used for low power. We selected 66 registers out of 180 registers as the target of modifying. The redundant power of the selected registers and reduced power by using gated-clock scheme are shown in Table 8 . We reduced 37mW in recording mode and 35mW in playback mode. The power reductions in each mode are 29% and 27% of the total powers, respectively.
In this case, the reduced power is larger than the estimated one. Recall our earlier example circuit in Fig. 1 . If data is not transferred from X to B in a clock cycle t, one unused data latching of X is identified. However, a clocking for A in clock cycle t -1 is also redundant clocking. In our algorithm, we do not identify this behavior of A as unused data latching. In the experiment, we modified HDL description of clockings for registers A based on the information about the redundant behavior of X.
Conclusion
We have proposed a method to detect redundant clocking for registers such as redundant data holding, unused data latching, and unchanged data latching in an RT level circuit. In order to detect the redundant clockings, the number of data-transfers is profiled by using RT level simulation techniques and the power is estimated. In the experiment, we have estimated the wasted power in circuits. We have obtained 27 -37% power reduction by introducing gatedclock scheme using the information about redundant clockings. Our experimental results show that our method can estimate redundant power accurately.
