Sleep mode operation and exploiting it to minimize the average power consumption are of great importance in modern VLSI circuits. In general, sleep mode refers to the mode in which parts of the system are idle. In this paper, we study the problem of partitioning a circuit according to the activity patterns of its elements such that circuit elements with similar activity patterns are packed into the same partition. Then a partition can be placed in sleep mode during the time intervals all elements contained in that partition are idle. We formulate the partitioning problem to exploit sleep mode operation and show that the problem is NP-complete. We p r e s e n t polynomial time algorithms for practical classes of the problem. Applications of the problem to memory and module partitioning and clock gating are discussed. The experimental data con rm that a careful partitioning allows upto 40 more sleep time which could be exploited to minimize the average power consumption.
Introduction
Advances in VLSI and packaging technologies have increased the average transistor count i n a c hip by about one-hundredfold every decade 2 , allowing much more complex functionality.
Moreover, the advent of portable and mobile communication and computing services has stirred a great deal of interest in both the commercial and research areas. The dissipation of the heat generated by highly integrated circuits is a crucial factor because virtually all failure mechanisms are boosted at higher temperatures 2 . The minimization of power consumption in modern circuits is therefore of great importance. Due to this importance, there has been considerable shift of attention in the logic and layout synthesis areas 14, 1 7 , 2 0 , 2 2 , 2 3 and more recently in high-level synthesis 4, 5, 15 from the delay and area minimization issues towards low p o wer design. Previous research for low p o wer synthesis of digital circuits has focused on issues such a s activity-driven technology decomposition and mapping 17, 20, 2 2 , l o w-power state assignment 12, 2 1 , architectural transformation and reduction of power supply voltage 4 , wire and driver sizing 6, 18 , and reversible and adiabatic computing 7, 2 6 . F or a survey of these techniques see 8 .
Transition density o r a verage switching rate at di erent sites in a circuit is introduced in 16 as a quantity to measure the circuit activity, which can be used to estimate the average power consumption in a digital circuit. Recent studies 1 indicate that the clock signal and memory unit in digital computers, each consumes somewhere betwe e n 1 5 t o 4 5 p e r c e n t of the total power. This suggests good opportunities for savings in power consumption due to these sources. Exploiting sleep mode operation is an attempt to do so. In general, the term sleep mode refers to the mode in which there is no activity in parts of the system during certain periods of time. The sleep mode issue can be studied at di erent l e v els, e.g., behavioral level, register-transfer level RTL, logic level and transistor level.
In this paper we study the partitioning problem to exploit sleep mode for power minimization in digital circuits. The general problem can be viewed as partitioning a set of circuit elements such that the savings in power consumption achieved by s w i t c hing each partition as a whole into sleep mode is maximized. A partition can be switched into sleep mode during time interval I = l;r i f all the elements in that partition are idle during I. The set of intervals during which an element m is idle, is referred to as the idle set of m. W e present a general formulation for the problem and study its complexity. The problem nds many applications in low p o wer design, e.g. the following see Figure 1 : memory segmentation.
partitioning to power-down portions of the design.
clock tree construction.
We assume we h a ve synthesis simulation-based or statistical data on the idle times of the data items in case of memory segmentation or the idle times of the modules for the two other cases.
We present a general formulation for this problem, propose polynomial time algorithms to solve special classes of the problem optimally, and show that the general problem is NP-complete. The rest of this paper is organized as follows: Section 2 presents the necessary background. Section 3 brie y describes how to obtain the idle times for a set of memory or clocked elements in a design. Section 4 presents the problem formulation. The complexity of the problem is discussed in Section 5. Exact algorithms to solve the general problem are presented in Section 6. Some special classes of the problem are discussed in Section 7 and polynomial time algorithms are presented to solve these classes. Section 8 focuses on some generalization of the problem. Experimental results for 
Background
There are three sources of power consumption in CMOS circuits: the charging and discharging of capacitive loads during transitions at gate outputs, the short circuit current which o ws during output transitions, and the leakage current. The last two sources should be dealt with and optimized using proper device and circuit design techniques 24 , hence the design automation community has focused on the minimization of the rst source, which is frequently referred to as the switching power or dynamic power. The average dynamic power consumption for a CMOS gate g with load capacitance C g is given by: P av g = 0 :5 C g V 2 dd Dg; 1 where Dg and V dd represent the transition density 3 of the signal at the output of g, and the voltage of the power supply, respectively.
This suggests that a signal has a high contribution to the dynamic power consumption if it has either relatively large load capacitance or relatively high transition density. And these are both true about the clock signal in moderately sized synchronous digital systems. Recent studies 1
indicate that the clock signal and memory each consumes somewhere between 15 to 45 percent of the total power in digital computers. Hence, it would be worthwhile to study the mechanisms and approaches through which t h e p o wer consumption due to these sources can be optimized.
Exploiting sleep mode is an attempt to do so. Consider a scenario in which the access times to a set of dynamic memory elements are known. If we can partition these memory elements such that for long periods of time either of the partitions contains no data, then we can turn o the memory refresh circuitry for that partition during these periods and thus reduce the power consumption. A similar partitioning approach can be applied for clock-tree construction when the activity patterns of the clocked elements are known. The clock signal destinations with close activity patterns should be partitioned into the same subtree to allow m a x i m um savings in power consumption via clock gating see Figure 1 . Clearly, there is some overhead involved, caused by the extra control logic needed to switch the partitions in and out of sleep mode and the amount o f power that switching in and out of sleep mode will consume. This overhead is mainly dependent on the switching pattern and switching frequency of the partitions in and out of sleep mode.
To h a ve a general formulation, we t a l k a b o u t elements. Depending on the application, an element may refer to a memory element, clocked element, or a module in the circuit. Given the activity patterns of a set of elements, the question is how to partition this set to maximize the savings 3 Average number of transitions per unit time.
in power consumption achievable through sleep mode, and that how m uch p o wer would this technique save u s . W e believe that there is a high potential of savings in the power consumption using this technique and our paper is an attempt to study this problem.
Obtaining Idle Sets
In this section, we brie y describe methodologies to obtain the activity patterns and the idle sets of the memory and clocked elements in our design. Availability of these activity patterns are vital for the partitioning algorithm to be applicable.
Idle Sets for Memory Elements
Let M = fm 1 ; m 2 ; : : : ; m r g represent the set of dynamic memory elements MEs in an application.
Assume that the access sequence for each M E m i 2 M during a whole run cycle is given as a sequence of ordered pairs each of the form t i ; A i , where t i corresponds to the access time, and A i 2 f R; Wg represents the type of access, read R, or write W see Figure 2 . Given the access sequence for all the MEs, we can use the following rules to generate the set of intervals for each M E m i , during which m i need not be refreshed see Figure 2 , and thus obtain the idle set for each ME. We s a y M E m i is idle during interval I if it need not be refreshed during I.
Therefore ME m i id idle:
After its nal access time,
Before each write access until the closest read access or the start of computation
To obtain the access sequence for the MEs, we c a n u s e s i m ulation-based tools that take as input an application program and produce statistics on the resource utilization over time and space. 
Idle Sets for Clocked Elements
Consider the description of a design after the scheduling and allocation steps have been performed.
We assume that the functional units have registers at their input. This means that if an FU M is not used for a consecutive set of cycles, then we can gate the clock signal to the registers feeding this FU during this idle time, which will reduce the power consumption due to the clock tree.
Furthermore, it guarantees that there would be no dynamic power consuming activity during this time in M. F rom the scheduled and allocated design we c a n s a y that if FU M is assigned to a control step c, then it is active during c. Otherwise, it is idle during this time. This allows us to generate the idle sets for each of the FUs in our design. In other contexts, the multipliers or other multi-step FUs may need to be clocked during their whole execution cycle. The idle times should be computed according to these requirements. Furthermore, the switching of a partition is equal to the number of such i n tervals.
In 4, the term t 1 +t 2 accounts for the savings in power consumption due to sleep mode operation of partitions S 1 ; S 2 , and the term a sw 1 + sw 2 accounts for the overhead resulting from the extra control circuitry needed to supervise sleep mode operation. Parameter a is introduced to control relative signi cance of savings vs. overhead terms. Figure 4 shows an example of memory partitioning to exploit sleep mode.
Note that many problems can be formulated as a decision or an optimization problem and that if the decision version of a problem P is NP-complete then its optimization version is also NPcomplete, and if its optimization version is polynomially solvable then its decision version can also be solved in polynomial time. We n o w formulate our problem as a decision problem:
P1:
Instance: Ordered quadruple a; b; c; S, where a is a positive non-negative number, b, c are positive i n tegers, and S = fN 1 ; N 2 ; :::; N r g is a set of NISs.
Objective: Determine whether there exists a b-balanced bi-partitioning S 1 ; S 2 o f S such that: G a S 1 ; S 2 c 5 . 4 An interval is maximal with respect to a property P, if it has the property P, but no interval containing it as a proper sub-interval has property P. Here, the property P is that the density of the partition during this interval is equal to the size of the partition
NP-Completeness
In this section we discuss the complexity o f P1 and show that it is NP-complete. We present a transformation from the MIN-CUT INTO BOUNDED SETS problem 11 , that we will denote as MCP Min-Cut Problem. This problem can be stated as follows 6 That is, the size of each partition is lower bounded by B, a n d t h e n umber of edges in E with one endpoint i n V 1 and the other endpoint i n V 2 is no more than K. W e will refer to the number of such edges as the cost of the bi-partitioning and denote it as CV 1 ; V 2 .
Given a partitioning V 1 ; V 2 o f V , w e de ne an attribute c i for each edge e i = v i1 ; v i2 i n E, If c i = 1 w e s a y that edge e i is cut by the partitioning, otherwise e i is not cut or uncut. I t i s straightforward to show that: 6 Note that we are using a special formulation of MIN-CUT INTO BOUNDED SETS problem which is still NP-complete. This special formulation is used to simplify our NP-completeness proofs for problem P1. General Properties: As it is shown in Figure 5 , the P1 instance is constructed such that corresponding to each edge e i = v j ; v k i n t h e MCP instance there are jV j intervals, one in each of the NISs, and they are all overlapping. Let I i = fI i 1 ; I i 2 ; : : : ; I i jV j g represent the set of these intervals, where I ip is the interval corresponding to e i = v j ; v k i n N p . Among these intervals, I i j and I i k , the two i n N j , N k NISs corresponding to vertices v j , v k , the two ends of edge e i , extend from 3i to 3i + 1, and the rest extend from 3i to 3i + 2. Consider a bi-partitioning S 1 ; S 2 o f S. An example is shown in Figure 6 . The partitioning V 1 ; V 2 o f t h e MCP instance and the corresponding partitioning S 1 ; S 2 of the constructed P1 instance are shown and the values CV 1 ; V 2 , G a S 1 ; S 2 are computed.
As it is shown in Figure 7 we can categorize the edges e i in E as cut and uncut edges, and hence categorize the sets I i corresponding to them and compute the contribution t 1 i and t 2 i for each o f the edges to CV 1 ; V 2 and G a S 1 ; S 2 as follows:
Category Contribution c i to C V 1 ; V 2 Contributions t 1i ; t 2i to G a S 1 ; S 2
Cut edges e i c i = 1 t 1i = 1 ; t 2i = 1 Uncut edges e i c i = 0 t 1i = 1 ; t 2i = 2 OR t 1i = 2 ; t 2i = 1 
Exact Algorithms
The fact that P1 is NP-complete, rules out the possibility of existence of a polynomial time algorithm for P1 unless P=NP 11 . The general strategy in such circumstances is to work at two fronts: towards the theoretical end, the complexity of special sub-classes of the general problem that are potentially solvable in polynomial time are studied. Pinning out such sub-classes, of course, is not always an easy task. Towards the practical end, heuristic approaches are developed to solve the problem sub-optimally but in polynomial time. Occasionally, it has been observed that formulation of an exact solution to a general NP-complete problem, despite its exponential running time, provides valuable insights on how to design practical heuristic algorithms for the problem. Such exact solutions may also help understanding some special sub-classes of the general problem that are optimally solvable in polynomial time.
In this section we address two algorithms Partition Exact1 and Partition Exact2 to solve P1. The outline of these algorithms are shown in Figure 8 . If jP 2 j b f P = fN i 2 P 1 jM^N i = M g If jPj + jP 2 j bf P = A subset of P with size p; whereb , j P 2 j p jP 1 j , b; P 1 = P 1 , P; P 2 = P 2 P; g g Figure 9 : Implementation of steps 6, 7 of Algorithm Exact2
We can use the basic algorithm Partition Exact2 to solve P2, h o wever, the following observation allows us to achieve a m uch faster algorithm.
Observation 5. Let P = fN 1 ; N 2 ; :::; N k g be a set of NISs, each containing a single interval.
Then the internal-intersection of P is a NIS that consists of either a single or no interval.
This observation tells us that no matter how w e partition the set of NISs S of P2 instance into S 1 and S 2 , the internal-intersection of either of the partitions S 1 , S 2 consists of only a single interval. That is, we do not need to spend time on multiple interval NISs for N and M, since such NISs cannot possibly be the internal-intersection of partitions for a bi-partitioning S 1 ; S 2 o f S.
Therefore to solve P2 we can use algorithm Partition Exact2 with the for loops modi ed such that only single interval NISs are picked for N and M. This leads to f 2 = Os + r = Os and the time complexity o f Osp 4 where s is the number of intervals in the problem instance, and p is the cardinality of the endpoint set of S, and hence we h a ve the following theorem:
Theorem 2. The problem P2 can be solved in polynomial time.
Observation 6. Let I min and I med represent the intervals in the P2 instance with the smallest and b-th largest lengths, respectively. Then it is easy to show that for one of the partitions we only need to enumerate intervals of lengths no more than I med and for the other partition we only need to enumerate intervals of lengths no more than I min .
This observation allows limiting the solution space to be searched during the execution of the algorithm. However, it does not improve the asymptotic time complexity of the algorithm.
It should be noted that the solution to problem P2 suggests a heuristic algorithm for the general problem P1. The idea would be to devise a function F mapping the set S of multi-interval NISs in the given instance of P1 onto a set S 0 of single interval NISs and thus generate a P2 instance. 
Bounded Number of Switchings
In practice, switching the partitions in and out of sleep mode is itself a power consuming activity which should be minimized. Moreover, as the number of such s w i t c hings is increased, the complexity of the extra control logic needed to supervise the sleep mode is also increased. As 
Generalization
In this section we brie y mention a couple of the generalizations of P1 and its counterparts P2, P3. This is intended to suggest that the basic formulation is easily adaptable to cover a broader range of optimization problems. We discuss two generalizations: the multi-way partitioning, and the weighted partitioning. Note that we could also have m ulti-way partitioning and weighted combined. 
Experimental Results
The algorithm Partition Exact2 and its modi cations to optimally solve P2 are implemented in C and tested. Because of unavailability of test data due to novelty of the problem and its formulation, a set of randomly generated data with controlled parameters were used as test cases.
The results of experiments are shown in Table 1 . To simplify the comparison, the following settings are made for all the test cases:
j Sj = 1 0 0 S is the set of elements.
Balance factor b = 4 0 e a c h partition should contain at least 40 elements.
A single interval per NIS complying with P2 instance.
Factor a is set to 0 a is the penalty factor for the total number of switchings. This makes sense because the switching of either of the partitions is in the range f0; 1g, hence the sleep mode control circuitry will cause negligible overhead on the area or power consumption. To apply this this algorithm for the general case, one can use a pre-processing step which t a k es as input the idle sets of the CEs, and generates as output a single idle interval for each CE. The generated single idle interval for a CE can simply be the longest interval in the idle set of that CE, or it can be obtained using a more complicated strategy. The parameter min-len shows the length of the shortest interval in each problem instance. For each v alue of min-len, 10 random inputs are generated and tested with the algorithm. The minimum, maximum and average values for the ratio t 1 +t 2 T resulted from our partitioning algorithm and from a random partitioning algorithm are shown, where t 1 and t 2 are the exploitable sleep time of the partitions in the resulting bipartitioning. The higher this ratio is, the more the savings in power consumption would be if we place the corresponding partitions in sleep mode. Note that if we don't consider the idle times in a partitioning scheme as it has been done so far the result is essentially equivalent to a random partitioning. However, by partitioning the set of elements according to their idle times we can Table 1 : Comparison of our partitioning algorithm and random partitioning observed that as the length of the minimum idle times min-len is increased to cover the whole time window, the results get closer. Note that since the computation window has width T = 5 0 , practical range for min-len is 5 to 25. These cases are shown in bold face in the rst column in Table 1 . In such cases, our algorithm produces superior results, with an average of 7 to 40 more sleep time, compared to random partitioning.
Discussion and Conclusion
In this paper we studied the circuit partitioning problem to exploit sleep mode operation for minimization of the average power consumption. The motivation is to de-activate the memory refresh circuitry, a p p l y p o wer down or just disable the clock signals during the inactive periods of operation of corresponding circuit elements. The idea is to partition the set of elements such that the elements with close activity patterns are grouped into the same partition so that each partition can be switched into sleep mode during the time intervals all of its elements are idle.
We formulated the problem and showed that it is NP-complete. We also discussed some special of the memory unit, using an iterative improvement partitioning technique. To obtain the idle sets for memory elements in the work reported in 9 the applications we r e r u n o n a n e m ulator with a pro ling tool that kept track of di erent resource utilizations over time and space. The idle sets were then calculated using an idea similar to the one mentioned in Section 3.1 from the access sequence provided by the pro ling tool. Further work on gated clock tree design have been reported in 3, 1 9 . The following provides directions for further research in this area:
Improving the time complexity of the algorithms. Although the algorithms presented for special cases P2 and P3 are polynomial time algorithms, the growth rate of the running time with problem size limits the applicability of this approach.
Having shown that sleep mode and its exploitation could lower the power consumption,
gives rise to new problems in high level synthesis, that is, how to perform the scheduling and allocation tasks such that potential savings in power consumption achievable by exploiting sleep mode operation is maximized. It is noteworthy that the register allocation step in high-level synthesis tends to minimize the sleep time of the registers in order to reduce the required number of registers in the design. This brings up the trade o issue between area and power consumption in the high-level synthesis, which calls for further investigation.
We m e n tioned earlier that using a mapping function F to obtain single interval NISs from multi-interval NISs in S, w e can construct a P2 instance from a given P1 instance. The constructed P2 instance can then be solved optimally to lead to a heuristic partitioning solution for our original P1 instance. Further theoretical and experimental studies can be pursued to identify suitable choices for the mapping function F.
It would be worthwhile to devise heuristics based on which to perform the partitioning suboptimally, but fast. This could be of use as a design aid for low p o wer design to provide a quick feedback to the designer on how the design modi cations or decisions made at higher levels would a ect the sleep times of the partitions.
The geometric avor of the problem demands for carefully designed algorithms that exploit the geometric features of the problem to achieve good solutions. Hence it is worthwhile to study this problem from a geometric viewpoint i n s e a r c h of fast approximation or heuristic algorithms for the general or special classes of the problem.
Another interesting problem is whether or not P1 can be formulated as a hypergraph partitioning or in general any h ypergraph problem at all. Our attempts indicate that such a f o r m ulation is unlikely to exist although we h a ve no formal proof to present for it. Further research is in order to show whether or not such formulation is possible. In the case of positive answer, the existing algorithms for the hypergraph formulation can be applied to solve P1.
As P1 is formulated as a set partitioning algorithm, it is nice to see how w ell the existing heuristics for MCP, e.g., Kernighan-Lin 13 , Fiduccia-Mattheyses 10 , Ratio-Cut 25 , etc., can be modi ed to operate on P1 instances, how fast they can be implemented, and how well they perform.
A crucial assumption in this paper was the availability of the activity patterns idle times as input to our problem. It is of particular interest to categorize the designs for which such patterns can be generated e ciently. F urthermore, in cases where such patterns may not be generated as a set of exact idle sets, statistical approaches could be employed to generate some weighted version of the idle sets in which the weights could represent the probabilities of being idle during di erent periods. It is therefore worthw h i l e t o f o r m ulate and study the weighted version of the problem.
A generalization of the problem would be to allow m ulti-way partitioning, and perhaps to compute the optimal number of partitions as well as the contents of each partition.
It is also useful to investigate other areas in which problem P1 could nd applications.
