Power optimization has always been an important issue for modern IC design. In this paper, we present a power optimization technique for clock tree by applying multi-bit flip-flops and reducing total wire length. Through merging flip-flops into MBFFs, we effectively reduce power consumption caused by clock buffers. Moreover, by judiciously merging and placing the MBFFs, the total wire length is also significantly reduced. The combined effect of both techniques leads to a strong reduction in total power consumption of the clock network.
Introduction
In modern VLSI design, power has become a critical issue. With limited power/thermal budget, as well as the increasing demand of reducing power dissipation, minimizing power consumption has become one of the most important objectives.
Power consumption of an IC chip can be categorized into two types: dynamic and static power consumption. Clock tree is one of the major causes of both types of power dissipation; it may consume as much as 40% of the total power [1] of the IC due to its frequent switching activity; clock tree is also the major consumers of leakage power because of the large number of buffers it contains. A lot of work related to reducing power consumption of clock tree has been proposed. Some address this problem when constructing clock network, reducing power consumption by planning a suitable topology and inserting buffer wisely [11] . Creating multiple supply voltage [3] is another approach. Donno et al. [9] and Mahmoodi et al. [8] use clock gating to resolve this problem. Energy recovery is also a feasible approach adopted in [8] . Lu et al. [6] and Lou et al. [10] focus on minimizing clock networks through replacing non-timingcritical cells with their high V t counter parts.
Hou [5] , Kretchmer [7] , and Chang [2] took another direction: applying multi-bit flip-flop (MBFF), or register banks. MBFF is one of the most effective methodologies in saving both chip area and power consumption. Hou [5] proposed an incremental clock tree placement flow applying MBFF. Kretchmer [7] introduced a design methodology to create the models of multi-bit registers in the cell library to be inferred by logic synthesis tools. Chang [2] incrementally applied MBFFs at post-placement stage. Figure 1 shows an example of merging two 1-bit flip-flops into one 2-bit flip-flop. Originally each flip-flop requires two inverters to generate clock signal respectively. However, due to the manufacturing ground rules, inverters in flipflops tend to be oversized; as the process technology advances into smaller geometry nodes like 65 nm and beyond, even the minimum-sized inverter-chain can still drive more than one flip-flop. We can merge multiple flip-flops into a multi-bit flip-flop; through sharing of the clock signal, the number of inverters required, as well as the power consumption and area occupied can be significantly reduced; Table 1 shows comparisons of power consumption and area of flip-flops with different bit numbers. Also, through choosing proper clusters and placement location, the wire length can also be reduced.
We address the problem of applying multi-bit flip-flops when performing synthesis of clock network. We propose a novel power optimization method of clock tree by applying MBFF with a clique-based approach. The cluster problem of flip-flops as formulated into clique-finding and maximum independent-set problems. Through iterative windowbased optimization, the flip-flops are gradually clustered into MBFFs, reducing the total power consumption and area of the clock tree; the total wire length is also taken into consideration to avoid additional driving load and further reduce power consumption caused by long metal wires.
The remaining chapters of this paper is organized as follows: Chapter 2 gives the motivation of our approach; the weakness of previous work is analyzed. Chapter 3 describes the problem formulation. Chapter 4 details our proposed algorithm. Chapter 5 presents the experimental results. Chapter 6 gives the conclusion of this paper.
Previous Work and Motivation
In Chang et al.'s work [2] , they proposed an algorithm to reduce the power consumption of clock tree by replacing flip-flops into multi-bit flip-flops. To realize this process, each time a window is selected. For flip-flops within the window, a set of cliques consisting of theses flip-flops is computed, where each clique denotes a cluster combination that satisfies the given constraints. To solve the conflict between the cliques, a greedy selection method is adopted to generate independent set of the cliques. The priority to select the clique is evaluated in terms of power consumption and wire length overhead. The algorithm repeats the above procedure until the whole chip is processed. This flow is capable of effectively merging flip-flops into multi-bit flip-flops. However, there are three key factors of this algorithm left to be discussed.
First is the size of the selected window. In Chang's work, the window size is fixed to two values. We found it is better to adjust it dynamically. A simple example of our reasoning is shown in Fig. 2 . Initially the distribution of flip-flops is relatively dense; a smaller window size suits [ Fig. 2(a) ]. However, as the clustering process proceeds, the distribution of flip-flops becomes sparser. A larger window size is required in order to reach solutions involving clustering distant flip-flops [ Fig. 2(b) ]. We proposed a mechanism to dynamically adjust the size of the selected window to reflect the distribution of flip-flops.
The second observation is that in Changs work, once a flip-flop is merged into a multi-bit flip-flop, other solutions involving this said flip-flop being merged is not considered anymore. As shown in Fig. 3, Fig. 3(a) is the original flipflops. If the cluster state of Fig. 3(b) is reached first, the solution of clustering the flip-flops as Fig. 3 (c) will not be considered anymore. In order to rectify this deficient, we proposed disruptive collection, which is to decompose the merged multi-bit flip-flops so that a broader solution space can be explored.
Last is the process of generating clique set that corresponds to cluster combinations satisfying the given constraints. An accurate and fast wire estimation for cliques is proposed to identify potential long metal wires. A pruning mechanism using this method to eliminate inferior cluster combinations is adopted in our synthesis flow. Benchmarking results show that by applying this estimation mechanism, we significantly reduce the total wire length for merged flipflops and the resulting power consumption. Sharing clk signal (a) (b) Fig. 2 . As the distribution of FFs becomes sparser, a larger window size is preferred.
CHANG and HWANG
The details of our algorithm will be explained in Chap. 4.
Problem Formulation
This chapter is to describe the problem formulation in detail. The section is organized as follows: Section 3.1 gives the detail and definition of input data; Section 3.2 formally defines the objective functions; Section 3.3 specifies three major constraints of this problem.
Input
This problem has the following inputs: . The width W c and height H c of the chip C. . A set of pre-placed logic blocks K. Each k i 2 K is with its own coordinate ðx k i ; y k i Þ, and the area of k i , A k i. The maximum placement density D max .
Objective function
The Synthesis of Multi-bit Flip-flops for Clock Power Reduction Problem is to minimize the total power consumption of all flip-flops f i 2 F by replacing flip-flops with MBFFs specified in the given cell library L, as well as the the total wire length of every net n ij ðp j ; f j Þ 2 N.
The total power consumption of all flip-flops f i 2 F is calculated by aggregating the power consumption PC f i of each flip-flop f i 2 F.
The total wire length of N is the aggregation of the wire length of every net n ij 2 N.
The wire length L n ij of net n ij ðp i ; f j Þ is defined as the Manhattan distance between p i and flip-flop f j . Let ðx p i ; y p i Þ be the coordinate of pin p i , and ðx f j ; y f j Þ be the coordinate of flip-flop f j ,
Constraint
In this optimization problem, there are three constraints: Non-overlap constraint, placement-density constraint, and timing slack constraint. We discuss them in details in the following subsections. Also note that the coordinate of pins cannot be changed. However, the flip-flops in original design can be re-placed in order to optimize the total wire length.
Placement density constraint
The total area consumption D i of a bin b i is the aggregation of the area of all flip-flops and pre-placed logic blocks within b i . In order to avoid routing congestion, there is a placement density constraint D max to all bins. When a new flipflop is generated, it can only be placed into a bin where the area consumption of the bin after adding up the area of the new flip-flop does not exceed D max as stipulated.
An example is shown in Fig. 4 . Each square denotes a bin. The number on each bin is the total area of pre-placed logic blocks and flip-flops within the bin. Let D max equal to 10. In this example, the grey bins in Fig. 4 violate the placement density constraint since the total area of those bins exceed D max , while the white bins satisfy it.
To further define this constraint: Let A K bi be the total area of pre-placed logic blocks within bin b i , and A F bi be the total area of flip-flops in b i .
Timing slack constraint
For each n ij ðp i ; f j Þ 2 N, there is a slack S ij given in the input file for the net. S ij is the additional driving load that pin p i could afford. The timing slack constraint demands the slack of every net remains larger than or equal to zero. This constraint may be violated when reposition of flip-flops. If a flip-flop f j of a net n ij ðp i ; f j Þ is relocated, the slack of the net will increase/decrease according to additional distance between the pin and flip-flop. When merging several flipflops into a multi-bit flip-flop, the location of the newly generated flip-flop must be chosen wisely, so that the additional distance would not lead to negative slacks on any connected nets. Figure 5 is an example of invalid slack. There are two 1-bit flip-flops, f 1 and f 2 ; f 1 is connected to p 1 and p 2 by net n 11 and n 21 respectively; f 2 is connected to p 3 and p 4 by net n 32 and n 42 . f 1 and f 2 are to be merged into a 2-bit flip-flop f 3 , which connects to all four pins, p 1 , p 2 , p 3 and p 4 . As shown in Fig. 5 (b), due to relocating and merging f 1 and f 2 , the additional wire length and the resulting driving load lead to a negative slack on n 13 ðp 1 ; f 3 Þ and n 33 ðp 3 ; f 3 Þ, violating the timing slack constraint.
The timing slack constraint can be formulated as follows: Let T ijMAX be the maximum timing tolerance of net n ij ðp i ; f j Þ, where f j is the original locations of flip-flop sink specified in the golden input. 
148
CHANG and HWANG
After repositioning flip-flops or merging them into MBFFs, for every net n ij ðp i ; f j Þ 2 N on the chip, the corresponding T ij must remain equal to or less than its T ijMAX .
The golden input design should also meet all aforementioned constraints before performing power/wire length optimization.
Synthesis of MBFF
Based on the aforementioned problem descriptions, we proposed a window-based cluster algorithm, which iteratively selects a window from the chip and merges the flip-flops within the window into multi-bit flip-flops.
Our optimization process is divided into four phases, as shown in Fig. 6 . Each phase performs a variation of our window-based cluster algorithm. The first phase, Regular Non-Disruptive Clustering (RNDC) generates an initial solution; the second phase, Dynamic Non-Disruptive Clustering (DNDC) and the third phase, Dynamic Disruptive Clustering (DDC) further refine the result; the last phase, Corner Case Refinement (CCR) focuses on enhancing corner cases. The detailed differences and transitions of these four phases will be explained in the later sections of this chapter.
A detailed pseudo code of our window-based optimization is shown in Fig. 7 . In the first stage, a window W is selected. In the second stage, we collect F local , the set of flip-flops within W, and based on F local we compute F target , which is the target set for our algorithm to generate MBFFs. In the third stage, a clique set C for F target is computed, and each c m i 2 C is a legal cluster corresponding to a cell type f m j 2 L. The algorithm greedily selects a c m i 2 C with the lowest cost to generate its corresponding multi-bit flip-flop. All cliques conflicting with the selected clique are then removed from C. The process is repeated until C becomes an empty set.
There are two key factors of our window-based clustering algorithm. First is the size of the selected window W. Second is the mechanism to compute F target . Instead of adopting all flip-flops in F local directly as in [2] , we introduce disruptive collection, optionally decomposing flip-flops in F local before merging them. The details will be discussed in the later sections.
The remaining sections of this chapter are organized as follows: Section 4.1 renders the transitions between the four phases of our optimization process; Section 4.2 focuses on deciding the size of selected window. Section 4.3 gives details of disruptive collection, our proposed mechanisms to compute target set of flip-flops to merge. Section 4.4 explains the clique-based merging process. The generation of cliques and pruning of solution space will also be discussed.
Phase transition

RNDC
In the first phase of our optimization process, Regular Non-Disruptive Clustering (RNDC), the chip is equally divided into windows with same size. This phase ends after clustering of all windows is completed.
DNDC and DDC
For the second and the third phases (DNDC and DDC), these two phases end when the solution converges. The Synthesis of Multi-Bit Flip-Flops for Clock Power Reductiondefinition of convergence is as follows: Each time a window is selected and the flip-flops within this window is clustered is called a round. After completing a round, the reduction of total power consumption is computed. Due to the nature of our algorithm, the gain is guaranteed to be either equal to or larger than zero. If the number of rounds with zero gain consecutively is larger than a threshold T, our optimization process will end current phase and continue to next phase.
Threshold T is defined as follows:
where L is the library of MBFFs given in the input. Ideally during each round the process should be able to merge several flip-flops into one MBFF with the minimum bit number. We use the ratio of the original number of flip-flops in the input to the number of bits of the smallest MBFF as the threshold to indicate time to end current phase. For DNDC, is set as 1; for DDC, in order to pursuit the quality of solution more persistently, is set to 5 to allow more rounds of attempts to search for better cluster.
CCR
The last phase of our optimization process, Corner Case Refinement focuses on enhancing flip-flips with the highest 20% power consumption. Each round one of those flip-flops is used as the center of the selected window to perform our window-based clustering. This phase ends after all the aforementioned flip-flops are processed.
Select window
We apply a window-based approach is to reduce the problem size. Through processing only a window instead of the whole chip at a time, the original problem is divided into smaller ones to be conquered.
In Chang et al.'s work [2] , the size of the selected window is fixed as either 2 Â 2 or 4 Â 4 bins. We observe that the window size should relate to the specific volume of the chip instead of bins. Moreover the window size should be able to adjust more freely. Our algorithm adopts fixed and dynamic window size according to different stages of clustering. Here we introduce the specific value to control the window size.
¼
Width Â Height number of FFs :
The details of our mechanism will be discussed in the following subsections. 
Fixed window size
When computing an initial solution of our algorithm in RNDC phase, a fixed window size is preferable. In consideration of run time and the relatively dense nature of the initial input, we use a fixed specific value xed in this stage.
xed is set as:
where W c and H c are the width and height of the chip respectively. Since the number of flip-flops in the input, as well as W c and H c are all constants, xed is also constant. The chip is divided into identical windows with the same width and height ( xed ) to be processed.
Dynamic window size
where W c and H c are the width and height of the chip respectively and are both constant. As the clustering process proceeds, the number of FFs on the chip decreases, leading to a larger dynamic . This trend fits our requirement of a growing window size.
In DNDC and DDC phases, the exact coordinates of the selected window are randomly decided, with the size computed as dynamic .
In CCR phase, we focus on clustering poorly merged flip-flops. The flip-flops with the highest 20% power consumption are to be re-processed. The window size in this stage is computed in same fashion as dynamic . However, the flip-flops to be processed are set as the center of the selected windows.
Compute target FF set
In Chang et al.'s work [2] , once a flip-flop is merge into a multi-bit flip-flop, other solutions involving this said flipflop being merged is not considered anymore. In order to overcome this weakness, when computing F target , the set of flip-flops to merge into MBFFs, we propose the option of disruptive collection, which is to decompose the merged flipflops into its component flip-flops and re-cluster these component flip-flops instead of directly adopting the existing F local (non-disruptive collection). The details of both mechanisms will be elucidated in the following subsections.
Non-disruptive collection
In our algorithm, each time for a window W selected to be processed, F local is the set of flip-flops whose coordinated are within W and F target is the set of flip-flops to be merged. If F target is computed in the fashion of directly adopting F local , it is called non-disruptive collection. The advantage of this method is that it is easy for computation with fast run time. However, as mentioned before, this mechanism lacks the ability to reach for a broader range of solution space. We utilize this approach in RNDC phase to compute initial solution, generating MBFFs with fast run time and fair quality.
Disruptive collection
After initial solution is computed, in order to search the solution space more thoroughly, we propose disruptive collection, which decomposes the merged flip-flops and re-cluster.
As shown in Fig. 9(c) .
The procedure to decompose a given flip-flops is shown in Algorithm 1. F is the global set of flip-flops of the clustering result. F target is the set of target flip-flops to be merged, f i is the flip-flops to be decomposed, and f i is recursively broken down into its component flip-flops.
In our algorithm, disruptive collection is adopted in DNDC, DDC, and CCR phases to compute F target . If the quality of the solution in measure of power consumption deteriorates after performing disruptive collection and re-clustering, our algorithm would restore the cluster status back to before decomposing and re-clustering.
Generating MBFF
Once a window is selected and the set of flip-flop F target within the window is obtained, the clustering of F target is performed. We propose a clique-based algorithm to decide how to cluster flip-flops in F target into multi-bit flip-flops, while satisfying the constraints in Chap. 3, §3.3.
Valid timing slack region (VTSR) computation
In order to meet the timing slack constraint, all flip-flops must be placed on a grid on which all nets connecting to the flip-flop are with a slack larger than or equal to zero. For a flip-flop f i , let P i be the set of pins connected to f i by a set of nets N i . Let T kiMAX be the maximum timing tolerance for net n ki ðp k ; f i Þ 2 N i ; p k 2 P i , as defined in eq. (5). Let VTSR of pin p k be the set of grids with the Manhattan distance from the grid to p k less than or equal to the T kiMAX . To satisfy timing slack constraint for all nets in N i , f i must be placed within the intersection of VTSR of every pin in P i . The intersection of VTSR of every pin of a flip-flop f i is defined as the valid timing slack region of f i , VTSR f i . An example is shown in Fig. 10 . f 1 is an 1-bit flip-flop with two pins p 1 and p 2 connected. VTSR f 1 is the intersection of VTSR of p 1 and p 2 .
VTSR f i ¼ intersection of VTSR of every pin connecting to f i :
Given two flip-flops f i and f j , let P i be the set of pins connected to f i by net set N i , and P j be the set of pins connected to f j by net set N j . If f i and f j are to be merged into a new flip-flop f k , the timing slack on all nets in N i and N j must be satisfied after reconnecting to f k . VTSR f k is the intersection of VTSR i and VTSR j , which are the intersection of VTSR of all pins in P i and P j respectively as eq. (12). 
CHANG and HWANG
VTSR f 1 and VTSR f 2 are the valid timing slack region of f 1 and f 2 respectively. VTSR f 1 and VTSR f 2 does not intersect; VTSR f 1 f 2 is a null set. f 1 and f 2 cannot be merged together since there is no grid satisfies the timing slack constraints for every net connecting to f 1 and f 2 . Let F be the set of flip-flops to cluster together, the VTSR for the merged multi-bit flip-flop is the intersection of VTSR for every f i 2 F. For example, in Fig. 12, f 1 , f 2 , f 3 , and f 4 are the intended flip-flops and VTSR f 1 , VTSR f 2 , VTSR f 3 and VTSR f 4 are their valid timing slack region respectively. The merged MBFF must be placed with the intersection of VTSR of all the four flip-flops, VTSR f 1 f 2 f 3 f 4 .
Valid timing slack clique (VTSC) generation
A VTSR intersection graph is a non-directed graph GðV; EÞ, where each vertex v i 2 V corresponds to a flip-flop f i in the design. e ij between vertex v i and v j exists if the intersection of VTSR f i \ VTSR f j 6 ¼ ;.
If a set of flip-flop is to be merged, as aforementioned, there must be a non-null intersection of the VTSRs of every flip-flop in the set, which means that there is an edge e ij between every v i ; v j 2 V. In other words, for all v i corresponds to the set of flip-flops, they form a clique. 
VTSR f3
Fig. 12. The merged MBFF must must be placed within VTSR f1 f2 f3 f4 , which is the intersection of VTSRs of f 1 , f 2 , f 3 , and f 4 .
Synthesis of Multi-Bit Flip-Flops for Clock Power Reduction 153
Let a window W be selected, and the target set of flip-flop to merge F target is collected. To explore the cluster combination of F target , the VTSR intersection graph of F is first computed. Then we enumerate cliques with degree less than m in the intersection graph, where there is a corresponding cell type in the input library L with bit number m.
To enumerate cliques, the flip-flops in the input form the initial cliques in the clique list. Each time two cliques are tested to see whether their VTSR have non-null intersection; if yes, these two cliques form a new clique. This new clique will also be tested with other cliques to derive larger cliques. The process repeats until all combinations to form cliques corresponding to multi-bit flip-flop with the maximum bit number in the library is explored. An example is shown in Fig. 14. Figure 13(a) is the VTSR intersection graph of six flip-flops. Figure 13(b) shows the corresponding enumerated cliques.
However, if the above clique generation method is utilized straight forward, many redundant solutions with large wire length cost may be generated and consumes momentous computation time. A branch-and-bound method is applied to eliminate these undesired cliques and limit their number. If the estimated wire length of the clique and its corresponding flip-flop is too long, this clique is considered undesirable and be pruned. In Fig. 14(b) , the grey cliques are cliques pruned based on above pruning mechanism. Cliques with too large wire length are considered undesirable and will not be used to derive more cliques. For instance, c 23 in Fig. 14(b) is pruned, along with its derivative c 234 .
The pruning procedure is shown as Algorithm 2. In this algorithm, AVWL is the average wire length of each pin in the golden input, computed as eq. (13). To compute the coordinate of the EP, first we need to compute three types of estimation points, inner point (IP), meso-point (MP), and outer point (OP). The estimation point of f m i is the weighted center of its inner points, mesopoints and outer points.
The shared estimation point for all pins inside VTSR, inner point IPðx inner ; y inner Þ, is the center of those pins. The x-coordinate of inner point, x inner is computed as eq. (14).
W inner x is the weight of x inner and is computed as eq. (15).
y inner and its corresponding weight W inner y are computed in similar fashion. For pin p s , which is outside VTSR f j but within bounding box of VTSR f j , the estimation point for p s , MP p s is the point on the edge of VTSR f j with the minimum Manhattan distance to p s . meso-point can be obtained by solving linear equations. The weight of MP p s ðx meso ps ; y meso ps Þ, W meso ps , is the distance between p s and MP p s .
For pin p t , which is outside the bounding box of VTSR f j , the estimation point outer point is the corner of VTSR with the minimum Manhattan distance to p t . The weight of OP p t ðx outer pt ; y outer pt Þ, W outer pt , is the distance between p t and OP p t . Figure 15 shows an example of estimation points. f 1 is a 2-bit flip-flop with four connected pins, p 1 , p 2 , p 3 , and p 4 . First, p 1 and p 2 are inside VTSR of f 1 . The inner point for p 1 and p 2 is computed as eq. (14), as point IP marked on the 
y estimate is computed in similar fashion.
Clique selection
Let C be the set of all VTSC of the flip-flop set F explored by method in §4.4.2. To select VTSCs to merge into multibit flip-flops, we propose an intuitive greedy heuristic.
Let R i be the cost of an VTSC c i 2 C corresponding to an multi-bit flip-flop f i .
Each time our algorithm picks the clique c min with the minimum cost to perform clustering; then every other clique with flip-flops in c min are removed from the set. 
Experimental Result
We implement our algorithm in C programming language under Linux operating system. We applied six industrial test cases to corroborate the quality of the solution of our algorithm. The number of flip-flops in these cases ranges from approximately 100 to 170000. The exact composition of the test cases is listed as Table 2 . We adopt a library with 1-bit, 2-bit, and 4-bit, three kinds of flip-flops. The specification of the library is shown as Table 3 .
Our experiments include two parts. First is to validate the effectiveness of our pruning mechanism and wire length estimation method. Second we compare the power reduction, wire length ratio, and runtime of our algorithm with Chang et al.'s work [2] .
Pruning and wire length estimation
In Chap. 4 we proposed an approach to estimate the wire length of a clique based on its corresponding implementation of multi-bit flip-flop, and used this information to prune inferior cliques. Table 4 shows the number of cliques enumerated during computing each case with and without applying our pruning mechanism. The result demonstrated the effectiveness of our pruning mechanism to reduce the computation time. Table 5 further shows the average wire length estimation error percentage of each test case. Among all cases the maximum average estimation error is less than five percent, manifesting the accuracy of our estimation mechanism, which is capable of predicting the wire length without actually searching every placement grid in VTSC of the clique. Figure 16 visualizes the estimated wire length and the actual minimum wire length for all cliques generated during computing test case c1. The blue line is the estimated wire length, and the red line is the actual minimum wire length for the clique. We can see that the two lines are almost identical in their trends, providing demonstration of the accuracy of our estimation.
5.2 Power, wire length and run time Table 6 shows the results of our clustering algorithm, and Table 7 shows the comparison between Chang et al.'s work [2] and ours based on the given cell library. Our work has competitive, if not slightly finer performance in power consumption of flip-flops, as well as significant improvement in wire length. Our algorithm takes longer time to compute due to a more thorough search in solution space, but the run time is still acceptable in every cases. Even in the largest case with around 170000 flip-flops, our program still takes less than five minutes.
Compared with Chang's work, our work addresses the importance of wire length reduction more. As the manufacturing technology evolves, the ratio of power consumption caused by metal wires becomes more and more Synthesis of Multi-Bit Flip-Flops for Clock Power Reduction 157 significant among the whole chip. In addition to power reduction through merging flip-flops, we further model the benefits in power due to reduced wire length into consideration. Under 65 nm technology node, assume the grid size is equal to the minimum distance between two flip-flops, which is 0.56 micron. The power ratio of an 1-bit flip-flop to 1-micron of metal wire under 65 nm technology node is 250:1, thus the normalized power of 1-bit flip-flop to one unit-length (one grid) of metal wire is 140. We estimate the power consumption of 2-bit and 4-bit flip-flops based on the ratio of their area. Table 8 shows the normalized power of each cell type.
Based on above normalized cell power, we compute the combined power consumption of both flip-flops and metal wires before and after our clustering process. The impact of reduced wire length is shown in Table 9 .
Conclusion
In this paper, we introduced a problem formulation to synthesize multi-bit flip-flop to optimize the power consumption of clock tree, as well as the wire length consumed. We proposed a window-based algorithm, in which a clique-based clustering is performed. Experimental results based on industrial cases show the effectiveness of our algorithm in merging flip-flops and decreasing wire length. In addition to the benefits of applying multi-bit flip-flops, comparing with previous work, our algorithm has stronger impact on total power dissipation since we are capable of reducing the wire length more significantly. 
