In the recent past, Mesh-based clock distribution has received interest due to their tolerance to process variations in deep-sub micron technology. Mesh buffers are placed on the mesh to drive the large load capacitance of clock sinks and mesh wire capacitance. In this paper, we propose a buffer placement algorithm which can overcome the short circuit power dissipated in clock meshes. Our buffer placement algorithm uses clustering technique to judiciously place buffers such that short-circuit power is minimized while minimizing skew at the same time. This is verified by Monte carlo simulations incorporating process, voltage and systemic variations in NGSPICE.
Introduction
Traditionally, tree-based clock distribution has been the preferred method in most VLSI designs. However, in Deep Sub Micron (DSM) technology, clock skew induced by process variations is posing a threat to the reliability of tree-based distribution. Mesh-based distribution, due to their redundant paths offers a variation tolerant alternative to tree-based distribution at the cost of increased power dissipation (due to increased wire capacitance) [1] . Tree driven mesh is a hybrid clock distribution scheme in which a top level clock tree feeds a leaf level mesh. Leaf level clock mesh synthesis is well studied in [1e3]. In the leaf level mesh, buffers are placed at the mesh nodes ( Fig. 1) to drive the large capacitance (input capacitance of flip flops/registers called clock sinks and mesh wire capacitance).
Existing buffer placement algorithms
Buffer placement on the mesh is a crucial step in the design of the clock distribution network. Although buffers can be placed on all the mesh nodes, such a liberal placement of buffers will be an inefficient way (due to power and area consumed by the buffers) to minimize the skew in a mesh. All the previous works ([1e3] ) place buffers at certain mesh nodes such that the clock latency to all the sinks is equalized. The buffer placed at a particular mesh node is designed to drive the capacitance load around that mesh node.
The buffers are placed using a set-cover algorithm with a discrete buffer library in Ref. [1] . In Ref.
[2], the buffers are placed by Iterative Buffer Deletion algorithm (IBD). In Ref.
[3], the buffers are placed at the mesh nodes according to the density of sinks in its vicinity and their proximity to the sinks. Since buffers have to be placed in hundreds of locations for designs having thousands of sinks in a clock mesh, the buffer sizing and placement is typically automated. Hence a discrete buffer library, Library ¼ fb 1 ; b 2 ; …; b n g is used. For example, if the load capacitance at mesh nodes varies between 0 and 300 fF in a design, a buffer library of 6 buffers can be used: b 1 to drive 0e50 fF, b 2 to drive 51e100 fF, b 3 to drive 101e150 fF…b 6 to drive 251e300 fF. Each of the buffers is designed/ sized to drive its load capacitance with a particular slew constraint. Since clock signals are critical timing signals, they are tightly slewed, typically at 10% of their time period. Hence an ideal buffer to feed a 1 GHz clock to a clock mesh must have a slew no greater than 100 ps when driving its corresponding load.
All the aforementioned algorithms place buffers using a discrete buffer library which may be inefficient i.e to drive a 101 fF load or a 149 fF load, the buffer placement algorithm might place the same buffer-b 3 . Buffers contribute to area penalty which is a very costly commodity in sub micron designs. More importantly, there will be Short-Circuit (SC) power dissipation between the buffers in the mesh. As illustrated in Fig. 2 , there is a short-circuit path formed between adjacent buffers when the clock reaches the mesh buffers at different times. In this paper, we refer to the power dissipated due to the formation of this short-circuit path as short-circuit power dissipation. This short-circuit power is different from the one inherent in CMOS switching which can be made negligible by careful design. The difference of clock arrival at mesh buffers is attributed to process variations in the top level tree driving the mesh ([1]). The existing buffer placement algorithms overlook this short circuit power dissipation while placing the buffers on the mesh. They only concentrate on achieving minimum skew with minimum number of buffers on the mesh. Hence we need an algorithm which places buffers such that SC power dissipation is minimized while conserving the clock skew. In this paper, we propose a clustering based buffer placement algorithm which has the following properties:
1. Our buffer placement algorithm is short-circuit power 'aware'.
Since buffers are placed near the centroids of clusters, they are physically spread apart minimizing the possibility of forming short-circuit paths. This will reduce the SC power dissipation as will be verified by simulations in Section 5.1. 2. Computationally less intensive than the buffer placement algorithms of [1,2] (Section 5.2) 3. More robust to variations in top-level tree driving the mesh (Section 5.3).
Clustering based Buffer Placement Algorithm
The distribution of clock sinks is highly uneven in real designs. This is evident from the ISPD2010 clock network synthesis benchmarks ([4] ) which are based on real 45 nm microprocessor designs of INTEL (Fig. 4) . The proposed buffer placement is based on clustering which is a technique to find similarity in data. Clustering is often used in image processing and data mining to find patterns in images/data sets. Here we use clustering to find groups of sinks which can be assigned a single buffer. The detailed steps are:
1. Start with the sink closest to the lower left corner of the mesh.
Let it be sink s 1 of the first cluster c 1 2. Find m nearest sinks to s 1 and keep adding them to cluster c 1 till the capacitance of sinks in a cluster reaches a target capacitance, say 100 fF 3. Find the capacitance centroid of the sinks in the cluster c 1 as follows
where s i , i ¼ 1 to m represents the m sinks belonging to cluster c 1 and C si represents the capacitance of sink s i having co-ordinates (x i , y i ).
4. Place a buffer which can drive the target load (100 fF) at the mesh node closest to the capacitance centroid of cluster c 1 . 5. Find the next sink closest to the last sink of cluster c 1 and repeat
Steps (2)e(4) for the remaining sinks till all sinks are covered by a cluster Fig. 3 illustrates the clustering algorithm. A direct application of the algorithm will result in clusters spanning large areas when the sinks distribution is very sparse (like cluster c 2 ). This will lead to increased clock latency which will affect the skew of the mesh. To avoid this, an imaginary box is used internally in Step (2) to limit the cluster area even when the target capacitance is not reached. Fig. 2 . Short circuit path in a clock mesh driven by buffers. Fig. 3 . Formation of clusters of clock sinks which can be assigned a single buffer.
