FPGA clock networks consume a significant amount of power, since they toggle every clock cycle and must be flexible enough to implement the clocks for a wide range of different applications. The efficiency of FPGA clock networks can be improved by reducing this flexibility; however, reducing the flexibility introduces stricter constraints during the clustering and placement stages of the FPGA CAD flow. These constraints can reduce the overall efficiency of the final implementation. This article examines the trade-off between the power consumption and flexibility of FPGA clock networks.
INTRODUCTION
With advancements in process technology, programmable architecture, and computer-aided design (CAD), field-programmable gate arrays (FPGAs) are now being used to implement and prototype large system-level applications. These applications often have several clock domains. In order to support applications with multiple clock-domains, FPGA vendors incorporate complex clock distribution circuitry within their devices [Actel 2007; Altera 2006 Altera , 2005 Xilinx 2007] .
Designing a suitable clock distribution network for an FPGA is significantly more challenging than designing such a network for a fixed-function chip such as an application-specific integrated circuit (ASIC). In an ASIC, the locations and skew requirements of each domain are known when the clock network is designed. In an FPGA, however, a single clock network that works well across many applications must be created. When the FPGA is designed, the number of clock domains the user will require, the clock signals that will be generated, the skew requirements of each domain, and where each domain will be located on the chip are all unknown. This forces FPGA vendors to create very complex yet flexible clock distribution circuitry.
This flexibility has a significant area and power overheads. Power is of particular concern, since the clock signals toggle every clock cycle and are connected to a large number of the flip-flops. Previous studies have indicated that in a typical FPGA, 19% of the total FPGA power is dissipated in the clock network [Tuan 2006 ]. The more flexible the clock network, the more parasitic capacitance on the clock nets, and the more routing switches traversed by each clock signal; this leads to increased power dissipation. Clearly, FPGA vendors must carefully balance the flexibility of their clock distribution networks and the power dissipated by these networks.
The clock distribution network also has an impact on the ability of computeraided design (CAD) tools to minimize the power and maximize the clock frequency of a user circuit. FPGA clock networks typically are not flexible enough to supply any clock signal to any flip-flop. As we will discuss in this article, typical networks allow only a subset of clock signals to reach particular regions of the FPGA. This imposes additional constraints on the placement algorithm, as well as on the clustering algorithm that groups logic elements into clusters. If the clock network is not flexible enough, these constraints could result in increased power dissipation and delay in a user circuit. Again, this balance must be considered by FPGA vendors as they design their clock distribution networks.
In this work, we investigate the trade-off between clock network flexibility and the power and speed of user circuits implemented on FPGAs. In particular, we make the following contributions.
1. We present a parameterized framework that describes a family of FPGA clock networks and encompasses the salient features of commercial FPGA clock networks. Such a framework is important, as it allows us to reason about and explore clock networks. 2. We present new clock-aware placement techniques that satisfy the placement constraints imposed by the clock network. As described before, the topology of the clock distribution architecture implies constraints on where logic blocks can be placed, and on which logic blocks can be packed together into clusters. For a given clock distribution architecture, our placement algorithm finds a legal placement solution that meets these constraints. As a secondary goal, the algorithms try to find a solution that uses the clock distribution network efficiently. For example, it may be possible to group together logic blocks such that parts of each clock network can be "turned off " or remain unused, thereby saving significant power. Most clock distribution architectures provide for both global and local clocks; the placement algorithm also determines which are carried by the global distribution network and which are carried by a local distribution network. We consider several placement algorithms and compare their ability to meet the placement constraints, as well as to minimize power by using the clock network efficiently. Our algorithms are implemented into the versatile place-and-route (VPR) tool [Betz 1999] , and are flexible enough to target an FPGA with any clock distribution network that fits within our parameterized framework. 3. We examine how the architecture of the clock network and the placement algorithm used affects the overall power, area, and delay of the FPGA. We consider both the cost of the clock network itself and the impact of the constraints imposed by the clock network. In doing so, we identify the key parameters in our clock framework that have the most significant impact. It is important to note that the approach we take is empirical, and therefore the results and conclusions we make regarding FPGA clock networks are inherently dependent on the clock-aware placement techniques used. With this in mind, however, we endeavor to make the clock-aware enhancements using standard and intuitive techniques and consider a number of different techniques to determine which are most suitable.
Together, the aim of these contributions is to provide insight into what makes a good FPGA clock distribution architecture.
Early versions of the parameterized clock network framework and clockaware placement techniques were presented in Lamoureux [2006] and Lamoureux [2007] , respectively. In this article we combine and expand these studies. Specifically, in this article, the assumptions we make regarding buffer sizing and -sharing within the clock networks have been revisited to reflect current low-skew design techniques. Moreover, the results have been expanded by including the impact of clock network constraints on the clustering stage of the FPGA CAD flow. Finally, the empirical study has been made more concrete by incorporating a greater number of multiple clock-domain benchmark circuits.
This work is organized as follows. Section 2 provides background on clock networks and previous work related to FPGA clock networks and CAD. Section 3 describes the parameterized framework used to describe and compare different FPGA clock networks. Section 4 describes new clock-aware placement techniques for FPGAs. Section 5 examines which clock-aware placement techniques perform the best in terms of power and speed. Section 6 then examines how FPGA clock networks affect overall FPGA power, area, and speed. Finally, Section 7 summarizes our conclusions. 
BACKGROUND AND PREVIOUS WORK
This section provides background on clock networks and then describes previous work related to FPGA clock networks and clock-aware CAD.
Background
The primary goal when designing a clock network for any digital circuit (ASICs and FPGAs alike) is to minimize clock-skew, and the secondary aim is to minimize power and area. Many low-skew and low-power techniques for ASIC clock networks have been described in the literature. Buffered trees comprise the most common strategy for distributing clock signals [Friedman 2001] . Buffered trees have a root (clock source), a trunk, branches, and leaves (registers), and are driven by buffers at the trunk and/or along the branches of the tree, as illustrated in Figure 1 (a). Another buffered-tree approach uses symmetry to minimize clock-skew, as illustrated in Figure 1 (b). Symmetric buffered trees utilize a hierarchy of symmetric H-tree or X-tree structures to make the path from the source to each register the same in length.
FPGA clock networks differ from ASIC clock networks since they are designed before the application is implemented. Thus, in addition to minimizing clock skew and power, the FPGA clock network must be flexible enough to implement a wide range of different applications. [Xilinx 2007] devices, support multiple local and global clock domains. In each of these devices, the FPGA is divided into regions, as illustrated in Figure 2 . The Stratix III has four quadrants that can be further subdivided into four subregions (per quadrant), the ProASIC3 has four quadrants, and the Virtex 5 has fixedsize regions that are 20 logic-rows high and span half the width of the FPGA. The Altera Stratix III provides 16 global clock signals, which can be connected to all the flip-flops on the FPGA, and 22 local clock networks in each of the four quadrants, which can be connected to any of the flip-flops within that quadrant. Similarly, Actel ProASIC3 devices provide 6 global clocks and 3 local clocks per quadrant, and Xilinx Virtex 5 devices provide 32 global clocks and 10 local ones. The global clocks in Virtex 5 are not connected to flip-flops directly; instead, the global clocks drive local clocks within each region.
Within the regions, the clocks are typically distributed to rows of logic blocks through a row multiplexer and rib routing channels. In Stratix II devices [Altera 2005], each row multiplexer chooses 6 clocks from the 24 local/global clocks, and provides them to all the flip-flops in that row, as shown in Figure 3 . In ProASIC3 devices, the row multiplexers choose from 6 global, 3 local, and several internal signals. In Virtex 5 devices, the row multiplexers choose between 10 local clocks.
The circuitry that drives the clock networks is similar for each of the three devices. As shown in Figure 4 , the local and global clock networks are driven by control blocks that select the clock signal and dynamically enable or disable the clock to reduce power consumption when the clock signal is not being used. The clock networks can be driven by an external source, an internal source, or by clock management circuitry which multiplies, divides, and/or shifts an external source.
For a number of years, commercial FPGAs have incorporated programmable clock distribution networks that support applications with multiple clocks and feature local and global clock regions; however, to our knowledge, the associated clock-aware CAD techniques have not been disclosed.
Full Crossbar and Concentrator Networks.
In this article, we will employ both full crossbar and concentrator crossbar networks as building blocks to connect various stages of the clock distribution network together, as described in Section 3. An n × m single-stage crossbar is a single-stage network that connects n inputs to m outputs, as shown in Figure 5 . Figure 5 (a) shows a full crossbar network in which each output can be driven by any input. Such a crossbar requires n·m switches. Figure 5 (b) shows a concentrator network in which any m-element set of the n input signals can be mapped to the m outputs (without regard for the order of outputs). A concentrator built this way contains m · (n − m + 1) switches [Nakamura 1982 ]. Sparse-crossbar concentrators are most efficient when m is close to n (i.e., the network is close to square), or when m is very small (close to 1). A perfectly square crossbar concentrator only has n switches. As the crossbar concentrator becomes more rectangular, the number of switches approaches that of a full crossbar. Full and concentrator crossbars can also be implemented using fanin-based switches (multiplexers followed by buffers).
As an example, the 6 × 3 concentrator crossbar network illustrated in Figure 6 shows how concentrators are used to connect rib tracks to spine tracks of the clock networks described in this article. The multiplexers are implemented with pass-transistors followed by cascaded buffers, as described in Section 5.1.
Previous Work
Existing literature on FPGAs has mostly assumed simplified FPGA clock architectures. The study described in George [1999] examines the design of energy efficient FPGAs. For clock networks, the study proposes that dual edge-triggered flip-flops be used within logic blocks to reduce the power dissipated by the clock network, since this reduces the toggle rates by a factor of two. In Li [2003] • 13:7 and Poon [2005] , FPGA power models are described that assume simple H-tree clock networks with buffers inserted at the end of each branch. In both models, clock networks span the entire FPGA and are implemented using dedicated (nonconfigurable) resources, and for both cases, in the architecture the clock network is fixed.
2.2.1 Clock-Aware CAD. Clock-aware placement has been considered in several studies related to ASICs. As an example, the quadratic placer described in Natesan [1996] minimizes clock-skew by biasing the placement to evenly disperse clocked cells. As a result, the clock tree generated after placement is more balanced. Another technique, described in Edahiro [1996] , minimizes clock-skew using a placement algorithm to optimally size and place clock buffers within the generated clock tree. Although useful for ASICs, these techniques are not applicable to FPGAs which have fixed (but configurable) clock networks.
2.2.2 FPGA Clock CAD. Only a few CAD studies have considered FPGA clock networks. In Zhu [1997] , clock-skew is minimized during placement by balancing the usage in each branch of the programmable clock network. This technique, however, is not necessary for recent FPGAs, since the clock inputs to logic blocks are buffered, which makes clock delay independent of usage. In Brynjolfson [2000] , dynamic clock management is applied to FPGAs to minimize power. The technique involves dynamically slowing clock frequencies when the bandwidth demand decreases. This technique can also be used in conjunction with dynamic voltage scaling to further reduce overall power. Finally, the TVPack clustering tool described in Betz [1999] is clock aware, since it limits the number of clock signals used within each cluster as specified by the user. Although the clusterer supports circuits with multiple clocks, the aforementioned study focused on FPGAs with only one clock.
PARAMETERIZED FPGA CLOCK-NETWORK FRAMEWORK
This section presents a parameterized framework for describing FPGA clock networks. The framework can be used to describe a broad range of clock networks and encompasses the salient features of the clock networks used in current and future FPGAs. This framework is important, since it provides a basis for comparing new FPGA clock networks and clock-aware CAD techniques. The framework assumes a clock network topology with three stages. The first stage programmably connects some number of clock sources (clock pads, PLL/DLL outputs, or internal signals) to the center of some number of clock regions. The second stage programmably distributes the clocks to logic blocks within each region. The third stage connects the clock inputs of each logic block to the flip-flops within that logic block. This topology is described in more detail in the following subsections.
Clock Sources
The source of the user clocks can be external, from a dedicated input pad, or internal, generated by a phase-locked loop (PLL), a delay-locked loop (DLL), or even the core of the FPGA. In all cases, we assume these clock sources to be distributed evenly around the periphery of the FPGA core. In our model, the number of potential clock sources is denoted nsource, meaning there are nsource/4 potential clock sources on each side of the FPGA.
Global and Local Clock Regions
The clock network has both local and global resources. The global clock resources distribute large user clocks across the entire chip (but not necessarily to every logic block, as described in Section 3.3). Local clock resources, on the other hand, distribute smaller user clocks to individual regions of the FPGA. Although it is possible to distribute a large user clock to the entire chip by stitching together several local clocks, this would be less efficient than using a global clock.
To support local clocks, the FPGA fabric is broken down into a number of regions, each of which can be driven by a different set of clock sources (the same clock source can be connected to more than one region). The number of regions in the X-dimension by nx region, and the number of regions in the Y-dimension by ny region. The total number of regions is thus nx region×ny region. shows an example FPGA in which both the nx region and ny region are 3, which produces 9 clock regions.
First Network Stage
The first stage of the clock network, through programming, connects clock sources on the periphery of the chip to the center of each region. This first stage consists of two parallel networks: one for global clocks and one for local. We denote the total number of global clock signals as W global . The global clock network selects W global /4 signals from each of the nsource/4 potential clock sources on each side, as shown in Figure 8 . This selection is done using a concentrator network on each side of the chip. The use of a concentrator network guarantees that any set of W global /4 signals can be selected from the nsource/4 sources on each side. This architecture does not, however, guarantee that any set of W global signals can be selected from the nsource potential sources; it would be impossible to select more than W global /4 signals from any side. Relaxing this restriction would require an additional level of multiplexing (or larger concentrators and longer wires), which we do not include in our model.
All W global signals are routed on dedicated wires to the center of the chip. Since all sources come from the periphery of the chip, the connection between the sources and the center of the chip will introduce little or no skew. In an architecture in which some sources come from inside the chip, the drivers can be sized such that skew is minimized. From the center of the chip, all W global clocks are distributed to the center of all regions using a spine-and-ribs distribution network with ny region ribs, as shown in Figure 9 . Although an H-tree topology would have a lower skew, it is more difficult to mesh such a topology onto a tiled FPGA with an arbitrary number of rows and columns.
There is one local clock network per region. We denote the number of local clocks per region as W local . The W local signals are selected from the two sides of the FPGA that are closest to the region, as shown in Figure 8 (if the number of regions in either dimension is odd, the selection of the "closest" side is arbitrary for regions in the middle). Half of the W local signals are selected from the sources on each of the two sides, using a concentrator network. The use of a concentrator network guarantees that any set of W local /2 signals can be selected from the nsource/4 potential sources on each side. Driver sizing can be used to minimize the skew among the clocks connected to each region. Skew between regions is not as important, since global clocks will likely be used if a clock is to drive multiple regions.
Second Network Stage
The second network stage, through programming, connects clock signals from the center of each region to the logic blocks within that region. There is one such network for each region.
The input to each second-stage network consists of W global global clocks and W local local clocks from the first stage-described in Section 3.3. These clocks are distributed using a spine-and-ribs topology, as shown in Figure 10 . The spine contains W global + W local wires. In each row, a concentrator network is used to select any set of W rib clocks from the spine. These clocks are distributed to the logic blocks in that row through a local rib. Each logic block in the row connects to the rib through another concentrator network; the concentrator is used to select any set of W lb clocks from the rib.
Third Network Stage
Finally, the third network stage, by means of programming, connects the W lb logic-block clocks to the N logic elements within the logic block. This is illustrated in Figure 11 . In order to provide flexibility to the clustering tool, we assume that the clock pins of the logic block are connected to the clock pins of the logic elements (within that logic block), using a full-crossbar network such that each of the N logic elements can be clocked by any of the W lb logic-block clocks. Note that this is the only stage of the clock network that uses full crossbars; the other stages use concentrator crossbars to reduce the number of switches. Table I summarizes the parameters of the FPGA clock network framework and includes the corresponding values that are used in the empirical studies described in Sections 5 and 6 of this article.
CLOCK-AWARE FPGA CAD
The previous section presented a parameterized framework to describe FPGA clock networks. This section describes new clock-aware placement techniques for finding a legal placement solution that satisfies all of the placement constraints imposed by the clock network.
Clock-Aware Placement
Placement can have a significant impact on power, delay, and routability. Placing logic blocks close together on a critical path minimizes delay, and placing logic blocks connected by many wires close together improves power and routability. Power can be further minimized by assigning more weight to connections with high toggle rates, as described in Lamoureux [2005] . For applications with many clock domains, however, constraints that limit the number of clock nets within ribs (W rib ) and within a region (W local ) can interfere with these optimizations. To investigate how clock network constraints affect the power, delay, and routability during placement, the T-VPlace algorithm from Betz [1999] was enhanced to make it clock aware. T-VPlace is based on simulated annealing. The original cost function used by T-VPlace has two components. The first is the wiring cost, which is the sum of the bounding box dimensions of all nets (not including clock nets). The second component is the timing cost, which is a weighted sum of the estimated delay of all nets. The cost of a swap is then
The PreviousTimingCost and PreviousWiringCost terms in the expression are normalizing factors that are updated once every temperature change, and λ is a user-defined constant which determines the relative importance of the cost components. Three enhancements were made to T-VPlace in order to make it clock aware. First, a new term was added to the existing cost function to account for the cost of the clock. Second, a new processing step was added that determines which user clock nets use local clock network resources and which use global. Third, the random initial placement approach was replaced with a new routine that finds a legal placement. Each enhancement is described next.
New Cost Function.
The new cost function is the same as the original cost function, except that it has a new term to account for the cost of using clock network resources. Intuitively, the new term in the cost function minimizes the usage of clock network resources, which minimizes clock power and is key for finding legal placements. The new cost function is described by the following expression.
Like the two original terms, the new term is normalized by the cost of the clock from the previous iteration and is weighted by the factor γ . The best value for γ is found empirically by doing a parametric sweep to find that which produces the most efficient (and legal) placements. In this article we consider two different cost functions for the clock term. The best γ values were found to be 1.0 and 0.3 for first and second cost functions (see the following), respectively. The first cost function, which we call the standard cost function, is based on the amount of clock-network resources needed to route all user clock nets. Specifically, the cost function counts the number of clocks used in each rib, local region, and global region. Moreover, the cost of each resource type (rib, local, or global) is scaled by a constant (K rib , K local , or K global , respectively) to reflect the capacitance of these resources.
Although straightforward, the aforesaid cost function can be short-sighted when large user clock nets (with numerous pins) occupy more than one region. Unless all the pins are moved from a region, the cost of occupying that region does not change.
The second cost function, which we call the gradual cost function, changes when large nets are partially moved. The function scales the cost of adding an LE to a region, based on how many other LEs in this region are connected to the same user clock net as this LE. Specifically, the incremental cost of adding an LE that uses clock net i to region j is described by.
where pins(i, j ) = number of pins of net i in region j max pins(i, j ) = min(#pins of clock net i, # LEs per region j ).
In the expression, K j is the weight factor for the region j, which is either K rib , K local , or K global , depending on the type of region being considered, pins(i , j ) is the number of pins of net i that are currently placed in region j, and max pins(i , j ) is the maximum number of pins from net i that could be placed in region j. This is determined either by the number of pins on net i if the entirety fits within region j, or by the number LEs in region j if the net has more pins than there are LEs in this region.
Intuitively, both cost functions encourage swaps that reduce the amount of clock network resources that are used. In the second cost function, however, the cost of moving a logic block to a region is smaller when other logic blocks in that region are connected to the same clock net. The goal of the gradual cost function is to prevent large clock nets from spreading out to more regions than necessary. However, since the function does not reflect the actual cost of the clock as directly as does the first cost function, the overall results may not be minimized as intended. To determine which cost function is most suitable, we performed an empirical study, described in Section 5.
Clock Resource Assignment.
The second enhancement made to the placer was to make it able to determine which user clock nets should use the local clock network resources and which should use the global. Although this decision could be left to the user, it is more appropriate for the tool to make the assignment, since it is convenient and the user may not be familiar with the underlying architecture of the clock network.
Global clock network resources are more expensive than the local clock network resources in terms of power, since they are routed to the center of the FPGA before spanning to the center of the clock regions. Depending on the clock network and application, global clock network resources may also be in short supply. Therefore, global clock networks should be reserved for large nets that do not fit within local regions, or for nets that are inherently spread out.
We consider two approaches for assigning global clock network resources. The first approach assigns global resources statically, based on the size (fanout) of the clock nets. The advantage of this static approach is that it is quick and easy. The disadvantage is that it overlooks smaller nets that require global resources because they are inherently spread out. The second approach assigns global resources dynamically during placement based on the amount of spread between the LEs driven by the same clock net. Figures 12, 13 , and 14 describe each assignment technique in more detail.
The static assignment routine is described in Figure 12 . It begins by determining how many clock nets use global clock network resources, based on the number of clock nets in the application, the number of global clock network resources available, and a user-defined relaxation factor called RELAX FAC. Intuitively, we are trying to use the fewest possible number of global clock resources, since they consume more power and should be reserved for those clock domains that need to be spread out across more than one clock region. In Figure 12 , the num global min parameter is the absolute minimum number of global clock resources that would be needed to legally place the application. This value is determined by calculating how many application clock nets would remain if each local clock network resource could be used to implement one of the clock nets. The placer, however, may not be able to find a placement that only uses local clock resources, since application clock nets are often too large and need to be spread out over more than one clock region. The relaxation factor makes finding a legal solution easier by allowing more clock nets to be implemented on global clock resources. In our experiments we use a relaxation factor of 0.5. Once the number of global clock resources to be used (num global relaxed) has been decided, the routine assigns the global resources to those clock nets with the greatest number of pins (highest fanout).
The dynamic assignment technique described in Figure 13 is applied during placement. Initially, all user clock nets are assigned to use local clock network resources. Then, during the simulated annealing routine, clock nets can be reassigned to use global resources if the placer cannot find a legal placement. Specifically, the placer reassigns clock nets when the cost of the clock no longer decreases and the placement is still not legal. When clock nets are reassigned, the temperature is reset back to the initial temperature to find a placement solution, given the new assignment. The pseudocode in Figure 13 is based on T-VPlace and described in Betz [1999] .
The clock nets are reassigned using the routine described in Figure 14 . It begins by calculating the spread of each clock net by calculating the locality distance, which counts how many clock pins would need to move in order to make the clock net local. After sorting each net by locality distance, it then assigns global clock network resources to that half of the remaining local clock nets which have the highest locality distance.
4.1.3 Legalization. The final enhancement needed to make the placer clock aware is to produce a placement that is legal. Legalization ensures that the number of different clock nets used in every region is less than or equal to the number of clock resources available in that region.
We consider two approaches. The first, called the preplacement approach, finds a legal solution before placement. A legal solution is found using simulated annealing with the timing and wiring-cost components turned off (leaving only the clock cost component turned on). If a legal placement is found, the actual placement is then performed with all three cost components turned on, but only allowing legal swaps.
The second approach involves legalizing during the actual placement. The algorithm starts with a random placement and then uses simulated annealing with the timing, wiring, and clock costs (from Eq. (2)) turned on. To gradually legalize the placement, the clock cost component is modified to severely penalize those swaps that make the placement either illegal or more illegal. In other words, illegal swaps are allowed, but have higher cost. Specifically, we multiplied the cost of using an unavailable rib, spine, or global routing resource by a constant value, called Illegal Factor. Intuitively, a large value forces the placer to find a legal solution quickly, but limits its ability to make major changes to the way the clock resources are used once a legal placement is found.
In our experiments, we found 10 to be a suitable value for Illegal Factor.
CLOCK-AWARE PLACEMENT RESULTS
This section begins by describing the experimental framework used in this article and then compares the clock-aware placement techniques described in the previous section.
Experimental Framework
The same empirical framework is used in Section 5 and in Section 6. A suite of benchmark circuits is implemented on a user-specified FPGA architecture, using standard academic FPGA CAD tools. The CAD tools consist of the Emap technology mapper [Lamoureux 2005 ], the T-VPack clusterer [Betz 1999 ], the VPR placer (with clock-aware enhancements), and the VPR router [Betz 1999 ]. Note that T-VPack does not need to be enhanced, since it is already clock aware. Finally, the power, area, and delay of each implementation are modeled, using VPR for area and delay and the power model from Poon [2005] , which has been integrated into VPR.
The VPR models are very detailed, taking into account specific switch patterns, wire lengths, and transistor sizes. After generating a user-specified FPGA architecture, VPR places and routes a circuit on the FPGA and then models the power, area, and delay of this circuit. The area is estimated by summing the area of every transistor in the FPGA, including the routing, CLBs, and configuration memory. The delay is estimated using the Elmore delay model as well as detailed resistance and capacitance information obtained from the router. The power is modeled using the capacitance information from the router and externally generated switching activities to estimate dynamic, short-circuit, and leakage power. In this article, the switching activities required by the power model are obtained using gate-level simulation and pseudorandom input vectors.
We enhanced the VPR power and area models to account for the parameterized clock network described in Section 3. Similar techniques to those used in VPR, along with the following buffer-sharing and -sizing assumptions, were used to model the clock networks.
1. Each switch is implemented using a transmission gate controlled by a SRAM cell. The transmission gate consists of one minimum-sized NMOS transistor and one 2X PMOS transistor, in parallel. 2. Shared buffers are used to drive all periphery, spine, rib, and CLB clock network wires. 3. Large cascaded buffers with four stages (1X, 4X, 16X, and 64X) are used to drive the periphery, spine, and rib wires, and smaller cascaded buffers with three stages (1X, 4X, and 16X) are used to drive the CLB clock wires. 4. Large (64X) noninverting repeaters are used to drive very long wires and are spaced by 40 FPGA tiles. 5. Unused clock networks are turned off to reduce power consumption.
An example of clock-network switches of buffers is illustrated in Figure 15 . Note that these buffer-sizing and -sharing assumptions differ from those 30mixedsize  12537  1343  2056  3399  1163  150  993  133  30seq20comb  16398  1342  1642  2984  992  191  803  165  3lrg50sml  10574  2861  1572  4433  925  156  904  155  40mixedsize  12966  2758  1820  4578  1071  161  941  146  4lrg4med16sml 11139  3278  1748  5026  1000  126  1006  133  4lrg4med4sml  8873  2598  1270  3868  739  107  721  105  50mixedsize  16477  3685  1996  5681  1177  179  1155  177  5lrg35sml  11397  2222  1424  3646  771  60  940  231  60mixedsize  15246  1731  3486  5217  1900  198  1479  168  6lrg60sml  15838  3673  4364  8037  2388  218  1554  165  70s1423  13440  5180  2128  7308  1260  251  420  124  70sml  9301  1134  1264  2398  801  170  735  162  lotsaffs  7712  3123  632  3755  367  51  477  162  lotsaffs2  9309  3609  958  4567  597  118  616  138 described in our earlier work [Lamoureux 2006 ] in which the buffers were smaller but not shared, and the repeater spacing was significantly smaller, separated by only 1 FPGA tile. The new assumptions reduce the power, area, and skew of the clock network by reducing the number of buffers and repeaters and the overall amount of parasitic capacitance.
Benchmark Circuits.
In order to empirically investigate new FPGA clock-network architectures and clock-aware CAD tools, benchmark circuits with multiple clock-domains are needed. Existing academic benchmark circuits, however, are small and have only one clock. Moreover, since developing system-level circuits with multiple clock-nets is labor intensive and expensive, commercial vendors are reluctant to release their intellectual property.
As a solution, we developed a technique to combine a number of singledomain benchmark circuits to form larger multidomain circuits; the latter resemble system-level circuits from a place-and-route viewpoint. To connect the circuits, we begin by determining the number of primary inputs (N PI ) and primary outputs (N PO ) the metacircuit should have, using Rent's rule [Landman 1971 ]. Rent's rule relates the number of pins (N p ) to the number of gates (N g ) in a logic, design, using the expression
where β is the Rent's constant and K p is a proportionality constant. The values of the two constants vary for different circuit types. Specifically, β is lower for circuits with pin counts that do not increase quickly when the size of circuit increases, such as static memory, and K p is lower for circuits with fewer pins in general. In Bakolgu [1990] , an empirical study found that β between 0.12 and 0.63, and K p varies between 0.82 and 6 for chip-level designs. For circuits implemented on gate arrays, β was 0.5 and K p was 1.9. These are the values used in this work. After determining the number of primary inputs and output, all sinks and sources of the metacircuit are listed in arrays, as illustrated in Figure 16 . Each sink is then assigned a source using the algorithm in Figure 17 . Two synchronizing flip-flops are inserted between connections from IP outputs to IP inputs, since the IP cores use different clocks.
Using this technique and benchmark circuits from MCNC, several large benchmark circuits with multiple clock domains were generated. Table II lists the new benchmark circuits and provides additional details regarding circuit size and the number of clock domains. Specifically, column 2 specifies the total number of LUTs in each benchmark circuit; columns 3 to 5 specify the number of flip-flops in the subcircuits, between subcircuits, and in total, respectively; columns 6 and 7 specify the number of primary inputs in the subcircuits and in the benchmark circuit, respectively; and columns 8 and 9 specify the number of primary outputs in the subcircuits and in the benchmark circuit, respectively. Note also that the benchmark names are intended to provide some insight regarding the size of clock domains. As an example, 1lrg40sml consists of one large circuit and 40 small ones. Betz [1999] , which consists of logic blocks with 10 logic elements each and a segmented routing fabric with length-4 wires. For the clock network, we assume the baseline clock architecture described in Table I , which is similar to current commercial architectures.
For each experiment, the size of FPGA is determined by the size of benchmark circuit. Specifically, the size is determined by finding the smallest twodimensional array of FPGA tiles (nx × ny) with enough logic blocks to implement the benchmark circuit. For general routing (not clock routing), the channel width is selected by finding the minimum routable channel width and then adding 20% to this width. These assumptions serve to model the case where FPGAs are highly utilized.
Placement Results
To compare the techniques for each of the three enhancements, we implemented eight different clock-aware placers (one for each possible combination). The eight placers are described in Table III . As an example, the first placer (placer 1) uses the standard function for the clock term in the cost function, the static approach to assign global clock network resources, and the preplacement approach to legalize the placement. Note that placers 3 and 7 assign global clock network resources dynamically during the preplacement, while placers 4 and 8 assign global clock network resources dynamically during the actual placement, since there is no preplacement in the latter implementations.
Table IV presents the overall energy per cycle dissipated by each benchmark circuit when implemented by the eight different clock-aware placers. The average is calculated using the geometric mean (rather than the arithmetic mean) to ensure that each benchmark circuit contributes evenly to the average, regardless of its size. Moreover, the averages only include those benchmark circuits that were successfully implemented by every placer, to make the results comparable.
A number of observations can be drawn from Table IV . First, the placer fails to find a legal solution in some cases. There is no guarantee that a legal placement will be found, even when the placer starts by looking for a legal solution. In this experiment, placers P1 and P2 failed to find a legal placement for the 10lrg, 70s1423, and 8lrg benchmark circuits, and placers P5 and P6 failed to find a legal placement for 70s1423. These placers all use the static approach to assign local and global clock resources. In terms of energy, those placers that use the static approach (P1, P2, P5, and P6) also performed worse than the other placers (P3, P4, P7, and P8), which use the dynamic approach. Intuitively, assigning global resources dynamically works better, since more information is available when the resources are being assigned; the placer can determine which nets are more widely dispersed and minimize the number of global clock resources used. Another notable observation is that energy efficiency is similar when the standard and gradual cost functions are used. On average, the energy results of placers P1 to P4 correspond closely to those of P5 to P8. However, the results suggest that the gradual cost function works better for finding a legal solution. Placers that use the standard function failed in seven cases are comparable to only two cases for placers that use the gradual function.
Finally, the third observation is that legalizing during preplacement works as well as during placement. In fact, when static assignment is used, the preplacement results are slightly better. The preplacement does not seem to lock the placement in local minima, probably because the preplacement goal of placing logic together within individual clock domains does not conflict with the goals of final placement.
To examine the placement techniques further, Table V compares the average energy of the clock-and general-purpose routing resources, as well as the The table demonstrates that clock constraints can have significant impact on the energy dissipated by clock networks and general purpose routing resources. As expected, clock-aware implementations dissipate significantly less clock power than the original (nonclock-aware) implementations, since the former minimize clock usage. On the other hand, the table also shows that clockaware implementations dissipate more power within general purpose routing resources. This is especially true for placers P1 to P4, which use the standard clock cost function. This effect is reduced for placers P5 to P8, which use the gradual clock cost function. In terms of speed, clock-aware placement techniques only have a small impact, with average critical-path delay variations of approximately 1% in either direction. Finally, in this experiment, placer 8 produces the best overall placements, with the lowest overall energy per cycle and the second-fastest average critical-path delay. Specifically, placer 8 is 4.7% more energy efficient than the original VPR placer (which does not produce legal placements), with no increase in average critical-path delay.
CLOCK NETWORK ARCHITECTURE RESULTS
In the previous section, we compared different clock-aware placement techniques. In this section we use the best clock-aware placer from the previous section (placer 8) to compare how different clock network parameters affect overall power, area, and delay. Specifically, we consider four clock network parameters: W lb , W rib , W global , and nx(y) region. For each experiment, we vary one parameter at-a-time. Our goal is not to exhaustively measure the cost of every possible combination of parameters, but rather to examine the impact that each clock network constraint has on the overall power, area, and critical-path delay of system-level applications. We first consider the flexibility within the logic blocks by varying W lb , the number of clocks per logic block. Intuitively, the larger this number, the simpler the task for the clustering algorithm (since the constraint on the number of clocks per logic block is not as severe); however, the larger W lb , the more power-hungry the clock network. Specifically, we would expect that limiting the number of clocks per logic block would reduce the packing efficiency of the clusterer, since this limits which blocks can be packed together. To verify this issue, we packed the benchmark circuits and varied the number of clock signals per logic block between 1 and 3. Table VI presents the results. The table shows that increasing the number of clocks per logic block increases packing efficiency. However, the efficiency only increases by 1.2% when W lb is increased from 1 to 3. The main reason for this negligible impact is that the efficiency is already close to 100% when W lb is limited to 1, which leaves little room for improvement.
We next consider how W lb affects the area of the clock network. Figure 18 shows a breakdown of the clock network area relative the overall FPGA area. We varied W lb from 1 to 10 and used the baseline value from Table I for the remaining parameters. As shown, the area due to the logic block-to-logic element (LB-LE) switches increases significantly as W lb increases. Recall that a full-crossbar network is assumed within the logic block. Therefore, increasing W lb by 1 adds 1 extra switch for every logic element in the FPGA.
The area due to the rib-to-logic block (RIB-LB) switches increases as W lb increases from 1 to 6, and then decreases as W lb increases from 7 to 10. This is because the number of switches in a concentrator network is smallest when the number of outputs is either very small or close to the number of inputs. Hence, the incremental area cost of increasing W lb is small when W lb is more than half of W rib (the number of clocks in each rib).
Note that the preliminary clock network area results presented in Lamoureux [2006] differ from those in this article, since we use different buffer sizing and sharing assumptions. Although the conclusions we draw are similar, the area overhead of the clock network is smaller when the new assumptions are modeled. As an example, the area of the baseline clock network described in Table I accounts of 4.0% of the overall area (in this article) as compared to 7.2% in Lamoureux [2006] .
Finally, we consider how the W lb constraint affects the energy per clock cycle and the critical-path delay. Table VII presents the overall energy for each circuit  and Table VIII gives a breakdown of this energy, showing how critical-path delay is affected when W lb is varied from 1 to 3.
The results indicate that increasing the number of clocks per logic block actually increases (rather than decreases) the overall energy and critical-path delay. Intuitively, increasing W lb should decrease energy and delay, since the clusterer can pack logic elements more efficiently when logic elements with different clocks can be packed together. However, both tend to increase. Upon further inspection, the average critical-path delay increases because flip-flops from different clock domains are sometimes packed in the same logic block (during clustering) and the corresponding clock domains are placed in different regions of the FPGA (during placement). When this occurs, flip-flops end up being placed far apart, which leads to increased critical-path delay. This issue could likely be resolved after placement by moving or duplicating those logic elements that are causing the delay increase, as described in Schabas [2003] ; however, this feature is not supported in our experimental CAD flow. The overall energy increases for two reasons. First, the added logic-block clocks add a significant amount of parasitic capacitance which the rib tracks in the clock network. Second, greater availability of logic block clocks increases the usage of logic-block and rib-clock resources. Increasing W lb from 1 to 2 increases clock energy by 24.5%, which overshadows the 7.0% energy savings obtained in the routing resources.
Clocks per Rib (W rib )
We next consider the impact of varying the number of wires in each rib within a region. Intuitively, the placement tool has to ensure that the total number of clocks used by all logic blocks lying on a single rib is no larger than W rib . The higher this value, the easier the placement task becomes; however, a larger value of W rib means the clock network will be larger and consume more power. Figure 19 illustrates the clock area when W rib is varied from 6 to 14 and the baseline values from Table I are used for the remaining parameters. The figure shows that the clock network area increases significantly as W rib increases. This is due to an increase in the number of rib-to-logic block (Rib-LB) switches and spine-to-rib (Spine-Rib) switches. The area of the remaining clock network connections stays unchanged and is mostly due to logic block-to-logic element (LB-LE) switches.
Table IX presents the overall energy per cycle when W rib is varied from 6 to 14. The results from this table show that the overall energy decreases only slightly when W rib is increased. In most cases, the majority of the savings are gained when W rib is increased by 1 or 2 tracks beyond the minimum W rib that produces a legal solution. Likewise, Table X, which gives a corresponding breakdown of energy dissipation and critical-path delay with respect to W rib , shows the energy savings obtained when W rib is increased to originate from the savings obtained in the general-purpose routing. Unlike the previous case (where W lb was increased), the energy savings obtained in general-purpose routing are not overshadowed by the increase in clock energy dissipation. In terms of criticalpath delay, Table X shows that W rib does not have a significant impact on the critical-path delay.
Global Channel Width (W global )
In this section, we consider the impact, in terms of area, power, and delay of changing the number of clock wires in the spine of the regions (W local + W global ). Intuitively, a wider spine means more clocks can be distributed within a given region, which makes placement easier but increases the area and power consumption of the clock network. Figure 20 illustrates the clock network area when W local + W global is varied from 40 to 120, keeping a 1:1 ratio between W local and W global (matching the baseline architecture in Table I ). The graph shows that increasing the number of clock wires in the spine does increase the area the clock network, but not as significantly as does increasing the W lb and W rib parameters. Most of this area increase comes from the increase in the number of spine-to-rib (Spine-Rib) switches. The number of switches between the clock sources and the spine (Pad to Local and Pad to Global) also increases, but this area is negligible and not apparent in the figure.
Table XI presents the average overall energy, clock energy, routing energy, and critical-path delay when the number of global clocks (W global ) is varied between 0 and 40 and the baseline values are used for the remaining parameters. Results when W global is 0 and 4 are not shown, since legal placements could not be found for every circuit. This emphasizes the importance of providing sufficient global clock network resources. The table also shows that (so long as a legal solution can be found) the number of global clock network resources only has a small impact on overall energy, clock energy, routing energy, and critical-path delay. Intuitively, the impact on energy is small, since most of the power dissipated by the clock network is dissipated in the ribs and logic blocks (not in the global clock network resources). Therefore, since W global has only a small impact on area, power, and speed, clock networks should be designed with enough global clock resources to ensure that all target applications can be legally placed. These results follow the general trend that clock network flexibility is beneficial and less expensive when close to clock sources (clock pads or internal sources) and becomes less beneficial and more expensive when close to clock sinks (flip-flops).
Number of Clock Regions (nx region, ny region)
Finally, this section examines how the number of clock regions affects power, area, and delay. As described in earlier, we assume the FPGA is broken down into nx region × ny region regions, each containing its own set of local clocks. A clock network with many clock regions is suitable for implementing applications with many clock domains. In the best case, we can map each domain of a user circuit to a single region of the FPGA. Intuitively, this would save power because each clock would only be routed to a minimum number of clock regions. Figure 21 shows a breakdown of the clock network when the number of clock regions is 1, 4, 9, and 16 and the baseline values from Table I are used for the remaining parameters. The figure illustrates that increasing the number of clock regions increases the area fairly significantly and that most of this area increase is due to the increase in number of spine-to-rib (Spine-Rib) switches.
Note, however, that as we increase the number of regions and therefore reduce their size, the number of clock domains of the user circuit that are mapped to each region decreases. In the case where each user clock can be placed within one clock region, the number of clocks required in each is inversely proportional to the number of regions. To examine this effect, Figure 22 shows a breakdown of the clock network when the number of clock regions is 1, 4, 9, and 16, W local is 64, 16, 7, and 4, W global is 16, and the baseline values from Table I are used for the remaining parameters. Figure 22 shows that the area of the clock network does not increase significantly when the number of clock regions is increased, so long as the number of clocks in the spines is roughly inversely proportional to the number of clock regions.
Finally, we consider how the number of clock regions affects the overall energy. Tables XII and Table XIII parameters. Note that in the following experiments, the size of clock regions varies with circuit size. In commercial devices, the size of clock regions is usually fixed (to facilitate layout) and the number of clock regions varies depending on FPGA size. However, since our benchmark circuits do not vary widely in size, we believe the conclusions are consistent. Table XII reports two different geometric means. The first value is averaged over every benchmark and second value is averaged over those benchmark circuits with valid results in each of the four experiments. Intuitively, increasing the number of clock regions increases the amount of circuitry needed to physically implement the local connections between clock sources on the periphery of the FPGA to center of each region. At the same time, however, increasing the number of clock regions decreases the size of each region, which reduces the number of tracks used within clock regions and ribs. Table XII shows that increasing the number of clock regions significantly reduces overall energy, with average savings of 14.6% when a clock network with 25 regions is used instead of the baseline clock network, which has 4 regions. Moreover, the table also shows that reducing the number of clock regions from 4 to 1 increases overall energy by nearly 30% and makes finding a legal placement more difficult, with a total of 5 failed placements. In each of these cases, the placer failed to find a legal placement since the demand for rib tracks increased beyond the baseline value. Finally, Table XIII shows that the overall energy savings are due in the most part to clock energy savings and that increasing the number of clock regions does not have a significant impact on the average critical path delay.
CONCLUSIONS AND FUTURE WORK
This article presented a new framework for describing FPGA clock networks, described new clock-aware placement techniques, and examined how FPGA clock networks as well as the constraints they impose on CAD tools affect the overall power, area, and speed of FPGAs.
The framework, which describes a wide range of FPGA clock networks, is key in this research, since it provides a basis for comparing new FPGA clock networks and clock-aware CAD techniques. The challenge in creating such a framework was to make it flexible enough to describe as many different "reasonable" clock network architectures as possible, and yet be as concise as possible. In our model, we describe a clock network using seven parameters.
After describing the framework, new clock-aware placement techniques were described that satisfy the placement constraints imposed on the clock network. Several useful clock-aware placement techniques were found. Specifically, we found that simulated annealing can be used to satisfy placement constraints imposed by the clock network (to legalize placement). Moreover, we found that the cost function used to minimize the usage of clock network resources and legalize the placement is important. We introduced a gradual cost function which produced implementations as energy efficient as those produced using a standard costing approach, but which aided in finding legal placements. Finally, for FPGAs with local and global clock network resources, we found that using a dynamic approach to assign global clock resources improves both the overall results and the likelihood of finding legal placements.
Using these clock-aware techniques, an empirical study was then performed to examine the impact of the constraints imposed by the clock network. A number of important observations were made. First, the clock network should be more flexible near clock sources and less so near the logic elements in order to minimize power. The results showed that adding flexibility near clock sources (near the pads) only slightly increases area and has little effect on overall energy, but makes finding legal placements significantly more straightforward. Moreover, adding flexibility near the logic elements significantly increases area, has little effect on overall energy, and can have negative impact on critical-path delay. Another important observation is that dividing FPGA clock networks into smaller regions only slightly increases area, but significantly reduces overall energy when implementing applications with many clock domains. On average, those with 2×2 clock regions dissipated 14.6% more power that those with 4×4 clock regions for the benchmark circuits in our study.
