POWER-AWARE TECHNOLOGY MAPPING AND ROUTING FOR DUAL-VT FPGAS by LOKE WEI TING
  
POWER-AWARE TECHNOLOGY MAPPING AND ROUTING 

















A THESIS SUBMITTED 
FOR THE DEGREE OF MASTER OF ENGINEERING 
DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING 






In this thesis, we present a technology mapping and clustering scheme, as well as a novel 
interconnect routing architecture, for modern FPGAs with programmable dual-VT fabrics. 
The use of Reverse Back Bias (RBB) in circuit design is recognized today as a feasible 
strategy for mitigating leakage power, a critical issue as process technologies shrink 
relentlessly towards sub-nanometre proportions. FPGAs with the ability to adjust fabric VT at 
conﬁguration time oﬀers the ability to reduce leakage power reduction with minimal or no 
sacriﬁce to circuit speed. 
 
Today, Altera’s Stratix-III/IV line of FPGAs already demonstrate the feasibility of a similar 
architecture, but with dual-VT optimization limited to post-P&R. We explore the limitations 
of such an approach. We also discuss why a dual-VT solution is superior to a dual-VDD one, 
an architecture adopted by some of the existing works in academia. Together, these form the 
basis for the contributions presented in this thesis. 
 
The first work presented is RBBMap, a power-aware, dual-VT technology mapping tool, and 
RBBPack, a dual-VT logic clustering tool. Using an existing baseline tool Emap, the 
combined use of RBBMap and RBBPack yields an average of 70.95% and 28.30% savings in 
logic block leakage and total power respectively. The second work explores a completely 
new domain: a programmable, dual-VT switch box routing architecture. This work holds 
promise in mitigating leakage power in the interconnect - the largest constituent component 
of the FPGA, yielding an average of 53.69% and 28.23% savings in leakage power savings 





First of all, I would like to take this opportunity to thank my family for their heartfelt support 
in my pursuance of this Master’s Degree, without which this work would never have been 
possible. I especially want to thank my fiancée, Xianpin, whom has rode rough tides and 
distances to be my pillar of support today. 
I am immensely grateful to my supervisor, Asst. Prof. Ha Yajun, for his mentorship and 
guidance. Where I am today is due entirely to his sincerity and commitment. For this reason I 
will forever consider him to be an astute educator and my life mentor. 
I would also like to thank my fellow research group members for our numerous fruitful 
discussions and exchanges –Wenfeng, Xiaolei, Rizwan and Yu Heng. Wenfeng in particular 
was instrumental to the circuit feasibility study of the back bias architecture presented in this 
work. Special thanks to my closest friends, for being there for me, everytime. 
Extended appreciation goes out to Julien Lamoureux (Emap / Tektronix), Jason Luu (VPR / 
University of Toronto), Alan Mishchenko (ABC/UC Berkeley), Peter Jamieson (Odin-II / 
Miami University in Ohio) and Kara Poon (VPR Power Model), all of whom had contributed 
in some way to the framework supporting this work. 
Last but not least, I want to thank my company, Xilinx Inc., for providing financial assistance, 




TABLE OF CONTENTS 
ABSTRACT ............................................................................................................................... ii 
ACKNOWLEDGEMENT ....................................................................................................... iii 
Chapter 1 INTRODUCTION ..................................................................................................... 1 
1.1. Research Goals................................................................................................................ 2 
1.2. Approach ......................................................................................................................... 3 
1.3. Achievements .................................................................................................................. 4 
1.4. Thesis Organization ........................................................................................................ 5 
Chapter 2 Background and Related Works ................................................................................ 6 
2.1. Modern FPGA Architectures .......................................................................................... 6 
2.1.1. Floorplan .................................................................................................................. 7 
2.1.2. Configurable Logic Blocks ...................................................................................... 8 
2.1.3. Interconnect Network ............................................................................................. 10 
2.1.4. Global Network ...................................................................................................... 11 
2.2. FPGA EDA ................................................................................................................... 12 
2.2.1 Technology Mapping .............................................................................................. 13 
2.2.2 Logic Block Packing ............................................................................................... 17 
2.2.3 Placement ................................................................................................................ 18 
2.2.4 Routing .................................................................................................................... 18 
2.3. Related Works ............................................................................................................... 19 
2.3.1. Dual-Voltage Fabric............................................................................................... 19 
2.3.2. Dual-VDD Technology Mapping and Clustering .................................................... 21 
2.3.3. Low Power Routing ............................................................................................... 23 
2.4. Focus and Contribution ................................................................................................. 24 
Chapter 3 Dual-VT Architectural Enhancements and Experimental Methodology ................. 27 
3.1. Basis Architecture ......................................................................................................... 27 
3.2. Architectural Enhancements ......................................................................................... 28 
3.2.1. Dual-VT Configurable Logic Block ....................................................................... 28 
3.2.2. Dual-VT Switch Box Architecture ......................................................................... 30 
3.3. Circuit Implementation and Area Overhead Analysis .................................................. 33 
3.3.1. Reverse Back Bias Voltage Generation ................................................................. 33 
3.3.2. Configurable Logic Block...................................................................................... 35 
v 
 
3.3.3. Switch Box ............................................................................................................. 36 
3.4. Experimental Methodology .......................................................................................... 37 
3.4.1. Discussion .............................................................................................................. 37 
3.4.2. Dual-VT Technology Mapping and Clustering ...................................................... 38 
3.4.3. Dual-VT Switch Box Routing ................................................................................ 42 
3.5. Power and Delay Modeling .......................................................................................... 44 
3.5.1. Elmore Delay Model .............................................................................................. 44 
3.5.2. Power Model .......................................................................................................... 45 
Chapter 4 Power and Cluster Architecture-Aware Dual-VT Technology Mapping and 
Clustering ................................................................................................................................. 47 
4.1. Discussion ..................................................................................................................... 47 
4.2. RBBMap ....................................................................................................................... 48 
4.2.1. Cost Function ......................................................................................................... 48 
4.2.2. Timing Analysis ..................................................................................................... 50 
4.2.3. Slack Reclamation and Distribution Scheme ......................................................... 52 
4.2.4. Cluster Architecture Delay Annotation ................................................................. 53 
4.2.5. VT Migration .......................................................................................................... 55 
4.2.6. Overall Algorithm .................................................................................................. 56 
4.3. RBBPack ....................................................................................................................... 58 
4.4. Experimental Results .................................................................................................... 59 
Chapter 5 Dual-VT Switch Box Pathfinder Routing Algorithm .............................................. 63 
5.1. Discussion ..................................................................................................................... 63 
5.2. Dual-VT Switch Box Pathfinder Algorithm .................................................................. 66 
5.2.1. Motivation .............................................................................................................. 66 
5.2.2. Net Criticality Ranking .......................................................................................... 68 
5.2.3 Switch Box Cost Enhancement ............................................................................... 71 
5.2.4. Unused Switch Boxes ............................................................................................ 71 
5.2.5. Overall Algorithm .................................................................................................. 72 
5.3. Experimental Results .................................................................................................... 74 
5.3.1. Criticality Threshold Investigation ........................................................................ 75 
5.3.2. Final Results........................................................................................................... 84 
5.4. Summary ....................................................................................................................... 88 
Chapter 6 Conclusion and Future Works ................................................................................. 89 
6.1. Conclusion .................................................................................................................... 89 
vi 
 
6.2.  Future Works ............................................................................................................... 90 
Appendix A Circuit Implementation Schematics .................................................................... 92 
A.1. Configurable Logic Block (CLB) Top ......................................................................... 92 
A.2. CLB containing Basic Logic Elements ........................................................................ 93 
A.3. SRAM Cell ................................................................................................................... 93 
A.4. MUX ............................................................................................................................ 94 
A.5. Look-up Table (LUT) .................................................................................................. 94 
A.6. LUT with SRAM ......................................................................................................... 95 
A.7. Sequential Logic within BLE ....................................................................................... 95 
A.8. Negative Charge Pump ................................................................................................ 96 
A.9. Voltage Doubler ........................................................................................................... 97 





List of Figures 
 
Figure 2.1: Example of commercial FPGA floorplans. ............................................................. 7 
Figure 2.2: Basic Logic Element comprising one 4-input LUT, DFF, and mux. ...................... 8 
Figure 2.3: General structure of a Configurable Logic Block. .................................................. 9 
Figure 2.4: The Subset, Wilton and Universal  Switch Box topologies .................................. 10 
Figure 2.5: Switch Box and Connection Box depopulation example [9] ................................ 11 
Figure 2.6: General FPGA EDA Flow. .................................................................................... 12 
Figure 2.7: Transforming a 4-feasible cut into a 4-input LUT ................................................ 14 
Figure 2.8: Node Duplication .................................................................................................. 15 
Figure 2.9: A switching activity-aware mapper encapsulates high activity edges .................. 16 
Figure 2.10: A non-switching activity-aware mapper may produce the above mapping ........ 16 
Figure 2.11: The Stratix LAB architecture with body bias set by a CRAM ............................ 20 
Figure 2.12: Post P&R slack reclamation (a) No clustering, (b) with clustering .................... 20 
Figure 2.13: Optimization loss in dual-VDD mapping .............................................................. 22 
Figure 3.1: Proposed Programmable Reverse Back Bias Configurable Logic Block 
Architecture.............................................................................................................................. 29 
Figure 3.2: Unidirectional single driver interconnect .............................................................. 30 
Figure 3.3: Connection box structure....................................................................................... 31 
Figure 3.4: Proposed Programmable Reverse Back Bias Switch Box Architecture. ............... 32 
Figure 3.5: Switch Capacitor Charge Pump Implementation for Reverse Back Bias Voltage 
Generation. ............................................................................................................................... 34 
Figure 3.6: Reverse Back Bias Control MUX Implementation ............................................... 36 
Figure 3.7: Experimental Flow for Dual-VT Technology Mapping and Clustering (RBBMap)
.................................................................................................................................................. 39 
Figure 3.8: Experimental Flow for Dual-VT Switch Box Routing .......................................... 42 
Figure 4.1: Transforming a mapped network into its corresponding timing graph. ................ 50 
Figure 4.2: Intra-cluster delay within a CLB (highlighted in blue). ........................................ 53 
Figure 4.3: Mapped network with annotated LUT and intra-cluster delays. ........................... 54 
Figure 4.4: Migration of VTH LUTs to VTL. ............................................................................. 56 
Figure 4.5: Pseudocode for the RBBMap algorithm. .............................................................. 57 
Figure 4.6: Pseudocode for the RBBPack algorithm. .............................................................. 59 
Figure 5.1: Route extension due to switch box avoidance scheme.......................................... 65 
Figure 5.2: Critical connection skipping through an unset switch box .................................... 71 
Figure 5.3: Pseudocode for the Dual-VT Switch Box Pathfinder Algorithm ........................... 73 
Figure 5.4: Plot of Critical Path Delay vs Criticality Threshold – Rank by Max Criticality ... 77 
Figure 5.5: Plot of Routing Leakage Power vs Criticality Threshold – Rank by Max 
Criticality ................................................................................................................................. 79 
Figure 5.6: Plot of Critical Path Delay vs Criticality Threshold – Rank by Weighted Average 
Criticality ................................................................................................................................. 82 
Figure 5.7: Plot of Routing Leakage Power vs Criticality Threshold – Rank by Weighted 
Average Criticality ................................................................................................................... 84 
viii 
 
List of Tables 
 
Table 4.1: Comparison of converted nodes against reclamation schemes. .............................. 55 
Table 4.2: Comparison of Logic Block Leakage Power, Total Power and Delay of Emap 
against RBBMap/RBBPack. .................................................................................................... 61 
Table 5.1: Delay vs Criticality Threshold (0.6 to 0.78) – Rank by Max Criticality ................ 75 
Table 5.2: Delay vs Criticality Threshold (0.8 to 0.98) – Rank by Max Criticality ................ 76 
Table 5.3: Critical Path Delay Standard Deviation, Average and Percentage Standard 
Deviation – Rank by Max Criticality ....................................................................................... 76 
Table 5.4: Routing Leakage Power vs Criticality Threshold – Rank by Max Criticality ........ 78 
Table 5.5: Leakage Power Standard Deviation, Average and Percentage Standard Deviation – 
Rank by Max Criticality........................................................................................................... 79 
Table 5.6: Delay vs Criticality Threshold – Rank by Weighted Average Criticality .............. 81 
Table 5.7: Critical Path Delay Standard Deviation, Average and Percentage Standard 
Deviation – Rank by Max Criticality ....................................................................................... 81 
Table 5.8: Routing Leakage Power vs Criticality Threshold – Rank by Weighted Average 
Criticality ................................................................................................................................. 83 
Table 5.9: Leakage Power Standard Deviation, Average and Percentage Standard Deviation – 
Rank by Weighted Average Criticality .................................................................................... 83 
Table 5.10: Results for our RBB switch box enhancements versus single VT ........................ 86 
Table 5.11: Ratio of VTH switch boxes set during routing versus total number of switch boxes 







List of Abbreviations 
 
“FPGA”: Field Programmable Gate Array 
“ASIC”: Application Specific Integrated Circuit 
“ASSP”: Application Specific Standard Product 
“EDA”: Electronic Design Automation 
 “CAD”: Computer Aided Design 
“MVCMOS”: Multi-Voltage Complementary Metal Oxide Semiconductor 
“RBB”: Reverse Back Bias 
 “CLB”: Configurable Logic Block 
“BLE”: Basic Logic Element 
“ALM”: Adaptive Logic Module 
“LAB”: Logic Array Block 












The management of power consumption has become an imperative for semiconductor 
vendors and customers. Despite eﬀorts at multiple levels [1] , the trend of ever-
increasing leakage power does not appear to be letting up [2], [3]. At the same time, 
prohibitive mask costs have become a persuasive force to ASIC vendors considering a 
transition to FPGA-based solutions, which sits well with their traditional needs for 
quick prototyping and rapid turnaround. Yet, the power problem in FPGAs is one of 
the most challenging issues to tackle due to the native requirement of programmability, 
and is undoubtedly the single biggest obstacle in its penetration of the mobile market. 
With static power ready to overtake dynamic power at the 28nm node and below, 
architectural and EDA-level enhancements are necessary if FPGAs are to remain a 
competitive option. 
 
To some extent, combined eﬀorts in the form of novel circuit design techniques and 
process technology advancements have helped to keep static power manageable in the 
last decade. Some examples of the latter are high-K metal gate and MVCMOS 
technology . MVCMOS in particular is comparatively easier to deploy in ASICs 
because circuit slacks are known during the design phase, and involves but a choice of 
cells in the technology library used. Such is not the case in FPGAs, where critical 
path(s) cannot be determined until a design has been implemented. However, FPGAs 
hold the key of an interesting concept: back-bias programmability. The use of reverse 
2 
 
back bias (RBB) to adjust transistor VT and hence reduce leakage power without 
compromising circuit speed has shown some promise in recent years [4], [5], and has 
been adopted by the industry for several years now [6]. With SOI processes steadily 
maturing, such a design paradigm looks set to continue [7]. 
 
1.1. Research Goals 
The works presented in these thesis focuses on two dual-VT-related architectural and 
algorithmic enhancements: Power and Cluster Architecture-Aware Dual-VT 
Technology Mapping and Clustering, and Dual-VT Switch Box Pathfinder Routing 
Algorithm. Both of these works explore the programmable RBB technique, but in 
different architectural domains. The contributions in this thesis are thus both 
architectural and algorithmic. 
 
The driving motivation behind each of these respective works are as follows: 
1. Power and Cluster Architecture-Aware Dual-VT Technology Mapping and 
Clustering: The Dual-VT Configurable Logic Block archtecture proposed as 
part of this work is most similar to Altera’s Stratix-III/IV Dual-VT Logic Array 
Block (LAB) architecture, a commercial FPGA available in the market today. 
In the CAD flow adopted by Altera, LABs on non-critical paths with sufficient 
slack are converted to low power (high VT) mode, only after the user circuit has 
been placed and routed. Such an approach would lead to a significant loss in 
optimisation space, which we will show in Section 2.3.1.  Nonetheless, the fact 
that an FPGA vendor with significant market share is willing to adopt such an 
architecture proves its feasiblity in cost and area overhead. We therefore adopt 
as close-matching an architecture as possible that can be reasonably 
3 
 
implemented in the academic FPGA place-and-route tool, VPR. The limitations 
in Altera’s CAD flow serves as the prime motivator for us in moving our 
optimization space upwards to the technology mapping level: RBBMap, 
presented in Chaper 4. The objective in this work is to gain an understanding of 
practical feasibility, power gains, and delay impact if any. 
 
2. Dual-VT Switch Box Pathfinder Routing Algorithm: The interconnect 
network of an FPGA is one of the largest and most complex components of the 
FPGA to analyse. Typically, interconnects take up approximately half the die 
area in an FPGA. Interestingly, given this huge optimization opportunity, there 
is currently no work to date presenting a treatment of reverse back bias as 
applied to the FPGA interconnect network. Such is the motivation of this work. 
The goal is similarly to demonstrate the practical feasibility, power gains, and 
delay impact attributed to our architectural and algorithmic enhancements. 
 
1.2. Approach 
For FPGA architectural investigation works in particular, it is almost always necessary 
to run a benchmark suite of circuits through the entire FPGA EDA flow to understand 
the benefits brought about by the architectural or algorithmic enhancement in question. 
We describe the FPGA EDA flow in Chapter 2. In evaluating our work, we leverage 
existing, established academic tools in the public domain where possible. The baseline 
tools used are ABC [8] for technology mapping, T-VPack [9] for logic block packing 
and VPR 5.0 [10] for placement and routing. For node and cluster activity estimation, 
we use the Activity Estimation Tools [11] and Odin-II [12] respectively; in evaluating 
power, we use the VPR power model [11].  
4 
 
To investigate the gains and impact from the tools developed in this work, we replace 
or enhance each tool in the flow accordingly. More details on our experimental 




Both works in this thesis has yielded excellent results with very few outlier cases. The 
first work, Power and Cluster Architecture-Aware Dual-VT Technology Mapping and 
Clustering,  yielded an average of 70.95% savings in logic block leakage power and 
28.30% savings in average total energy consumption, when compared against the 
baseline tool Emap. Critical paths were observed to be minimally impacted in all but 
one design, with better timing numbers in some cases. 
 
An early version of this work was accepted for Poster Presentation at The 20th 
ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA 
2012). The complete version is to appear at the upcoming 19th Reconfigurable 
Architectures Workshop (RAW 2012), to be held on May 2012, in Shanghai. 
 
In the second work, Dual-VT Switch Box Pathfinder Routing Algorithm, the 
combination of our RBB switch box architecture and Dual-VT Switch Box Pathfinder 
Algorithm demonstrated excellent results against the baseline VPR flow, yielding an 
average of 53.69% savings in leakage power savings on the routing network alone, and 
28.23% in total power savings. Average delay impact amounted to just +1.28%. The 
5 
 
results confirms the feasibility of this novel routing architecture and algorithm at 
tackling leakage power in routing logic through RBB. 
 
The first version of this work recently appeared in The 8
th
 Symposium in Applied 
Reconfigurable Computing (ARC 2012), March 2012, Hong Kong. The complete 
version was recently submitted to the 22
nd
 International Conference on Field 
Programmable Logic and Applications (FPL 2012) Call-For-Paper. 
 
1.4. Thesis Organization 
The rest of this thesis is organized as follows. We first give a background of FPGA 
architecture, EDA, and related works in Chapter 2. In Chapter 3, we present the 
architectural enhancements associated with our works and also the experimental 
metholodgy that we employ. The two core contributions of this thesis, Power and 
Cluster Architecture-Aware Dual-VT Technology Mapping and Clustering and the 
Dual-VT Switch Box Pathfinder Routing Algorithm, is presented in Chapter 4 and 5 











Field-Programmable Gate Arrays (FPGA) are a special class of programmable 
integrated circuits whose function is not determined during fabrication. FPGAs are 
normally reconfigurable, allowing them to realize different functions depending on 
needs and requirements. Once purely used as glue-logic, FPGAs today are a serious 
contender to servicing application spaces once occupied by ASICs and ASSPs due to 
their ability in enabling rapid prototyping and quick time-to-market. 
 
In this chapter, we give an overview of the architectural basics of FPGAs and the EDA 
tool chain associated with implementing a user design onto an FPGA. We also 
introduce some recent works relating to dual-voltage FPGAs, and conclude with our 
research focus and contribution. 
 
2.1. Modern FPGA Architectures 
FPGAs comprise five key components: Logic Blocks, Interconnect, I/O, Clocks and 
Configuration Logic. The latter is responsible for realizing the configuration of the 
FPGA, which allows for any digital circuit function to be implemented subject to 
constraints such as timing and sufficiency of resources. FPGAs today are composed of 
several different types of logic blocks: Look-Up Tables (LUTs), embedded memory, 
embedded digital signal processing blocks and transceivers being some of them. 
FPGAs comprising only CLBs are termed to be homogeneous, while FPGAs that also 
7 
 
contain other types of logic blocks are termed heterogeneous. Today, more advanced 
FPGAs can also contain embedded processor blocks [13]. 
 
The actual configuration of a FPGA is realized by setting the configuration memory 
associated with each programmable component within the FPGA. Several forms of 
configuration memory exists, such as Static RAM (SRAM), antifuse [14] and floating 
gate transistors [15]. The former is by far the most ubiquitous in commercial FPGAs.  
 
2.1.1. Floorplan 
Commercial FPGAs today are mostly laid out in island-style [9], or columnar [16]. A 
simplified representation of both floorplans assuming a homogeneous fabric are as 





























































The island-style floorplan is so-called because logic blocks are laid out in a manner 
resembling islands within a sea of routing logic. The column style floorplan has all 
components of the FPGA laid out in columns. In both cases, logic blocks are typically 
arranged in a grid-coordinate fashion with horizontal and vertical wires for routing 
(indicated by lines interconnecting switch boxes, or SBs, in the figure). 
 
Generally, the island-style floorplan is picked where wirebond packaging is preferred, 
such as in low-cost, high-volume FPGAs. The opposite is true for the columnar 
floorplan, which is chosen where flip chip packaging is preferred, for example in high-
performance FPGAs. 
 
2.1.2. Configurable Logic Blocks 
The configurable logic block, or CLB, is the heart of the FPGA in which any digital 
circuit function can be implemented. This is made possible primarily due to look-up 
tables (LUT) that constitute the CLB. In typical FPGAs, the CLB consists of a cluster 
of basic logic elements (BLE), each in turn comprising one LUT, one flip-flop, and 
local routing. The schematic of a BLE is shown below. 
 
 











Like the rest of the configuration circuitry, the LUT is programmed using SRAM cells 
that set the values stored in the LUT. A K-input LUT (K-LUT) is capable of 
implementing any K-input function using 2
K
 SRAM cells. In general, an FPGA with 
larger LUTs is able to implement a digital circuit function using less LUTs, thereby 
requring less routing logic, but a tradeoff exists in area required in implementing large 
LUTs. 
 
Within a CLB, BLEs are interconnected with additional intra-cluster routing resources 
that provide fast connections between LUTs in the same CLB. The motivation for this 
is better performance, more efficient LUT utilization, and reduced interconnect 
utilization (inter-cluster routing). Most commercial FPGAs deploy LUT sizes ranging 
from 4 to 6 and CLB sizes ranging from 4 to 10. Figure 2.3 below depicts the general 
structure of a CLB [9]. 
 






















2.1.3. Interconnect Network 
The interconnect of an FPGA is the other key component of an FPGA. This is the 
routing logic responsible for connecting CLBs and IOs in the FPGA together through a 
process known as routing. The interconnect network comprises wire segments of 
varying lengths and channel directions (horizontal or vertical) and programmable 
switches that realize the connection between these wire segments, as well as between 
wire segments and CLBs. Generally, switches in the former category constitute the 
switch box, while switches in the latter category constitute the connection box. 
 
The design of a good interconnect network is integral to how well an FPGA performs 
in its desired application space [17]. Some of the architectural parameters fundamental 
to interconnect design include switch box topology, connection box topology, switch 
depopulation and wire length distribution [9]. Wires with long reach are desired for 
good speed and reduced routing logic utilization, at the expense of routing flexibility 
and access to nearby CLBs; the opposite is true for short wires. Similarly, different 
switch box and connection box topologies influence routing cost differently. Some 
example switch box topologies and depopulation are shown below. 
 



























Length-3 wire  
Figure 2.5: Switch Box and Connection Box depopulation example [9] 
 
In the past, FPGAs utilized either pass-transistors or tri-state buffers for routing. Today, 
modern FPGAs deploy a unidirectional single driver (UDSD) architecture [22] in 
which all wires are driven at only one point and is able to travel in only one direction. 
UDSD is shown to yield an average 25% area improvement and 9% delay 
improvement over multi-driver routing [22], [23]. We stay relevant to modern FPGAs 
by assuming this architecture in our work. 
 
2.1.4. Global Network 
In addition to the interconnect network, an FPGA also contains global networks such 
as the configuration logic and clock networks. Modern FPGAs have global clocks and 
regional clocks capable of handling different clock domains, as well as advanced clock 
management features such as delay-locked loops (DLL) and phase-lock loops (PLL) to 
implement complex memory interface features. In this thesis, a simple H-tree clocking 
network is assumed. 
12 
 
2.2. FPGA EDA 
The FPGA Electronic Design Automation (EDA) tool chain is responsible for mapping 
a digital circuit onto the FPGA of interest. This is generally achieved through the 
sequence of steps as shown in Figure 2.6 below. At the end of the tool chain, the 
programming information for the FPGA, known as the bitstream, is generated. The 
bitstream is typically very large, comprising individual bit programming information 
for all SRAM cells within the FPGA. Commercial FPGA vendors normally encrypt the 
bitstream to protect against reverse-engineering. 
 
Figure 2.6: General FPGA EDA Flow. 
 
In this section, we briefly introduce the EDA tool chain used to map a general digital 
circuit onto FPGAs. We start with technology mapping since logic synthesis involves 
but a transformation of a hardware description language (HDL) into gate 
representation, and is agnostic to the architecture of the FPGA. Greater emphasis is 
also placed on technology mapping since it is related to one of the two main 












FPGA EDA Tool chain
13 
 
2.2.1 Technology Mapping 
We first review some of the terminology associated with technology mapping [24] [25] 
[26]. A Boolean network   can be represented as a directed acyclic graph (DAG), 
where gates correspond to nodes in the graph and an edge       exists where the output 
of gate   is an input of gate  . Primary inputs (PIs) have no incoming edges and 
primary outputs (POs) have no outgoing edges. We use          to denote the set of 
nodes that are fanins to gate  , and           to denote the set of fanouts of gate  . 
     denotes a cone of node   in  . The maximum cone of   comprising all PI 
predecessors of   is termed to be the fanin cone of  . In a similar manner           
and            denote input nodes to the cone     . A cut is a partitioning        of 
     such that     is a cone of   with   as the root node of the cut. The cut-set 
        comprises          .      is K-feasible if the cardinality of its input is 
smaller than or equal to K. The level of a node   is defined as the longest path from all 
PIs to  ; correspondingly the depth of the network is the largest level amongst all 
nodes    . A fanout-free cone (FFC) at   is a cone of   that does not fanout to a 
node outside the cone; a maximum fanout-free cone is the fanout-free cone with the 
most nodes.  
 
A cut        can be represented as a product term, or p-term, of the variables 
corresponding to nodes in the cutset        
  . The p-terms constitute a Sum-of-
Product expression that represents a set of cuts. Cut enumeration is performed 
according to the following theorem [25]: 
         





where        represents all the K-feasible cuts rooted at node  ; + and    are the 
Boolean OR and AND operations respectively. Each p-term on the RHS of the 
equation represents a possible cut; terms that are not K-feasible are discarded during 






Figure 2.7: Transforming a 4-feasible cut into a 4-input LUT 
 
Technology mapping tranforms the Boolean network into a network comprising only 
LUTs and ﬂip-ﬂops that can be implemented onto the FPGA fabric. The process 
involves choosing a set of K-feasible cuts containing based on a given cost function. 
The cost function typically optimizes for or balances the metrics of interests – typically 
timing, area and power. Depth oriented mapping [27], [28], [24] attempts to ﬁnd the 
mapping that minimizes the level, and therefore timing, of the mapped netlist. This 
normally introduces some amount of area overhead due to encapsulation of high fanout 
nets, a phenomenon known as node duplication. In such cases, affected nodes are 
mapped into more than one cut, resulting in an increased LUT count. Recent methods 
to deal with this problem include area ﬂow for area recovery [27], [29]. 












Figure 2.8: Node Duplication 
 
Works relating to FPGA technology mapping have also started placing focus on power 
in recent years [30], [31], [32], [33], [34], [35]. Power dissipation comprises two main 
components: dynamic and static (also known as leakage) power. In FPGAs, dynamic 
power is mostly dissipated in the routing network as it comprises long, highly 
capactive wires. To date, there has been no work on leakage-mitigation in technology 
mapping, which is a concern since leakage power today has caught up with dynamic 
power, and the trend looks set to continue and process nodes continue to shrink. 
 
Existing works in technology mapping with dynamic power awareness attempts to 
minimize inter-cluster routing capacitance, which as described above forms most of 
the dynamic power component in FPGAs. Specifically, this is done by minimizing the 
switching activities between LUTs by encapsulating edges with high activities. The net 
effect is a containment of highly active switching to within LUTs, leaving lower 
activity wires for implementation by the routing network. An example of switching 









a b c d e fc d
out





Figure 2.9: A switching activity-aware mapper encapsulates high activity edges 
 
 
Figure 2.10: A non-switching activity-aware mapper may produce the above mapping 
 
The numbers on each edge corresponds to the switching probability associated with the 
edge. In a switching-activity aware scheme, the mapper attempts to encapsulate, and 
thereby absorb, high activity nets into the LUT (4-input in this case). The resulting 
mapped network then comprises edges with low activities. In contrast, a non-switching 
activity-aware scheme may encapsulate edges with lower activities, leaving high 
activity edges present in the resulting network. These edges are eventually 
implemented in the routing logic, resulting in high dynamic power dissipation. Recent 
switching activity-aware works include Emap [31] and the recent power optimization 
toolbox in ABC [30]. 
17 
 
2.2.2 Logic Block Packing 
The second step in the FPGA EDA flow is logic block packing, also known as 
clustering in other literature. The goal of packing is typically to minimize external 
routing across clusters, often with timing-awareness. The packing algorithm does this 
by partitioning BLEs in the input netlist into clusters matching the intended CLB 
architecture. 
 
Packing is considered to be a well-studied problem [36], [37], [38], [39], [40]. In 
particular, the T-VPack algorithm [36] is an efficient timing-driven clustering tool that 
has been widely adopted in academia since its conception over a decade back. The 
algorithm uses a seed attraction function to select a seed BLE from the set of unpacked 
BLEs. Once a seed BLE has been chosen, a second attraction function picks the next 
unpacked BLE to pack into the cluster. The seed attraction function chooses BLEs in 
the technology mapped netlist that are on the most critical path; the second attraction 
function is given by: 
                                      
                 
       
 
(2.2) 
where                is a measure of how close   is to the critical path,         are 
the nets connection to BLE  ,         are the nets connected to the BLEs already in 
the cluster  ,   is a parameter controlling the tradeoff between net-sharing and delay 
minimization.        is the maximum number of nets that can connect to any BLE, 




Following logic block packing, the next step is to decide the physical locations of each 
cluster on the actual device. The close relative placement on timing-critical clusters is 
essential to obtaining good speed. The goal of placement algorithms is therefore to 
shorten the critical path delay as much as possible, while also mitigating wiring needed 
during routing. Four large categories of placers exist today: quadratic [41], [42], 
simulated annealing [9], [43], [44], analytic [45], [46] and min-cut [47]. 
 
2.2.4 Routing 
In the final stage of the FPGA EDA flow, the placed design is routed by making 
connections between each logic cluster. The main objective of a routing algorithm is to 
minimize timing, and to utilize as little of the routing resource available as possible. 
Several routing schemes were investigated in the past, such as two-step routing with 
global routing followed by detailed routing, or combined global-detailed routing [44], 
the latter being more popular in FPGAs. 
 
The timing-driven algorithm in VPR is based on the original Pathfinder algorithm with 
enhancements in the form of optimizing the Elmore delay model. It uses a congestion-
negotiation mechanism that places priority on the most critical nets within a given 
routing iteration, and relaxes non-critical nets to relief congestion. The router performs 
well and is popular in academia. In our second work, the Dual-VT Pathfinder, we 





2.3. Related Works 
In this section, we discuss existing literature that most closely relate to the works 
presented in this thesis. 
 
2.3.1. Dual-Voltage Fabric 
Most existing works in academia investigate dual/multi-voltage fabric with varying 
degrees of body bias granularity and algorithmic enhancements. In [48], a fine-grain 
multi-VT scheme was proposed for logic clusters; while in [49], a low power FPGA 
comprising dual-VDD and dual-VT fabrics were introduced. 
 
Today, Altera’s Stratix-III/IV line of FPGAs offer dual-VT capability at the logic 
region of one Logic Array Block (LAB) – refer to Figure 2.11 below . We discuss 
some of the architectural considerations  in Section 3.2.1. The optimization approach 
adopted [6] was to reclaim slack post place and route; specifically, after place and 
route, the CAD flow attempts to convert regions of logic and routing to low power. 
The tool then re-analyses timing and, if critical path violation was found, the 
conversion will be reverted. Enhancements to cluster more critical logic  into the same 
LAB and thereby reduce the ratio of LABs on the critical path were not considered, on 






Figure 2.11: The Stratix LAB architecture with body bias set by a CRAM 
 
However, we assert that such a flow results in the loss of significant optimization 
space. Refer to the example in Figure 2.12 below, which serves as the prime motivator 
for us in moving our optimization space to the technology mapping level. 
 
 
(a) No clustering 
 
(b) With clustering 








































cluster 1 cluster 2 cluster 3
21 
 
Here, LB2 has slack of 2 units. However, because LB2 has been packed together into 
the same cluster 1 as other LBs having no slack, it is not possible to adjust the cluster 
for high VT ; similarly for LB6 with slack 1. To do so would require backing up to the 
packer stage, yet the slack values are known only post P&R. The alternative would be 
to keep cluster size small, since the likelihood of sharing a cluster with a critical path 
LB is reduced in this way. Obviously, this means losing some of the beneﬁts of fast 
intra-cluster routing. Hence, we see that such a method clearly is sub-optimal. 
 
2.3.2. Dual-VDD Technology Mapping and Clustering 
Very recently (Nov 2010), another work closely related to the Dual-VT technology 
mapping tool presented in this thesis was published by Chen et. al. [26]. In this work, a 
dual-VDD technology mapping and clustering tool was proposed to mitigate dynamic 
leakage power dissipation. The cost function used accounts for optimisation factors 
such as node coverage, cut size, switching activity and output fanout number: 
   
           
              
 
(2.3) 
where     is the cut size of the cut  ,    is the switching activity on input   of the cut, 
     is the total number of nodes covered by the cut, and     is the number of fanouts 
on the root node.   and   are constants. 
 
The cost function is used during cut enumeration to cost each cut. Upon mapping, the 
actual power of each mapped LUT is estimated using a suggested power model: 
22 
 
              
          
 
(2.4) 
where    represents the power associated with a cut  ,    is the power contributed by 
the cut   itself, and    is the number of fanouts of signal  . Furthermore, a global and 
local cost adjustment scheme was designed alongside slack reclamation in which level 
shifters must be added during VDDL LUT to VDDH LUT transitions. 
 
Generally speaking, dual-VT is better suited to mitigating leakage power, while dual-
VDD is better suited to mitigating dynamic power [50] (although the latter does help in 
reducing leakage to some extent as well). However, we contend that a dual-VDD 
approach results in signiﬁcant loss to the optimization space, namely due to the 









































Consider the case in Figure 5.1. In line with [26], we assume that VDDH , VDDL and 
level-shifter each constitute a delay of 1, 1.4 and 0.3 units respectively. Red arrows 
indicate level-shifting is required. We see that in 4 VDDL -VDDH sequences, we lose 1 
VDDH LUT. In 10 such sequences, we lose 3 low VDDH LUTs. Obviously, the delay of 
the level-shifter has a big impact on the optimization space; if the delay is 0.33, we 
lose 1 VDDH LUTs in just 3 sequences. 
 
As a simple analysis, suppose 30% of the LUTs are VDDL , which is reasonable given 
the 1.4 delay ratio between VDDH and VDDL . A path that passes through 10 LUTs will 
then have, on average, 1.9 VDDL -VDDH transitions requiring level-shifters. This implies 
adding a delay of 1.9×0.3=0.56 to the path that had an original delay of 
0×(0.3×1.4+0.7×1)=11.2, or approximately +5% delay due to level-shifting alone. The 
dual-VDDH solution thus requires a signiﬁcantly better power-delay trade off than dual-
VT in order to recover from this delay impact. 
 
Area-wise, a dual-VDD solution is also more costly. While a dual-VT solution requires 
only one SRAM for RBB programmability, a dual-VDD solution for the same 
programmable granularity requires two level-shifters on top of the SRAM. In practice, 
programmable dual-VDD circuits are also extremely challenging to implement and 
characterize [51]. 
 
2.3.3. Low Power Routing 
To our knowledge, there is currently no work today that provides a treatment of dual-
voltage programmability at the switch boxes level. Most works relating to low power 
routing logic today target enhancements at the routing-switch level.  
24 
 
In [52], a low power routing switch design was proposed. In this design, each routing 
switch is able to operate in 3 modes: high speed, low power, or sleep. Whether the 
switch operates in high speed or low power mode depending on whether sufficient 
slack exists in the route. An extra configuration SRAM cell exists to select between the 
two modes. Circuit implementation involves the use of two sleep transistors arranged 
in parallel. Clearly, such a design incurs significant area overhead, particularly in the 
SRAM cell since some amount of latch up circuitry exist in routing switches of FPGAs 
today to keep unused routing switches in a known state. Similarly in [53], while an 
SRAM-efficient programmable VDD switch was proposed amongst other dual-VDD 
fabric structures, we deem the level of control to be excessively granular and hence 
area-expensive. 
 
In [54], a mixed dual-VDD/dual-VT channel was proposed as part of the interconnect 
architecture. In this work, the routing switch for each track is fixed at a designated 
voltage level. The voltage levels were not programmable. While the results in this 
work yielded good power savings, a significant impact to delay was observed, which is 
difficult to tolerate in practice. 
 
2.4. Focus and Contribution 
The focus of this thesis is on the following two works: (1) Power and Cluster 
Architecture-Aware Dual-VT Technology Mapping and Clustering, and (2) Dual-VT 




The deficiencies in Altera’s CAD flow serves as the prime motivator for the first work 
in this thesis. By moving optimizations up to the technology mapping level, which is 
the highest level in the FPGA EDA flow, we free ourselves from the clustering 
constraints limiting Altera’s post-P&R slack reclamation method. In addition, it is 
widely established that circuit optimizations tend to yield the most benefit when 
injected higher up in the design flow [55]. The decision in this thesis to adopt a dual-
VT fabric as opposed to a dual-VDD one is motivated by the analysis given in Section 
2.3.2 above. 
 
On the second work presented in this thesis, we observe that dual voltage 
programmability at the granularity of a switch box has not been investigated to date, 
thus providing fertile ground for exploration in this thesis. Given that a proven, 
programmable dual-VT logic cluster architecture exists in the industry today, it is 
reasonable to assume that a dual-VT switch box architecture is also feasible and that 
similar gains should be expected. 
 
The contributions in this thesis are two-folds: Architectural, and Algorithmic. We 
summarize our accomplishments as follows: 
1. Proposed and implemented a programmable, dual-VT configurable logic block 
architecture; 
2. Proposed and implemented a programmable, dual-VT switch box architecture; 
3. Developed a power and cluster architecture-aware technology mapping scheme 
for FPGAs with dual-VT LUTs; 
26 
 
4. Developed a novel, dual-VT switch box-aware timing driven routing algorithm 







Dual-VT Architectural Enhancements 





The work in this thesis comprises enhancements in two broad areas: Architecture and 
Algorithm. In this chapter, we present the architectural enhancements required in order 
to leverage our EDA level enhancements. We start with our basis CLB and switch box 
architectures, following which we discuss our architectural enhancements, circuit 
considerations, as well as the area overhead associated with our proposed architecture. 
We also describe the experimental methodology in which we deploy to measure the 
contribution of our work. 
 
3.1. Basis Architecture 
We first lay down the basis architecture adopted in this thesis, which are invariant 
throughout the discussions that follow. We assume a homogeneous fabric architecture 
with island-style floorplan. Configuration is effected using SRAM. We choose a CLB 
cluster size of 10, with 4-input LUTs. A unidirectional single driver (UDSD) 
interconnect architecture with Wilton switch box topology were chosen. For wire 
distribution, we assume a frequency spread of 4x , 2x and 1x for length-1, length-2 and 
length-4 wires respectively, with no switch box nor connection box depopulation. Such 




3.2. Architectural Enhancements 
The work in this thesis deals with a programmable, dual-VT FPGA fabric. Specifically, 
RBBMap focuses on reverse back bias-capable LUTs, and the dual-VT Switch Box 
Routing Algorithm leverages reverse back bias-capable routing switches.  In this 
section, we present the architectural enhancements to the FPGA fabric associated with 
our EDA-level algorithms. 
 
3.2.1. Dual-VT Configurable Logic Block 
The architecture proposed as part of our first work, RBBMap, supports dual-VT CLBs 
through reverse back bias. Depending on whether the CLB is in VTL or VTH mode, a 
configuration SRAM bit sets the back bias voltage as required. The VTL and VTH 
modes corresponds to when no back bias and when a reverse back bias voltage is 
applied, respectively. When required to operate at high speed, the CLB is set to VTL, 
meeting its nominal speed. When required to operate at low power, the CLB is set to 
VTH, reducing leakage power dissipation.  
 
Figure 3.1 below depicts this scheme. For clarity, only the NMOS case is shown, 
where a negative back bias voltage is applied when in VTH mode. In the PMOS case, a 
positive back bias voltage should be applied when in the same mode. Setting CLBs on 
non-critical paths to VTH effects slack reclamation on these paths, lowering overall 






Figure 3.1: Proposed Programmable Reverse Back Bias Configurable Logic Block Architecture. 
 
In the circuit implementation of the BLE, the back bias voltage is supplied to all 
transistors within the BLE. A schematic reference is available in Appendix A. The 
closest existing FPGA in the market today with a similar architecture is Altera’s 
Stratix-III/IV line of FPGAs. In [6], the primary questions that were explored for the 
architecture were the granularity of the back bias region, and the back bias voltage that 
should be applied. High granularity is preferred, however the back bias control 
circuitry incurs significant area overhead in the form of power switches and circuit 
spacing requirements required for implementation. We explore some of these design 
aspects in Section 3.3. Because larger back bias regions incurs a smaller proportion of 
area overhead for this circuitry, a tradeoff exists between granularity of the back bias 
region versus area overhead.  It was decided from architectural evaluation in [6] that 
the granularity be one Logic Array Block (LAB), comprising logic elements with local 
routing. In our architecture, we adopt the closest match of one CLB per back bias 
region. We note that differences do exist in the form of local routing logic and 
30 
 
advanced logic element architectures in the Stratix-III/IV LAB that we are unable to 
model in VPR due to reasons of practicality. 
 
3.2.2. Dual-VT Switch Box Architecture 
The architectural enhancement proposed in with our second work, the Dual-VT Switch 
Box Pathfinder, is the dual-VT switch box. Per our basis architecture, we adopt a 
UDSD interconnect architecture with Wilton switch box topology. Specifically, a 
switch box will be used where connections between wires and connections from a 
logic block output to a wire wire, is required. Signals from interconnect wires drop oﬀ 
to logic block inputs via connection boxes. Figure 3.2 and Figure 3.3 below depicts 
this driving structure. 
 
 







Figure 3.3: Connection box structure 
 
We make the back bias option available at the granularity of one switch box, meaning 
all connections utilizing a particular routing switch in a particular switch box will be 
subject to the delay assigned by the VT level of that switch box (a routing switch 
comprises a mux and a buffer, and drives a wire in one direction only). Wires longer 
than length-1 feed through all intermediate switch boxes and are not subject to these 
delays. However, if a signal drops oﬀ at any of these intermediate points, then it will 
be subject to the delay of the associated switch box. This gives the router the ﬂexibility 
to “skip” past switch boxes of diﬀerent VT levels. This enhancement is simple, elegant, 
and architecturally agnostic in the sense that it is applicable to any switch box topology, 







Figure 3.4: Proposed Programmable Reverse Back Bias Switch Box Architecture. 
 
One alternative granularity option that was considered is to have a subset of routing 
switches within one switch box fall within one back bias region. A grouping scheme 
can be devised to determine which routing switches should fall within this subset; a 
viable option for the Wilton switch box would be the track number associated with 
each segment. This allows flexibility for the routing algorithm to decide hop-off and 
hop-on points for live slack reclamation during routing. Naturally, the feasibility of 
any grouping scheme would depend strongly upon the switch box topology in question. 
However, the major concern with such a granularity option is the area overhead, since 
design rules exist in the spacing of back bias regions. Given that routing switches in 
switch boxes are typically grouped tightly together in circuit design, such a granularity 


























A back bias region comprising multiple switch boxes was also deemed to be too 
coarse-grained, limiting the optimization space of slack paths excessively. Based on 
the reasoning above, we choose a granularity of one switch box per back bias region in 
our work. 
 
3.3. Circuit Implementation and Area Overhead Analysis 
The back bias circuitry in our proposed architecture necessarily incurs noticeable area 
overhead. This overhead is important in deciding the granularity of the back bias 
region and involves a tradeoff of tuning flexibility and real estate space on the die, as 
was discussed above. This section discusses our circuit implementation and presents an 
area overhead analysis involved in our proposed architecture. 
 
3.3.1. Reverse Back Bias Voltage Generation 
In conventional circuits, nominal voltage levels are GND and VDD. One of the major 
circuit design consideration in a circuit implementing reverse back bias is then the 
generation of the back bias voltage.To do this, several structures can be used; two 
popular options include low-dropout regulators (LDO) and switch capacitors charge 
pumps [56], [57]. In this work, we implemented the latter for both the negative voltage 
and voltage doubler generation; these supply the back bias voltage for the NMOS and 
PMOS transistors respectively. In addition, a triple-well process is required in order to 
supply the negative back bias voltage. We used the UMC65LL triple-well process in 
our circuit implementation (HSPICE simulations on regular VT transistor models in the 





For simplicity in modeling, our CLB and switch box circuits were implemented using 
regular VT transistors of VT levels 0.41v and 0.37v for NMOS and PMOS respectively. 
Back bias voltages were set to -1.2v and 2.4v for NMOS and PMOS respectively. The 
simplified schematics for our negative charge pump and voltage doubler circuit is 









(b) Voltage doubler for PMOS body bias 
 
Figure 3.5: Switch Capacitor Charge Pump Implementation for Reverse Back Bias Voltage 
Generation. 
 
In reality, such a structure can support up to a max of -0.6v and 1.9v for NMOS and 
PMOS transistors respectively. To realize the voltages required, a level shifter stage 
was added. Nonetheless, the area overhead involved is largely dominated by the charge 
pump capacitor. 
 
The size of the charge pump depends entirely on the on-chip capacitor available e.g. 
Metal Insulator Metal Capacitor (MIM) or Metal Oxide Metal Capacitor (MOM). The 
35 
 
approximate size of one charge pump in our implementation is ~100μm*100μm. The 
number of charge pumps required is correlated mostly to the number of transistors 
requiring back bias. As an estimate, we base our area overhead analysis below on the 
number of transistors in each back bias region. However, we note that in practice, 
design rule requires circuit spacing between back bias regions, which will incur 
additional cost that is difficult to estimate without the actual layout. 
 
3.3.2. Configurable Logic Block 
In line with our basis architecture, we adopt a cluster architecture with 10 BLEs and 22 
inputs (thus requiring a mux size of 32 for the local MUXes to each BLE). Each BLE 
comprises four 32-input MUXes, one 4-input LUT, one D Flip-Flop with 
asynchronous set/reset, SRAM bit cells and buffers where required. SRAM bit cells 
are used during configuration and as such can be implemented using high VT cells in 
the technology library; they should therefore be excluded from the area overhead 
estimation. Estimated transistor count per BLE is then ~520 transistors (400 NMOS 
and 120 PMOS) and ~5200 transistors for a CLB. Implementation schematics are 
available in Appendix A. 
 
From HSPICE simulations, one CLB requires around 13.3nA for Pwell and 4.9nA for 
Nwell in the nominal case. This design can therefore support sufficient well current for 
up to 1 million transistors [56], meaning one unit area of the biasing circuit 





We note also that the control mux setting the back bias incurs a minimal area overhead. 
Assuming the same SRAM bit cells are used to set both the NMOS and PMOS back 
bias control muxes, an overhead of 16 transistors – 2 in each mux and 6 in each SRAM 





Figure 3.6: Reverse Back Bias Control MUX Implementation 
 
3.3.3. Switch Box 
In our experimental flow (introduced later in Section 3.4), we size the device to the 
smallest area required to implement the design. Accordingly, track count will be the 
minimum required to route the design. Because switch box size is entirely dependent 
on track count, the number of switch boxes in which a charge pump can support would 
also depend entirely on track count. Additionally, in the UDSD interconnect 
architecture, one routing switch is required is each direction. 
 
As such, it is more meaningful in the case of the switch box to estimate the number of 
routing switches supported by the biasing circuit. From Figure 3.4, each routing switch 
comprises one 3 input MUX and a buffer, which is implemented as two inverters. 













the biasing circuit is able to support about ~142,800 routing switches. As a ballpark 
estimate, if we assume each channel width to be 60, each switch box would contain 
240 routing switches. One unit area of the biasing circuit is then capable of supporting 
~590 switch boxes (or ~4:1 ratio versus the CLB). The area overhead associated with 
our architectural enhancement is thus very small and proven to be feasible. Similar to 
the CLB, neglible area is required to implement the back bias control mux. 
 
3.4. Experimental Methodology 
In this section, we describe the experimental methodology that was deployed to 
measure the gains of the RBBMap and Dual-VT Switch Box Pathfinder algorithms 
presented in this thesis.  
 
3.4.1. Discussion 
FPGA EDA algorithms interact in complex ways when mapping a circuit onto the 
architecture of interest. Commercial FPGAs are typically characterized for accurate 
timing and power figures using industrial-strength tools like HSPICE, after which 
models are developed for consumption by downstream EDA tools for optimization and 
reporting purposes. In the ideal scenario, if it is desired to understand the effectiveness 
or optimality of EDA algorithms for a given architecture, a large regression suite of 
benchmark circuits should be mapped onto the real FPGA that is built upon this 
architecture, with delay and power figures obtained through these models. 
 
However, such a flow is not possible for an FPGA architect who is responsible for 
conceiving the architecture to be designed. Similarly in our case, because FPGAs with 
38 
 
architectures representative of those proposed in this work do not exist today, it is 
necessary for us to leverage existing tools and models for the purpose of evaluating the 
effectiveness of our algorithms. 
 
Early works focusing on FPGA power [33], [34], [35] often leveraged overly-
simplified models. For example, it was often assumed in these models that each net has 
the same capacitance, or scales according to the fanout of each net. In reality, the 
power dissipated in nets are a complex interplay of how long each net is and how they 
are hooked up. Some of these works were also focused purely on power, even at the 
expense of timing. Such an approach yields little value in reality since timing remains 
to be the single most important metric to FPGA users. Maintaining critical path delays 
should therefore be a hard constraint on any power-oriented EDA work. Today, 
academic tools such as VPR [9] and the power model [11], [12] provide an established 
baseline framework for observing timing and power figures, which we leverage in our 
experimental flow. 
 
Because the two works in this thesis focus on different architectural enhancements, the 
flows used to evaluate both works were kept different in order to maintain 
observability in timing and power impact from each of these work. We present their 
respective experimental methologies in the subsections that follow. 
 
 
3.4.2. Dual-VT Technology Mapping and Clustering 
The overview of our experimental framework for RBBMap is shown in Figure 3.7 
below. In this flow, the only architectural enhancement injected is the Dual-VT CLB 
39 
 
described in Section 3.2.1. To perform power estimation, we leveraged the power 
model for VPR 5.0, developed by Poon and Jamieson [11], [58]. Although the power 
estimation framework does not provide ﬁgures exact to SPICE simulation, it has 





Figure 3.7: Experimental Flow for Dual-VT Technology Mapping and Clustering (RBBMap) 
 
 
In this flow, we first perform structural hashing on each benchmark circuit into an 
And-Invert-Graph (AIG) a graph comprising only 2-input AND and 1-input NOT 
gates, using ABC [8]. Structural hashing expands the optimization space for the 
technology mapping tool, resulting in significantly better mapped circuits with reduced 






















Next, switching activity estimates for each node in the mapped circuit are generated 
using the activity estimator tool (ACE), developed by Poon [11]. In a real world 
scenario, switching activities can be collected by exercising the circuit using test 
vectors representative of the circuit’s use case. The AIG network, along with its 
activity information generated from ACE, is then fed into RBBMap to perform Dual-
VT technology mapping. The output is a technology-mapped BLIF netlist annotated 
with VT levels for each LUT. This netlist is then fed into RBBPack  for packing into 
distinct VTL and VTH clusters. The output of RBBPack is a NET file annotated with VT 
level information for each CLB. The NET ﬁle is then fed into the cluster activity 
estimator tool in Odin-II, created by Jamieson [12]. This generates the activity and 
function ﬁles needed for power estimation in VPR. 
 
At this stage, we are ready to run VPR. The NET file containing annotated VTL and 
VTH information for each cluster is then input into VPR for timing-driven placement 
and routing. We push VPR for timing-driven placement and routing, as well as for 
minimum sizing of device area required for implementing the circuit. The power 
model within VPR has been updated to support dual-VT CLBs. We note here that we 
do not use P-T-VPack and P-VPR [59] as, while the tools are power-aware, they were 
shown to introduce a certain amount of delay impact in all circuits. This must be 
avoided in our work since we want to maintain visibility on the delay impact 
contributed by  RBBMap/RBBPack alone, if any. 
 
We generate an architecture XML ﬁle with timing and power information extracted 
from our circuit implementation described in Section 3.3.2. Configurable Logic Block 
41 
 
All other base VPR parameters are set according to our basis architecture described in 
Section 3.1. 
 
Our regression suite of benchmark testcases are the “Golden” 20 MCNC circuits: alu4,  
apex2,  apex4,  bigkey,  clma,  des,  diffeq,  dsip,  elliptic,  ex1010,  ex5p,  frisc, 
misex3, pdc, s298, s38417, s38584.1, seq, spla, and tseng. Half of these circuits are 
sequential, while the other half is purely combinational. The size of each circuit range 
from 1858 to 14233  after structural hashing. While not extremely large, these circuits 
are still considered to be highly representative of industrial designs. 
 
The baseline flow in which we compare the performance of RBBMap is Emap [31]. 
We use the same flow as in Figure 3.7 but with RBBMap replaced with Emap and 
RBBPack replaced with T-VPack. While Emap is not dual-VT aware (this is part of our 
major contributions), its is one of the more recent works today that deals with FPGA 
power at the technology mapping stage. Because DVmap-2 [26] is not available in the 
public domain, a comparison with this tool is not currently possible. 
 
Metric-wise, we focus on logic block leakage power, total power consumption and 
critical path delay. We choose logic block leakage power as one of the two key metrics 
to report because this is where RBBMap is focused on optimizing. Naturally, total 
power should also be reported since this is the ultimate concern for the FPGA user. 
Lastly, we compare critical path delay in order to understand the delay impact 




3.4.3. Dual-VT Switch Box Routing 
The overview of our experimental framework for RBBMap is shown in Figure 3.8 
While the flow is largely similar to that above, there are a few important differences, 
which we highlight here.  
 
 
Figure 3.8: Experimental Flow for Dual-VT Switch Box Routing 
 
This time, the only architectural enhancement introduced is the Dual-VT Switch Box, 
described in Section 3.2.2 This is for the reason that we want to observe only the gains 
attributed to our architectural and algorithmic enhancements in this area. Our 
enhancements are relevant only to VPR, which we call DVT-VPR to distinguish 
between the original VPR 5.0 tool. We also update the power estimation tool within 

















 As such, we no longer use RBBMap and RBBPack in our experimental flow. For the 
technology mapping phase, we leverage ABC [8], an established and well-maintained 
logic synthesis tool available in the public domain. ABC is eﬃcient at generating 
mapped circuits with low LUT count, in turn translating to both dynamic and leakage 
power savings. For logic block packing, we use T-VPack. Cluster activity information 
is then generated using the cluster activity estimator tool in Odin-II. This part of the 
flow are now invariant between our Dual-VT Switch Box and the baseline flow. For 
our baseline flow, we use the original VPR 5.0 with native single VT switch box 
architecture. Since no similar work exists today, a comparison on this end was not 
possible. 
 
Next, the NET file from T-VPack, along with the activity and function files from 
Odin-II, are  fed into DVT-VPR for placement and routing. Similar to before, we push 
VPR for minimum timing and area. Specifically, our Dual-VT Switch Box Routing 
algorithm is enhanced from the original timing-driven Pathfinder algorithm native to 
VPR. Upon successful completion of routing, power and timing numbers are reported. 
 
For metrics, we focus on Routing Leakage Power Dissipation, Total Power Dissipation, 
and Critical Path Delay, with similar motivation as before. The ﬁrst metric is direct 
related to our enhancements, while the second metric concerns the user. The third 






3.5. Power and Delay Modeling 
In this section, we briefly touch on the power and delay models deployed in the 
framework described above, and the associated enhancements to support our dual-VT 
architectural enhancements. 
 
3.5.1. Elmore Delay Model 
To perform timing analysis, we leverage the built-in Elmore delay model in VPR. The 
Elmore model, while not extremely accurate, is nonetheless able to rank net delays 
correctly [9]. This is important since the effectiveness of the timing-driven algorithm 
depends entirely on the fidelity of the underlying timing analyser. 
 
The Elmore delay of a source-sink path   is given as [60]: 
                    
                  
 
(3.1) 
where      is the intrinsic delay of a buffer if   is a buffer, and 0 otherwise. Given our 
UDSD architecture, the former will always be true.    is the equivalent resistent of  , 
and             is the total capacitance of the dc-connected subtree at  . Again, since 
we employ a UDSD architecture, the subtree in all cases are shortly segmented.  This 
also implies a greater level of accuracy in the Elmore model when applied to the 




The delay enhancements made in our Elmore timing analyser is to account for the 
delays of slow switches in the dual-VT switch box case. We annotate the intrinsic delay 
     with the fractional delay introduced when the switch in high VT mode, 
characterized from HSPICE sims of our implementation circuit described in Section 
3.3.3. 
 
3.5.2. Power Model 
The VPR power model [11] is a tool to estimate the dynamic, leakage, and short-
circuit power of circuits after they have been placed and routed on the specified 
architecture. Power constituents are the logic blocks, routing logic, and the clock 
network. First, the transition density model [61] is used to estimate the switch activity 
of each node in the circuit. Next, switching activities and estimated capacitances at the 
transistor level are used to evaluate power dissipation of the implemented circuit. 
The transition density      of a signal   is the expected number of times the signal will 
toggle in each clock cycle, and is given by the expression: 
       
  
   
     
 
   
  
(3.2) 
where   is the number of input signals to  ,  
  
   
 is the probability that a change in    
will cause   to change as well, and       is in turn the transition density for the input 




Dynamic power is evaluated using the expression: 
                                        
        
 
(3.3) 
where    is the capacitance of node   and      is determined by the Elmore delay model. 
For leakage power, the power model assumes drain leakage is negligible. Subthreshold 
current is evaluated with the classic equation: 
                          
          
   
  
(3.4) 
Leakage power is then found by multiplying                  with    . Although the 
model is observed to generate an absolute error of 13.4% [11], it has been proven to 
have good fidelity. 
To support power estimation for dual-VT clusters, a different     value was assigned 










Power and Cluster Architecture-Aware 





This chapter presents one of the two contributions in this thesis: A power and cluster-
architecture aware technology mapping tool, RBBMap
1
. We also describe our 
companion dual-VT-aware logic block packing tool, RBBPack. 
 
4.1. Discussion 
There are two prime motivators for moving dual-VT optimization up to the technology 
mapping stage. Firstly, by predicting candidate high-VT LUTs upfront, we free 
ourselves from the logic block packing constraints that limit Altera’s post-P&R slack 
reclamation method. Secondly, it is often the case that circuit optimizations yield 
greater benefits higher up in the EDA flow [55]. While any method that operates at 
higher levels of the CAD ﬂow has the inherent disadvantage of reduced visibility on 
the eventual critical path following P&R, optimizing slack paths is expected to present 
the least delay impact. On this basis, we target nodes on non-critical paths in the 
technology mapped netlist for slack reclamation.  
 
Slack reclamation can be performed either inline during the cut selection process, or 
after formation of the LUT network. An inline method can do this by extending slack 
paths during cut selection, with the aim of minimizing node duplication (having less 
                                                 
1
 This work is to appear at The 19
th




nodes in the mapped network is beneficial to the dual-VT CLB architecture since 
unused CLBs can be set to high-VT). Reclaming slack after forming the LUT network 
however has the advantage of retaining slack on non-critical paths before the 
reclamation process, allowing for more nodes to be converted to high-VT. We adopt 
the latter approach in RBBMap since minimizing node duplication during the mapping 
process in less effective compared to advanced methods such as Area Recovery [27], 
which can also be deployed on mapped networks. Most importantly, this allows us to 
preserve the structure of the mapped network. Consequently, the eventual critical path 
after P&R is likely to correspond to the same LUT sequence as that of the mapped 
network. This should be proven by taking the mapped network through the entire EDA 
flow all the way to routing, and comparing against the base line. 
 
4.2. RBBMap 
RBBMap is a technology mapping tool capable of mapping a netlist into LUTs of two 
different VT levels, while maintaining network depth. The high level algorithm is 
presented in Figure 4.5 below. The main contribution in RBBMap is in the two-phase 
slack reclamation scheme with cluster-architecture awareness. We also describe 
RBBPack, our supporting dual-VT logic block packing tool. 
 
4.2.1. Cost Function 
When selecting cuts for mapping to LUTs, we adopt a scheme similar to Emap with 
enhancements to account for the height of a candidate cut. The base line Emap cost 
function is efﬁcient for its ability to encapsulate edges with high switching activity, 
which reduces dynamic power consumption in the routing logic. It also attempts to 
49 
 
minimize node duplication where possible. We make the additional observation that, in 
a dual-VT scenario, optimizing for depth on non-critical paths will help introduce more 
slack to these paths. This encourages more nodes along these paths to be converted to 
VTH . At the same time, we do not want to over-emphasize height to the extent that 
excessive node duplication occurs. 
 
Our cost function for cut selection is as follows: 
             
 
           
 
               
                     
  
                      
           




The middle and right-hand terms are essentially the same as that used in Emap, which 
we will discuss only brieﬂy.          is a cut in the cut set        
   of the node  . It 
comprises a set of nodes encapsulated within the cut, rooted at  .             is the 
set of nodes in     that are root nodes to other cuts.           is a weight variable 
that determines the likelihood of   being chosen as a root node of the LUT.        is 
the estimated switching activity of the net driven by node  , and λ controls the 
importance of the switching activity.           is the set of fanout nodes of node u. 
 
The middle term penalizes node duplication (numerator), while encouraging cuts to 
encapsulate more nodes (denominator). The right-hand term favours cuts with lower 
input activities and cuts with low fanouts to reduce node duplication. 
 
Our new ﬁrst term, 
 
           
, is the height factor, which we introduce as a costing 
enhancement to assist in dual-VT mapping. If the height of the cut is below a pre-
50 
 
deﬁned limit, we ignore this term by setting             to 1. However, if it is equal 
or greater than this limit, we set             to be equal to the height of the cut, 
thereby lowering its cost. This encourages the algorithm to select “taller” cuts, thereby 
shortening paths along this cut and allowing more nodes to be converted to high VT. 
The height factor is an enhancement feature in which the user can use to calibrate 
against the circuit of interest. 
 
4.2.2. Timing Analysis 
Timing analysis is used to estimate the slack for each path in the mapped network. 
Slack reclamation can then be performed by converting VTL LUTs on slow paths to 
VTH. In order to perform timing analysis, we first transform the mapped network into a 
levelized DAG-representation timing graph, as shown in below. 
 
 
Figure 4.1: Transforming a mapped network into its corresponding timing graph. 
 
Primary inputs constitute all nodes at level 0; LUT inputs and primary outputs are at 
odd levels (“in-level”), while LUT outputs are at even levels (“out-levels”). Primary 
inputs and register outputs have no incident edges, while primary outputs and register 
51 
 
inputs have no outward edges. The maximum level always corresponds to either 
primary outputs or register inputs. Levelization enables efficient breadth first analysis 
for the timing graph during timing analysis. Delays are annotated only on the edges of 
the timing graph; in the base case, LUT edges (in-level to out-level) constitute a delay 
of 1 unit, while edges between LUTs (out-level to in-level) constitute a delay of 0 units. 
Cluster architecture delay annotation is described in Section 4.2.4. below. 
The first step in the timing analysis procedure is determining the critical path of the 
network. This can be done by performing an      breadth-first traversal of the timing 
graph from level 0 down to the maximum level. Nodes at level 0 are labelled with a 
signal arrival time,     , of 0. Downstream nodes are then labelled with their 
respective arrival times according to: 
                                           
(4.2) 
where   is the node currently being labelled and            is the delay from node 
  to node  . Accordingly, the node with the largest arrival time is the critical path of 
the mapped network. 
 
To find the slack of each timing graph edge, we perform a backwards breadth-first 
traversal of the timing graph. Nodes with no outward edges are labelled with the signal 
required time,
     , equal to the critical path delay. Upstream nodes are then labelled 
according to: 




where  is the node currently being labelled this time. Finally, slack for the edge from 
node  to node   is evaluated according to: 
                                      
(4.4) 
Edges on the critical path will have no slack, while edges on non-critical paths will 
have some slack. The latter are assessed as candidates for our slack reclamation 
scheme, described next. 
 
4.2.3. Slack Reclamation and Distribution Scheme 
Slack reclamation sweep on the timing graph can be performed either bottoms-up (PO 
to PI) or top-down (from PI to PO) to convert VTL nodes on slack paths to VTH . 
Depending on the structure of the network, either of the methods may perform better. 
In general, a bottoms-up sweep converts more nodes for a downward cone network; 
the opposite is true for an upward cone network. For our benchmark circuits, we adopt 
a bottoms-up sweep. 
 
In addition, not all nodes that qualify for conversion should actually be converted. This 
is because nodes with high fanin and fanout degrees tend to inﬂuence a greater part of 
the graph; in other words, converting a high degree node may deprive other nodes on 
the same path of slack that could otherwise be used for conversion. To mitigate this 
problem, we introduce a slack distribution scheme and two-pass slack reclamation 
sweep to convert as many nodes as possible in a general graph. This is captured in the 
pseudocode in Figure 4.5. We ﬁrst perform a breadth-ﬁrst, bottoms-up sweep to 
convert qualiﬁed nodes to VTH , disqualifying nodes with high degrees even if the node 
53 
 
has sufﬁcient slack. After the ﬁrst sweep completes, we perform another bottoms-up 
sweep, this time converting nodes as long as the node has sufﬁcient slack. This is to 
avoid “missing” out on nodes that were disqualiﬁed in the ﬁrst sweep, but were not 
taken advantage of by other nodes along the path. 
 
4.2.4. Cluster Architecture Delay Annotation 
In reality, any two nodes in sequence are separated minimally by an intra-cluster delay. 
We build architectural awareness into our slack reclamation scheme by annotating 
each LUT edge in the timing graph with its combinational delay and each edge 
between LUTs with an intra-cluster delay, characterized from our circuit 
implementation described in Section 3.2.1. A full timing analysis is then performed to 
obtain exact slacks before carrying out the reclamation scheme described above. This 
is illustrated in Figure 4.2 and Figure 4.3 below. 
 




Figure 4.3: Mapped network with annotated LUT and intra-cluster delays. 
 
Table 4.1 below compares the number of converted nodes for each of the reclamation 
schemes. It should be noted that the low conversion rates in three of the testcases – 
apex4, bigkey and dsip is due to their extremely flat circuit structures, resulting in 
minimal slack available on any path. This is correlated to the number of levels in the 
mapped network. For instance, bigkey and dsip would expectably yield low conversion 
rates since the critical path is only 3 LUTs high. Regardless, cluster architecture delay 




Table 4.1: Comparison of converted nodes against reclamation schemes. 
 
4.2.5. VT Migration 
Because the routing network consumes signiﬁcant dynamic power, we make RBBMap 
convert certain VTH LUTs back to VTL if it is expected to incur less routing cost and 
hence reduce overall power dissipation. 
 
Consider the following example in Figure 4.4, where part of a design is mapped into 6 
VTL and 2 VTH LUTs. A cluster of size 4 would result in 3 clusters mapped. However, 
if we migrate the 2 VTH LUTs back to VTL, as in the case of Figure 4.4b, they can be 
absorbed into the same cluster as 2 of the VTL LUTs, therefore incurring less routing 






Top down Bottom up Distributed Cluster
alu4 1300 7 68 67 67 202
apex2 1632 8 31 31 31 191
apex4 1221 6 1 1 1 60
bigkey 1810 3 1 1 1 6
clma 6839 16 1866 1855 1858 2487
des 1358 6 156 183 183 263
diffeq 1057 14 703 712 712 757
dsip 1364 3 1 1 1 2
elliptic 2282 18 1822 1808 1808 1968
ex1010 4229 8 64 64 64 579
ex5p 990 7 33 33 33 119
frisc 2406 23 2103 2103 2103 2173
misex3 1226 7 54 54 54 281
pdc 4122 9 353 352 352 1177
s298 1657 15 837 837 837 988
s38417 5085 11 2003 2048 2056 2611
s38584.1 4519 10 3410 3411 3411 4061
seq 1479 7 98 98 98 331
spla 3510 8 185 185 185 719




(a) Before migration 
 
(b) After migration 
Figure 4.4: Migration of VTH LUTs to VTL. 
 
The migration condition is specified as follows: 
                                
(4.5) 
where   corresponds to CLB size. Further, nodes with higher input switching activity 
are picked for migration. This is due to our assumption that less leakage power is 
dissipated during actual switching; consequently, nodes that switch most often 
dissipate the least leakage. Migrating these nodes back to VTL therefore represents the 
least sacriﬁce to leakage power. 
 
4.2.6. Overall Algorithm 
The full RBBMap algorithm is shown in Figure 4.5 below. In the first stage, LUT 
mapping is performed to transform the original gate network into a LUT network that 
is implementable on the FPGA fabric. The mapping process itself comprises a cut 
enumeration step, cut selection step, and network collapsing step similar to Emap, but 
























Figure 4.5: Pseudocode for the RBBMap algorithm. 
 
/* N is the network to be mapped      */
/* M is the timing graph of network N */
/* K refers to LUT input size         */
function rbbmap (N) {




backwards_slack_sweep (M, TRUE);  // with slack distribution
backwards_slack_sweep (M, FALSE); // w/o slack distribution
migrate_VT (M);
generate_blif (N, M);
}  /* end rbbmap */
function map_to_LUT (network N) {
for (each node n  N) {
enumerate_cuts(n, K);
}
for (each node n  N) {
depth(n) = level(n) in depth-oriented mapping;
set PI and PO as rooted;
}
set crit_path = max (depth(n) for all n  N);
for (each node n  N) {
evaluate_slack(n);
}
for (each node n  N) {
if (is_rooted(n)) {
C(n) = cut_with_lowest_cost (cutset(n));
for (each fanin node u  C(n)) {
evaluate_slack(u);





}   /* end LUT mapping */
function backwards_slack_sweep (timing_graph M, distributed) {
for (each input level from max_level down to 0) {
for (each node m  nodes_at_level(level)) {
// edge from m to v corresponds to LUT input to output
// v is sole outedge from m
if (slack(m, v) > delay_with_rbb(m, v))
if (distributed and num_fanout(v) > deg_threshold)
skip;
else
set delay(m, v) = delay_with_rbb(m, v);







Next, the full timing graph corresponding to the mapped network is constructed and 
levelized to enable fast timing analysis. Timing graph edges are then annotated with 
their respective LUT or intra-cluster delays, following which the two-phase slack 
reclamation scheme is performed. Finally, a netlist annotated with VT information for 
each LUT is generated. 
 
4.3. RBBPack 
Logic block packing in a dual-VT scenario involves adding the constraint of packing 
LUTs of the same VT levels together in the same cluster, since in our architecture RBB 
is available only at the cluster granularity. 
 
RBBPack is a modiﬁed version of T-VPack with this constraint in place. An overview 
of the algorithm is shown in Figure 4.6 below. When starting a new cluster, we first set 
the VT level of that cluster to be equal to the next BLE with the most used inputs. 
Subsequently, when evaluating the attraction cost of BLEs, we consider only BLEs of 
the same VT level. This ensures attracting only BLEs of the same VT level as the 
current cluster. The seed attraction function and second attraction function for 




Figure 4.6: Pseudocode for the RBBPack algorithm. 
 
4.4. Experimental Results 
An overview of our experimental framework for this work was presented in Section 
3.4. We review the framework briefly in this section. 
 
To obtain power estimates, we use the updated power model in VPR 5.0 [11], [12]. 
Enhancements were made to VPR to support dual-VT clusters, and to the power model 
to support dual-VT power calculations. In the first step of our flow, structural hashing 
is carried out on each benchmark circuit to obtain its corresponding And-Invert-Graph 
(AIG) network using ABC. Switching activity estimates are then generated using ACE. 
/* UnclusteredBLEs is the set of BLEs not contained in any cluster */
/* C is the set of BLEs contained in the current cluster           */
/* LogicClusters is the set of clusters                            */
/* Each cluster is in turn a set of BLEs                           */
UnclusteredBLEs = patternMatchToBLEs (LUTs, Registers);
LogicClusters = NULL;
while (UnclusteredBLEs != NULL) {  // cluster not full
C = GetBLEwithMostUsedInputs (UnclusteredBLEs);
ClusterVt (C) = BLEVt (C);
while (|C| < N) {  // cluster not full
BestBLE = MaxAttractionLegalBLE (C, ClusterVt, UnclusteredBLEs);
if (BestBLE == NULL)
break;
UnclusteredBLEs = UnclusteredBLEs – BestBLE;
C = C  BestBLE;
}
if (|C| < N) {  // cluster not full – try hill climbing
while (|C| < N) {
BestBLE = MinClusterInputIncreaseBLE (C, ClusterVt, UnclusteredBLEs);
C = C  BestBLE;









The AIG network, along with its activity information, is then input into RBBMap. For 
this experiment, a height factor limit of 4 was set. Next, the technology-mapped blif, 
annotated with VT information for each LUT, is input into RBBPack for logic block 
packing into VTL and VTH clusters. The output net ﬁle of RBBPack is then fed into the 
cluster activity estimation tool in Odin II. This generates the activity and function ﬁles 
needed for power estimation in VPR. These, along with the.net ﬁle output by 
RBBPack containing VTL and VTH information for each cluster, is then input into VPR 
for timing-driven P&R. Additionally, VPR was pushed for minimum area; this is so as 
to obtain objective figures for which leakage power saving is attributed almost entirely 
to our RBB enhancements and not unused logic blocks.  
 
For our baseline, we adopt Emap in conjuction with T-VPack and VPR 5.0 with no 
dual-VT enhancements. We maintain the same experimental flow as described above 
but with RBBMap and RBBPack replaced with Emap and T-VPack respectively. P-T-
VPack and P-VPR were not used to avoid further impact to delay. 
 
We generate an architecture XML ﬁle for the same process (UMC65LL) used in 
implementing our circuit. HSPICE simulations showed an average of 2x increase in 
leakage power against 0.18um. We utilize a unidirectional single driver interconnect 
architecture and Wilton switch block, with the cluster architecture presented in Section 
3.2.1. For VTL levels, we have 0.41v and 0.37v for NMOS and PMOS respectively. 




In our experiments, we run 18 MCNC benchmarks circuits through our toolchain of 
RBBMap, RBBPack and VPR. We compare the percentage savings offered by our 
framework against Emap as the base line. We compare logic block leakage power and 
total power consumption between Emap and our framework. We choose logic block 
leakage power as one of the two key metrics to report because this is where we are 
focused on optimizing. We also report the critical path in each case. Table 4.2 
summarises the results of RBBMap/RBBPack against Emap. 
 
 
Table 4.2: Comparison of Logic Block Leakage Power, Total Power and Delay of Emap 
against RBBMap/RBBPack. 
 
From the results, we observe that our tools yield excellent results in all cases. In 
particular, an average of 70.95% savings in logic block leakage power and 28.30% 
savings in average total energy consumption is observed. This is to be expected since 
we have a dual-VT LUT architecture. Furthermore, critical paths are minimally 
impacted in all but one design; in some cases, even better ﬁgures were obtained. This 
is an interesting find since intuitively, one would expect the circuit to be slower in the 
Design

















alu4 0.0107 0.0217 4.941E-08 0.0034 68.62 0.0199 8.43 4.678E-08 -5.33
apex2 0.0077 0.0199 5.777E-08 0.0031 60.36 0.0224 -12.94 5.699E-08 -1.36
apex4 0.0143 0.0265 5.194E-08 0.0042 70.89 0.0160 39.55 5.753E-08 10.76
bigkey 0.2709 0.3055 3.528E-08 0.0595 78.02 0.0777 74.56 4.010E-08 13.64
clma 0.0424 0.0768 1.281E-07 0.0139 67.26 0.0322 58.05 1.228E-07 -4.17
des 0.4350 0.4750 8.507E-08 0.0935 78.51 0.1488 68.66 5.745E-08 -32.47
diffeq 0.0098 0.0138 6.082E-08 0.0025 74.35 0.0116 16.28 5.275E-08 -13.26
dsip 0.2908 0.3139 3.446E-08 0.0632 78.27 0.0822 73.80 3.446E-08 0.00
elliptic 0.0193 0.0628 7.418E-08 0.0049 74.74 0.0395 37.12 6.942E-08 -6.42
ex1010 0.0427 0.1464 9.975E-08 0.0126 70.51 0.0430 70.59 9.998E-08 0.23
ex5p 0.0136 0.0298 6.000E-08 0.0038 72.37 0.0108 63.57 4.952E-08 -17.48
frisc 0.0132 0.0538 6.638E-08 0.0038 71.19 0.0604 -12.23 6.989E-08 5.29
misex3 0.0140 0.0306 5.061E-08 0.0040 71.85 0.0163 46.86 5.201E-08 2.77
pdc 0.0262 0.0738 9.470E-08 0.0086 67.31 0.0511 30.74 9.875E-08 4.28
s298 0.0063 0.0136 7.447E-08 0.0023 64.17 0.0170 -24.96 6.410E-08 -13.93
seq 0.0147 0.0346 4.747E-08 0.0043 70.75 0.0322 7.01 5.390E-08 13.53
spla 0.0153 0.0467 8.278E-08 0.0060 60.57 0.0907 -94.15 8.221E-08 -0.69
tseng 0.0224 0.0286 3.655E-08 0.0051 77.36 0.0118 58.52 4.853E-08 32.77
AVERAGE 70.95 28.30 -0.66
62 
 
presense of slower VTH clusters. There are therefore two takeaways: (1) Upfront 
optimizations early on in the EDA flow is feasible and practical, and (2) FPGA 
algorithms interact in complex ways in which results may not be initially predictable. 
The latter point speaks strongly for our experimental methodology – the entire EDA 
flow must be run in order to understand how well FPGA algorithms work in specific 
architectural scenarios; considering individual stages in isolation is not recommended. 
 
However, we note that total power had in fact increased for a few circuits (apex2, frisc, 
s298 and spla). This is due to the additional routing logic required for these circuits, 
resulting in greater routing power dissipation. The exact network characteristics 











In this chapter, we presents the second of the two contributions in this thesis: A dual-
VT switch box architecture and algorithm
2
. While works relating to power reduction in 
routing switches exist, there is currently no work that presents a treatment of 
programmable dual-VT on routing switches or switch boxes to date. 
 
5.1. Discussion 
Programmable RBB granularity at the switch box level presents a unique challenge to 
an FPGA routing algorithm. While such an architecture presents the clear advantage of 
incurring extremely low area overhead, critical nets have to be segregated from non-
critical nets, since shared use will result in the inability to
 
tune switch box VT without 
compromising circuit speed. This is non-trivial given that routed nets are typically 
intertwined with varying criticalities sharing switch boxes and connection boxes in all 
practical designs. 
 
The Pathfinder Negotiated Congestion Algorithm [62] is a popular routing algorthim 
in academia. In the timing-driven version of this algorithm in VPR, wavefront 
expansion is performed based on upstream and expected downstream cost of the net in 
question until all sinks are reached. Assuming all nets are routable, timing analysis is 
then performed the slack for each source-sink connection, which is then used in the 
                                                 
2
 The first version of this work was recently presented at The 8
th
 Symposium in Applied Reconfigurable 
Computing (ARC 2012), Mar 2012, Hong Kong. 
64 
 
next routing iteration to resolve congestion by prioritizing paths with little or no slack. 
With each routing iteration, the previous routing is ripped up and rerouted; in the ﬁrst 
iteration, every connection is routed as critical, even if congestion occurs, and by some 
iteration all congestion is resolved and a routed design is obtained. The algorithm 
demonstrated excellent performance in most previous work. 
 
Intuitively, one way to take advantage of such an architecture is to have nets with low 
criticality avoid low-VT switch boxes that are used by the critical net, and vice-versa 
for highly critical nets. However, such an avoidance scheme would result in extremely 
poor performance as this would cause non-critical nets to extend in length,  rendering 
them more critical in the next iteration. In the worst case, such nets can end up 
becoming more critical than the “real” critical nets. The algorithm would then optimize 
this falsely critical net in the next routing iteration, resulting in indeﬁnitely extended 





Figure 5.1: Route extension due to switch box avoidance scheme 
 
We assume in this example that the critical net (red) was routed before the non-critical 
net (green). In the original pathfinder algorithm, the non-critical net can use SB_2 to 
reach its sink at CLB_7. However, because SB_2 has already been designated as low-
VT when routing the critical net, the modified algorithm needs to detour a further 
distance, namely through SB_4 and SB_5, to get to the same sink. If the delay 
associated with this detoured route extends beyond the critical net, the former will be 
recognized as critical in the next iteration, prioritising it over the real critical net. The 
problem is repeated in subsequent routing iterations, with the net effect of critical path 































5.2. Dual-VT Switch Box Pathfinder Algorithm 
To deal with the limitations of the original Pathfinder algorithm as applied to the dual-
VT switch box problem, we propose an enhanced version of the Pathfinder algorithm, 
which we call the Dual-VT Switch Box PathfinderAlgorithm, with intelligence 
surrounding the handling of net criticality. 
 
5.2.1. Motivation 
 We make the following observations in a routing scenario: 
 The original pathfinder algorithm permits for relative free play in the routing 
order of nets. While each pin is sorted on criticality when routing a net, no 
sorting is performed on nets per se; 
 A single net can have sinks of vastly different criticalities; 
 Any net must be completely routed before proceeding to the next net. It is not 
possible to route a different net before the current net has completed routing; 
 Assuming VT levels of switch boxes are set on the first net routed  on the 
switch box, nets that are routed later in the routing iteration are increasingly 
deprived of the right to decide VT levels, even if these nets are critical; 
 Blocking expansions into switch boxes of different VT levels severely limits the 
availability of routing logic to the current net being routed; 
 In reality, very few nets actually constitute the critical path in the routed design. 
With respect to the third point, one alternative is to override VTH switch boxes to VTL 
when routing critical nets. Such a heuristic is however difficult to design in practice 
since switch boxes that have been set to VTL have no way of being set to VTH, resulting 
67 
 
in the likely scenario where most or all nodes are set to VTL, defeating the original 
intent of the algorithm. 
 
Motivated by the above observations, the Dual-VT Switch Box Pathfinder Algorithm 
adopts a net criticality ranking scheme, wherein critical nets are routed before non-
critical nets. Two different methods of determining and ranking net criticalities were 
implemented: Maximum Criticality and Weighted Average Criticality. Both methods 
will be present later. 
 
VT levels of switch boxes are set according to the criticality of the first net that uses the 
switch box. If the net’s criticality exceeds a specified criticality threshold, it is 
considered to be critical, and routed as critical net. The opposite is true if the criticality 
falls below this threshold – the net is routed as non-critical. Critical nets are assured to 
have sufficient VTL switch boxes for routing since all critical nets are routed before 
non-critical nets. Consequently, switch boxes that are not used during the routing of 
critical nets can assume VTH levels. When performing wave expansion for non-critical 
nets, we do not prune away switches on VTL switch boxes. Instead, we cost switches 
on the wavefront accordingly when evaluating the Elmore delay [9] for expanding into 
a node   that was reached via the switch       : 
                                                
         
 





              
                                                
                                                                 
  
In this way, we no longer limit the switches available to the current route – the 
algorithm is free to choose between switches of lowest cost with no impact to the 
number of switch boxes that can be set to VTH (since unused switch boxes should 
always be set to VTH). 
 
5.2.2. Net Criticality Ranking 
In our algorithm, we introduce the notion of net criticality, which we define as the 
criticality associated with a net. Previously, criticality was restricted only to sinks for 
each net [9]: 
                       
          
    
 
 
    
(5.2) 
where           is the criticality of sink   in the current net   being routed,      is the 
circuit critical path delay,            is the slack of the connection between the source 
and sink   of net  .   and         are parameters controlling how a connection’s 
slack impacts the congestion-delay trade-off in the cost function to include the node   
in net  ’s routing: 
                                                                        
(5.3) 
where     ,      and      are the base cost, historical congestion, and present 




The notion of net criticality enables us to perform ranking of nets. Two methods of net 
criticality evaluation and ranking were considered: Maximum Criticality, and Weighted 
Average Criticality. In both cases, the key requirement is to preserve, as far as possible, 
the critical path delay as obtained in a non-dual-VT scenario. Results comparing the 
two ranking methods are presented in the Section 5.4.1. 
 
Maximum Criticality 
In the Maximum Criticality scheme, nets are ranked by the most critical sink of the net: 
                                       
(5.4) 
When more than one net has the same criticality, the tie breaker is the summation of 
criticalities over all sinks. This has the advantage of guaranteeing that the most critical 
nets are routed first; the disadvantage is its inherent inability in determining how many 
critical sinks each net has. Consider three nets, A  and B, with the following sink 
criticality distribution: 
Net A 
 Sink 1: 0.99 
 Sink 2: 0.10 
 
Net B 
 Sink 1: 0.98 
 Sink 2: 0.98 
 Sink 3: 0.96 
 
Net C 
 Sink 1: 0.70 
 Sink 2: 0.70 
 Sink 3: 0.60 
 Sink 4: 0.70 




Assume a criticality threshold of 0.9. In the Maximum Criticality scheme, Net A, 
having the sink (sink 1) with highest criticality, is ranked higher and therefore routed 
before Net B. This is despite that fact that Net B has more sinks, each of reasonably 
high criticality exceeding the criticality threshold. Net A on the other hand has only 
one other sink, with criticality much lower than the criticality threshold. 
 
Weighted Average Criticality 
In view of the problem above, the Weighted Average Criticality scheme is proposed. 
We first observe that simply taking the average criticality of all sinks is not a good 
solution –nets with both extremely critical and extremely slack sinks become 
noticeably skewed. In the example above, Net A has an average criticality of 0.545, 
rendering it easily outranked by Net C, which is in fact not critical by any margin. 
Taking the total criticality over all sinks is obviously a poor solution since such as 
scheme favours nets with more sinks. Again in the example above, Net C outranks 
both Net A and Net B. 
 
The Weighted Average Criticality scheme attempts to balance this by doubling the 
weightage of sinks with criticalities above the criticality threshold: 
            
                            




                         
                                  





In the example above, assuming a criticality threshold of 0.9 with    , Net A, B and 
C would have weighted average criticality values of 1.04, 5.84 and 0.64 respectively. 
The routing order would then be Net B, Net A and then Net C. 
 
5.2.3 Switch Box Cost Enhancement 
In addition, we introduce a switch box enhancement to minimize the number of VTL 
switch boxes used. When expanding  into  a  wire for critical nets, we prioritize longer 
wires by lowering the associated downstream cost with a factor   if it stays within the 
bounding box; this will encourage the algorithm to use less VTL switch boxes, as well 
as guide it to “skip” past potential VTH switch boxes. In Figure 6 below, the VTL switch 
box SB_1 can “skip” through the switch box SB_2 using a length-2 wire. We found a   





Figure 5.2: Critical connection skipping through an unset switch box 
 
5.2.4. Unused Switch Boxes 
In addition, all unused switch boxes and LUTs are set to VTH in order to further reduce 
leakage. While the eﬀect of this is relatively limited in our experimental ﬂow due to 
VPR implementing the minimum array size required to realize a particular circuit, such 
SB_1 (VTL) SB_2 (unset/VTH) SB_3 (VTL)
72 
 
an approach is expected to be highly relevant to the industry since real FPGAs are of a 
ﬁxed array size and the tendency for leftover unused logic can be quite high. 
 
5.2.5. Overall Algorithm 
The pseudocode for the Dual-VT Switch Box Pathfinder Algorithm is shown in Figure 
below. At the start of each routing iteration, nets are first ranked using either of the 
Maximum Criticality or Weighted Average Criticality schemes before proceeding with 
timing driven routing. A criticality threshold determines whether the current net being 
routed is critical or non-critical. At the end of some n iteration, either a successful 





Figure 5.3: Pseudocode for the Dual-VT Switch Box Pathfinder Algorithm 
 
/* PriorityQueue stores nodes in the current expansion, sorted on total_cost */
Function dvt_timing_driven_route () {
for (all nets i and sinks j)
set Crit(i,j) = MaxCrit;
while (overused resources exist) {
rank_nets (all nets i);  // rank by max crit or weighted average
foreach (net i in decreasing order of criticality) {




update switch box locations and VT;
}
update historical costs for all n;
perform timing analysis and update Crit(i,j);
}
}  // end single routing iteration
function timing_driven_route_net (net, crit_threshold) {
rip up routing tree;
update congestion cost;
foreach (sink in decreasing Crit(i,j)) {




while (not sink) {  // start wave expansion
pop lowest cost node m from PriorityQueue;
if (best path to m) {
update path_cost(m) and total_cost(m);




update cost to VTL;
}
else {  // not route_as_critical
if (switch box is VTL)
update cost to VTL;
else {
if (switch box not used)
set to VTH;
update cost to VTH;
}
}
add n to PriorityQueue;
}
}
}  // end wave expansion
update congestion costs and routing tree delay;
for (all expanded nodes n)





5.3. Experimental Results 
An overview of our experimental framework for this work was presented in Section 
3.4.3. We review the framework briefly in this section. 
 
In the ﬁrst stage, we take the benchmark circuit through ABC. Next, the technology-
mapped blif generated from ABC is fed into T-VPack for timing-driven packing. The 
output net ﬁle from T-VPack is then input into the cluster activity estimation tool (or 
ACE) in Odin-II. The netlist ﬁle generated by T-VPack is passed into DVT-VPR, 
where our the Dual-VT Switch Box Pathfinder Algorithm reside. Finally, the output 
ﬁles from ACE are consumed by the power estimation tool in VPR for dynamic and 
static power calculations.  
 
Similar to RBBMap, an architecture ﬁle was generated from characterization of our 
circuit implementation. For wire distribution, we create a frequency spread of 4x , 2x 
and 1x for length-1, length-2 and length-4 wires respectively, with no switch box nor 
connection box depopulation. We push VPR for minimum area so as to obtain 
objective figures for which leakage power saving is attributed entirely to our routing 
algorithm and not unused switch boxes. Again, we do not use P-T-VPack and P-VPR 
in so as to restrict delay impact only to our enhancements. 
 
We utilize a unidirectional single driver interconnect architecture and Wilton switch 
block but with our dual-VT enhancements. For VTL levels, we have 0.41v and 0.37v for 
75 
 
NMOS and PMOS respectively. RBB voltage for the switch box is set to -1.2v and 
2.4v for NMOS and PMOS respectively. 
 
5.3.1. Criticality Threshold Investigation 
In this section, we investigate the influence of critical threshold on the performance of 
each net criticality ranking methods. Experiments were performed on a spread of 
criticality thresholds ranging from 0.60 to 0.98, in steps of 0.02. 
 
Maximum Criticality 
We start with net ranking by Maximum Criticality. Individual critical path delays and 
percentage standard deviations are tabulated in Table 5.1 and Table 5.3 respectively. 
 
 
Table 5.1: Delay vs Criticality Threshold (0.6 to 0.78) – Rank by Max Criticality 
Design
Delay (s) vs Criticality Threshold: Rank by Max Criticality
0.6 0.62 0.64 0.66 0.68 0.7 0.72 0.74 0.76 0.78
alu4 4.52E-08 4.52E-08 4.52E-08 4.52E-08 4.52E-08 4.52E-08 4.56E-08 4.52E-08 4.50E-08 4.52E-08
apex2 5.17E-08 5.15E-08 5.17E-08 5.17E-08 5.17E-08 5.17E-08 5.17E-08 5.17E-08 5.17E-08 5.17E-08
apex4 4.45E-08 4.45E-08 4.58E-08 4.84E-08 5.02E-08 4.46E-08 4.71E-08 4.75E-08 4.42E-08 4.82E-08
bigkey 2.84E-08 2.84E-08 2.84E-08 2.84E-08 2.84E-08 2.84E-08 2.84E-08 2.76E-08 2.89E-08 2.84E-08
clma 9.67E-08 1.01E-07 9.51E-08 9.44E-08 9.44E-08 9.42E-08 9.44E-08 9.44E-08 9.43E-08 9.48E-08
des 5.92E-08 5.92E-08 5.92E-08 5.99E-08 5.92E-08 5.92E-08 5.92E-08 5.92E-08 5.92E-08 5.92E-08
diffeq 3.05E-08 3.05E-08 3.05E-08 3.05E-08 3.05E-08 3.05E-08 3.05E-08 3.05E-08 3.05E-08 3.05E-08
dsip 2.88E-08 2.88E-08 2.88E-08 2.88E-08 2.88E-08 2.88E-08 2.88E-08 2.88E-08 2.88E-08 2.88E-08
elliptic 5.37E-08 5.37E-08 5.37E-08 5.37E-08 5.37E-08 5.37E-08 5.37E-08 5.37E-08 5.37E-08 5.37E-08
ex1010 9.13E-08 9.11E-08 9.38E-08 9.14E-08 9.10E-08 9.13E-08 9.12E-08 9.28E-08 9.13E-08 9.10E-08
ex5p 4.57E-08 4.64E-08 4.55E-08 4.57E-08 4.62E-08 4.57E-08 4.53E-08 4.56E-08 4.53E-08 4.57E-08
frisc 6.26E-08 6.52E-08 6.38E-08 7.01E-08 6.59E-08 6.30E-08 6.37E-08 6.22E-08 6.28E-08 6.23E-08
misex3 4.50E-08 4.50E-08 4.50E-08 4.50E-08 4.50E-08 4.50E-08 4.50E-08 4.50E-08 4.50E-08 4.50E-08
pdc 8.74E-08 8.96E-08 8.75E-08 8.68E-08 9.11E-08 9.62E-08 9.02E-08 1.06E-07 1.01E-07 8.53E-08
s298 5.72E-08 5.68E-08 5.72E-08 5.69E-08 5.72E-08 5.72E-08 5.74E-08 5.72E-08 5.67E-08 5.68E-08
s38417 4.46E-08 4.46E-08 4.46E-08 4.46E-08 4.46E-08 4.46E-08 4.46E-08 4.46E-08 4.54E-08 4.46E-08
s38584.1 4.90E-08 4.83E-08 4.78E-08 4.82E-08 4.82E-08 4.82E-08 4.82E-08 4.82E-08 4.80E-08 4.82E-08
seq 4.92E-08 4.93E-08 4.89E-08 4.92E-08 4.94E-08 4.92E-08 4.92E-08 4.92E-08 4.92E-08 4.92E-08
spla 6.57E-08 6.86E-08 6.92E-08 6.62E-08 7.24E-08 6.74E-08 6.66E-08 6.70E-08 6.70E-08 7.15E-08




Table 5.2: Delay vs Criticality Threshold (0.8 to 0.98) – Rank by Max Criticality 
 
 
Table 5.3: Critical Path Delay Standard Deviation, Average and Percentage Standard 
Deviation – Rank by Max Criticality 
Design
Delay (s) vs Criticality Threshold: Rank by Max Criticality
0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98
alu4 4.52E-08 4.52E-08 4.52E-08 4.57E-08 4.52E-08 4.52E-08 4.52E-08 4.52E-08 4.52E-08 4.52E-08
apex2 5.17E-08 5.17E-08 5.17E-08 5.17E-08 5.17E-08 5.17E-08 5.17E-08 5.17E-08 5.17E-08 5.17E-08
apex4 4.91E-08 4.40E-08 4.49E-08 4.57E-08 4.53E-08 5.13E-08 5.16E-08 4.73E-08 4.71E-08 4.71E-08
bigkey 2.84E-08 2.84E-08 2.84E-08 2.84E-08 2.82E-08 2.76E-08 2.84E-08 2.76E-08 2.76E-08 2.84E-08
clma 9.44E-08 9.47E-08 9.56E-08 9.67E-08 9.44E-08 9.42E-08 9.64E-08 9.52E-08 9.41E-08 9.58E-08
des 5.92E-08 5.92E-08 5.92E-08 5.92E-08 5.92E-08 5.92E-08 5.92E-08 5.92E-08 5.92E-08 5.92E-08
diffeq 3.05E-08 3.05E-08 3.05E-08 3.05E-08 3.05E-08 3.05E-08 3.05E-08 3.05E-08 3.05E-08 3.05E-08
dsip 2.88E-08 2.88E-08 2.88E-08 2.88E-08 2.88E-08 2.88E-08 2.88E-08 2.88E-08 2.88E-08 2.88E-08
elliptic 5.37E-08 5.37E-08 5.37E-08 5.37E-08 5.37E-08 5.37E-08 5.37E-08 5.37E-08 5.37E-08 5.37E-08
ex1010 9.20E-08 9.11E-08 9.13E-08 9.13E-08 9.10E-08 9.11E-08 9.13E-08 9.29E-08 9.17E-08 9.17E-08
ex5p 4.73E-08 4.70E-08 4.57E-08 4.73E-08 4.62E-08 4.57E-08 4.57E-08 4.57E-08 4.58E-08 4.58E-08
frisc 6.25E-08 6.20E-08 6.45E-08 6.54E-08 6.54E-08 6.50E-08 6.37E-08 6.51E-08 6.22E-08 6.30E-08
misex3 4.50E-08 4.50E-08 4.50E-08 4.50E-08 4.50E-08 4.50E-08 4.50E-08 4.50E-08 4.50E-08 4.50E-08
pdc 1.01E-07 8.99E-08 8.44E-08 9.37E-08 1.23E-07 9.32E-08 8.63E-08 1.15E-07 9.22E-08 8.55E-08
s298 5.72E-08 5.72E-08 5.67E-08 5.67E-08 5.72E-08 5.67E-08 5.67E-08 5.72E-08 5.72E-08 5.67E-08
s38417 4.46E-08 4.46E-08 4.46E-08 4.46E-08 4.46E-08 4.46E-08 4.46E-08 4.46E-08 4.46E-08 4.46E-08
s38584.1 4.82E-08 4.90E-08 4.90E-08 4.90E-08 4.82E-08 4.90E-08 4.90E-08 4.90E-08 4.90E-08 4.78E-08
seq 5.00E-08 4.89E-08 4.92E-08 4.92E-08 4.92E-08 4.92E-08 4.94E-08 4.92E-08 4.92E-08 4.92E-08
spla 6.56E-08 7.30E-08 6.59E-08 6.70E-08 7.75E-08 6.82E-08 6.78E-08 6.64E-08 6.56E-08 7.70E-08
tseng 2.37E-08 2.42E-08 2.42E-08 2.34E-08 2.34E-08 2.34E-08 2.56E-08 2.34E-08 2.34E-08 2.34E-08
Design
Critical Path
Std Dev (s) Average (s) %age Std Dev
alu4 1.49E-10 4.52E-08 0.33%
apex2 3.77E-11 5.17E-08 0.07%
apex4 2.34E-09 4.69E-08 4.98%
bigkey 3.52E-10 2.82E-08 1.25%
clma 1.52E-09 9.53E-08 1.60%
des 1.61E-10 5.93E-08 0.27%
diffeq 0.00E+00 3.05E-08 0.00%
dsip 6.79E-24 2.88E-08 0.00%
elliptic 6.79E-24 5.37E-08 0.00%
ex1010 7.44E-10 9.16E-08 0.81%
ex5p 5.97E-10 4.60E-08 1.30%
frisc 1.92E-09 6.40E-08 2.99%
misex3 1.36E-23 4.50E-08 0.00%
pdc 1.04E-08 9.43E-08 11.03%
s298 2.63E-10 5.70E-08 0.46%
s38417 1.74E-10 4.46E-08 0.39%
s38584.1 4.54E-10 4.85E-08 0.94%
seq 2.19E-10 4.93E-08 0.44%
spla 3.63E-09 6.88E-08 5.28%
tseng 1.77E-09 2.41E-08 7.32%
77 
 
From the tables above, we observe that while there is some influence on delay across 
different criticality thresholds, the deviation tends to be small. To determine whether a 
trend exists,  Figure 5.4 below plots the critical path delays against criticality threshold 
(to reduce clutter, we limit the plot to 8 designs with similar critical path delays): 
 
 
Figure 5.4: Plot of Critical Path Delay vs Criticality Threshold – Rank by Max Criticality 
 
We see that critical path delay is not extremely sensitive to criticality threshold value. 
This result is actually to be expected, since the most critical nets will always have 
criticality values of 0.99, exceeding the criticality threshold in all cases. What the 
results suggest is that, in most designs, ample slack exists on non-critical paths for 
RBB optimization, which we know to be the case. 
 
Next, we first investigate the influence of critical threshold on routing leakage power. 


























scheme was set out to optimize. Figures and plots are generated in similar fashion to 
critical path delay. 
 
Table 5.4: Routing Leakage Power vs Criticality Threshold – Rank by Max Criticality 
Design
Routing Leakage Power (W) vs Criticality Threshold: Rank by Max Criticality
0.6 0.62 0.64 0.66 0.68 0.7 0.72 0.74 0.76 0.78
alu4 3.01E-03 3.02E-03 3.00E-03 4.02E-03 2.99E-03 2.99E-03 4.02E-03 4.10E-03 3.02E-03 3.00E-03
apex2 1.06E-02 9.18E-03 7.73E-03 1.06E-02 1.21E-02 7.73E-03 1.06E-02 9.10E-03 7.68E-03 9.01E-03
apex4 1.60E-02 1.33E-02 1.56E-02 1.72E-02 1.56E-02 1.20E-02 1.42E-02 1.33E-02 1.94E-02 1.45E-02
bigkey 1.40E-02 1.41E-02 1.38E-02 1.40E-02 1.41E-02 1.40E-02 1.40E-02 1.41E-02 1.40E-02 1.39E-02
clma 9.46E-02 1.15E-01 9.93E-02 1.88E-01 8.35E-02 1.46E-01 1.63E-01 8.98E-02 9.47E-02 1.25E-01
des 2.77E-02 2.76E-02 2.75E-02 2.76E-02 2.75E-02 2.74E-02 2.76E-02 2.72E-02 2.73E-02 2.76E-02
diffeq 3.52E-03 3.55E-03 3.46E-03 3.06E-03 3.53E-03 3.50E-03 3.53E-03 3.54E-03 3.54E-03 3.10E-03
dsip 1.56E-02 1.57E-02 1.56E-02 1.57E-02 1.57E-02 1.57E-02 1.56E-02 1.58E-02 1.58E-02 1.56E-02
elliptic 2.15E-02 2.63E-02 3.07E-02 2.31E-02 2.48E-02 2.80E-02 2.77E-02 1.69E-02 1.37E-02 2.47E-02
ex1010 7.32E-02 7.34E-02 7.34E-02 5.80E-02 5.09E-02 7.33E-02 7.33E-02 7.33E-02 5.11E-02 5.07E-02
ex5p 6.74E-03 6.92E-03 6.81E-03 6.83E-03 6.79E-03 6.93E-03 6.81E-03 6.82E-03 6.77E-03 6.82E-03
frisc 1.29E-01 1.34E-01 1.56E-01 1.35E-01 1.44E-01 1.61E-01 1.21E-01 1.11E-01 1.40E-01 1.41E-01
misex3 9.62E-03 1.04E-02 8.65E-03 5.78E-03 8.61E-03 8.60E-03 5.81E-03 6.78E-03 6.87E-03 6.80E-03
pdc 1.44E-01 1.08E-01 1.60E-01 1.44E-01 1.30E-01 1.02E-01 1.49E-01 1.47E-01 8.36E-02 9.15E-02
s298 6.54E-03 6.69E-03 6.60E-03 6.71E-03 4.93E-03 6.58E-03 6.65E-03 6.63E-03 6.63E-03 6.62E-03
s38417 2.76E-02 2.74E-02 2.73E-02 2.77E-02 2.76E-02 2.78E-02 2.79E-02 2.79E-02 2.93E-02 2.95E-02
s38584.1 1.61E-02 1.59E-02 1.60E-02 1.61E-02 1.60E-02 1.60E-02 1.60E-02 1.60E-02 1.60E-02 1.61E-02
seq 1.17E-02 1.19E-02 1.18E-02 1.19E-02 1.19E-02 1.19E-02 1.18E-02 1.18E-02 1.18E-02 1.19E-02
spla 2.00E-02 6.12E-02 7.08E-02 6.24E-02 2.01E-02 6.34E-02 3.95E-02 6.31E-02 6.28E-02 3.67E-02
tseng 3.99E-03 3.99E-03 4.02E-03 4.01E-03 3.99E-03 4.01E-03 3.98E-03 3.99E-03 4.03E-03 4.04E-03
Design
Routing Leakage Power (W) vs Criticality Threshold: Rank by Max Criticality
0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98
alu4 3.00E-03 3.04E-03 3.01E-03 3.02E-03 4.02E-03 3.00E-03 4.02E-03 4.02E-03 3.05E-03 3.00E-03
apex2 9.12E-03 1.07E-02 9.08E-03 1.07E-02 9.03E-03 9.03E-03 9.03E-03 9.03E-03 1.05E-02 1.06E-02
apex4 2.11E-02 1.07E-02 1.86E-02 1.33E-02 1.97E-02 1.29E-02 1.19E-02 1.21E-02 1.20E-02 1.20E-02
bigkey 1.37E-02 1.39E-02 1.40E-02 1.38E-02 1.42E-02 1.40E-02 1.39E-02 1.41E-02 1.40E-02 1.41E-02
clma 1.51E-01 9.37E-02 1.21E-01 1.20E-01 1.43E-01 1.36E-01 1.10E-01 1.70E-01 1.36E-01 1.20E-01
des 2.75E-02 2.75E-02 2.77E-02 2.76E-02 2.75E-02 2.78E-02 2.75E-02 2.73E-02 2.78E-02 2.74E-02
diffeq 3.06E-03 3.07E-03 3.10E-03 3.09E-03 3.09E-03 3.10E-03 3.54E-03 3.54E-03 3.07E-03 3.53E-03
dsip 1.57E-02 1.57E-02 1.56E-02 1.55E-02 1.54E-02 1.55E-02 1.56E-02 1.56E-02 1.54E-02 1.57E-02
elliptic 2.33E-02 2.46E-02 2.33E-02 2.18E-02 1.68E-02 1.40E-02 2.32E-02 2.32E-02 2.30E-02 1.71E-02
ex1010 5.06E-02 7.31E-02 7.37E-02 7.33E-02 7.35E-02 5.10E-02 7.29E-02 7.30E-02 7.34E-02 7.34E-02
ex5p 6.81E-03 5.90E-03 5.94E-03 5.92E-03 6.71E-03 6.79E-03 6.79E-03 6.79E-03 6.80E-03 6.65E-03
frisc 1.26E-01 1.48E-01 1.36E-01 1.57E-01 1.14E-01 9.26E-02 1.29E-01 1.11E-01 1.83E-01 1.04E-01
misex3 5.95E-03 6.75E-03 6.69E-03 8.76E-03 8.76E-03 8.76E-03 8.76E-03 8.66E-03 8.66E-03 8.66E-03
pdc 1.75E-01 8.87E-02 7.40E-02 1.36E-01 1.33E-01 1.19E-01 1.08E-01 8.85E-02 8.78E-02 1.07E-01
s298 6.60E-03 6.66E-03 5.84E-03 6.64E-03 6.65E-03 6.67E-03 6.68E-03 6.60E-03 6.69E-03 6.68E-03
s38417 2.93E-02 2.76E-02 2.76E-02 2.79E-02 2.93E-02 2.61E-02 2.93E-02 2.93E-02 2.81E-02 2.77E-02
s38584.1 1.61E-02 1.61E-02 1.60E-02 1.60E-02 1.60E-02 1.35E-02 1.35E-02 1.35E-02 1.35E-02 1.36E-02
seq 1.19E-02 1.18E-02 1.18E-02 1.18E-02 1.18E-02 1.18E-02 1.19E-02 1.20E-02 1.18E-02 1.18E-02
spla 3.64E-02 3.95E-02 6.26E-02 6.27E-02 1.99E-02 6.26E-02 2.64E-02 6.34E-02 6.17E-02 2.02E-02




Table 5.5: Leakage Power Standard Deviation, Average and Percentage Standard 
Deviation – Rank by Max Criticality 
 
 




Std Dev (s) Average (s) %age Std Dev
alu4 4.81E-04 3.32E-03 14.51%
apex2 1.18E-03 9.55E-03 12.36%
apex4 3.04E-03 1.48E-02 20.55%
bigkey 1.16E-04 1.40E-02 0.83%
clma 2.90E-02 1.25E-01 23.22%
des 1.65E-04 2.75E-02 0.60%
diffeq 2.28E-04 3.33E-03 6.87%
dsip 1.11E-04 1.56E-02 0.71%
elliptic 4.60E-03 2.24E-02 20.53%
ex1010 1.01E-02 6.69E-02 15.09%
ex5p 3.29E-04 6.67E-03 4.93%
frisc 2.16E-02 1.34E-01 16.15%
misex3 1.35E-03 7.92E-03 17.06%
pdc 2.89E-02 1.19E-01 24.33%
s298 4.15E-04 6.51E-03 6.37%
s38417 9.09E-04 2.81E-02 3.23%
s38584.1 1.11E-03 1.54E-02 7.19%
seq 6.95E-05 1.18E-02 0.59%
spla 1.87E-02 4.78E-02 39.15%





































While there is again no discernable trend in routing leakage power against criticality 
threshold, a greater percentage standard deviation is observed. This suggests that 
criticality threshold does in fact have a noticeable impact on the conversion rates of 
switch boxes to VTH, although finding a sweet spot that works well for all designs is a 
challenging task. Another important implication of this result is that critical nets tend 
to be routed through the same switch boxes in this scheme; otherwise, we would 
observe increased leakage powerwith decreasing criticality thresholds. 
 
Weighted Average Criticality 
The same experiment is repeated with the Weighted Average Criticality net ranking 
method, with    . Respective figures and plots for critical path delay and routing 




Delay (s) vs Criticality Threshold: Rank by Weighted Average Criticality
0.6 0.62 0.64 0.66 0.68 0.7 0.72 0.74 0.76 0.78
alu4 4.55E-08 4.52E-08 4.52E-08 4.52E-08 4.52E-08 4.50E-08 4.52E-08 4.52E-08 4.54E-08 4.50E-08
apex2 5.67E-08 5.30E-08 5.17E-08 5.22E-08 5.22E-08 5.17E-08 5.17E-08 5.20E-08 5.17E-08 5.17E-08
apex4 4.39E-08 4.83E-08 4.78E-08 4.65E-08 4.79E-08 4.39E-08 4.68E-08 5.83E-08 4.59E-08 4.53E-08
bigkey 2.81E-08 2.81E-08 2.76E-08 2.76E-08 2.84E-08 2.76E-08 2.76E-08 2.76E-08 2.76E-08 2.76E-08
clma 9.69E-08 9.44E-08 9.47E-08 9.61E-08 9.44E-08 9.51E-08 9.50E-08 9.96E-08 9.44E-08 9.61E-08
des 5.92E-08 5.92E-08 5.92E-08 5.94E-08 5.92E-08 5.92E-08 5.92E-08 5.92E-08 5.92E-08 5.93E-08
diffeq 3.05E-08 3.05E-08 3.05E-08 3.05E-08 3.05E-08 3.05E-08 3.05E-08 3.05E-08 3.05E-08 3.05E-08
dsip 2.88E-08 2.88E-08 2.88E-08 2.88E-08 2.88E-08 2.88E-08 2.88E-08 2.88E-08 2.88E-08 2.88E-08
elliptic 5.37E-08 5.37E-08 5.37E-08 5.37E-08 5.37E-08 5.44E-08 5.37E-08 5.37E-08 5.37E-08 5.37E-08
ex1010 9.34E-08 9.16E-08 9.13E-08 9.17E-08 9.19E-08 9.15E-08 9.19E-08 9.19E-08 9.17E-08 9.17E-08
ex5p 4.58E-08 4.51E-08 4.53E-08 4.50E-08 4.58E-08 4.51E-08 4.57E-08 4.64E-08 4.52E-08 4.58E-08
frisc 7.49E-08 6.23E-08 6.43E-08 6.26E-08 8.12E-08 6.34E-08 6.38E-08 6.25E-08 6.24E-08 6.19E-08
misex3 4.50E-08 4.50E-08 4.50E-08 4.50E-08 4.50E-08 4.50E-08 4.50E-08 4.50E-08 4.50E-08 4.50E-08
pdc 8.40E-08 8.79E-08 8.66E-08 8.28E-08 8.72E-08 8.38E-08 8.64E-08 8.76E-08 8.66E-08 1.03E-07
s298 5.69E-08 5.76E-08 5.72E-08 5.67E-08 5.69E-08 5.72E-08 5.67E-08 5.67E-08 5.74E-08 5.67E-08
s38417 4.46E-08 4.46E-08 4.46E-08 4.46E-08 4.46E-08 4.46E-08 4.46E-08 4.46E-08 4.46E-08 4.46E-08
s38584.1 4.82E-08 4.82E-08 4.78E-08 4.90E-08 4.82E-08 4.82E-08 4.82E-08 4.83E-08 4.78E-08 4.82E-08
seq 4.92E-08 4.95E-08 4.93E-08 4.93E-08 4.90E-08 4.93E-08 4.93E-08 5.06E-08 4.90E-08 4.93E-08
spla 7.12E-08 7.20E-08 7.61E-08 7.32E-08 6.58E-08 6.83E-08 6.55E-08 6.95E-08 6.63E-08 6.85E-08




Table 5.6: Delay vs Criticality Threshold – Rank by Weighted Average Criticality 
 
 
Table 5.7: Critical Path Delay Standard Deviation, Average and Percentage Standard 
Deviation – Rank by Max Criticality 
Design
Delay (s) vs Criticality Threshold: Rank by Weighted Average Criticality
0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98
alu4 4.52E-08 4.52E-08 4.52E-08 4.52E-08 4.52E-08 4.52E-08 4.57E-08 4.52E-08 4.47E-08 4.57E-08
apex2 5.17E-08 5.17E-08 5.20E-08 5.17E-08 5.17E-08 5.15E-08 5.15E-08 5.17E-08 5.21E-08 5.17E-08
apex4 4.91E-08 5.37E-08 5.12E-08 5.14E-08 6.01E-08 4.56E-08 4.81E-08 4.44E-08 4.66E-08 4.90E-08
bigkey 2.81E-08 2.76E-08 2.76E-08 2.76E-08 2.84E-08 2.81E-08 2.83E-08 2.84E-08 2.81E-08 2.84E-08
clma 9.44E-08 9.44E-08 9.46E-08 9.44E-08 9.95E-08 9.44E-08 9.62E-08 9.43E-08 9.44E-08 9.44E-08
des 5.92E-08 5.92E-08 5.92E-08 5.92E-08 5.92E-08 5.92E-08 5.92E-08 5.92E-08 5.92E-08 5.93E-08
diffeq 3.05E-08 3.12E-08 3.05E-08 3.05E-08 3.05E-08 3.05E-08 3.05E-08 3.05E-08 3.05E-08 3.05E-08
dsip 2.88E-08 2.88E-08 2.88E-08 2.88E-08 2.88E-08 2.88E-08 2.88E-08 2.88E-08 2.88E-08 2.88E-08
elliptic 5.37E-08 5.37E-08 5.37E-08 5.37E-08 5.37E-08 5.37E-08 5.37E-08 5.37E-08 5.37E-08 5.54E-08
ex1010 9.16E-08 9.20E-08 9.22E-08 9.17E-08 9.20E-08 9.20E-08 9.17E-08 9.35E-08 9.17E-08 9.22E-08
ex5p 4.51E-08 4.57E-08 4.56E-08 4.63E-08 4.57E-08 4.51E-08 4.51E-08 4.68E-08 4.50E-08 4.51E-08
frisc 6.34E-08 6.64E-08 6.29E-08 6.30E-08 6.61E-08 6.23E-08 6.27E-08 6.26E-08 6.29E-08 6.67E-08
misex3 4.50E-08 4.50E-08 4.50E-08 4.50E-08 4.50E-08 4.50E-08 4.50E-08 4.50E-08 4.47E-08 4.47E-08
pdc 8.63E-08 8.88E-08 8.40E-08 8.12E-08 9.14E-08 8.93E-08 8.81E-08 9.19E-08 9.30E-08 9.10E-08
s298 5.67E-08 5.72E-08 5.67E-08 5.93E-08 5.72E-08 5.69E-08 5.67E-08 5.72E-08 5.95E-08 5.69E-08
s38417 4.54E-08 4.46E-08 4.46E-08 4.46E-08 4.46E-08 4.46E-08 4.46E-08 4.46E-08 4.46E-08 4.46E-08
s38584.1 4.82E-08 4.80E-08 4.83E-08 4.82E-08 4.82E-08 4.90E-08 4.90E-08 4.90E-08 4.90E-08 4.90E-08
seq 4.93E-08 4.90E-08 5.00E-08 4.90E-08 4.90E-08 4.93E-08 4.93E-08 4.93E-08 5.00E-08 4.93E-08
spla 6.65E-08 6.99E-08 7.40E-08 6.94E-08 7.09E-08 6.85E-08 6.68E-08 6.75E-08 8.05E-08 6.75E-08
tseng 2.45E-08 2.41E-08 2.42E-08 2.36E-08 2.46E-08 2.37E-08 2.37E-08 2.49E-08 2.34E-08 2.34E-08
Design
Critical Path
Std Dev (s) Average (s) %age Std Dev
alu4 2.35E-10 4.52E-08 0.52%
apex2 1.12E-09 5.21E-08 2.16%
apex4 4.42E-09 4.87E-08 9.07%
bigkey 3.31E-10 2.79E-08 1.19%
clma 1.60E-09 9.54E-08 1.68%
des 3.61E-11 5.92E-08 0.06%
diffeq 1.56E-10 3.06E-08 0.51%
dsip 6.79E-24 2.88E-08 0.00%
elliptic 4.17E-10 5.38E-08 0.78%
ex1010 5.44E-10 9.20E-08 0.59%
ex5p 5.16E-10 4.55E-08 1.13%
frisc 4.80E-09 6.49E-08 7.40%
misex3 7.73E-11 4.49E-08 0.17%
pdc 4.64E-09 8.80E-08 5.27%
s298 7.98E-10 5.72E-08 1.40%
s38417 1.74E-10 4.46E-08 0.39%
s38584.1 4.21E-10 4.84E-08 0.87%
seq 4.09E-10 4.94E-08 0.83%
spla 3.79E-09 6.99E-08 5.42%























































































Routing Leakage Power (W) vs Criticality Threshold: Rank by Weighted Average Criticality
0.6 0.62 0.64 0.66 0.68 0.7 0.72 0.74 0.76 0.78
alu4 3.02E-03 4.00E-03 2.99E-03 4.00E-03 2.99E-03 5.13E-03 4.08E-03 4.06E-03 3.02E-03 4.02E-03
apex2 9.21E-03 7.75E-03 7.71E-03 1.05E-02 9.12E-03 2.33E-02 9.13E-03 1.63E-02 1.21E-02 1.06E-02
apex4 1.20E-02 2.61E-02 1.31E-02 1.59E-02 1.32E-02 1.06E-02 1.31E-02 2.08E-02 9.46E-03 1.47E-02
bigkey 1.40E-02 1.38E-02 1.39E-02 1.39E-02 1.39E-02 1.40E-02 1.39E-02 1.39E-02 1.39E-02 1.38E-02
clma 1.04E-01 1.36E-01 1.46E-01 1.56E-01 1.57E-01 1.29E-01 1.20E-01 7.83E-02 1.14E-01 1.25E-01
des 2.75E-02 2.79E-02 2.74E-02 2.77E-02 2.74E-02 2.76E-02 2.78E-02 2.73E-02 2.77E-02 2.77E-02
diffeq 3.52E-03 3.53E-03 3.60E-03 3.54E-03 3.54E-03 3.52E-03 3.09E-03 3.50E-03 3.54E-03 3.08E-03
dsip 1.56E-02 1.56E-02 1.56E-02 1.57E-02 1.57E-02 1.57E-02 1.55E-02 1.57E-02 1.55E-02 1.57E-02
elliptic 2.16E-02 1.38E-02 2.03E-02 2.47E-02 2.02E-02 3.41E-02 2.93E-02 2.31E-02 2.65E-02 1.86E-02
ex1010 8.08E-02 5.07E-02 7.31E-02 5.09E-02 6.56E-02 7.33E-02 7.27E-02 6.54E-02 7.31E-02 7.33E-02
ex5p 6.73E-03 6.78E-03 6.68E-03 6.81E-03 6.80E-03 6.79E-03 5.79E-03 7.49E-03 6.87E-03 6.69E-03
frisc 1.60E-01 1.08E-01 1.81E-01 1.12E-01 1.53E-01 1.12E-01 1.07E-01 1.51E-01 9.89E-02 1.31E-01
misex3 5.72E-03 8.64E-03 5.90E-03 5.80E-03 7.59E-03 6.72E-03 9.45E-03 6.71E-03 5.89E-03 6.75E-03
pdc 9.18E-02 1.27E-01 1.14E-01 1.56E-01 8.05E-02 1.06E-01 1.05E-01 1.60E-01 1.09E-01 7.27E-02
s298 5.04E-03 7.46E-03 6.60E-03 6.73E-03 6.52E-03 6.65E-03 6.61E-03 6.63E-03 6.68E-03 6.71E-03
s38417 2.91E-02 2.77E-02 2.74E-02 2.94E-02 1.93E-02 2.93E-02 2.79E-02 2.78E-02 2.77E-02 2.77E-02
s38584.1 1.61E-02 1.60E-02 1.61E-02 1.60E-02 1.36E-02 1.61E-02 1.61E-02 1.61E-02 1.61E-02 1.60E-02
seq 1.19E-02 1.18E-02 1.17E-02 1.17E-02 1.19E-02 1.19E-02 1.18E-02 1.18E-02 1.18E-02 1.17E-02
spla 2.00E-02 6.47E-02 7.09E-02 3.60E-02 5.61E-02 3.55E-02 6.50E-02 4.60E-02 2.32E-02 3.29E-02








Table 5.9: Leakage Power Standard Deviation, Average and Percentage Standard 
Deviation – Rank by Weighted Average Criticality 
Design
Routing Leakage Power (W) vs Criticality Threshold: Rank by Weighted Average Criticality
0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98
alu4 4.02E-03 2.99E-03 2.99E-03 3.01E-03 3.01E-03 3.02E-03 3.00E-03 4.00E-03 4.12E-03 3.01E-03
apex2 1.21E-02 1.07E-02 1.21E-02 1.06E-02 1.06E-02 1.20E-02 1.05E-02 9.07E-03 1.92E-02 1.20E-02
apex4 2.34E-02 2.21E-02 1.69E-02 2.61E-02 1.20E-02 1.82E-02 1.19E-02 1.81E-02 1.82E-02 1.71E-02
bigkey 1.38E-02 1.39E-02 1.39E-02 1.38E-02 1.37E-02 1.38E-02 1.40E-02 1.39E-02 1.37E-02 1.41E-02
clma 1.41E-01 1.51E-01 1.15E-01 1.41E-01 1.20E-01 1.09E-01 1.84E-01 1.10E-01 1.30E-01 1.76E-01
des 2.73E-02 2.75E-02 2.75E-02 2.75E-02 2.75E-02 2.74E-02 2.75E-02 2.73E-02 2.75E-02 2.74E-02
diffeq 3.08E-03 3.53E-03 3.55E-03 3.53E-03 3.57E-03 3.52E-03 3.52E-03 3.50E-03 3.51E-03 3.51E-03
dsip 1.56E-02 1.56E-02 1.57E-02 1.56E-02 1.57E-02 1.56E-02 1.55E-02 1.56E-02 1.56E-02 1.57E-02
elliptic 2.18E-02 2.01E-02 1.38E-02 2.17E-02 1.99E-02 2.31E-02 1.85E-02 2.48E-02 1.54E-02 2.17E-02
ex1010 7.27E-02 7.30E-02 5.82E-02 7.35E-02 8.04E-02 7.29E-02 7.30E-02 7.33E-02 5.08E-02 7.30E-02
ex5p 6.85E-03 6.79E-03 6.84E-03 6.76E-03 6.83E-03 6.82E-03 6.70E-03 6.72E-03 6.79E-03 6.68E-03
frisc 1.30E-01 1.52E-01 1.75E-01 1.25E-01 1.60E-01 1.30E-01 1.18E-01 1.25E-01 1.40E-01 7.52E-02
misex3 5.79E-03 6.80E-03 5.86E-03 7.54E-03 7.54E-03 6.74E-03 6.76E-03 8.54E-03 6.73E-03 6.73E-03
pdc 1.46E-01 1.61E-01 1.05E-01 1.23E-01 9.90E-02 1.00E-01 6.19E-02 1.01E-01 9.31E-02 1.68E-01
s298 7.50E-03 7.43E-03 6.63E-03 6.57E-03 6.60E-03 7.53E-03 6.58E-03 6.73E-03 6.68E-03 6.75E-03
s38417 2.94E-02 2.10E-02 2.10E-02 2.94E-02 2.74E-02 2.79E-02 2.75E-02 1.77E-02 2.78E-02 2.74E-02
s38584.1 1.60E-02 1.60E-02 1.61E-02 1.60E-02 1.60E-02 1.61E-02 1.61E-02 1.61E-02 1.61E-02 1.61E-02
seq 1.18E-02 1.19E-02 1.18E-02 1.19E-02 1.18E-02 1.32E-02 1.19E-02 1.19E-02 1.18E-02 1.18E-02
spla 6.58E-02 6.36E-02 3.57E-02 6.27E-02 2.32E-02 3.98E-02 3.62E-02 6.22E-02 3.61E-02 6.21E-02
tseng 4.45E-03 4.45E-03 4.44E-03 4.42E-03 4.39E-03 4.46E-03 4.39E-03 4.33E-03 4.44E-03 4.45E-03
Design
Leakage Power
Std Dev (s) Average (s) %age Std Dev
alu4 6.37E-04 3.52E-03 18.08%
apex2 3.82E-03 1.17E-02 32.59%
apex4 4.99E-03 1.67E-02 29.96%
bigkey 1.01E-04 1.39E-02 0.73%
clma 2.53E-02 1.32E-01 19.19%
des 1.69E-04 2.75E-02 0.61%
diffeq 1.66E-04 3.46E-03 4.79%
dsip 7.16E-05 1.56E-02 0.46%
elliptic 4.88E-03 2.16E-02 22.55%
ex1010 9.19E-03 6.90E-02 13.32%
ex5p 2.85E-04 6.76E-03 4.21%
frisc 2.69E-02 1.32E-01 20.33%
misex3 1.04E-03 6.91E-03 15.10%
pdc 3.04E-02 1.14E-01 26.65%
s298 5.28E-04 6.73E-03 7.85%
s38417 3.59E-03 2.65E-02 13.57%
s38584.1 5.51E-04 1.59E-02 3.46%
seq 3.07E-04 1.19E-02 2.58%
spla 1.68E-02 4.69E-02 35.80%




Figure 5.7: Plot of Routing Leakage Power vs Criticality Threshold – Rank by Weighted 
Average Criticality 
 
We make similar observations to the case of Ranking By Maximum Criticality – 
although criticality threshold influences both delay and routing leakage power, there is 
no discernable trend amongst the testcases as to which threshold value works best. 
Again, this is due to the fact that the most critical nets in VPR will always have sinks 
with criticality = 0.99. On top of this, it appears the weighted average criticality 
scheme is likely to be prioritising the same nets, or nets with similar criticalities, as the 
maximum criticality scheme – this is proven in the relatively small standard deviation 
in critical path delay.  As in the case of ranking by maximum criticality, the criticality 
threshold influences routing leakage power much more than critical path delay, with 
different thresholds working better for different designs. 
 
5.3.2. Final Results 
We now focus on the potential power savings offered by our RBB switch box 




































baseline architecture and algorithm, we adopt the native single VT architecture and the 
built-in timing-driven Pathfinder algorithm in VPR 5.0. Because no related work on 
switch box-level architectures exist today, a comparison on this end was not possible. 
In our algorithm, we choose to rank nets by Maximum Criticality, with criticality 
threshold set to 0.9. This is a reasonable decision since the investigation in Section 
5.4.1 did not yield conclusive hints as to which ranking method or criticality threshold 
performs better for our given suite of designs. 
 
We compare three broad metrics: Routing Leakage Power Dissipation, Total Power 
Dissipation, and Critical Path Delay. The ﬁrst two metrics are directly relevant to our 
architectural enhancements. The third metric is important as we want to observe the 
delay impact attributed to our switch box architecture and algorithm enhancements. 





Table 5.10: Results for our RBB switch box enhancements versus single VT 
 
From the results, we see that the combination of our RBB switch box architecture and 
Dual-VT Switch Box Pathfinder Algorithm yields excellent results against the baseline, 
returning an average of 53.69% savings in leakage power savings on the routing 
network alone, and 28.23% in total power savings. This also suggests that, at the 65nm 
node, leakage power in the routing network alone constitutes a significant proportion 
of total power, which in addition also comprises both dynamic and leakage power 
dissipation in the logic blocks and clock network, as well as dynamic power dissipated 
in the routing network. Additionally, our enhancements resulted in minimal impact to 
timing, incurring an average of only +1.28% delay. However, we note that outliers do 
exist, namely pdc and especially spla. This indicates that our enhanced Pathfinder 
Design





















alu4 0.0167 0.0286 4.42E-08 0.0030 82.07 0.0149 47.89 4.52E-08 2.22
apex2 0.0187 0.0335 5.32E-08 0.0090 51.64 0.0241 27.99 5.17E-08 -2.74
apex4 0.0383 0.0513 4.83E-08 0.0129 66.22 0.0257 49.80 5.13E-08 6.33
bigkey 0.0668 0.3803 2.85E-08 0.0140 79.00 0.3277 13.83 2.76E-08 -3.07
clma 0.2030 0.2399 1.01E-07 0.1360 33.01 0.1737 27.57 9.42E-08 -7.08
des 0.1148 0.5652 5.76E-08 0.0278 75.76 0.4777 15.48 5.92E-08 2.78
diffeq 0.0125 0.0230 2.72E-08 0.0031 75.14 0.0132 42.59 3.05E-08 12.15
dsip 0.0676 0.3806 3.42E-08 0.0155 77.12 0.3305 13.15 2.88E-08 -15.76
elliptic 0.0618 0.0996 5.34E-08 0.0140 77.42 0.0518 48.00 5.37E-08 0.38
ex1010 0.0924 0.1294 9.16E-08 0.0510 44.79 0.0881 31.94 9.11E-08 -0.46
ex5p 0.0160 0.0276 4.07E-08 0.0068 57.66 0.0181 34.32 4.57E-08 12.14
frisc 0.2217 0.2419 6.65E-08 0.0926 58.23 0.1131 53.24 6.50E-08 -2.24
misex3 0.0189 0.0319 4.59E-08 0.0088 53.65 0.0220 31.00 4.50E-08 -2.12
pdc 0.1110 0.1430 8.37E-08 0.1188 -6.96 0.1504 -5.20 9.32E-08 11.37
s298 0.0186 0.0259 5.47E-08 0.0067 64.21 0.0139 46.44 5.67E-08 3.58
s38417 0.0458 0.0724 4.41E-08 0.0261 42.88 0.0527 27.18 4.46E-08 1.14
s38584.1 0.0530 0.1249 4.29E-08 0.0135 74.50 0.0839 32.80 4.90E-08 14.32
seq 0.0304 0.0469 4.51E-08 0.0118 61.36 0.0278 40.63 4.92E-08 9.15
spla 0.0371 0.0654 6.85E-08 0.0626 -68.56 0.0911 -39.30 6.82E-08 -0.47
tseng 0.0159 0.0432 2.79E-08 0.0040 74.72 0.0323 25.18 2.34E-08 -16.11
AVERAGE 53.69 28.23 1.28
87 
 
algorithm may not be ideal for the solution space presented by these designs. A study 
of circuit structures causing such behaviour is proposed as future work. 
A good indicator of how well our routing algorithm performs is the ratio of high-VT 
switch boxes set during the routing phase, versus the total number of switch boxes 




Table 5.11: Ratio of VTH switch boxes set during routing versus total number of switch 
boxes used to implement the route 
 
 
Testcase # high-VT SBs # low-VT SBs
Ratio of high-VT 
SBs / total SBs 
used in routing
alu4 88 55 0.615
apex2 90 78 0.536
apex4 61 82 0.427
bigkey 605 178 0.773
clma 415 113 0.786
des 795 293 0.731
diffeq 68 52 0.567
dsip 476 307 0.608
elliptic 197 91 0.684
ex1010 320 163 0.663
ex5p 65 55 0.642
frisc 196 92 0.681
misex3 84 59 0.587
pdc 121 278 0.303
s298 50 70 0.417
s38417 329 70 0.825
s38584.1 436 92 0.826
seq 99 69 0.589
spla 183 140 0.567




In this chapter, a novel enhancement to the original timing-driven Pathfinder algorithm, 
which we call the Dual-VT Switch Box Pathfinder Algorithm, was presented. To our 
knowledge, this is the ﬁrst work that explores RBB at the switch box level, and has 
yielded promising results. When compared against the baseline single VT architecture, 
the combination of our proposed architecture with the Dual-VT Switch Box Pathfinder 
claimed an average of 53.69% savings in routing leakage power savings, and 28.23% 
in total power savings. These results make a strong case for our proposed work, and 
shows promise in tackling the critical need for leakage power mitigation in process 












In this thesis, two novel works were presented: A Power and Cluster Architecture-
Aware Dual-VT Technology Mapping and Clustering Scheme, and the Dual-VT Switch 
Box Pathfinder Routing Algorithm. 
 
On the first work, we discussed the advantages of a moving the optimization space up 
to technology mapping, which is the highest level possible in the FPGA CAD ﬂow, as 
well as the beneﬁts of a dual-VT approach versus a dual-VDD one. We presented 
RBBMap, a dual-VT capable technology mapper with a two-phase slack reclamation 
scheme and cluster size and delay awareness, and RBBPack, a logic block packer with 
cluster VT awareness. The combined use of RBBMap/RBBPack with our dual-VT CLB 
architecture yielded excellent results, obtaining an average of 70.95% savings in logic 
block leakage power and 28.30% savings in average total energy consumption against 
the baseline Emap/T-VPack combination. In addition, critical path is minimally 
impacted; in fact, better ﬁgures were obtained in some cases. This demonstrates the 
feasibility of considering optimizations upfront in the EDA flow – in our case, the 




The second work presented opens the door to a previously unexplored domain: dual-
VT switch boxes. A novel enhancement to the original Pathfinder algorithm with a net 
criticality ranking scheme ensures that the most critical nets are given priority in 
routing, honouring as far as possible the fidelity in the original algorithm. Two net 
ranking schemes were proposed – Rank by Maximum Criticality, and Rank by 
Weighted Average Criticality. Interestingly, there is no discernable trend in Criticality 
Threshold versus routing leakage power and timing, even though an influence is 
observed. The combination of our RBB switch box architecture and Dual-VT Switch 
Box Pathfinder Algorithm yields excellent results against the baseline single-VT 
pathfinder, returning an average of 53.69% savings in routing leakage power savings 
and 28.23% in total power savings. A small timing impact of +1.28% was observed to 
average critical path delay. 
 
 The results presented makes a strong case for works presented in this thesis, proving 
their effectiveness in tackling the burning issue of leakage power in ever-shrinking 
process nodes of the future. 
 
6.2.  Future Works 
Both of the works presented in this thesis are very new; as such, we believe plenty of 
optimization spaces exist. In the area of dual-VT technology mapping, we believe there 
is much space to explore in dual-VT-aware area recovery methods to reduce node 
count in a dual-VT scenario. In addition, the recent concept of edge recovery for 
improved FPGA routability [63] should also be explored in the dual-VT context. 
91 
 
In areas more closely related to the current RBBMap, a few testcases were found to 
have incurred higher routing power to the extent that total power were higher than in 
the baseline. The circuit network characteristics causing this should be explored, and 
may open new doors to understanding the correlation between network structures and 
routing in the dual-VT scenario. 
 
As to our dual-VT switch box work, this is the first work to our knowledge that 
explores dual-VT programmability at the switch box level. A few outliers exist in 
which higher routing power were incurred, causing total power to exceed the baseline. 
The algorithm is therefore not well-suited to certain circuit structures, which should be 
further understood and improved upon. Finally, given how new this work is, we expect 
to see further algorithmic possibilities, some of which may be better suited to our 
proposed architecture. 
 
Finally, amalgamating both of the architectural enhancements suggested in this thesis, 
a true dual-VT logic cluster and switch box FPGA architecture and associated EDA 
algorithms all the way from technology mapping to placement and routing is an 
exciting space to explore. The work in this thesis has already proven the effectiveness 
of each architectural enhancement in its own right. We strongly believe that the 
combination of both, coupled with dual-VT, architecturally-aware algorithms, will 





































































[1] Julien Lamoureux and Wayne Luk, "An Overview of Low-Power Techniques for 
Field-Programmable Gate Arrays," in Adaptive Hardware and Systems, 
NASA/ESA Conference on, 2008, pp. 338-345. 
[2] T. Sakurai, "Perspectives on power-aware electronics," in Solid-State Circuits 
Conference, 2003. Digest of Technical Papers. ISSCC. 2003 IEEE International, 
2003, pp. 26-29. 
[3] K Roy, S Mukhopadhyay, and Mahmoodi-Meimand H., "Leakage current 
mechanisms and leakage reduction techniques in deep-submicrometer CMOS 
circuits," in Proceedings of the IEEE, 2003, pp. 305-327. 
[4] K Usami and M Horowitz, "Clustered voltage scaling technique for low-power 
design," in Proceedings of the 1995 International Symposium on Low Power 
Design, ISLPED ’95, New York, NY, USA, 1995, pp. 3-8. 
[5] A Rahman and Polavarapuv V., "Evaluation of low-leakage design techniques for 
Field Programmable Gate Arrays," in Proceedings of the 2004 ACM/SIGDA 12th 
International Symposium on Field Programmable Gate Arrays, FPGA ’04, New 
York, NY, USA, 2004, pp. 23-30. 
[6] D Lewis et al., "Architectural enhancements in Stratix-III and Stratix-IV," in 
Proceeding of the ACM/SIGDA international symposium on Field programmable 
gate arrays, FPGA ’09, Monterey, California, USA, 2009, pp. 33-42. 
[7] A. Keshavarzi, K. Roy, and C. F. Hawkins, "Intrinsic Leakage in Low Power 
Deep Submicron CMOS ICs," in Test Conference, International, 1997, p. 146. 
[8] A. Mishchenko et al. (2009) ABC: A System for Sequential Synthesis and 
Veriﬁcation. [Online]. http://www.eecs.berkeley.edu/alanmi/abc, 
[9] V. Betz, J. Rose, and A. Marquardt, Architecture and CAD for Deep-Submicron 
FPGAs. Norwell, Massachusetts: Kluwer Academic Publishers, 1999. 
[10] J. Rose et al. (2009, Mar) VPR and T-VPack 5.0.2. [Online]. 
http://www.eecg.toronto.edu/vpr/ 
[11] K. K. W. Poon, S. J. E. Wilton, and A. Yan, "A detailed power model for ﬁeld-
programmable gate arrays," in ACM Trans. Des. Autom. Electron. Syst., , April 
2005, pp. 279-302. 
[12] K. B. Kent, F. Gharibian, and L. Shannon P. Jamieson, "Odin II – An Open-
99 
 
Source Verilog HDL Synthesis Tool for CAD Research," in Annual IEEE 
Symposium on Field-Programmable Custom Computing Machines, 2010, pp. 149-
156. 
[13] Xilinx Inc. (2011) Zynq-7000 Extensible Processing Platform. [Online]. 
http://www.xilinx.com/products/silicon-devices/epp/zynq-7000/index.htm 
[14] J. Greene, E. Hamdy, and S. Beal, "Antifuse Field Programmable Gate Arrays," 
Proceedings of the IEEE, pp. 1042-1056, July 1993. 
[15] S. Brown, "An Overview of Technology, Architecture and CAD Tools for 
Programmable Logic Devices," in Custom Integrated Circuits Conference, 1994, 
pp. 69-76. 
[16] Xilinx Inc. (2011) 7 Series. [Online]. 
http://www.xilinx.com/support/documentation/7_series.htm 
[17] V. Betz and J. Rose, "FPGA routing architecture: segmentation and buffering to 
optimize speed and density," in Proceedings of the 1999 ACM/SIGDA seventh 
international symposium on Field programmable gate arrays, Monterey, 
California, United States, 1999, pp. 59-68. 
[18] Xilinx Inc., The Programmable Logic Data Book., 1994. 
[19] G. G. Lemieux and S. D. Brown, "A detailed router for allocating wire segments 
in ﬁeld-programmable gate arrays," in Proceedings of the ACM Physical Design, 
Apr. 1993. 
[20] S. J. E. Wilton, "Architectures and Algorithms for Field-Programmable Gate 
Arrays with Embedded Memory," University of Toronto, Toronto, PhD thesis 
1997. 
[21] Y.W. Chang, D. Wong, and C. Wong, "Universal switch modules for FPGA 
design," in ACM Transactions on Design Automation of Electronic Systems, Jan. 
1996, pp. 80-101. 
[22] G. Lemieux, E. Lee, M. Tom, and A. Yu, "Directional and single-driver wires in 
FPGA interconnect," in IEEE International Conference on Field-Programmable 
Technology, 2004, Dec. 2004, pp. 41-48. 
[23] David Lewis et al., "The Stratix Routing and Logic Architecture," in roceedings 
of the 2003 ACM/SIGDA eleventh international symposium on Field 
programmable gate arrays, Monterey, California, USA, 2003, pp. 12-20. 
[24] J. Cong and Y. Hwang, "Simultaneous Depth and Area Minimization in LUT-
Based FPGA Mapping," in ACM International Symposium on Field-
100 
 
Programmable Gate Arrays, Monterey, CA, USA, Feb. 1995, pp. 68-74. 
[25] J. Cong, C. Wu, and Y. Ding, "Cut ranking and pruning: enabling a general and 
efﬁcient FPGA mapping solution," in Proceedings of the 1999 ACM/SIGDA 
seventh international symposium on Field programmable gate arrays, FPGA ’99, 
New York, NY, USA, 1999, pp. 29-35. 
[26] D. Chen, J. Cong, L. He, F. Li, and C.C. Peng, "Technology Mapping and 
Clustering for FPGA Architectures With Dual Supply Voltages," IEEE 
Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 
29, pp. 1709-1722, 2010. 
[27] A. Mishchenko, S. Chatterjee, and R. K. Brayton, "Improvements to Technology 
Mapping for LUT-Based FPGAs," IEEE Transactions on Computer-Aided Design 
of Integrated Circuits and Systems, vol. 26, pp. 240-253, February 2007. 
[28] J. Cong and Y. Ding, "FlowMap: An Optimal Technology Mapping Algorithm for 
Delay Optimization in Lookup-Table Based FPGA Designs," IEEE Trans. on 
Computer-Aided Design, vol. 13, no. 1, pp. 1-12, January 1994. 
[29] A. Mishchenko, Sungmin Cho, and S. Chatterjee, "Combinational and sequential 
mapping with priority cuts," in IEEE/ACM International Conference on 
Computer-Aided Design, 2007. ICCAD 2007., San Jose, CA, 2007, pp. 354-361. 
[30] S. Jang, K. Chung, A. Mishchenko, and R. Brayto, "A Power Optimization 
Toolbox for Logic Synthesis and Mapping," ERL Technical Report, EECS Dept., 
UC Berkeley 2011. 
[31] J. Lamoureux and S. J. E. Wilton, "On the Interaction Between Power-Aware 
FPGA CAD Algorithms," in Proceedings of the 2003 IEEE/ACM international 
conference on Computer-aided design, Washington, DC, USA, 2003, p. 701. 
[32] J. Anderson and F.N. Najm, "Power-Aware Technology Mapping for LUT-Based 
FPGAs," in IEEE International Conference on Field-Programmable Technology, 
Dec 2002, pp. 211-218. 
[33] Z-H. Wang, E-C. Liu, J. Lai, and T-C. Wang, "Power Minimization in LUT-
Based FPGA Technology Mapping," in ACM Asia South Pacific Design 
Automation Conference, 2001, pp. 635-640. 
[34] H. Li, W-K. Mak, and S. Katkoori, "LUT-Based FPGA Technology Mapping for 
Power Minimization with Optimal Depth," in IEEE Computer Society Workshop 
on VLSI, Orlando, 2001, pp. 123-128. 
[35] C-C. Wang and C-P Kwan, "Low Power Technology Mapping by Hiding High-
101 
 
Transition Paths in Invisible Edges for LUT-Based FPGAs," in IEEE 
International Symposium on Circuits and Systems, Jun. 1997, pp. 1536-1539. 
[36] V. Betz, and J. Rose A. Marquardt, "Using Cluster-based Logic Blocks and 
Timing-Driven Packing to Improve FPGA Speed and Density," in ACM 
International Symposium on Field-Programmable Gate Arrays, 1999, pp. 37-46. 
[37] J. Cong, H. Li, S. Lim, T. Shibuya, and D. Xu, "arge Scale Circuit Partitioning 
with Loose / Stable Net Removal and Signal Flow Based Clustering," in , 1997 
IEEE/ACM International Conference on Computer-Aided Design, 1997. Digest of 
Technical Papers, San Jose, CA , USA , 1997, pp. 441-446. 
[38] C. Alpert and A. Kahng, "Recent Directions in Netlist Partioning: A Survey," 
VLSI Journal, vol. 19, pp. 1-81, 1995. 
[39] J. Cong, L.W. Hagen, and A.B. Kahng, "Random Walks for Circuit Clustering," , 
1991, pp. 14.2.1-14.2.4. 
[40] J. Cong and M. Smith, "A Parallel Bottom-up Clustering Algorithm with 
Applications to Circuits Partitioning in VLSI Design," in ACM/IEEE Design 
Automation Conference, 1993, pp. 755-760. 
[41] W. Hou, X. Hong, W. Wu, and Y. Cai, "A path-based timing-driven quadratic 
placement algorithm," in Asia and South Pacific Design Automation Conference, 
2003. Proceedings of the ASP-DAC 2003, 2003, pp. 745-748. 
[42] G. Nam, Reda S., C. J. Alpert, P.G. Villarrubia, and A.B. Kahng, "A Fast 
Hierarchical Quadratic Placement Algorithm," IEEE Transactions on Computer-
Aided Design of Integrated Circuits and Systems, vol. 25, no. 4, pp. 678-691, Apr 
2006. 
[43] A. Marquardt, V. Betz, and J. Rose, "Timing-Driven Placement for FPGAs," in 
ACM International Symposium on Field-Programmable Gate Arrays, Monterey, 
CA, USA, 2000, pp. 203-213. 
[44] V. Betz, "Architecture and CAD for the Speed and Area Optimization of FPGAs," 
University of Toronto, Toronto, Ph.D. Dissertation 1998. 
[45] J. Kleinhans, G. Sigl, F. Johannes, and K. Antreich, "Gordian: VLSI Placement by 
Quadratic Programming and Slicing Optimization," IEEE Transactions on 
Computer-Aided Design, pp. 356-365, 1991. 
[46] B. Riess and G. Ettelt, "Speed: Fast and Efficient Timing Driven Placement," in 
IEEE International Symposium on Circuits and Systems, 1995, pp. 377-380. 
102 
 
[47] D. Huang and A. Kahng, "Partitioning-Based Standard-Cell Global Placement 
with an Exact Objective," in ACM Symposium on Physical Design, 1997, pp. 18-
25. 
[48] T. Kawanami et al., "Preliminary Evaluation of Flex Power FPGA: A Power 
Reconﬁgurable Architecture with Fine Granularity," IEICE Transactions on 
Information and Systems, pp. 2004-2010, 2004. 
[49] F. Li, Y. Lin, L. He, and J. Cong, "Low-Power FPGA Using Pre-deﬁned Dual-
Vdd/Dual-Vt Fabrics," in Proceedings of the 2004 ACM/SIGDA 12th 
international symposium on Field programmable gate arrays (FPGA '04), 2004, 
pp. 42-50. 
[50] J. Tschanz, S. Narendra, R. Nair, and V. De, "Effectiveness of adaptive supply 
voltage and body bias for reducing impact of parameter variations in low power 
and high performance microprocessors," IEEE Journal of Solid-State Circuits, 
vol. 38, pp. 826-829, may 2003. 
[51] L. He. (2005) Vdd Programmable and Variation Tolerant FPGA Circuits and 
Architectures. PDF. 
[52] J. H. Anderson and F. N. Najm, "A novel low-power FPGA routing switch," in 
IEEE Custom Integrated Circuits Conference, 2004, pp. 719-722. 
[53] Y. Lin, F. Li, and L. He, "Power modeling and architecture evaluation for FPGA 
with novel circuits for Vdd programmability," in Proceedings of the 2005 
ACM/SIGDA 13th international symposium on Field-programmable gate arrays 
(FPGA '05), 2005, pp. 199-207. 
[54] S. Mondal and S. Memik, "A Low Power FPGA Routing Architecture," in IEEE 
International Symposium on Circuits and Systems, 2005. ISCAS 2005, 2005, pp. 
1222-1225. 
[55] B. Bailey, G. Martin, and A. Piziali, ESL Design and Veriﬁcation: A Prescription 
for Electronic System Level Methodolog, 1st ed.: Morgan Kaufmann, 2007. 
[56] G. Pique and M. Meijer, "A 350na voltage regulator for 90nm cmos digital 
circuits with reverse-body-bias," in 2011 Proceedings of the ESSCIRC, Sep. 2011, 
pp. 379-382. 
[57] B. Li, "Triple Well No Body Effect Negative Charge Pump," 6452438 B1, Sep 
17, 2002. 
[58] P. Jamieson, W. Luk, S. Wilton, and G. Constantinides, "An energy and power 
consumption analysis of FPGA routing architectures," in International 
103 
 
Conference on Field-Programmable Technology, 2009. FPT 2009., Dec 2009, pp. 
324-327. 
[59] J. Lamoureux and S.J.E. Wilton, "On the Interaction between Power-Aware CAD 
Algorithms for FPGAs," in IEEE/ACM International Conference on Computer 
Aided Design (ICCAD), 2003, pp. 701-708. 
[60] T. Okamoto and J. Cong, "Buffered Steiner Tree Construction with Wire Sizing 
for Interconnect Layout Optimization," in International Conference on Computer 
Aided Design, 1996, pp. 44-49. 
[61] F.N. Najm, "Transition Density, A New Measure of Activity in Digital Circuits," 
Texas Instruments Technical Report #7529/0032 Aug. 1991. 
[62] C. Ebeling, L. McMurchie, S. A. Hauck, and S. Burns, "Placement and Routing 
Tools for the Triptych FPGA," in IEEE Trans. on VLSI, Dec 1995, pp. 473-482. 
[63] S. Jang, B. Chan, K. Chung, and A. Mishchenko, "WireMap: FPGA technology 
mapping for improved routability," in In Proceedings of the 16th international 
ACM/SIGDA symposium on Field programmable gate arrays (FPGA '08), 
Monterey, CA, USA, 2008, pp. 47-55. 
[64] Mahnke, S. Panenka, M. Embacher, W. Stechele, and W. Hoeld, "Efﬁciency of 
dual supply voltage logic synthesis for low power in consideration of varying 
delay constraint strictness," in 9th International Conference on Electronics, 
Circuits and Systems, 2002, 2002, pp. 701-704. 
[65] S. Raje and M. Sarrafzade, "Variable voltage scheduling," in Proceedings of the 
1995 international symposium on Low power design, ISLPED’95, New York, 
NY, USA, 1995, pp. 9-14. 
[66] P. Spindler and F. M. Johannes, "Fast and robust quadratic placement combined 
with an exact linear net model," In Proc. of ICCAD, pp. 179-186, 2006. 
 
 
 
