A study on coarse-grained placement and routing for low-power FPGA architecture by Li Ce
Waseda University Doctoral Dissertation
A Study on Coarse-Grained Placement and Routing for
Low-Power FPGA Architecture
LI, Ce
Graduate School of Information, Production and Systems
Waseda University
February, 2012

Abstract
By using nanometer process technology, ICs (Integrated Circuits) have been widely
used in all electronic equipments today, such as computers, mobile phones, domestic
appliances, toys and so on. They have revolutionized our lives. ASIC (Application
Specific Integrated Circuit) and FPGA (Field Programmable Gate Array) are two
biggest types of IC.
ASIC is a unique type of IC that is designed for specific applications certain
purpose. It has great advantages in terms of area, performance and power. But a
long development cycle is cost by design and verification before tape-out to avoid
bugs. When an ASIC gets out of the production line, it can no longer be altered. If
terrible bugs are found during this time, the ASIC chip should be re-designed and
tape-out one more time, so that money and time are wasted.
The nature of programmability of an FPGA allows the manufacturers to correct
mistakes and to send out patches or updates after the product has been bought.
Manufacturers also take advantage of this by creating their prototypes in an FPGA,
so that, it can be thoroughly tested and revised on the PCB board before actually
sending out the design to the IC foundry for ASIC production.
The power consumption of FPGA is much larger than that of ASIC to implement
the same circuit in the same scaling. That is the reason why FPGAs could not be
used in the mobile electronic devices. Reducing the power consumption is a compli-
cated work. Because not only the hardware architecture, but also the EDA tools for
mapping, placement and routing should be improved. In this paper, we propose a
region based low power FPGA architecture and related placement & routing methods
to reduce the power consumption. This paper focuses on the following three points.
First, a region based FPGA architecture is proposed, where the whole area is
divided into coarse-grained and power supply is controlled for each region. Various
methods of power reduction are used in ASIC design, such as power gating, clock
gating, multi-Vdd, etc. In our FPGA architecture, coarse-grained power gating and
clock gating are used to reduce the design and power network complexities. Isolation
is also added to avoid function error and leakage when some logic is powered off. A
power control hard macro (PCHM) is designed as the controller of these low power
methods for each region. Power-up and power-down sequences are also generated by
i
ii
PCHM. To save more power, the switch box, connection box and routing channel can
also be powered off with the cluster logic blocks (CLBs) which are in the same region
during the same time. To reduce the congestion of routing channels which are always
on, their width are enlarged. The asymmetric switch box is also designed to connect
different routing channels whose widths are not same. As a result, 6.4% routing area
is reduced in average. The critical path delay of mixed SB is 2.2% smaller than the
original FPGA architecture.
Second, we pay more attention on the impact of the CLB placement to reduce
power consumption. Simulated annealing algorithm is used to get a better placement
result, where a cost function defined by several costs is used to evaluate its quality.
By introducing a new cost in the function, named region cost, CLBs can be placed
closely by using the minimum number of regions. The unused regions can be powered
on or off statically. The regions which are filled in CLBs can be powered on or off
dynamically by the states of a circuit in the FPGA or even by the input pad of
FPGA. Parameters are introduced to adjust the proportion of the region cost in the
total cost, to control the number of used regions. By gating the unused SRs, 14.1%
power is reduced during two modules’ working. The power consumption is reduced
about 30.3% and 31.8% when different power domains are powered off dynamically.
Finally, a routing method which can route the connections among the routing re-
source with the same power state is researched. We added power domain information
to each routing node and resource. CLBs used for implementing a user circuit also
have the information before placement. The links between two nodes which are not
in the same power domain are disconnected.
A full CAD framework from logic synthesis to placement and routing is introduced
to meet the proposed FPGA architecture. After enhancing the CAD framework, a
detailed study by using MCNC benchmarks is given under different region sizes that
are supported by the proposed FPGA architecture. When region size is 4*4, we could
get best result.
In conclusion, the power consumption of FPGA can be dynamically controlled
and it is suitable for the circuit which has low power states. With the proposed
architecture and the developed CAD framework, FPGA will be used for low power
ASIC emulation, and furthermore it is promising to be widely applied to portable
devices.
This thesis is organized as follows:
Chapter 1 introduces the components of FPGA power consumption. Both the
causes of dynamic and static power consumption are explained. Based on these kinds
of power, the related low power methods used in this FPGA architecture are discussed.
The reasons why other methods cannot be adopted are also shown in this chapter.
Chapter 2 discusses the state-of-the-art architectures of commercial FPGAs and
the academic FPGA which is provided by versatile place and route (VPR). Some
iii
common points among these FPGA architectures are listed in this chapter.
In Chapter 3, a novel low-power FPGA architecture based on static coarse-grained
power gating is proposed to reduce power consumption. Not only cluster logic blocks
but also switch boxes and connection boxes can be powered off. The new placement
algorithm and routing resource graph for sleep regions are also presented. By using the
enhanced the CAD framework, our proposed FPGA architecture can reduce 21.1%
in the total power consumption compared with the traditional FPGA. The power
reduction is composed of 14.7% static power reduction, 8.0% dynamic power reduction
and 1.6% power consumption which is at the cost of low power related logic. The
area increases by 4.3% comparing with normal FPGA architecture in the same CLB
count when the size of SR is 16.
In Chapter 4, dynamic coarse-grained power gating FPGA architecture is pro-
posed where a whole area of an FPGA is partitioned into several regions and power
supply is controlled for each region. We also propose a region oriented FPGA place-
ment algorithm fitted to the user’s hierarchical design based on VPR. The CPU time
increment of placement is 5.3%, and routing time is increased for 6.4% for one mod-
ule. Simulation results show that power consumption is reduced by 14.1% when two
circuits are in working state by static power gating and clock gating. The power con-
sumption can be reduced by 30.3% on average by setting unused modules or regions
in sleep mode.
Chapter 5 proposes a sophisticated routing method for a region oriented FPGA
architecture which can support dynamical power gating. This chapter has 2 main
contributions. First, it gives a special routing method to support dynamical power-
off switch boxes in a sleep region. Second, asymmetric Wilton switch boxes are
introduced in the public routing channels to reduce channel width. Experimental
result shows that 6.4% routing area can be reduced compared to the symmetric Wilton
switch box.
Finally, Chapter 6 concludes this work and discusses the future work.
iv
Acknowledgements
Studying towards Ph.D. at Graduate School of Information, Production and Systems,
Waseda University takes three years. It is a long process filled with mixed feelings,
satisfaction and disappointment, pleasure and pressure. The satisfaction comes often
for the recognition of the research and works while the disappointment always for the
rejections and misapprehension. In these three years, I have learned that I could never
have done the research and writing this thesis without the support and encouragement
of the following people.
First, Professor Takahiro Watanabe of Waseda Univ. is the one I thank most.
Thank you very much because of the fact that you are my supervisor. You are also
my father. You are very respectable for your personality, knowledge and creativity.
And you teach me not just be a researcher, but also how to be a man. On my life
road, you guide me forward. I am lucky to be one of your students.
I am also very grateful to Prof. Shinji Kimura and Prof. Takeshi Yoshimura of
Waseda Univ. who revised the entire thesis and gave a lot of valuable comments.
Also thanks for their constructive advices about my research.
I am indebted to Dr. Song Chen, Dr. Zhangcai Huang and Dr. Wei Li, for their
constructive and fruitful discussions of my research.
I would also thanks my classmates Dr. Yiping Dong, Dr. Zhiguo Bao, Dr. Jiongyao
Ye, Shuo Li, Yang Jiang et al., my friends in in Watanabe Lab, and many other
students and faculty, too numerous to name. Not just I can discuss my research with
you, but also I can discuss my troubles with.
Finally, I give my deepest gratitude to my family, my parents, my wife and my
grandmother. Without their support, unending and love, I would never through this
process. Any piece of my achievement has an invisible part of their contribution.
Thank you.
Kitakyushu, Japan Ce Li
Nov. 5, 2011
v
vi
Table of Contents
Abstract i
Acknowledgements v
Table of Contents vii
List of Tables xi
List of Figures xiv
Glossary xv
Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The Power Consumption in ICs . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Dynamic Power . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Static Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Low Power Technology . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.1 Clock Gating . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.2 Power Gating . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.3 Multi-V dd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Organization of This Thesis . . . . . . . . . . . . . . . . . . . . . . . 9
2 FPGA Architectures 11
2.1 Conventional FPGA Architectures . . . . . . . . . . . . . . . . . . . . 11
2.1.1 FPGA Architecture of Altera Corp. . . . . . . . . . . . . . . . 11
2.1.2 FPGA Architecture of Xilinx Inc. . . . . . . . . . . . . . . . . 13
2.2 Academic FPGA Architecture . . . . . . . . . . . . . . . . . . . . . . 15
vii
viii
2.2.1 BLE and CLB . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.2 Routing Architecture . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3 Static Coarse-grained Power Gating FPGA Architecture 21
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Proposed FPGA Architecture . . . . . . . . . . . . . . . . . . . . . . 23
3.2.1 Sleep Region in Proposed FPGA . . . . . . . . . . . . . . . . 23
3.2.2 Power Domain . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.3 Signal Isolating . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3 Routing Resource . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4 Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4.1 Cost Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4.2 Improvement for Fast SA . . . . . . . . . . . . . . . . . . . . . 32
3.4.3 Placement and Routing Result . . . . . . . . . . . . . . . . . . 33
3.5 CAD Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.6 Experimental Results and Discussion . . . . . . . . . . . . . . . . . . 38
3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4 Dynamic Coarse-grained Power Gating FPGA Architecture and Re-
gion Oriented Placement Algorithm 43
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.3 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3.1 Sleep Module . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3.2 Proposed FPGA Architecture and Sleep Region . . . . . . . . 46
4.3.3 Power Control Hard Macro (PCHM) . . . . . . . . . . . . . . 47
4.4 Placement Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.4.1 Linear Congestion Wire cost and Timing Driven Cost . . . . . 49
4.4.2 Sleep Region Cost . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.4.3 FPGA Total Cost . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.4.4 Cost Function Comparison . . . . . . . . . . . . . . . . . . . . 54
4.5 CAD Framework and SR-VPR . . . . . . . . . . . . . . . . . . . . . . 55
4.6 Experimental Results and Discussion . . . . . . . . . . . . . . . . . . 58
4.6.1 Conditions for Experiments . . . . . . . . . . . . . . . . . . . 58
4.6.2 Architecture Comparison . . . . . . . . . . . . . . . . . . . . . 60
4.6.3 Placement Algorithm Comparison . . . . . . . . . . . . . . . . 62
4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
ix
5 Region Oriented Routing for Dynamic Power Gating FPGA 69
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2.1 FPGA Architecture . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2.2 CAD Framework . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.3 Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.4 Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.4.1 Routing Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 77
5.4.2 Routing Resource Graph . . . . . . . . . . . . . . . . . . . . . 78
5.4.3 Asymmetric Wilton SB . . . . . . . . . . . . . . . . . . . . . . 80
5.5 Experimental Results and Discussion . . . . . . . . . . . . . . . . . . 85
5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6 Conclusions 89
Publication List 91
Bibliography 94
x
List of Tables
4.1 The number of CLBs in each SR . . . . . . . . . . . . . . . . . . . . . 51
4.2 The comparison between different FPGA architectures. . . . . . . . . 61
4.3 The FPGA power consumption result with one module . . . . . . . . 64
4.4 The FPGA power consumption result with two modules . . . . . . . 66
5.1 Experimental results based on different SB styles . . . . . . . . . . . 86
xi
xii
List of Figures
1.1 Dynamic power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Switch power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Clock gating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Power gating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 The LE architecture of Altera FPGA [1] . . . . . . . . . . . . . . . . 12
2.2 Cyclone IV Device LAB Structure [2] . . . . . . . . . . . . . . . . . . 13
2.3 Arrangement of slices within the CLB [3] . . . . . . . . . . . . . . . . 14
2.4 ASMBL architecture [4] . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5 Architecture of VPR 5.0 [5] . . . . . . . . . . . . . . . . . . . . . . . 16
2.6 The basic logic element in VPR . . . . . . . . . . . . . . . . . . . . . 17
2.7 Schematic of CLB in VPR . . . . . . . . . . . . . . . . . . . . . . . . 17
2.8 Island style FPGA architecture . . . . . . . . . . . . . . . . . . . . . 18
2.9 Single driver routing [5] . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1 The proposed FPGA architecture using SR . . . . . . . . . . . . . . . 24
3.2 The proposed FPGA internal connection . . . . . . . . . . . . . . . . 26
3.3 Signal isolating. (a)from CB to CLB; (b)from CLB to CB; (c)from CB
to SB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4 The routing resource graph for the proposed FPGA architecture . . . 29
3.5 CLB swapping results. . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.6 Initial connection before routing . . . . . . . . . . . . . . . . . . . . . 34
3.7 VPR routing result . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.8 SR-VPR routing result . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.9 CAD framework for the SR based FPGA . . . . . . . . . . . . . . . . 37
3.10 The number of CLBs vs. NSR−size. . . . . . . . . . . . . . . . . . . . 38
3.11 Area result vs. NSR−size. . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.12 Channel width vs. NSR−size. . . . . . . . . . . . . . . . . . . . . . . . 40
3.13 Critical path delay vs. NSR−size. . . . . . . . . . . . . . . . . . . . . . 40
3.14 Power vs. NSR−size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.1 Power gating example of an SoC . . . . . . . . . . . . . . . . . . . . 47
xiii
xiv
4.2 A new FPGA architecture based on SR. . . . . . . . . . . . . . . . . 48
4.3 PCHM block diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.4 CLB swapping results. . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.5 Initial placement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.6 VPR placement result. . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.7 SR-VPR placement result (NSR−size = 9). . . . . . . . . . . . . . . . . 57
4.8 SR-VPR placement result (NSR−size = 16) . . . . . . . . . . . . . . . 58
4.9 Software flow for the SR based FPGA. . . . . . . . . . . . . . . . . . 59
4.10 SR-VPR placement flow. . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.1 Region based FPGA architecture [6] . . . . . . . . . . . . . . . . . . 71
5.2 Architecture evaluation CAD flow . . . . . . . . . . . . . . . . . . . . 74
5.3 Placement results with one module . . . . . . . . . . . . . . . . . . . 76
5.4 Placement results with two modules . . . . . . . . . . . . . . . . . . . 77
5.5 FPGA architecture with PDs [7] . . . . . . . . . . . . . . . . . . . . . 78
5.6 rr-graph modeling [7] . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.7 rr-graph modeling with PDs . . . . . . . . . . . . . . . . . . . . . . . 80
5.8 Symmetric and asymmetric SBs . . . . . . . . . . . . . . . . . . . . . 81
5.9 Mixed routing architecture with both symmetric and asymmetric Wilton
SBs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.10 Average channel width vs. δ. (Type (a) is symmetric Wilton SB, type
(b) and (c) are asymmetric Wilton SBs) . . . . . . . . . . . . . . . . 84
Glossary
Some notations may have different meaning locally.
Notations
CostW wire cost
CostT timing driven cost
CostSR sleep region cost
δ aspect ratio (LSB/WSB) of the asymmetric Wilton SB
Fcin connection flexibility of cluster logic block input
Fcout connection flexibility of cluster logic block output
Fs connection flexibility of switch box
LSB edge length of switch box
NSR−size the total number of CLBs in a sleep region
NSR−length the number of CLBs on the edge of sleep region
Rlimit the distance limitation of two CLB’s swapping
W the number of tracks in a routing channel
WSB edge width of switch box
V dd the voltage of power supply
Abbreviations
ASIC application-specific integrated circuit
CB connection box
CLB cluster logic block
CHANX horizontal routing channel
CHANY vertical routing channel
EDA electronic design automation
FPGA field-programmable gate array
GND ground
xv
xvi
ISOC isolate cell
LAB logic array block
LE logic element
LUT look-up table
MUX multiplexer
PCHM power control hard macro
PD power domain
P&R placement and routing
rr graph routing resource graph
RC routing channel
RTL register transfer level
SA simulated annealing
SB switch box
SoC system on chip
SR sleep region
VDD power supply
VPR versatile place and route
Chapter 1
Introduction
This paper focuses on a region based FPGA architecture and related EDA tools for
low power. In this chapter, the components of power consumption are listed. Both
the causes of dynamic and static power consumption are explained. Based on these
kinds of power, the related low power methods used in this FPGA architecture are
discussed. The reasons why other methods can not be adopted are also shown in this
chapter.
1.1 Motivation
By using nano-meter process technology, ICs have been widely used in all electronic
equipments today, such as computers, mobile phones, domestic appliances, toys and
so on. They have revolutionized our lives. ASIC (Application Specific Integrated
Circuits) and FPGA (Field Programmable Gate Array) are two biggest types of IC.
ASIC is an unique type of IC that is designed for specific applications certain
purpose. It has great advantages in terms of area, performance and power. But a
long development cycle is cost by design and verification before tape-out to avoid bugs.
When an ASIC gets out of the production line, it can no longer be altered. If terrible
bugs are found during this time, the ASIC chip should be re-designed and tape-out
one more time, so that money and time are wasted. That is why the designers need
to be totally sure of their design, especially when making large quantities of the same
ASIC. Because the non-recurring cost of ASIC is relatively high and often reaching
1
2into million dollar.
Another type of IC, FPGA, is not build for any circuit during the production. As
the name implies, it can be programmed to achieve any function within the ability
of user and itself. In an FPGA, a certain number of transistor elements are used to
provide design flexibility. Therefore, the cost of an FPGA is often higher than that
of an ASIC. But, when considering the design risk, some designers prefer FPGAs.
The nature of programmability of an FPGA allows the manufacturers to correct
mistakes and to send out patches or updates after the product has been bought.
Manufacturers also take advantage of this by creating their prototypes in an FPGA,
so that, it can be thoroughly tested and revised in the PCB board before actually
sending out the design to the IC foundry for ASIC production.
A co-founder of Intel Corp., Gordon E. Moore, said, “the number of transistors
that can be placed inexpensively on an IC chip doubles approximately every two
years”[8]. It is called Moore’s law, a famous conclusion of the semiconductor tech-
nology improvement. In the early days of integrated circuits, only a few transistors
could be placed on a chip, as the scale used was large because of the contemporary
technology, and manufacturing yields were low by today’s standards. When a new
process technology comes, the width and length of the transistor are shrunk. Millions
or billions of transistors are placed can be achieved on the same area on the chip.
Along with the improvement of IC process, power consumption becomes a hot
issue, especially in LSI (Large Scale Integrated circuits) and VLSI (Very Large Scale
Integrated circuits). But designers have to face to the affect of power consumption and
heat removal which are caused by high integration of chips. High power consumption
leads to hight temperature.
These days, mobile equipments like smart phones are familiar with us. More and
more functions can be achieved by them. When we are enjoying the convenience of
these phones, we also complain the battery lives. Only a few hours can be lasted
when we play games or access internet. Both high performance and long battery life
are needed.
As a special type of IC, FPGAs spend more power than ASICs when achieve the
same circuit. That limits the application of FPGA. Therefore, we need to know where
3Vdd
IN OUT
CL
charge
discharge
GND
Figure 1.1 Dynamic power
the power consumption comes from, and then we have to solve this problem.
In this paper, we focus on region oriented FPGA architecture and related place-
ment & routing (P&R) method to reduce the power consumption of FPGA. After
applying dynamic power gating, 38% power can be reduced when the design is in
sleep state. With a little area increasing, the performance of the FPGA chip is also
improved.
1.2 The Power Consumption in ICs
The power consumption of IC consists of two parts, dynamic power and static power.
Dynamic power occurs when the logic level is switching in the circuit. When the
toggle rate of the circuit is fast, the dynamic power is high. Static power can occur
when there is leakage current in the absence of any switch activity.
1.2.1 Dynamic Power
The dynamic power of an IC chip consists of two parts. The first and primary source
of the dynamical power consumption is switching power. The other one is internal
power.
Usually, there is capacitance between the output of each logic gate and ground
(GND) on the chip. When the output is switched by the input, the capacitance is
charged and discharged. So, dynamic power is dissipated during this time. It is shown
in Fig. 1.1.
4Vdd
IN
OUT
CL
Figure 1.2 Switch power
Dynamic power can be described as follow,
Pswitch = CL · V dd2 · Ptrans · fclock (1.2.1)
where CL is the output capacitance, V dd is the supply voltage, Ptrans is the probability
of an output transition, and fclock is the effective operation frequency [9]. From this
equation, we could find that the switching power is not a function of transistor size.
But it depends on the output capacitance, supply voltage and the operation frequency.
Higher operation frequency the gate has, higher power it costs. When other conditions
are not changed, reducing V dd can save the power consumption exponentially.
Internal power is the other resource of dynamic power. We usually call it crowbar
current. It happens when there is a path form VDD to GND. During the switching
time, both N-MOS and P-MOS transistor can be opened incompletely. So, a current
goes through these two transistors shown in Fig. 1.2.
Internal power can be calculated by the following equation,
Pshort = tsc · V dd · Ipeak · fclock (1.2.2)
where tsc is the short circuit current occur time, and Ipeak is the total internal switch
current, including short current and the current required to charge the internal ca-
pacitance [9].
5Pdyn = Pswitch + Pshort (1.2.3)
Pdyn ≈ CL · V dd2 · Ptrans · fclock (1.2.4)
The total dynamic power is consisted of Pswitch and Pshort, shown in equation
1.2.3. But compared to the switch time, the short current time is very short. When
we calculate the dynamic power, we can skip short current. So the total dynamic
power which contains switch power and short circuit power shown in equation 1.2.3
approximates to equation 1.2.4.
1.2.2 Static Power
Other power consumption comes from static power. It means that this power con-
sumption can occur independently of switch activity. Static power consists of several
parts: sub-threshold leakage, gate leakage, gate Induced drain leakage and etc.
Sub-threshold leakage current comes from the drain to the source of a transistor
when it is not turned completely off. That means when the gate voltage is below the
threshold voltage but higher than the GND, there is a small current which flows from
drain to source. The current of sub-threshold leakage can be calculated as below [9]:
Isub = µCoxVth
2W
L
e
VGS−VT
nVth (1.2.5)
where µ is the carrier mobility, Cox is the gate capacitance, Vth is the thermal volt-
age, W and L are the width and length of the transistor respectively, and n is a
parameter depending on the fabrication process. We can find that the current of
sub-threshold leakage (Isub) depends on the exponent (mathematical constant e) of
difference between VGS and VT which are the gate-source voltage and the threshold
voltage.
There is conflict that if we want to reduce dynamic power by scale V dd and VT
down by the new technology, the leakage current exponentially worse.
When the process technology is scaled down within 90nm, gate oxide thickness
is only a few atoms thick. The tunneling current becomes substantial. Gate leakage
6which flows directly from the gate through the oxide to the substrate is increased.
The static power is as big as the dynamic power. New CMOS material technologies
such as high-k metal gate silicon technology [10] should be used.
1.3 Low Power Technology
Many methods are used to reduce the power of ICs, for different power consumption
resources or in different design levels. Such as clock gating, power gating, multi-V dd,
VTCMOS and so on.
1.3.1 Clock Gating
Clock gating is a popular technique used to save the dynamic power caused by the
clock toggling. In many synchronous circuits, there are clock trees. The nodes as the
leaves are balanced. Usually, an additional clock gating logic is added to the root of
the clock tree or sub-tree. When the clock is gated, no dynamic power is consumed in
the clock buffers or the flip-flops because there are no switch states in the sequential
circuit.
Adding enable condition in the RTL code can be translated into clock gating
by synthesis tools automatically as fine-grained clock gating. Figure 1.3 shows the
clock gating logic which is synthesized by the EDA tool for registers. All the clocks
of registers which are controlled by “EN” in the hardware description language are
gated. GCLK is the gated clock. An ICG (Integrated clock gating) hard macro is
inserted by EDA synthesis tools. Designer can also insert the clock gating logic into
the design by using some instantiating library cells to gate the clock of a couple of
registers or a whole modules.
1.3.2 Power Gating
To reduce the dynamic power, we can use clock gating when the related circuit does
not need to switch. But sub-threshold leakage also occurs even the clock is gated. An
efficient way to reduce the static leakage is power gating.
7DFF
DFF
DFF
DFF
CLK
EN
ICG
D Q
GCLK
Figure 1.3 Clock gating
VDD VDD
Vdd
Logic
SLP Virtual 
VDD
SLP_
Logic
Virtual 
GND
GND
Logic
SLP
SLP_
Virtual 
VDD
Virtual 
GND
Figure 1.4 Power gating
Sleep transistors with high VT are used as header, footer or both to gate the power.
The virtual VDD or GND, shown in Figure 1.4, are connected to the circuit which
needs power gating.
Power gating uses low-leakage P-MOS transistors as header switches to shut off
power supplies to parts of a design in standby or sleep mode. N-MOS footer switches
can also be used as sleep transistors. Inserting the sleep transistors splits the chip’s
power network into a permanent power network connected to the power supply and
a virtual power network that drives the cells and can be turned off [9].
Fine-grained Power Gating
Fine-grained power gating is a power control method that adds a sleep transistor
to every cell and controls their power individually. Usually, the find-grained power
gating cells are provided by library IP vendor or standard cell designer at gate level
8[11]. Since the sleep transistor is added into each cell, a large area penalty is imposed
to the chip. To reduce area increasing, some designs only implement the fine-grained
power gating for low VT cells to reduce the area increasing [9].
Coarse-grained Power Gating
The coarse-grained power gating drives cells locally through shared virtual power
networks. Each virtual power network is gated by one P-MOS or N-MOS switch or
switches. Coarse-grained power gating architecture imposes a smaller area overhead
than the fine-grained one which implements power gating in each cell. But when the
power-gating sleep transistor drives many standard cells, sleep transistor is a part of
the power distribution.
The size of power gating sleep transistor depends on the overall switching current
of the module at any given time. Since only a fraction of circuits switch at any point
of time, power gate sizes are smaller compared to the fine-grained switches. Dynamic
power simulation using worst case vectors can determine the worst case switching for
the module and hence the size. The IR drop can also be factored into the analysis.
Simultaneous switching capacitance is a major consideration in coarse-grained power
gating implementation. In order to limit simultaneous switching, gate control buffers
can be daisy chained, and special counters can be used to selectively turn on blocks
of switches.
1.3.3 Multi-V dd
In equation 1.2.1, switch power is proportional to V dd2. Lower V dd can reduce
power significantly. But lower voltage leads to larger latency which can affect the
performance of the chip. Therefore, high (normal) voltage is applied to the critical
path to keep the performance. While for the non-critical path, low voltage is used to
reduce the power consumption with meeting system timing in the ASICs [12].
Multi-V dd [12, 13, 14] and V dd-hopping [15] can reduce the power consumption,
but they lead to some complexity to the design. When the logic which is supplied by
low voltage drives the cell powered by high voltage, a level converter [16] should be
used to make sure that the high level of drive signal is higher than VT . Otherwise,
9leakage current occurs in the high voltage domain. Multi-V dd also needs some IOs
to support a lower voltage supply.
FPGAs with dual-V dd supplies are researched in [17, 18, 19, 20]. In this paper
we do not use multi-V dd in our low power FPGA. Because if VDD of each region can
be programmed, huge number of level converters should be used on the boundary of
the regions. This will increase the area significantly. To compare the affect of the
proposed P&R method, we simplify the power supply of FPGA and using only one
V dd. However, the P&R method in this paper can support multi-V dd FPGA fabrics.
1.4 Organization of This Thesis
The rest of this paper is organized as follows:
Chapter 2 reviews the state-of-the-art in current FPGA architectures of some
FPGA vendors and the academic FPGA architecture which is used in this paper.
Chapter 3 describes low power placement and routing for the coarse-grained power
gating FPGA architecture where some regions are directly power gated or clock gated
after placement. The routing channels, switch boxes and connection boxes in these
regions are also powered off.
Chapter 4 describes region-oriented placement algorithm for coarse-grained power-
gating FPGA architecture whose regions can be dynamically powered off. This archi-
tecture can be used for the user design which has low power state or sleep modules.
Chapter 5 demonstrates the optimization of switch box for coarse-grained power
gating FPGA architecture. Enlarging the routing channel whose power can never be
powered off can reduce the routing conjunction.
Chapter 6 concludes this thesis and discusses the future work.
10
Chapter 2
FPGA Architectures
This chapter discusses the state-of-the-art architectures of commercial FPGAs and
the academic FPGA which is provided by versatile place and route (VPR) [7]. Some
common points and differences between commercial FPGA and academic FPGA are
listed in this chapter. Based on these points, we can do our research by using the
academic FPGA. The fruits of research can be used in commercial FPGAs with some
improvements in the future.
2.1 Conventional FPGA Architectures
2.1.1 FPGA Architecture of Altera Corp.
Altera is one of the big companies of FPGA devices. There are three series of Altera
FPGA products, such as Stratix, Arria and Cyclone [21]. Stratix series has the
highest performance in Altera FPGAs. Arria series is perfect for high-performance
computation functionality and keeping costs down. Cyclone series can be used in
the lowest power and cost applications. So, in this section, we pay attention on the
architecture of Cyclone series.
Logic Element
The smallest logic unit in Cyclone IV device is logic elements (LEs). LEs are used to
realize the user design. Each LE has a 4-input look-up table (LUT), a programmable
11
12
register, a carry chain connection, a register chain connection, clear related logics and
so on, as shown in Fig. 2.1.
Each register in LE which has data input, clock, clock enable and clear inputs
can be configured to multifarious flip-flop operation. Signals that use the global clock
network, general-purpose I/O pins, or any internal logic can drive the clock and clear
control signals of the register [2].
4-LUT
FF
Input[3:0]
LE
M
U
X C
a
rr
y
 c
h
a
in M
U
X
LAB-Wide
Synchronous
Logic
Carry in
Carry out
Synchronous load
Synchronous clear
Clock & enable inputs
Asynchronous
Clear Logic
Asynchronous clear
Clock & enable 
logic
D
ENA
Q
M
U
X
Row, Column,
And Direct 
Link
Routing
M
U
X
Local
Routing
M
U
X
M
U
X
Register chain input
Register 
chain 
output
CLR
Figure 2.1 The LE architecture of Altera FPGA [1]
Multiplexers (MUXs) are used for the flexibility setting. When a MUX bypasses
the register and the LUT output drives the LE outputs directly, this LE can be
used for combinational functions. The MUXes which are connected to the output of
register and LUT can drive three outputs of LE. Two outputs drive the column or row
and direct link routing connections, while the last output drives the local interconnect
resources alone. This feature is can improve device utilization. Because the device
can use the register and the LUT for unrelated functions. The clock enable signal
can be controlled by either general-purpose I/O pins or the internal logic [2].
The register feedback mode allows the register output to feed back into the LUT of
the same LE to ensure that the register is packed with its own fan-out LUT, providing
another mechanism for improved fitting. The LE can also drive out registered and
unregistered versions of the LUT output.
13
Logic Array Block
Each of logic array blocks (LAB) consists of 16 LEs. The LAB local interconnect
is driven by column and row interconnects and LE outputs in the same LAB shown
in Fig. 2.2 [2]. Neighboring LABs, phase-locked loops (PLLs), memory blocks, and
embedded multipliers from the left and right can also drive the local interconnect of a
LAB through the direct link connection. The direct link connection feature minimizes
the use of row and column interconnects, providing higher performance and flexibility.
Each LE can drive up to 48 LEs through fast local and direct link interconnects.
L
o
c
a
l 
in
te
rc
o
n
e
c
ti
o
n
LE
LE
LE
LE
LE
LE
LE
LE
LE
LE
LE
LE
LE
LE
LE
LE
LAB
L
o
c
a
l 
in
te
rc
o
n
e
c
ti
o
n
LE
LE
LE
LE
LE
LE
LE
LE
LE
LE
LE
LE
LE
LE
LE
LE
LAB
Figure 2.2 Cyclone IV Device LAB Structure [2]
2.1.2 FPGA Architecture of Xilinx Inc.
Xilinx is another biggest FPGA company, which has three types of FPGA family:
Artix, Kintex, and Virtex devices. All these three types have the same architecture.
The basic logic element is called, slice, which is used for logic realization. Slice is
composed of four 6-input LUTs and eight storage elements. Each 6-input LUT can
be used as double 5-input LUTs with independent outputs.
14
Configurable Logic Block
High-performance FPGA logic of Xilinx’s 7 series FPGA is provided by configurable
logic blocks which have several elements in it. Each configurable logic block (CLB) has
two slices as shown in Fig. 2.3 [3]. CLBs are the main logic resources for implementing
sequential as well as combinatorial circuits. Each block contains a pair of slices which
is connected to the general routing switch matrix. These two slices do not have direct
connections to each other in the configurable logic block, and each slice is organized
as a column. Each slice in a column has an independent carry chain. Shift register,
high-speed carry logic and wide MUX are also added in the CLB.
Slice0
Slice1
Carry in
Carry out
CLB
Switch 
matrix
Figure 2.3 Arrangement of slices within the CLB [3]
ASMBL Architecture [3]
The CLBs are arranged in columns in the Xilinx’s 7 series FPGAs. This series is
based on the unique columnar approach provided by the Advanced Silicon Modular
Block (ASMBL) architecture. ASMBL architecture is used to enable FPGA platforms
with varying feature mixes optimized for different application domains. Through this
innovation Xilinx offers a greater selection of devices, enabling customers to select
the FPGA with the right mix of features and capabilities for their specific design.
Figure 2.4 provides a high-level description of the different types of column-based
15
resources.
M
ix
ed
 S
ig
n
al
H
ar
d
 I
P
C
lo
ck
 m
an
ag
em
en
t 
ti
le
L
o
g
ic
M
em
o
ry
D
S
P
L
o
g
ic
L
o
g
ic
L
o
g
ic
H
ig
h
-p
er
fo
rm
an
ce
 I
O
L
o
g
ic
L
o
g
ic
Application 1 Application 2
H
ig
h
-r
an
g
e 
IO
Figure 2.4 ASMBL architecture [4]
2.2 Academic FPGA Architecture
An academic FPGA architecture and related CAD framework is provided by VPR [7,
22]. Reference [23] also presents a flexible FPGA architecture evaluation framework,
named fpga EVA-LP. In this paper, the research is based on VPR. It has been widely
used by academic research for the the past decade. And now it is updated to “VPR
5.0 [5]” to include the following features which are presented in modern FPGAs.
First, it now supports a broad range of single-driver routing architectures [24]
which is also used in the FPGA architectures of Altera and Xilinx mentioned in the
former sections [4, 1]. Second, hard memory and multipliers which are ubiquitous
in modern FPGAs can be included in this version as shown in Fig. 2.5. The blocks
which are higher than one grid indicate the heterogeneous selections, such as hard
16
memory and multipliers.
Figure 2.5 Architecture of VPR 5.0 [5]
The most important thing we select VPR is that VPR is a open source software
which can be modified by us for the research. Also, the information of technology
under 40nm process is provided by [25].
2.2.1 BLE and CLB
VPR models an array of logic blocks in a two-level hierarchy. The first level is a basic
logic element (BLE) which consists of an LUT and a flip-flop, as shown in Fig. 2.6.
VPR uses a parameter, K, to define the input number of LUT. The combinational
logic of user circuit can be realized by LUTs in BLEs, the flip-flops can be used
17
for sequential design. N BLEs make up of a cluster logic block (CLB) which is the
second level, as shown in Fig. 2.7. I denotes the inputs number of each CLB. Not all
the K ∗ N BLE inputs are connected to the CLB input in a fully-connected cluster
directly. Ref. [7] indicates that if I = 2N + 2, the FPGA can get the 98% logic
utilization when K is 4.
LUT
K
FF
M
U
X
INPUT
OUTPUT
BLE
SRAM
Figure 2.6 The basic logic element in VPR
BLE[1] 1X 4X
1X
M
U
X
1X 4X
1X
M
U
X
1X 4X
INPUT[1]
INPUT[I]
OUTPUT[1]
CLB
OUTPUT[N]BLE[N] 1X 4X
1X 4XClock
1X 4X
Set/Reset Set/Reset logic
Figure 2.7 Schematic of CLB in VPR
Without considering the heterogeneous blocks, the island style FPGA architecture
which is used by VPR can be simplified as shown in Fig. 2.8. The basic units in this
style FPGA [1] are CLB, input output block (IOB), connection box (CB), routing
channel (RC) and SB. FPGA implements the user circuit by dividing it into small
pieces which can be achieved by CLBs. CBs, RCs and SBs are used as wires to connect
18
the inputs and outputs among the CLBs. Many MUXes, buffers and transmission
gates are used in the FPGA basic unit to achieve the flexibility and reprogram ability
of FPGA. Unused routing resources still cost power.
CLB
IOB
CB
C
B
SB
CB
C
B
CB
C
B
SB
CB
C
B
CLB
CB
C
B
SB
CB
C
B
CB
C
B
SB
CB
C
B
CB
CB
CB
CB
CB
CB
CB
CB
SB
SB
SB SBSB
SB SBSB
SB SBSB
SB
SB
SB
SB
SB
SB
SB
SB
SB
SB
CLB CLB
CLB CLBCLB CLB
CLB CLBCLB CLB
CLB CLBCLB CLB
C
B
C
B
C
B
C
B
C
B
C
B
C
B
C
B
C
B
C
B
C
B
C
B
CB CBCB CB
RC
Figure 2.8 Island style FPGA architecture
2.2.2 Routing Architecture
RCs, SBs and CBs provide the flexibility of FPGA routing. Each RC composed of W
routing tracks. The inputs and outputs of CLBs are connected to the RCs via CBs.
The number of tracks in each channel to which the input of CLB is connect is called
Fcin, the number of tracks which are connected to the output of CLB is Fcout [26].
SBs link up tracks in different RCs in different sides. MUXes are used as switches of
19
the track connection shown in Fig. 2.9. Fs is used to define the connection flexibility
of SB [26]. It is the number of tracks that connected to the input of SB in different
sides. All the Fs of SBs are the same in original FPGA architecture of VPR.
CLB CLB
CLB CLB
M
U
X
Figure 2.9 Single driver routing [5]
In the single-driven FPGA routing architecture shown in Fig. 2.9, each track has
one driver. That means only one track can be programmed to link to the destination
one each time. The length of the track, L, is 1 that means the track is as long as a
CLB.
2.3 Conclusion
In this chapter, the state-of-the-art FPGA architectures are explained. More discre-
tion is given on the architecture of academic FPGA which is used in our research in
this paper. Based on this architecture, we develop the placement and routing method
to reduce the power consumption of FPGA.
20
Chapter 3
Static Coarse-grained Power
Gating FPGA Architecture
In this chapter, a novel low-power FPGA architecture based on static coarse-grained
power gating is proposed to reduce power consumption. The new placement algorithm
and routing resource graph for sleep regions is also presented. After enhancing the
CAD framework, a detailed discussion is given under different region size supported
by the new FPGA architecture.
3.1 Introduction
The power consumption limits the applications of FPGA in the portable electronic
devices. Many commercial FPGA companies pay more attention to minimize process
technology to get a lower supply power. Leakage current is dramatically increasing
when technology has shrunk to 90nm below [9, 27, 28]. The unused logics in the
FPGA still cost power. The former research focuses on power reduction methods
such as power gating, clock gating, dual-Vdd/Vth and so on.
Fine-grained power gating is used for each LUT [29], CLB [17] or routing switch
[19] in FPGA architectures. Usually, a sleep transistor and a 1-bit SRAM control
resistor are needed for each unit. Obviously, adding sleep transistors leads to area
overhead. However, this overhead is reduced by using coarse-grained power gating
because some LUTs, CLBs and SBs can be gated by the same sleep transistor.
Some coarse-grained FPGA architectures were proposed [6, 30]. However, they
21
22
did not gate the power of routing resources such as SBs. Furthermore, unused CLBs
may be redundantly powered on since they did not take the placement and routing
into consideration.
Dual-Vdd and dual-VT techniques are effective in reducing power consumption.
They are achieved by adding two sleep transistors, 2 bits SRAMs and level converters
to each CLB, and also by using dual-VT libraries. High voltage is applied to the critical
path to keep performance. In Ref. [19], an FPGA with configurable dual supply
voltages and high VT SRAM was designed. However, it also has more overhead areas
in fine-grained architecture than that in coarse-grained one. Therefore, the coarse-
grained architecture combined with dual-Vdd and dual-VT techniques can get some
outstanding points than the fine-grained architecture.
The placement affects performance and power consumption. Former study [7]
paid more attention to the critical path affected by the placement. In Ref. [31],
power gating of logic fabrics was investigated and region-constrained placement was
applied to reducing the leakage power of unused logic blocks on Xilinx FPGA. This
placement algorithm placed a designed circuit into contiguous regions by utilizing
two different styles: horizontal and vertical placement. One limitation of this idea is
that the parietal row cannot be used until the lower rows are fully filled in horizontal
placement. So, the circuit for the IO PAD on the FPGA top edge may be placed in
bottom row when FPGA size is large. It will increase the wire length and decrease
the FPGA performance.
Based on our proposed FPGA architecture, a new algorithm for the circuit place-
ment and routing has to be explored. The CLBs which implement a circuit will be
placed into regions. The CLBs, CBs and SBs in the unused regions can be powered
off. We use an enhanced routing resource graph to avoid the using of the power-off
routing resources. Applying this idea to the proposed FPGA architecture, the total
power consumption is reduced by 21.1% on average. The CAD framework for the
placement is also described in this chapter.
The rest of this chapter is organized as follows. In the next section, preliminary
of the low power FPGA design is introduced. Section 3.3 describes the enhanced
routing resource graph. Simulated annealing placement algorithm considering sleep
23
region is presented in Section 3.4. The CAD framework which supports this new
FPGA architecture is shown in Section 3.5. Finally, we discuss the experimental
results and conclude this chapter.
3.2 Proposed FPGA Architecture
3.2.1 Sleep Region in Proposed FPGA
Sleep regions (SRs) used in our proposed FPGA architecture are shown in Fig. 3.1,
where SR(x sr, y sr) means the sleep region located in SR coordinate (x sr, y sr),
such as SR(1, 1), SR(1, 2). The difference with Fig. 2.8 is that the FPGA is parti-
tioned into several regions and the power of each region is separately controlled by
P-MOS sleep transistor. Unlike [30], we do not only control the power of CLB, but
also some CBs and SBs. Quadrate region is used for SR. NSR−length means the length
of one side of SR, which is the number of CLBs on the SR edge, the NSR−size means
how many CLBs are in one SR. The NSR−length and NSR−size are uniform in the fixed
SR-based FPGA. We set NSR−length to 2 and NSR−size to 4 in Fig. 3.1. So, SR(1,2)
is a complete SR which has 2 columns and 2 rows CLBs, and SR(1,1), SR(2,1) and
SR(2,2) are subsets of a complete SR. If the user circuit is implemented in SR(1,2)
and SR(2,2), the lower two SRs can be powered off. Note that the CBs and SBs
which don’t belong to any SRs are always powered on.
3.2.2 Power Domain
Power Domain Definition
Because the power-off SR could not be used for routing, power domain (PD) in-
formation is used to indicate which region is powered off and which region has the
implemented circuit. So, unused area can be effectively powered off. In the proposed
FPGA architecture, each SR is the basic unit of the PD. We define three PDs for the
placement and routing.
• PD(0)
24
CLB(1,4)
PD(1)
OUT
IN1 IN2
CHANX (1,3) 
PD(1)
C
H
A
N
Y
(1
,4
)
P
D
(1
)
CLB(2,4)
PD(1)
OUT
IN1 IN2
CHANX (2,3) 
PD(1)
CLB(3,4)
PD(1)
OUT
IN1 IN2
CHANX (3,3) 
PD(1)
CLB(1,3)
PD(1)
OUT
IN1 IN2
C
H
A
N
Y
(1
,3
)
P
D
(1
)
CLB(2,3)
PD(1)
OUT
IN1 IN2
CLB(3,3)
PD(0)
OUT
IN1 IN2
CLB(1,2)
PD(-1)
OUT
IN1 IN2
C
H
A
N
Y
(1
,2
)
P
D
(-
1
)
CLB(2,2)
PD(-1)
OUT
IN1 IN2
CLB(3,2)
PD(-1)
OUT
IN1 IN2
CHANX (1,2) 
PD(0)
CHANX (2,2) 
PD(0)
CHANX (3,2) 
PD(0)
CHANX (1,1) 
PD(-1)
CHANX (2,1) 
PD(-1)
CHANX (3,1) 
PD(-1)
C
H
A
N
Y
(2
,4
)
P
D
(0
)
C
H
A
N
Y
(2
,3
)
P
D
(0
)
C
H
A
N
Y
(2
,2
)
P
D
(0
)
C
H
A
N
Y
(3
,4
)
P
D
(1
)
C
H
A
N
Y
(3
,3
)
P
D
(1
)
C
H
A
N
Y
(3
,2
)
P
D
(-
1
)
SR(1,2) SR(2,2)
SR(1,1) SR(2,1)
SB(2,3)
PD(0)
SB(1,3)
PD(1)
SB(3,3)
PD(1)
SB(2,2)
PD(0)
SB(1,2)
PD(0)
SB(3,2)
PD(0)
SB(2,1)
PD(0)
SB(1,1)
PD(-1)
SB(3,1)
PD(-1)
UNUSED
Figure 3.1 The proposed FPGA architecture using SR
PD(0) is used for the CBs and SBs which are not in the SR. They are not power
gated. We also use PD(0) for the unused CLB, such as CLB(3,3) in Fig. 3.1.
• PD(1)
PD(1) presents the SR which has used CLBs in it. Except the unused CLB slot,
all the CLBs, CBs and SBs located in the PD(1) region are marked as PD(1).
• PD(-1)
The last power domain is power-off domain. No logic is implemented in these SRs
which are in PD(-1) after we enhance the placement. The CLBs, SBs and CBs in this
25
power domain are not be used. By shutting down these SRs, the power consumption
is reduced.
Algorithm 1 Pseudo code of power domain distribution
x = the horizontal coordinate of CLB
y = the vertical coordinate of CLB
x sr = the horizontal coordinate of SR
y sr = the vertical coordinate of SR
∗.p = the PD of current unit
x sr = (x -1) / sr length +1;
y sr = (y -1) / sr length +1;
if CLB(x, y) 6= empty then
CLB(x, y).p = 1;
else
if SR(x sr, y sr) 6= empty then
CLB(x, y).p = 0;
else
CLB(x, y).p = −1;
end if
end if
if x % SR length 6= 0 then
CHANY (x, y).p = SR(x sr, y sr).p;
else if y % SR length 6= 0 then
CHANX(x, y).p = SR(x sr, y sr).p;
else
CHANX(x, y).p = 0;
CHANY (x, y).p = 0;
end if
if x % NSR−length 6= 0 and y % NSR−length 6= 0 then
SB(x, y).p = SR(x sr, y sr).p;
else
SB(x, y).p = 0;
end if
}
26
Power Domain Distribution
Before the initial placement, we set the PD information of all routing resources to
PD(0). When a circuit is initially mapped on the FPGA, the CLBs filled with logic
are set by PD(1). After placement, the PD of SR which is filled with the CLBs
of PD(1) is set to PD(1). When we know the PD information of each SR, we can
use Algorithm 1 to distribute the PD information of each FPGA unit. Given x, y,
x sr and y sr for the coordinates of CLB and SR, power domains of each CLB(x,
y), SR(x sr, y sr), SB(x, y), CHANX(x, y) and CHANY(x, y) are determined. We
use x and y to be divided by NSR−length for checking whether the routing resource
(CHANX or CHANY) is in the SR.
An SRAM register which is supplied by the un-gated power is used to save the
PD information of each SR, named “PWROFF”. It is used for the sleep transistor
and clock gating shown in Fig. 3.2. When the value of this register is “zero”, all the
CLBs, CBs and SBs are power-on. When the value is “one”, the SR is powered off.
Only one CLB in the SR is shown in this figure. The Basic Logic Element (BLE)
consists of one 4-LUT, one D flip-flop and one 2-1MUX. For clock gating, we use the
clock sub-tree which can be gated by the state of “PWROFF” for each SR. The roots
of clock sub-trees are balanced on the FPGA chip.
Figure 3.2 The proposed FPGA internal connection
27
3.2.3 Signal Isolating
Power-on CB Output to Power-off CLB Input
The CB from a set of routing tracks to a CLB input is implemented by using a
multiplexer shown in Fig. 3.3(a). Buffers isolate each track from the input of CB,
and also isolate the output from the input of CLB. When the CLB is powered off, the
buffer output will case the sneak leakage in the CLB input MUX which is composed
of transmission gates. Based on the design rule in [11], we put the buffers in CLB,
when the CLB is powered off, inputs will not lead to the sneak leakage through the
power-off MUX.
Power-off CLB Output to Power-on CB Input
Fig. 3.3(b) shows a buffer is shared by CB which is connected to the CLB output.
When the CLB is powered off, the output floating signal will cause short current to
the buffer in the power-on CB. To avoid the short current, an AND2 gate and a NOT
gate are used for the output gating. The control signal for gating comes from the
PWROFF register. When the register is set as “1”, the output signal will be gated
to low level. Although the power-off CLB connects the power-on CB which is only
on the SR boundary, we use this method for each CLB output to balance them.
Power-off CB Output to Power-on SB Input
Unidirectional routing architecture is used in all of the recent commercial devices
[1, 4] by using MUX in the SB shown in Fig. 3.3(c). Our proposed FPGA is based on
the unidirectional routing architecture. When the CB is powered off, the wires which
connect the CB and SB have no drivers. But for the SBs are in none of the SRs, they
will be always powered on. The MUX can select the un-floating signal to avoid the
leakage.
28
(a)
(b)
(c)
Figure 3.3 Signal isolating. (a)from CB to CLB; (b)from CLB to CB; (c)from CB
to SB
3.3 Routing Resource
To determine FPGA connection information quickly, VPR [7] uses a routing-resource
graph (rr-graph for short) to describe the connectivity. In this graph, a node repre-
sents each wire or each logic block pin and an edge between the two nodes means
a switch [7]. Figure 3.4 lays out the rr-graph corresponding to the architecture of
Fig. 3.1. SOURCE and SINK routing nodes are used for the logically equivalent pins
to CLB. For example, all the IN pins of a CLB are connected to the same SINK.
Each wire is a routing node, but to simplify the rr-graph, we just show CHANX and
29
CHANY instead of the wires in the x-channel and y-channel in Fig. 3.4. We also use
a bi-directional edge instead of a pair of directed edges for the unidirectional routing
structure.
The rr-graph for the original VPR does not contain PD information. CBs (CHANY
and CHANX) will connect to others when they have a common SB neighbor. This
kind of the rr-graph cannot be used for the proposed FPGA architecture. Because
when we power down the units in the PD(-1) region, the nodes or edges in this region
are not available in the rr-graph any longer. So, the rr-graph should be enhanced for
the coarse-grained power gating FPGA. First, we add the PD information for each
routing node. SOURCE, SINK, IN and OUT have the same PD information as that
in CLB. The PDs of wires in CHANX and CHANY are set based on the Algorithm
1.
Figure 3.4 The routing resource graph for the proposed FPGA architecture
Since units in SR(1,1) and SR(2,1) in Fig. 3.1 are powered off, they could not
30
be used for the rr-graph. No edges are connected to these routing resources, shown
in Fig. 3.4. The remained routing resources are powered on. During the placement,
VPR moves the CLBs into the NSRmin SRs, where NSRmin means the minimum SR
count needed for the user circuit. When the placement is ready, the power domain
information for each SR is known. The routing resources are connected except the
power-off ones whose PD information is (-1).
3.4 Placement
FPGA placement algorithm determines the CLBs’ location of the user circuit on
the chip to minimize total area, wire connection and delay of a critical path. We
focus our efforts on power evaluation because placement also plays an essential role
on reducing power consumption based on SR. An FPGA CAD tool called Versatile
Place and Route (VPR [7]) is used in our experiment with some modifications.
3.4.1 Cost Function
VPR could get a better placement result within a short time than other placers [7].
It models the difficulty of routing connections in areas with different channel widths
by using wire cost, CostW . The wire cost penalizes placement which requires more
routing areas of the FPGA that have narrower channels.
CostW =
Nnet∑
i=1
q(i)(
bbx(i)
Cav,x(i)β
+
bby(i)
Cav,y(i)β
) (3.4.1)
In Eq. 3.4.1, the summation covers all nets in the circuit. For each net i, bbx(i)
and bby(i) denote the horizontal and vertical spans of its bounding box respectively.
q(i) is a factor to compensate for the fact that the bounding box wire length model
underestimate the wiring necessary to connect a net with more than three terminals
[32]. Its value depends on the number of terminals of the net i. Cav,x(i) and Cav,y(i)
are the average channel capacities (in tracks) in x and y directions respectively, over
the bounding box of the net i. The exponent, β , allows the relative cost of using
narrow and wide channels to be adjusted [7].
31
Path timing driven idea is also adopted during the placement to get a faster
FPGA. A timing cost, CostT , is used for the placement [33]. [34] introduces a cost
for clock-ware placement.
The placement based on the CostW and CostT is used for the placement in VPR.
But they cannot be used for the low power design architecture or routing idea. Figure
3.5(a) shows the random placement of four CLBs in two SRs. Figure 3.5(b), (c) and
(d) may be one of the placement results after CLB swapping based on different wire
cost and timing cost. Of course, the placement result shown in Fig. 3.5(d) is preferable
because one blank SR could be powered off. Therefore we newly introduce another
cost to increase the powered off SRs. First, we define the CLB density of each SR:
ρ(x sr, y sr) =
NCLB(x sr, y sr)
NSR−size
(3.4.2)
where NCLB(x sr, y sr) indicates the CLB count filled in SR(x sr,y sr), and NSR−size
is the maxim number of CLBs in each SR. NSR−size is fixed before placement. Then,
we define CostSR(x sr, y sr) to record the cost of SR that is located in the physical
coordinate (x sr, y sr) of SR on a chip. The cost function is as follows,
CostSR(x sr, y sr) =
{
1− ρ(x sr, y sr)2 ρ(x sr, y sr) 6= 0
0 ρ(x sr, y sr) = 0
(3.4.3)
When SR is not empty, the cost will decrease as CLB count increases. CostSR is
awarded the swapping for the SR density increasing. It should also award the blank
SR generation, so that the cost is 0 when the blank SR is generated after CLBs’
swapping. The total SR cost can be calculated by the following equation,
CostSR = 1 +
FPGA length∑
x sr=1
FPGA width∑
y sr=1
CostSR(x sr, y sr) (3.4.4)
All the SRs’ cost is summed for the full chip SR cost. For example, the total
CostSR is 2.5 in Fig. 3.5(a), 2.375 in (b) and 2.5 in (c), but it is 1.0 in (d). If we only
use CostSR for the SA placement, the layout result like Fig. 3.5(d) is preferable.
32
CLB1
CLB2
CLB3 CLB4
SR1 SR2
(a)
CLB1 CLB4
CLB2
CLB3
SR1 SR2
(b)
CLB4
CLB2
CLB3
CLB1
SR1 SR2
(c)
CLB4
CLB3 CLB2
SR1 SR2
CLB1
(d)
Figure 3.5 CLB swapping results.
3.4.2 Improvement for Fast SA
The original VPR uses Simulated Annealing (SA) placement algorithm. The total
FPGA cost function CostWTSR for the FPGA is based on the cost of wire CostW ,
path timing CostT and SR CostSR. To do the tradeoff among CostW , CostT and
CostSR, parameters γ and τ are introduced,
CostWTSR = (1− γ)[(1− τ) CostW
CostW oldT
+ τ
CostT
CostT oldT
] + γ
CostSR
CostSR oldT
(3.4.5)
To save the CPU time during the placement, we do not re-compute the CostWTSR
33
during each CLB swap. We just add the cost change to only for the affected nets and
SRs from the old temperature (oldT ) to the new one (newT ). The old temperature
values are used for normalizing each cost, so that all the cost have roughly the same
magnitude.
CostWTSR newT = CostWTSR oldT + ∆CostWTSR
= CostWTSR oldT + (1− γ)
[
(1− τ) ∆CostW
CostW oldT
+ τ
∆CostT
CostT oldT
]
+ γ
∆CostSR
CostSR oldT
(3.4.6)
∆CostW and ∆CostT which will change due to the affected nets by the two
swapped CLBs (or by moving a CLB to the blank grid) are calculated. ∆CostSR is
affected by the SRs whose density is changed. Only swapping a CLB to a blank slot
which is not in the same SR can change the CostSR. CLBs swapping in the same SR
does not affect the CostSR.
Similar to the former work [7], the conditions to exit SA and to decrease tempera-
ture are not changed. When the swapping range is small in the low temperature, the
CostSR is hardly changed. Only the CostW and CostT optimize the local placement
to reduce the wire cost.
Twenty MCNC [35] benchmarks are placed and globally routed in the smallest
FPGA that could accommodate each. Based on our algorithm, 1.1x CPU time is cost
during the placement and routing. Because the final result is partly affected by the
rr-graph in which some RCs and SBs cannot be used for routing, increased time is
used to avoid using the power-off routing resources. But usually, the compile time will
not affect the circuit performance and we get a good power reduction as described in
Section 3.6.
3.4.3 Placement and Routing Result
Fig. 3.6 shows the initial placement of the VPR, Fig. 3.7 shows the placement result
of the original VPR. We can not power down the unused CLBs and related CBs or
SBs. But in coarse-grained power gating architecture, placement and routing results
34
shown in Fig. 3.8 indicate the SR whose CLBs, CBs and SBs are not used can be
powered off.
Figure 3.6 Initial connection before routing
3.5 CAD Framework
A common CAD software flow from front-end RTL synthesis to the back-end place-
ment and routing is used based on the ODIN II [36], ABC [37], T-VPACK and VPR
[7, 22] and so on. Figure 3.9 shows a CAD framework used in our design flow. To
evaluate the power consumption, we use a power model [38] in terms of dynamic
power, short-circuit power and leakage power.
35
Figure 3.7 VPR routing result
ODIN II and ABC are used to do the synthesis, logic optimization and mapping.
The output of ABC is a .blif (Berkeley Logic Interchange Format) [39] netlist which
consists of LUTs and flip-flops. The T-VPACK program packs the LUTs and flip-
flops into CLB which contains one or more BLEs. After that, T-VPACK generates the
output file in .net format. The activity estimator determines the switching activities
inside the circuit. The Transition Density Model of probabilistic techniques is used
in the activity generation step [38].
VPR has been widely used as the core CAD tool in FPGA architectural, placement
and routing researches. It supports kinds of FPGA architecture with different BLE
count, CLB count or channel width. Ref. [40] merges the power estimation work [38]
36
Figure 3.8 SR-VPR routing result
and the latest version VPR [41]. So, we modify VPR to meet our FPGA architecture
placement and routing algorithm. The MCNC benchmark circuits are used to evaluate
the proposed FPGA architecture with SR.
The intelligent FPGA Architecture Repository (iFAR) [25] contains accurate area
and timing estimates architecture file for the logic and routing of varied island-style
FPGA architectures. This is done from 22 nm to 180 nm CMOS technologies. We
use 45 nm technology file in our experiment. For our FPGA architecture, not only
the net cost but also the SR size should be considered. So, two new arguments for
SR based on the FPGA architecture are added in the architecture file, one is γ, the
other is NSR−size.
37
Front-end Synthesis 
(ODIN_II) 
.v format verilog code 
.blif format 
netlist code 
Logic Optimization and 
Technology mapping 
(ABC) 
Placement and Routing 
(SR-VPR)
Placement and Routing 
output Files
.net format netlist 
.blif Format Netilst of 
LUTs and Flip-Flops 
SR setting file
Activity 
Estimator
Activity
file 
CLB pack
(T-VPACK)
Architecture
file (iFAR)
Figure 3.9 CAD framework for the SR based FPGA
When placing with the SR arguments is finished, VPR routes the internal con-
nection. Finally, VPR outputs the total logic area, routing area, circuit placement,
routing information and power consumption. After enhancement, the PWROFF S-
RAM setting in each SR can also be generated in the output file.
To get the accurate result of power consumption, we use HSPICE to generate the
parameters in the FPGA architecture file. The data of SRAM leakage current comes
from [42]. Compatible with the current popular process technology, 45 nm Predictive
Technology Model (PTM) [43] is used.
38
3.6 Experimental Results and Discussion
Since the new region based FPGA architecture and the related P&R method are
introduced, we do some experiments focusing on the affection of area, timing, power
and channel width. FPGAs with different sizes of SRs are compared with the normal
VPR FPGAs which have the same number of CLBs.
We performed experiments on the 20 big MCNC benchmarks [35]. The .blif netlists
in which CLBs consist of 10 BLEs are used. Each BLE has one 4-LUT and one flip-
flop. The routing parameters are set as Fcin=0.25, Fcout=0.1, Fs=3, and length of
segments L=4. The routing channel width is set to 1.2 times of the minimum channel
width required to route each of the benchmark circuits. We set all the channels to the
same capacity to compare the performance of different placers in our environment.
Figure 3.10 The number of CLBs vs. NSR−size.
To support the SR architecture, we set the FPGA size to multiples of SR size.
Figure 3.10 shows the comparison among the total number of CLBs, the number
39
of power-on CLBs and the number of used CLBs in different SR size. The average
number of CLBs is 200.4 based on the 20 big MCNC benchmarks. The total CLB
count indicates the total number of CLBs on the FPGA chip. Although the enhanced
VPR can resize FPGA according to the NSR−size and benchmarks automatically, note
that the results shown in each bar in Fig. 3.10 may not be multiple times of NSR−size,
because they are arithmetic means of 20 benchmarks. Different SR sizes need different
FPGA sizes to make sure the CLB count equals to NSR−size in each SR. When SR
size is 1, the number of power-on CLBs equals to the number of used CLBs which
means all the unused CLBs are powered off. When the NSR−size becomes larger, not
all the SRs are filled with CLBs fully. So, the power-on CLB count grows.
Figure 3.11 Area result vs. NSR−size.
The area becomes larger as the CLB count increasing shown in Fig. 3.11. The
values of normal FPGA which has the same number of CLBs as SR-based FPGA are
shown in these figures for comparison. The area overhead caused by power gating and
clock gating related logic, such as sleep transistor, power-gating configuration SRAM
and isolation cell, is 2.3% when NSR−size is 16. Because less sleep transistors are
40
used, this area overhead decreases more when NSR−size grows larger. The tendency
of channel width is shown in Fig. 3.12. When the NSR−size is larger, the SR based
routing algorithm needs more channels because some of CBs and SBs will be powered
off.
Figure 3.12 Channel width vs. NSR−size.
Figure 3.13 Critical path delay vs. NSR−size.
41
Figure 3.14 Power vs. NSR−size.
Generally, the critical path delay is not affected so much when the NSR−size is less
than 25 based on this reason shown in Fig. 3.13. But, when the NSR−size grows larger,
if an SR is powered off, the routing resource in the SR can not be used. Connections
need to be routed longer to avoid the power-off RCs and SBs. So, the performance
of FPGA decreases when NSR−size is larger.
Although more power is saved when SR size is larger shown in Fig. 3.14, size 9 or
16 are recommended due to the NSR−size for the smaller chip area. Considering the
power consumption overhead which is caused by the low power control logic, the power
consumption is reduced up to 21.1% when NSR−size is 16. It is composed of 14.7%
static power reduction, 8.0% dynamic power reduction and 1.6% power consumption
which is at the cost of low power related logic. The area increases by 4.3% comparing
with normal FPGA architecture in the same CLB count. The critical timing path
delay is not affected during this setting. The channel width is also same during this
time.
42
3.7 Conclusion
We have proposed a new coarse-grained power gating FPGA architecture. The place-
ment algorithm and routing resource are also optimized for this architecture. Com-
paring with the original FPGA architecture of VPR which has no low power methods
and no related P&R algorithm supports, when NSR−size is 16, 21.1% power reduction
could be gotten at the cost of increasing 4.3% areas by using the proposed FPGA
and SR-VPR mentioned in this chapter. When the user circuit is smaller, more un-
used region on the SR-based FPGA could be powered off. It can cause more power
reduction. As a result, our proposed FPGA will be applied to many portable devices
due to its low power.
Chapter 4
Dynamic Coarse-grained Power
Gating FPGA Architecture and
Region Oriented Placement
Algorithm
In this chapter, dynamic gating for coarse-grained FPGA architecture is proposed.
A hard macro is designed to gate the power supply and clock. By using this FPGA
architecture, modules in the user’s design can be powered off dynamically to save
the power consumption when they are in sleep mode. Power domain information
is updated for multiple power domains. We also propose a region oriented FPGA
placement algorithm fitted to this users hierarchical design based on SR-VPR.
4.1 Introduction
Field-Programmable Gate Array (FPGA) has many advantages such as short devel-
opment time and flexibility for commercial design. But the disadvantage is power
consumption that limits its applications in mobile devices. Many commercial FPGA
companies pay more attention to minimize process scaling to get a lower supply power
[1][4].
Also, many ASIC companies use FPGA to emulate their designs to check the
43
44
design quality and debugs before tape-out. It can reduce the non-recurring engi-
neering (NRE) [7]. Nevertheless, more and more chips in the portable devices or
notebook computers, such as ARM, x86 CPU and CHIPSET are designed by using
hieratical design method. Power consumption can be saved by dynamically power off
some modules which are in sleep mode. In the ASIC floor-plan, some Sleep Regions
(SRs) are used for sleep modules. For FPGA, fine-grained power gating has been dis-
cussed [17, 19, 29]. It provides flexibility for power gating and is almost independent
of placement. But, the drawback is area increasing, because each Clustered Logic
Block (CLB) needs one related sleep transistor and related control logic in the FPGA
chip. Therefore, most of current commercial FPGA chip design does not adopt power
gating.
This chapter presents a dynamic coarse-grained power gating FPGA architecture
instead of traditional fine-grained, which supports hieratical design with sleep mod-
ules. It is used not only to fill the SR with CLBs that come from the same module but
also to power off unused SR. In this architecture, CLB of the sleep modules cannot be
placed with the one which is in always power-on module in the same region. Besides,
the fewer the used SR is, the less power consumption is. So, placement algorithm
plays an essential role on coarse-grained power gating FPGA architecture. From this
consideration, we focus our effort on module placement algorithm.
The remainder of this chapter is organized as follows. In the next section, related
work is described. Section 4.3 introduces low power design background and SR based
FPGA architecture. Placement algorithm based on SR is discussed in Section 4.4.
The CAD framework which supports this new FPGA architecture and region oriented
placement algorithm is shown in Section 4.5. Finally, experimental results and the
conclusion are presented in Sections 4.6 and 4.7, respectively.
4.2 Related Work
Former researches [9, 17, 30] focused on power reduction methods such as power-
gating, clock gating, dual-VTH/VDD, micro-VDD-hopping and so on. Paper [19]
introduced a field programmability of dual supply voltages for FPGA power reduction.
45
High VDD was applied for critical path logic, and low VDD for non-critical path to
save power consumption. This is a good way for low power consumption, but it does
not take user’s top-down design method into consideration. When we apply dynamic
power gating on this kind of FPGA, there is a heave burden for the power control
signal routings of all CLBs. Usually, logic in the same modules has the same voltage
and power state in ASIC design. So, this section pays more attention on region
placement of module level by using single supply voltage. What’s more, dual-VDD
can be added for different regions based on our FPGA architecture if needed.
An asynchronous FPGA architecture based on autonomous fine grain power gating
was proposed in [29]. It is more efficient in power than synchronous FPGA at less than
30% utilization of the total resource. However, most current systems are synchronous,
so, an asynchronous approach cannot be adopted.
In ref. [31], power gating of logic fabrics was investigated and region-constrained
placement was applied to reduce leakage power of unused logic blocks on Xilinx F-
PGA. Their placement algorithm placed a designed circuit into contiguous regions
by utilizing two different styles: horizontal and vertical placement. One limitation
of this idea is that the parietal row can be used only when the lower rows are fully
filled in horizontal placement. So, circuit for Input Output (IO) PAD on FPGA top
edge may be placed in bottom row when FPGA size is large. It may increase the wire
length and decrease FPGA performance.
In the preceding chapter, we proposed a circuit named a ‘power control hard
macro’ (PCHM) [44] at the cost of increasing area, by which synchronous FPGA
logic blocks are autonomously powered-off by the IO PAD or internal logic signal.
We did fine-grained power gating by using a sleep enable (SLPEN) signal, but power
efficiency is decreased for small CLB.
This chapter proposes coarse-grained power gating instead of fine-grained. Only
one PCHM is used for each SR which is composed of several CLBs. PCHM controls
the power of the related SR. We could get both high efficiency of power and area by
reducing PCHM number. Former study [7] paid more attention on the critical path
affected by the placement. The placement will affect routing resource and power
consumption. But this kind of placement could not support the low power FPGA
46
design with sleep module. Based on the FPGA architecture with SR, placement
algorithm for user design with sleep module is also explored.
4.3 Background
4.3.1 Sleep Module
ASIC chips are composed of several modules, each of which uses its own supply,
because different modules have different performance objective and constraints. An
example of mobile SoC (System on Chip) is shown in Fig. 4.1. Processor module and
memory module use un-gated power. USB and WIFI modules use gated power. A
power management unit (PMU) is used to control power supplies for USB and WIFI
modules by the command of the processor. The power can be saved by shutting down
the modules in sleep or idle state.
FPGA users generally pay more attention on optimizing their circuits, but can do
little improvement based on the fixed FPGA architecture to reduce power. So, power
gating methods of sleep module based on their circuit could not be implemented on
FPGA chip.
4.3.2 Proposed FPGA Architecture and Sleep Region
We focus on island-style FPGA architectures in this chapter, shown in Fig. 4.2. In our
proposed architecture, a power control hard macro (PCHM) is used as a low power
controller as the PMU in Fig. 4.1. The PCHM could power off the 2(column)*2(row)
CLBs, totally called an SR. The power of each SR can be gated by a sleep transistor
which is controlled via the corresponding PCHM. A MUX is used for PCHM to select
the SLPEN signal from internal connection. A MUX is used for PCHM to select
the SLPEN signal from internal connection. In another word, we could assume that
PCHM gets the SLPEN signal from one edge of the SB which is on the top-left of
the SR. The number of MUXes equals the number of SR on FPGA chip. To avoid
leakage, isolate cells (ISOC) are added to CLB outputs which are not shown in this
figure. Because when the circuit grows even larger, the size of SBs, CBs and RCs
are bigger, adding ISOC at the output of CLB need less area than at the CB or SB
47
Processor
module
Memory
moduleUSB
sleep
module
WIFI
sleep
module
PMU
VDD
SoC
Figure 4.1 Power gating example of an SoC
output. In Fig. 4.2, when CLBs in SR(1,2) and SR(2,2) are used for same module
or have same power behavior, these two PCHMs have the same power state. SR(1,1)
can be used for logic in another module. If no logic is mapped into SR(2,1), it could
be powered off.
4.3.3 Power Control Hard Macro (PCHM)
PCHM controls the power down and power up sequencing [44]. It also gates the
clock of SR. Figure 4.3 shows the PCHM block diagram. A power-off register in
top-left of this figure is a key register to gate power of connected SR in the highest
priority. SLPEN comes from FPGA IO PAD or another CLB output after routing. It
controls the power state of SR dynamically when the value of power-off register is “0”.
Delay counter is used to meet signal phase delay by a group of D flip-flops (DFF).
The number of DFF can be changed basing on power sequence. To reduce power
consumption, we use a heartbeat clock (32KHz) for the delay counter. SLP0, SLP1,
SLP2 and SLP3 are outputs of the delay counter. They are used by combinational
logic to generate the SR control signals such as ISOL, CLBRST, VDD SW, and
48
CLB
PCHM
CLB
MUX
CLB CLB
SR(2,1)
CLB
PCHM
CLB
MUX
CLB CLB
SR(1,2)
CLB
PCHM
CLB
MUX
CLB CLB
SR(1,1)
CLB
PCHM
CLB
MUX
CLB CLB
SR(2,2)
SB
SB
SB
SB
SB
SB
SB
SB
SB
SB
SB
SB
SB
SB
SB
SB
SB
SB
SBSB
SB
SB
SB
SB
SB
VDD VDD
VDD VDD
Figure 4.2 A new FPGA architecture based on SR.
GCLK. ISOL is used by ISOC to isolate the power off domains and power on domain.
The VDD SW and GCLK are power gating control signal and gated clock sent to
the SR, respectively. CLBRST is CLB reset signal based on the global clear signal
(CLR) and power on/off sequence.
4.4 Placement Algorithm
FPGA placement algorithm determines CLB location of user circuit on chip. The
goals are to minimize total area, wire connections, delay of critical path and so on.
We focus our effort on finding a fast method which can separate the CLB of different
modules into different SRs.
49
CLBRST
DFF
ISOL
GCLK
VDD_SW
DFF
DFF
DFF
DFF
SLP0
SLP1
SLP2
SLP3
SLPEN
Delay Counter
Heartbeat
clock
CLK
Power-off 
register
CLR
SLP1
SLP2
SLP0
SLP3
PCHM
VDD
Figure 4.3 PCHM block diagram.
4.4.1 Linear Congestion Wire cost and Timing Driven Cost
Based on the adaptive annealing schedule, an FPGA CAD tool called Versatile Place
and Route (VPR [7]) could get a better placement result within a short time by using
Simulated Annealing (SA) method. By using the linear congestion wire cost function
Eq. 3.4.1, VPR can model the difficulty of routing connections in areas with different
channel widths. The placement tries to reduce the wire length based on the CostW
To get higher FPGA performance, path timing driven cost, CostT , is used in [7].
By using CostT , CLBs on the critical path can be placed closely to reduce path delay
of whole FPGA chip. So, critical path delay is smaller.
4.4.2 Sleep Region Cost
None of the above cost function can support the SR placement. A new cost, CostSR, is
introduced to indicate the sleep region cost. To make this idea clearly, CLB placement
50
CLB
(PD1)
CLB
(PD2)
CLB
(PD1)
UNUSED
(PD0)
CLB
(PD1)
CLB
(PD2)
CLB
(PD1)
CLB
(PD2)
SR(1,1) SR(2,1)
CLB
(PD1)
CLB
(PD1)
CLB
(PD1)
CLB
(PD1)
CLB
(PD2)
CLB
(PD2)
CLB
(PD2)
SR(1,1) SR(2,1)
UNUSED
(PD0)
(a)
CLB
(PD1)
CLB
(PD2)
SR(1,1) SR(2,1)
CLB
(PD1)
CLB
(PD2)
SR(1,1) SR(2,1)
UNUSED
(PD0)
UNUSED
(PD0)
UNUSED
(PD0)
UNUSED
(PD0)
UNUSED
(PD0)
UNUSED
(PD0)
UNUSED
(PD0)
UNUSED
(PD0)
UNUSED
(PD0)
UNUSED
(PD0)
UNUSED
(PD0)
UNUSED
(PD0)
(b)
CLB
(PD1)
CLB
(PD1)
CLB
(PD1)
UNUSED
(PD0)
CLB
(PD1)
CLB
(PD2)
CLB
(PD1)
SR(1,1) SR(2,1)
CLB
(PD1)
CLB
(PD1)
CLB
(PD1)
CLB
(PD1)
CLB
(PD2)
CLB
(PD1)
SR(1,1) SR(2,1)
UNUSED
(PD0)
UNUSED
(PD0)
UNUSED
(PD0)
(c)
Figure 4.4 CLB swapping results.
cases which could occur after initial placement or during CLBs swapping are shown in
Fig. 4.4 [45]. Each SR shown on the left side has CLBs in two modules (M1 and M2)
colored by red and blue. M0 means the current CLB slot is unused. Figure 4.4(a)
shows four CLBs of M1 and three CLBs of M2 which are placed in two SRs. The
desired placement result is shown on the right. All the CLBs in each SR come from
the same module. Figure 4.4(b) shows a special case of Fig. 4.4(a). Although there
are two CLBs from different modules and one SR is enough to place them, we can not
put them in the same SR. Because their power states may be different. Two SRs are
enough for the two modules placement in Fig. 4.4(a) and Fig. 4.4(b). CLBs in different
modules are separated and placed in two SRs. Figure 4.4(c) shows an unwilling case.
In this case, different modules share the SR. After swapping, SR(1,1) is fully filled
with CLBs in M1. But we could not allow SR(2,1) has the CLBs which come from
51
Table 4.1 The number of CLBs in each SR
Before Swap After Swap
SR(1,1) SR(2,1) SR(1,1) SR(2,1)
Case (a) NCLB(x sr, y sr, 0) 1 0 0 1
NCLB(x sr, y sr, 1) 2 2 4 0
NCLB(x sr, y sr, 2) 1 2 0 3
NPD(x sr, y sr) 2 2 1 1
Case (b) NCLB(x sr, y sr, 0) 2 4 3 3
NCLB(x sr, y sr, 1) 1 0 1 0
NCLB(x sr, y sr, 2) 1 0 0 1
NPD(x sr, y sr) 2 0 1 1
Case (c) NCLB(x sr, y sr, 0) 1 1 0 2
NCLB(x sr, y sr, 1) 3 2 4 1
NCLB(x sr, y sr, 2) 0 1 0 1
NPD(x sr, y sr) 1 2 1 2
different modules when these two modules do not have same power state. If there
are more SRs in FPGA chip, swapping will be continued. So, we must calculated the
minimum number of SRs for all modules, and then, use an FPGA with enough space
to do placement. More conditions are described in Section 4.5.
Two parameters are used for module information for each SR. One is the number
of power domains whose CLB is in the SR(x sr, y sr), called NPD(x sr, y sr); another
is the number of CLBs, NCLB(x sr, y sr, p), in each module (from 0 to p) in SR(x sr,
y sr). x sr and y sr are SR coordinates on FPGA chip. p is the index in total
module numbers (NPD) of FPGA. No logic is assigned to CLB slot when p is zero.
The values of NPD(x sr, y sr) and NCLB(x sr, y sr, p) are shown in Tab. 4.1. The
placement goal for the SR architecture FPGA is to make sure each NPD(x sr, y sr)
is less than 2, and make NCLB(x sr, y sr, p) as big as possible for each SR.
A cost function, CostSR(x sr, y sr, p), is introduced to indicate a module cost of
SR located in physical coordinate (x sr, y sr) of FPGA as shown in Eq. 4.4.1. It is
award swapping for SR density increasing. NSR−size means the maximum number of
CLBs in one SR. If SR size is 2*2, NSR−size is 4. CostSR(x sr, y sr, p) turns to zero
when SR is fully filled with CLBs in the same module or without any CLBs.
52
CostSR(x sr, y sr, p) =
{
1− (NCLB(x sr,y sr,p)
NSR−size
)2 NCLB(x sr, y sr, p) 6= 0
0 NCLB(x sr, y sr, p) = 0
(4.4.1)
The CostSR(x sr, y sr) of one SR(x sr, y sr) is cost summation of each power
domain shown in Eq. 4.4.2.
CostSR(x sr, y sr) = NPD(x sr, y sr) ∗
NPD(x sr,y sr)∑
p=1
CostSR(x sr, y sr, p) (4.4.2)
Because a module index 0 (M0) is used for unused CLBs, their cost is not added
into CostSR. But if we only do the addition of CostSR(x sr, y sr, p) in each power
domain, we cannot handle the case shown in Fig. 4.4(b). The cost summation of these
two SRs will not be changed after swapping. So, we use NPD(x sr, y sr) to penalize
the case that one SR has CLBs that come from different modules. After swapping,
NPD(1, 1) and NPD(2, 1) is 1 in Fig. 4.4(b). CostSR for the two SRs is reduced. Total
SR cost of the FPGA is the summation of all CostSR(x sr, y sr) on the FPGA shown
in Eq. 3.4.4. So in Fig. 4.4(a), CostSR before swapping is 7.375 , and it is 1.4375 after
swapping.
4.4.3 FPGA Total Cost
We update FPGA full chip cost function, CostWTSR, based on the cost of CostW ,
CostT and CostSR. To trade-off among costs, factors γ and τ are introduced as shown
in Eq. 3.4.5.
In the original VPR environment, τ is set to 0.5 to balance CostW and CostT
[41]. After we enhanced VPR, the minimal γ could be auto determined by user’s
design. It can make sure the circuit has a better performance based on CostT and
CostW when supporting SR placement. Of course, it supports fixed γ provided by
user. More information is given in Section 4.5 about CAD flow.
Millions of potential block swaps will be evaluated in a typical placement even
with a good annealing schedule. Making computation as fast as possible is crucial.
53
To reduce CPU time during placement, we do not re-compute CostWTSR during each
swap. The swapped CLBs come from either the same or different modules, or even
swap to a empty slot. We just calculate cost change based on affected nets and SRs.
Algorithm 2 Pseudo code of trying swap.
(xfrom, yfrom) = start coordinate of CLB
(xto, yto) = destination coordinate of CLB
pfrom = power domain information of start CLB
pto = power domain information of destination CLB
(x srfrom, y srfrom) = SR coordinate where the start CLB locates
(x srto, y srto) = SR coordinate where the destination is
swap2clb{
random pick start(CLB(xfrom, yfrom));
random pick destination(CLB(xto, yto), Rlimit);
if pfrom 6= pto and (x srfrom, y srfrom) 6= (x srto, y srto) then
∆CostSR ← calculate SR cost change(CLB(xfrom, yfrom), CLB(xto, yto));
else
∆CostSR ← 0;
end if
∆CostWTSR ← calculate cost change(∆CostW ,∆CostT ,∆CostSR);
if swap accept(∆CostWTSR, T ) then
CostWTSR newT ← CostWTSR oldT + ∆ CostWTSR;
if CLB from.p 6= CLB to.p and sr from 6= sr to then
NCLB(x srfrom, y srfrom, pfrom)− 1;
NCLB(x srto, y srto, pfrom) + 1;
NCLB(x srto, y srto, pto)− 1;
NCLB(x srfrom, y srfrom, pto) + 1;
swap(CLB(xfrom, yfrom), CLB(xto, yto));
end if
end if
}
The change of CostW and CostT due to affected nets by two swapped CLBs (or
by moving a CLB to an empty slot) is calculated. CostSR change is affected by
SRs whose NCLB(x sr, y sr, p) are changed. Based on Eq. 4.4.1 and Eq. 4.4.2, only
swapping two CLBs in different modules between different SRs can affect the cost.
No consideration should be taken when the swapping happens inside the same SR.
54
Algorithm 2 shows the pseudo code to swap two CLBs.
During each swap, we pick one CLB of user’s design randomly and get its posi-
tion (xfrom, yfrom) and PD information (PD(1) or PD(2)). Then, we choose desti-
nation position (xto, yto) randomly within the Rlimit (the distance limitation of two
CLB’s swapping). If the PD information of CLB(xfrom, yfrom) is not the same as
CLB(xto, yto) and they are not in the same region, CostSR will be affected after swap-
ping. Otherwise, the change of CostSR need not to be calculated.
After calculating the changes of CostW , CostT and CostSR, this swap is considered
whether to be accepted based on the SA with the cost changes (∆CostWTSR) and
temperature (T ). Each old temperature (oldT ) cost is added to balance different cost.
They make the tradeoff factor more accurate. If the swap is accepted after assessing
based on ∆CostWTSR and new temperature (newT ), the cost of newT is changed by
adding cost change as Eq. 3.4.6.
The number of CLBs (NCLB) which are in different PDs (pfrom or pto) in the
start SR(x srfrom, y srfrom) and destination SR(x srto, y srto) are updated before
the swap of two CLBs.
Conditions to exit SA and to decrease T are almost same as the former work in
[7]. When swapping range is small in low temperature, CostSR is hardly changed.
Only CostW and CostT optimize local placement to reduce wire cost.
4.4.4 Cost Function Comparison
Different cost functions are compared based on the same FPGA architecture to explore
the merits of region oriented placement. Figure 4.5 shows the initial placement result.
Figure 4.6, 4.7 and 4.8 show placement result of different cost function or different
region sizes. Since there is no benchmark to evaluate sleep module feature on FPGA,
we use two MCNC [35] benchmarks instead and assume they are in different power
states. The difference between two benchmarks and user’s design which has sleep
module in it is that the interconnection between different modules. But it is impartial
for different placers based on two benchmarks.
Figure 4.5 shows initial placement after VPR is enhanced to get two circuits.
CLBs in different color come from different modules. After initial placement, CLBs
55
Figure 4.5 Initial placement.
are placed randomly in FPGA. Placement result with SR cost is shown in Fig. 4.6.
Blank grids located between two modules are disordered. Although blank space is
big enough for a 4*4 sleep region, but it is impossible to use power gating for unfix
position. More efforts should be taken if we want to use some low power design
methods. But, in Fig. 4.7, based on SR cost, CLBs are placed into the minimum
count of 4*4 size SRs. One region should be powered off in this case. If we change
NSR−size to 3*3, we could power off three SRs shown in Fig. 4.8.
4.5 CAD Framework and SR-VPR
There are lots of commercial FPGA design tools. But, most of them could only
support the FPGAs by the same provider, especially for the placement and routing
tools which could not support other architecture. So, a CAD framework should be
explored to support and evaluate our proposed architecture, that is, special placement
for SR based architecture.
We use two benchmarks as two modules in user’s hierarchical design. Figure 4.9
illustrates CAD software flow that supports two modules. This flow composes a CAD
56
Figure 4.6 VPR placement result.
framework that can be used for packing, placement, routing, and power simulation
for non-commercial FPGA.
The Berkeley Logic Interchange Format (.blif)[39] files are used to describe MCNC
benchmark circuits. We could get the .blif file from verilog RTL file by ODIN II[36]
and ABC[37]. The T-VPACK[41] program packs LUTs and flip-flops into CLB which
contains one or more Logic Elements (LEs) in .net format for each module.
VPR supports kinds of FPGA architecture with different number of LEs and CLBs
or channel width. It can place the design with SA algorithm [7] and route it for the
normal FPGA architecture.
A power model is mentioned in [38] in terms of dynamic power, short-circuit power
and leakage power. The activity estimator determines the switching activities inside
the design. The transition density model of probabilistic techniques is used in the
activity generation step. [40] merges power model and the latest version VPR. We
develop enhancement of this CAD software, named Sleep Region VPR (SR-VPR), to
support FPGA design with sleep modules.
Our developed SR-VPR can treat multiple circuits as different modules. We also
57
EMPTY
EMPTY
EMPTY
Figure 4.7 SR-VPR placement result (NSR−size = 9).
modify VPR basing on our SR place algorithm. It can get SR setting parameters,
such as NSR−size. After placement, SR-VPR routes internal connection. The output
of SR-VPR describes area, critical path delay, circuit placement, routing information
and power consumption. Power-off register setting in each SR can also be generated
in the output file.
VPR can auto-size the FPGA basing on benchmark circuits. But to support
the SR based FPGA architecture, we enhance this feature to set FPGA size as the
minimal times of the SR size. In SR-VPR, for example, if one benchmark contains 7
CLBs, when NSR−size is 3*3, the FPGA will be sized to 3*3 for this benchmark. But
if NSR−size is 2*2, the size of the FPGA should be 4*4. It contains 4 SRs.
VPR can also find a better solution to place and route user’s design based on
different architecture. But to import SR, a new challenge is needed for placement.
Not only the net delay, but also the module information should be considered. If
an user does not care the parameters of CostSR, SR-VPR can find γ suitable to get
the minimum factor based on the SR. Figure 4.10 shows the placement flow with
the parameter selecting in SR-VPR. After reading the netlist, it checks the NPD and
NCLB[1..p]. When the minimum number of SRs for the circuit has been calculated,
58
EMPTY
Figure 4.8 SR-VPR placement result (NSR−size = 16)
required FPGA size is fixed. Then, SR-VPR checks whether γ is fixed. If so, it does
the placement and gives out the result. If user does not give the desired value, a
placement by only using SR cost is processed to check whether current setting could
support SR placement successfully. After that, a loop is used for γ increasing from a
very small initial value. We set initial value of γ to 0.05 based on former experiment.
When γ grows bigger enough to place successfully, SR-VPR starts routing.
4.6 Experimental Results and Discussion
4.6.1 Conditions for Experiments
To compare the performance between different cost functions, 20 MCNC benchmarks
are placed and globally routed. In .blif netlist, CLB consists of 10 LEs, and each
LE has one 4-LUT and one flip-flop. The intelligent FPGA Architecture Repository
(iFAR) [25] contains accurate area and timing estimates architecture file for the logic
and routing of varied island-style FPGA architectures. We use 45 nm technology
file in our experiment. The routing parameters are set as Fcin=0.25, Fcout=0.1,
59
Placement and Routing
(SR-VPR)
Area
Netlist 1
(.net )
SR 
result 
Activity Estimator
Netilst1 (.blif )
Module1
Netilst2 (.blif )
Module2
Activity1 
(.ac2 &.fun )
SR 
setting
Power Delay
Netlist2 
(.net )
Activity2 
(.ac2 &.fun )
Packing
 (T-VPACK) 
Arch. File
(iFAR)
Figure 4.9 Software flow for the SR based FPGA.
Fs=3, and length of segment L=4 in this file. Based on our experiments in Section
3.6, the best result can be gotten when NSR−size is 4*4 for less area and higher
performance. So, following results of coarse-grained are based on this NSR−size. The
same as VPR, minimum number of transistor is used for area comparison. The
routing channel width is set to 1.2 times of the minimum channel width required
to route each of the benchmark circuits. IO PAD capacity is 4. To get accurate
result of power consumption, we use HSPICE to generate power parameters in FPGA
architecture file. Compatible with the current popular process technology, 45 nm
Predictive Technology Model (PTM) [43] is used. We set VDD to 1.0V.
To check the performance of our coarse-grained architecture and placement al-
gorithm, we compare the results both on single module and double modules in this
section by using 20 MCNC benchmarks. By comparing the results of using different
60
Check CLB count in 
each power domain
START
Calculate the 
minimum SR count 
for the netlist
Fix the FPGA sizeSet the γ to 1
Initial placement
Auto place
placed 
in the minimal SR 
count ?
γ = 1 ?
FINISH
Y
γ = 0.05
Y
Report SUCCESS
Netlist could be placed
Report FAILED
Netlist could not be placed
γ +0.01
Y N
N Y
γ = 1 twice ?
N
Y
Fixed γ ?N
γ < 1 ?
Fix γ ?
N
N
Y
Read multiple power 
domain netlists
Figure 4.10 SR-VPR placement flow.
FPGA architectures, Section 4.6.2 analyzes the advantages of our proposed coarse-
grained power gating. Based on this architecture, VPR and SR-VPR are compared
to find the impact of the new placement algorithm in Section 4.6.3.
4.6.2 Architecture Comparison
All FPGA architectures discussed here are based on island style [7]. The normal
FPGA architecture which is used in the original VPR environment does not have sleep
transistors and PCHMs. Fine-grained and coarse-grained power gating architectures
are based on the normal FPGA architecture with sleep transistors and PCHMs. Each
61
CLB has a P-MOS sleep transistor and a PCHM in fine-grained architecture, while all
the CLBs in an SR of coarse-grained architecture can share a P-MOS sleep transistor
and a PCHM. Before comparing VPR and SR-VPR in detail, we need to find whether
coarse-grained power gating architecture has advantages compared with others. A
comparison is shown in Tab. 4.2 for different FPGA architectures by using 20 MCNC
benchmarks. We apply the suitable placement algorithm for each FPGA architecture.
Original VPR placement algorithm is used for the normal and fine-grained FPGA
architecture. While, the coarse-grained power gating architecture needs more support
of the placement to put CLBs into minimum number of SRs for reducing the power
consumption maximally. So, we use SR-VPR for coarse-grained power gating FPGA
in this test.
We run this comparison on a workstation with AMD opteron 2435 CPUs whose
frequencies are 2.6 GHz. The total memory of this workstation is 4 GB. The operation
system is Linux.
Table 4.2 The comparison between different FPGA architectures.
FPGA architecture
Normal Fine-grained Coarse-grained
power on CLB percent 100% 67.7% 70.3%
area of logic 5044863 5053846 5124607
area of sleep transistor - 392611(7.7%) 118669(2.3%)
and related control logic
critical path delay(s) 4.24E-9 4.37E-9 4.39E-9
placement time (s) 23 26 28(89)
routing time (s) 278 278 296
power (W) 7.5E-2 5.81E-2 5.63E-2
To compare the architectures, we set the numbers of CLBs in normal and fine-
grained FPGA architecture the same as coarse-grained architecture. Table 4.2 shows
average value of power on CLB percent, CPU time of placement and routing, chip
area, critical path delay and power consumption by using different FPGA architecture.
FPGA in this table is sized to minimal size of each benchmark automatically. Usually,
the number of CLBs which are used for the benchmarks does not equal a multiple of
NSR−size. When an SR has both used and unused CLBs, the unused CLBs in this SR
62
cannot be powered off. In our experiments, 32.3% unused CLBs can be powered off
by fine-grained power gating, while in coarse-grained power gating architecture, this
value can reach 29.7%.
Area results of different FPGA architectures are shown in the number of minimum
transistor. Including CLBs and routing area, the logic area increasing of coarse-
grained FPGA architecture is 1.4% compared to the normal architecture. When we
merge the sleep transistors which are used by each CLB in the same SR, the area
of sleep transistors can be reduced [6]. This is one of advantages in coarse-grained
power gating architecture. Since there is no sleep transistor in a normal FPGA, we
skip the area for it. 7.7% area is overhead by using sleep transistors in fine-grained
architecture. But in coarse-grained architecture, this area overhead can be reduced
to 2.3%. In other words, the area of sleep transistors and related control logic can
be reduced 69.8% by using coarse-grained power gating architecture compared to
fine-grained one. When considering the area reduction of less PCHMs, 3.7% FPGA
area can be reduced totally. Our SR-based placement algorithm does not give heavy
burden to the CPU during the placement when fix γ, but when it automatically
detects γ, CPU time is almost 4 times of VPR. Chip performance is not affected
much when using SR-VPR as shown in the row of critical path delay.
By powering off the SBs, RCs and CBs in the unused SRs, we can finally get
more power reduction in coarse-grained power gating FPGA architecture than normal
architecture and fine-grained architecture as shown in Tab. 4.2. We can also get
more area saving. Those are the reasons why we pay attention to coarse-grained
architecture.
4.6.3 Placement Algorithm Comparison
For detailed evaluation of our proposed SR-VPR, we execute placement by VPR and
SR-VPR using coarse-grained power gating FPGA architecture. The first experiment
is single module placement, and results are shown in Tab. 4.3. In the experiment for
Tab. 4.3, we use the minimum FPGA size of each benchmark. We can check the
performance of SR-VPR compared with VPR when the benchmark is treated as one
sleep module.
63
The usage of IO PAD, CLB and SR are list in this table. When the usage of
CLB is less than the usage of SR, it means that there are unused CLBs in one SR.
Different circuit has different minimum γ, the range is from 0.05 to 0.11, we could
find that γ is 0.07 on average based on the last average line. The CPU time increment
of placement is 5.3%, and routing time is increased for 6.4%. By increasing 2.5% chip
area and 1.6% critical path delay, 24.9% power consumption can be reduced. Note
that the area of sleep transistors is not included in the chip area when compare these
two algorithm by using the same FPGA architecture. Compared with the result of
Tab. 4.2 which use minimum FPGA size based on each benchmark, SR-VPR performs
better when FPGA usage is lower.
64
T
ab
le
4.
3
T
h
e
F
P
G
A
p
ow
er
co
n
su
m
p
ti
on
re
su
lt
w
it
h
on
e
m
o
d
u
le
IO
C
L
B
S
R
γ
P
la
ce
m
en
t
R
ou
ti
n
g
A
re
a
C
ri
ti
ca
l
p
at
h
P
ow
er
ti
m
e(
s)
ti
m
e(
s)
N
o.
of
T
ra
n
.
d
el
ay
(s
)
(W
)
P
A
D
u
sa
ge
u
sa
ge
u
sa
ge
V
P
R
S
R
-
V
P
R
V
P
R
S
R
-
V
P
R
V
P
R
S
R
-
V
P
R
V
P
R
S
R
-
V
P
R
V
P
R
S
R
-
V
P
R
al
u
4
11
.5
%
81
.9
%
88
.9
%
0.
07
8
9
50
39
2.
0E
+
6
2.
1E
+
6
3.
6E
-9
3.
4E
-9
5.
3E
-2
5.
0E
-2
ap
ex
2
21
.4
%
95
.8
%
10
0%
0.
08
12
13
52
54
2.
2E
+
6
2.
3E
+
6
3.
7E
-9
3.
9E
-9
5.
0E
-2
5.
1E
-2
ap
ex
4
14
.6
%
77
.8
%
77
.8
%
0.
10
9
9
84
73
2.
2E
+
6
2.
2E
+
6
3.
5E
-9
3.
5E
-9
3.
4E
-2
3.
0E
-2
b
ig
ke
y
95
.1
%
15
.2
%
16
.3
%
0.
11
13
14
30
41
7.
8E
+
6
7.
9E
+
6
2.
1E
-9
2.
1E
-9
1.
6E
-1
7.
8E
-2
cl
m
a
37
.5
%
78
.8
%
80
.6
%
0.
07
81
87
73
8
77
6
9.
1E
+
6
9.
4E
+
6
7.
0E
-9
6.
7E
-9
9.
7E
-2
9.
1E
-2
d
es
97
.9
%
13
.5
%
14
.1
%
0.
05
25
27
15
9
17
5
1.
2E
+
7
1.
2E
+
7
3.
4E
-9
3.
6E
-9
2.
4E
-1
1.
2E
-1
d
iff
eq
53
.6
%
64
.6
%
66
.7
%
0.
06
10
9
22
21
1.
7E
+
6
1.
7E
+
6
4.
5E
-9
4.
6E
-9
2.
8E
-2
2.
4E
-2
d
si
p
95
.1
%
14
.7
%
16
.3
%
0.
05
16
18
10
6
19
2
8.
5E
+
6
8.
5E
+
6
2.
1E
-9
2.
2E
-9
1.
5E
-1
9.
2E
-2
el
li
p
ti
c
95
.7
%
81
.3
%
81
.3
%
0.
08
29
32
79
10
3
3.
5E
+
6
3.
5E
+
6
6.
3E
-9
6.
6E
-9
4.
2E
-2
3.
8E
-2
ex
10
10
5.
2%
73
.8
%
75
.0
%
0.
11
59
62
24
45
25
78
1.
1E
+
7
1.
1E
+
7
4.
8E
-9
4.
8E
-9
1.
0E
-1
8.
8E
-2
ex
5p
37
.0
%
63
.2
%
66
.7
%
0.
06
7
8
51
40
2.
0E
+
6
2.
1E
+
6
3.
9E
-9
3.
6E
-9
3.
5E
-2
2.
7E
-2
fr
is
c
53
.1
%
95
.3
%
10
0%
0.
07
38
38
19
1
19
7
3.
9E
+
6
3.
9E
+
6
8.
4E
-9
8.
5E
-9
3.
2E
-2
3.
2E
-2
m
is
ex
3
14
.6
%
78
.5
%
88
.9
%
0.
05
8
9
38
51
2.
0E
+
6
2.
0E
+
6
3.
2E
-9
3.
1E
-9
5.
3E
-2
5.
0E
-2
p
d
c
17
.5
%
84
.3
%
88
.0
%
0.
08
45
46
57
8
65
4
6.
8E
+
6
7.
4E
+
6
5.
2E
-9
5.
8E
-9
6.
4E
-2
5.
8E
-2
s2
98
5.
2%
66
.0
%
66
.7
%
0.
06
6
6
38
41
2.
0E
+
6
2.
0E
+
6
5.
4E
-9
5.
6E
-9
1.
8E
-2
1.
3E
-2
s3
84
17
42
.2
%
85
.8
%
88
.0
%
0.
05
39
40
46
59
4.
8E
+
6
4.
9E
+
6
4.
4E
-9
4.
7E
-9
6.
7E
-2
6.
1E
-2
s3
85
84
1
89
.1
%
66
.5
%
66
.7
%
0.
11
65
68
11
7
11
3
7.
2E
+
6
7.
7E
+
6
3.
8E
-9
4.
0E
-9
1.
1E
-1
8.
1E
-2
se
q
39
.6
%
95
.1
%
10
0%
0.
05
14
14
60
58
2.
2E
+
6
2.
2E
+
6
3.
1E
-9
3.
2E
-9
5.
7E
-2
5.
6E
-2
sp
la
19
.4
%
66
.5
%
68
.0
%
0.
08
30
33
66
1
63
2
6.
1E
+
6
6.
3E
+
6
4.
2E
-9
4.
1E
-9
7.
2E
-2
6.
3E
-2
ts
en
g
90
.6
%
54
.9
%
55
.6
%
0.
09
9
9
14
17
1.
7E
+
6
1.
8E
+
6
4.
7E
-9
4.
8E
-9
2.
9E
-2
2.
3E
-2
av
er
ag
e
46
.8
%
67
.7
%
70
.3
%
0.
07
26
28
27
8
29
6
4.
9E
+
6
5.
1E
+
6
4.
4E
-9
4.
4E
-9
7.
5E
-2
5.
6E
-2
p
er
ce
n
t
10
5.
3%
10
6.
4%
10
0%
10
2.
5%
10
0%
10
1.
6%
10
0%
75
.1
%
65
The second experiment is two modules placement, and results are shown in Tab. 4.4.
A most highlight characteristic of SR-VPR is that it can place CLBs of different mod-
ules into different SRs. VPR can not separate CLBs in different modules into different
regions as shown in Fig. 4.6. We performed two modules experiments by using 20 M-
CNC benchmarks as the first module (M1) and using alu4 as the second module (M2).
Table 4.4 shows simulation results based on M1 and M2. CPU time of placement is
4.4 times of VPR for chose the best γ, while routing time increases 12.3%. Of course,
the placement time can be reduced when we use a fixed γ. By using coarse-grained
power gating method, area of PCHM is almost ignored when FPGA chip grows larger.
The power consumption in different states are also compared. Assuming the power of
normal un-gated FPGA architecture is 100% when using VPR, by gating the unused
SRs, 14.1% power is reduced during two modules’ working. Different modules are
powered on during “M1ON” and “M2ON” states. These two states mean M1 or M2
is power on individually. The power consumption is reduced about 30.3% and 31.8%
during these power states.
66
T
ab
le
4.
4
T
h
e
F
P
G
A
p
ow
er
co
n
su
m
p
ti
on
re
su
lt
w
it
h
tw
o
m
o
d
u
le
s
p
la
ce
m
en
t
ro
u
ti
n
g
A
re
a
C
ri
ti
ca
l
p
at
h
P
ow
er
ti
m
e(
s)
ti
m
e(
s)
N
o.
of
T
ra
n
.
d
el
ay
(s
)
(W
)
M
1
M
2
V
P
R
S
R
-
V
P
R
V
P
R
S
R
-
V
P
R
V
P
R
S
R
-
V
P
R
V
P
R
S
R
-
V
P
R
V
P
R
S
R
-
V
P
R
M
1O
N
M
2O
N
al
u
4
al
u
4
23
10
5
13
1
17
4
3.
9E
+
6
4.
1E
+
6
3.
5E
-9
3.
5E
-9
1.
1E
-1
1.
0E
-1
8.
1E
-2
8.
0E
-2
ap
ex
2
al
u
4
27
11
8
23
4
29
2
5.
8E
+
6
6.
1E
+
6
4.
1E
-9
4.
0E
-9
1.
1E
-1
9.
6E
-2
7.
5E
-2
7.
1E
-2
ap
ex
4
al
u
4
25
10
1
13
5
14
3
4.
1E
+
6
4.
4E
+
6
3.
5E
-9
4.
0E
-9
9.
4E
-2
8.
8E
-2
6.
5E
-2
7.
2E
-2
b
ig
ke
y
al
u
4
36
15
8
14
6
23
4
1.
1E
+
7
1.
2E
+
7
3.
4E
-9
3.
4E
-9
1.
7E
-1
1.
2E
-1
9.
8E
-2
1.
1E
-1
cl
m
a
al
u
4
10
8
50
0
81
1
99
7
1.
3E
+
7
1.
4E
+
7
7.
2E
-9
6.
7E
-9
1.
2E
-1
1.
0E
-1
9.
1E
-2
7.
0E
-2
d
es
al
u
4
48
21
3
18
6
24
6
1.
8E
+
7
1.
8E
+
7
3.
6E
-9
3.
9E
-9
3.
0E
-1
2.
1E
-1
1.
9E
-1
1.
6E
-1
d
iff
eq
al
u
4
30
12
1
10
3
16
7
3.
5E
+
6
3.
9E
+
6
4.
3E
-9
4.
3E
-9
7.
2E
-2
6.
6E
-2
4.
8E
-2
6.
1E
-2
d
si
p
al
u
4
39
16
5
18
6
26
2
1.
1E
+
7
1.
2E
+
7
3.
5E
-9
3.
7E
-9
1.
6E
-1
1.
2E
-1
9.
5E
-2
1.
0E
-1
el
li
p
ti
c
al
u
4
56
24
2
14
0
17
7
5.
8E
+
6
6.
0E
+
6
6.
3E
-9
6.
2E
-9
7.
4E
-2
6.
6E
-2
5.
3E
-2
5.
0E
-2
ex
10
10
al
u
4
84
39
5
21
22
20
98
1.
1E
+
7
1.
2E
+
7
4.
6E
-9
4.
7E
-9
1.
4E
-1
1.
3E
-1
1.
1E
-1
9.
4E
-2
ex
5p
al
u
4
22
96
13
8
14
0
3.
8E
+
6
4.
1E
+
6
3.
5E
-9
3.
8E
-9
9.
2E
-2
8.
5E
-2
6.
1E
-2
7.
3E
-2
fr
is
c
al
u
4
63
27
2
29
4
34
1
6.
4E
+
6
6.
6E
+
6
8.
3E
-9
8.
6E
-9
5.
7E
-2
5.
3E
-2
4.
3E
-2
4.
1E
-2
m
is
ex
3
al
u
4
23
96
10
8
11
1
3.
8E
+
6
4.
0E
+
6
3.
5E
-9
3.
4E
-9
1.
0E
-1
9.
6E
-2
7.
3E
-2
7.
8E
-2
p
d
c
al
u
4
66
29
1
94
3
89
1
1.
0E
+
7
1.
1E
+
7
4.
7E
-9
5.
0E
-9
1.
2E
-1
1.
1E
-1
9.
5E
-2
8.
2E
-2
s2
98
al
u
4
21
89
12
0
10
8
3.
7E
+
6
4.
1E
+
6
5.
1E
-9
5.
3E
-9
5.
6E
-2
5.
1E
-2
3.
5E
-2
4.
8E
-2
s3
84
17
al
u
4
60
25
9
12
8
20
3
7.
8E
+
6
8.
1E
+
6
4.
7E
-9
4.
5E
-9
1.
2E
-1
1.
1E
-1
8.
9E
-2
9.
2E
-2
s3
85
84
1
al
u
4
10
0
43
6
14
9
19
0
8.
4E
+
6
8.
5E
+
6
3.
8E
-9
3.
8E
-9
1.
6E
-1
1.
5E
-1
1.
3E
-1
1.
2E
-1
se
q
al
u
4
28
12
2
26
6
29
4
6.
0E
+
6
6.
2E
+
6
3.
5E
-9
4.
1E
-9
1.
2E
-1
1.
1E
-1
8.
6E
-2
8.
0E
-2
sp
la
al
u
4
55
24
4
30
5
35
2
6.
4E
+
6
7.
1E
+
6
4.
3E
-9
5.
7E
-9
1.
0E
-1
9.
7E
-2
7.
7E
-2
7.
2E
-2
ts
en
g
al
u
4
29
12
3
11
4
16
9
3.
5E
+
6
3.
9E
+
6
4.
5E
-9
4.
5E
-9
7.
3E
-2
6.
6E
-2
5.
0E
-2
5.
9E
-2
av
er
ag
e
47
20
7
33
8
37
9
7.
3E
+
6
7.
7E
+
6
4.
5E
-9
4.
7E
-9
1.
2E
-1
1.
0E
-1
8.
2E
-2
8.
0E
-2
p
er
ce
n
t
10
0%
43
9.
2%
10
0%
11
2.
3%
10
0%
10
5.
6%
10
0%
10
3.
4%
10
0%
85
.9
%
69
.7
%
68
.2
%
67
Because two placement methods are compared based on coarse-grained power
gating architecture, and the size of CLB is pre-defined in the architecture file, the
area difference is affected by the channel width and SB size. 2.5% area is increased
in Tab. 4.3 and 5.6% increased in Tab. 4.4. The increments of critical path delay are
also shown in Tab. 4.3 and Tab. 4.4 within 3.4%.
4.7 Conclusion
We have proposed a new low power FPGA architecture and its placement algorithm
which is incorporated in top-down design method with sleep modules. The CLBs for
the user circuit are stuffed in minimal sleep regions. As a result power consumption
is reduced by 14.1% when circuit is in working states, and it can be saved by 30.3%
when sleep module is in sleep state.
68
Chapter 5
Region Oriented Routing for
Dynamic Power Gating FPGA
Dynamically power gating which is applicable to in FPGA can effectively reduce the
power consumption. In this chapter, we propose a sophisticated routing architec-
ture for a region oriented FPGA which can support dynamical power gating. This is
the first routing solution of dynamical power gating for coarse-grained power gating
FPGA. There are two main contributions in this section. First, it gives a special rout-
ing method to support dynamical power-off switch boxes in a sleep region. Second,
asymmetric Wilton switch box in the public routing channel is introduced to reduce
channel width. As a result, our proposed FPGA architecture with sophisticated P&R
can reduce the power consumption of a system implemented in FPGA.
5.1 Introduction
FPGA is widely used because of its short development time and flexibility for com-
mercial design. But the disadvantage is the power consumption per function which
limits its applications especially in the mobile devices. Many commercial FPGA com-
panies pay more attention to minimize the technology scaling to get a lower supply
power [1, 4]. The leakage current is dramatically increasing when technology has
shrunk to 90nm and below [9].
The former researches focused on power reduction methods such as power-gating,
clock gating, dual-Vdd and so on [17, 19, 30, 46]. Leakage power can be reduced by
69
70
gating the unused logic block. Dynamic power can be reduced by using low Vdd to
CLB in non-critical path. An FPGA architecture based on autonomous fine grain
power gating was proposed in [29].
All of the above low power methods do not take the user circuit states into con-
sideration, such as active-, idle- and sleep-mode. Previous power gating methods
dedicate to power on or off modules. Dynamical power gating can reduce power con-
sumption in advance [6, 44]. Because when the state of a logic module is changed
from active-mode to sleep-mode, the Vdd or clock of this module can be gated with
the power state changing simultaneously. [6] introduced a region based FPGA ar-
chitecture that supports dynamical power gating. In this architecture, some CLBs
whose powers are controlled by the common sleep transistor are combined into one
region, so that power consumption can be efficiently reduced when these CLBs have
the same power state (power-on or power-off). But the related placement and routing
methods suitable for this FPGA architecture were not explicitly given.
The placement result affects the power consumption and the routing resource such
as CB, SB and RC. We proposed provided a region based placement algorithm to place
CLBs into minimal count of regions and also placement of multiple modules of an user
circuit into different regions [45, 47]. They give placement solutions for the dynamical
power gating FPGA architectures. But region based dynamical power gating needs
not only the support of placement, but also a sophisticated routing method when we
want to power the routing resources dynamically off in different regions.
This chapter is based on the region oriented FPGA architecture and focuses on
routing method. We add power domain information to each net in the user circuit
and each routing resource on the FPGA chip to prevent the SBs and RCs from using
by the nets in other power domains. Power domain defines the FPGA region in which
the FPGA units (such as CLBs, RCs and SBs) are used for the nets and logic in the
same power state. After that, SBs in the region could be powered off dynamically.
Because the total routing resources are classed by different power domains, not all
the routing resources can be used for each power domain. When the routing of each
power domain nets is limited in corresponding routing resources, channel width may
be increased significantly to avoid the congestions. In this chapter, we newly introduce
71
an asymmetric “Wilton” SB [48] to increase the routing resources in the public RCs
and SBs whose routing resources are commonly used for every power domain. As a
result, the increment of channel width affected by the routing for dynamical power
gating FPGA architecture can be reduced.
The remainder of this chapter is organized as follows. In the next section, the
background of dynamical power gating FPGA architecture and CAD framework are
explained. The placement used for our routing is introduced in Section 5.3. Then,
routing resource distribution for different modules and asymmetric “Wilton” SB are
mentioned in Section 5.4. Finally, experimental results and conclusions are presented.
Figure 5.1 Region based FPGA architecture [6]
5.2 Background
5.2.1 FPGA Architecture
The basic units in a conventional FPGA such as [7] are CLB, CB, RC and SB. FPGA
implements user’s circuit by dividing it into small pieces which can be achieved by
CLBs. CB, RC and SB are used as wires to connect the input and output among
72
the CLBs. To reduce power consumption, sleeping transistors are added to gate the
power of CLBs in FPGA architecture of fine-grained power gating.
Area overhead and leakage power can be reduced by region power gating more
than CLB-level architecture [6]. Fig. 5.1 shows a coarse-grained FPGA architecture
which supports dynamical power gating. 2*2(column*row) CLBs and the RCs which
are between these CLBs compose of a region. Each region is power gated by one sleep
transistor or by several sleep transistors during the same time. A sleep transistor for
CLB and RC is controlled by a MUX. Static power-on and power-off or dynamical
power control can specify one of the power states of the CLB or RC [6].
The basic idea of dynamical power gating comes from a low-power ASIC design
method. Modules can be power gated during a sleep state, and they will be powered
on when they are active in the system. The states of these modules are controlled by
a power management unit on the chip or by some control signals outside the ASIC
chip. Based on the state of an always power-on control signal which comes from
internal CLB output or IO PAD, FPGA can be dynamically powered off to reduce
power when it is in sleep-mode.
But, there are some problems to be solved for this architecture. First, CLBs in
the same region should belong to the same module or have the same power state. If
we place CLBs which have different power states in the same region, this region can
be powered off only when all the CLBs in this region enter sleep state. Otherwise,
this region should be always powered on. Second, [6] just gives a guidance of SB
power gating in the architecture. But power gating of RC and SB need the support
of router. For example, let’s consider a route from point A to point B in Fig. 5.1. Two
paths may be candidates. But path2 will be disconnected when region2 is powered
off. Consequently, a sophisticated router is indispensable to the region based FPGA
architecture.
We solve the above two problems for placement and routing in the following two
sections, respectively. Here we define PD information for each routing resource to
specify its power state in this architecture. For example, in Fig. 5.1, if the power states
of CLBs located in region1 and region2 are the same, we add “PD(1)” information
to all the CLBs, RCs and SBs in these two regions. When the CLBs in these two
73
regions have different power states, that means PG CTRL1 and PG CTRL2 (power
gating control signals for region1 and region2) have different states from each other
or the MUXs of these two regions are not selected to the same control value. We
add different PD information such as “PD(1)” and “PD(2)” to each region. The
SBs and RCs which are not belonging to any regions are in “PD(0)”. They can be
used as public routing resources by every PD. The region with CLBs that comes from
different PDs is also marked as “PD(0)”. The 3-1MUX will dedicate ‘0’ to the control
signal of sleep transistor for “PD(0)” region. If the FPGA is larger than the circuit
and some regions are not used, the PD information of the unused regions will be
marked as “PD(-1)”. The PD information of each region is determined by the CLBs
in it during placement. The PD information of RCs and SBs in each region can be
set depending on the region.
5.2.2 CAD Framework
VPR [41] is a well-known placement and routing tool which is used in academic
FPGA research. It supports kinds of FPGA architectures with different number of
BLEs, CLBs or different channel width based on different cost functions. For the
region based FPGA architecture, not only the wire cost and timing cost but also the
region size should be considered. VPR can place the circuit by simulated annealing
algorithm and then route for the normal FPGA architecture.
However, the original VPR can not be applied to our architecture due to the fol-
lowing two shortcomings: (1) it cannot place the CLBs in the minimal count regions;
(2) It cannot route in the desired path such as path1 shown in Fig. 5.1. Therefore,
we propose VPR, named Sleep Region VPR (SR-VPR). The main flow is shown in
Fig. 5.2.
The circuits with sleep modules are input to SR-VPR, and then, PD information
will be added on each net and CLB. After that, SR-VPR allocates rr graph (short
name of routing resource graph) which is described in Section 5.4.2 based on the
architecture file and does the initial placement. By changing the architecture file to
the rr graph, SR-VPR knows the FPGA resources and architecture which are used for
placement and routing. All PD information for the regions is initially set to PD(0).
74
Figure 5.2 Architecture evaluation CAD flow
Then, PD information will be updated after generating a region with CLBs of the
same PD. The RC and SB in the region are reserved for routing the net in this PD.
The placement is finished when the SA terminates. When finishing the placement
by the arguments of region, SR-VPR routes internal connection based on the latest
rr graph. CLBs with the same PD information are placed in the same region. The
output of SR-VPR describes the total logic area, routing area, circuit placement,
routing information and so on [38].
75
5.3 Placement
Based on the adaptive SA schedule, VPR can get a better placement result in a
short time. By using wire cost, VPR models the difficulty of routing connections in
areas with different channel widths. To get higher FPGA performance, path timing
driven cost based on the Elmore timing model is usually used. With this cost, CLBs
connected by critical path can be placed closely to reduce the path delay of the whole
FPGA chip.
Ref. [6] uses the placement result by VPR without any enhancements, assuming
that CLBs connected by nets can be placed closely based on the cost function of the
wire cost and critical path delay. But, after investigating the placement results, we
found that the placement of original VPR is not suitable for the region based FPGA
architecture. There are two limitations as follows:
• the minimal count of region for the circuit cannot be reached,
• mixed placement with CLBs in multiple modules cannot be dealt.
Fig. 5.3(a) shows the initial placement by using one benchmark. During this time,
all CLBs of a circuit will be placed on the FPGA randomly. Pure region with CLBs
from the same PD is hardly generated after initial placement. Figure 5.3(b) is the
placement result by using wire cost and timing cost in VPR. The count of regions
occupied by the circuit is not reduced to the minimum. So, more regions can be
powered off by improving placement. Figure 5.3(c) shows the placement result by
using wire cost, timing cost and region cost we proposed in [47]. The CLBs are
placed into the minimal count of regions. Therefore, more regions can be directly
powered off by selecting “1” to the sleep transistor of these regions.
Another limitation occurs when we use multiple modules on the chip. CLBs in
different modules are placed randomly in Fig. 5.4(a). By using wire cost and timing
driven cost, CLBs in different modules may be placed closely, shown in Fig. 5.4(b).
We cannot power off the regions which are composed of CLBs come from different
regions. If they are powered off, the function of related modules will be affected. But
in Fig. 5.4(c), the CLBs which come from different modules are placed into different
76
(a) Initial placement (b) Original VPR placement
(c) SR-VPR placement
Figure 5.3 Placement results with one module
regions. We can power off the regions which is occupied by CLBs of different modules
separately by using our proposed method in [45].
5.4 Routing
After we get the placement suitable for the dynamical coarse-grained power gating
FPGA architecture, routing is executed by the important enhancement based on [6].
Routing is a complex problem in the back-end design of FPGA. The total routing
area and the latency of the critical path are all affected by the routing algorithm and
routing architecture. The routing results the final size and performance of FPGA.
77
(a) Initial placement (b) Original VPR placement
(c) SR-VPR placement
Figure 5.4 Placement results with two modules
5.4.1 Routing Algorithm
There are usually two kinds of routing, global routing and detailed routing. Global
router balances the track requirement throughout FPGA, while the detailed router
assigns the routing resource actually. Several routing algorithms are provided by the
former research, such as timing-driven [33], routability-driven [7] of VPR, ROAD [49]
and routing approach via search-based boolean satisfiability [50]. In Ref. [51], these
routers are compared and it is shown that VPR is a state-of-the-art router suitable for
FPGA. Our research focuses on the timing driven VPR router based on a combined
global-detailed routing.
VPR router is based on the pathfinder algorithm [52] which is delay sensitive
78
and congestion sensitive. Non-timing-critical net will be routed to a longer and un-
congested path to allow timing-critical net use a minimum delay path. All the nets are
ripped-up and rerouted during each routing iteration [7]. Cost for each node for the
circuit is updated based on the Elmore delay model. When all nets are successfully
routed and the routing resources are not overused, the routing terminates.
CLB1
IN1     IN2     
OUT
CLB2
IN1     IN2     
OUT
CHANX1 CHANX2
C
H
A
N
Y
1
C
H
A
N
Y
2
PD(1) PD(2)PD(0)
SB1 SB2
CB1 CB2
C
B
3
CB4 CB5
C
B
6
Figure 5.5 FPGA architecture with PDs [7]
5.4.2 Routing Resource Graph
VPR executes the routing based on a routing resource graph (rr graph). The rr graph
is used to model the FPGA routing problem. In this graph, SOURCE or SINK nodes
are CLBs, and routing nodes are wires (in CHANX or CHANY) or pins (IN or OUT
pins of CLB). Directed edges between nodes mean switches in SB or CB for inter
connection.
The initial rr graph is generated by architecture file as shown in Fig. 5.5 [7]. it is
an example of FPGA architecture with CLBs which have two inputs and one output.
Figure 5.6 shows a corresponding rr graph. Assume there is a net between CLB1
and CLB2 in Fig. 5.5. VPR router searches a path from SOURCE (CLB1) to SINK
(CLB2) on rr graph of Fig. 5.6 by traversing directed edges and evaluating node
capacity which is an amount of tolerant resources given as channel width.
79
SOURCE, 
CLB1
CHANY1
OUT
CHANX1
IN1 IN2
SINK, 
CLB1
SOURCE, 
CLB2
CHANY2
OUT
CHANX2
IN1 IN2
SINK, 
CLB2
Figure 5.6 rr-graph modeling [7]
After each swapping for the region based power gating FPGA architecture men-
tioned in Chapter 4, PD information of the routing resources may be changed. In
Fig. 5.5, CLB1 and CHANX1 are located in PD(1), while CLB2, SB2, CHANX2 and
CHANY2 are in PD(2). They are placed into different regions. CHANY1 and SB1
are located in PD(0) and used as public routing resource.
If we want to power off SB2 when PD(2) is powered off, nets in PD(1) cannot be
routed in PD(2) and use SB2. To solve this problem we reconstruct an rr-graph for
each PD. Figure 5.7 shows the rr-graphs for PD(1) and PD(2). CHANY1 is a node
which is used in both PD(1) and PD(2) because it is in PD(0). But after splitting the
rr graph, the edges between CHANX1 and CHANX2 are removed from the rr graph
for each power domain, because CHANX2 cannot be used for the nets in PD(1).
SR-VPR uses the rr graph not only in routing, but also in placement. Although
we can generate the PD information for rr graph after placement, we update it after
each CLB swapping, because the costs of some nets are affected by placement when
80
SOURCE, 
CLB1
CHANY1
OUT
CHANX1
IN1 IN2
SINK, 
CLB1
SOURCE, 
CLB2
CHANY2
OUT
CHANX2
IN1 IN2
SINK, 
CLB2
CHANY1
PD(1) PD(2)
Figure 5.7 rr-graph modeling with PDs
the rr graph is updated. SR-VPR should calculate the placement cost based on the
renewed rr graph. SR-VPR can do a new placement decision basing on the latest
rr graph which has power domain information.
5.4.3 Asymmetric Wilton SB
SBs are used on the chip to connect RCs. Some kinds of SBs are discussed in [53].
They give a switch box in which a pin labeled i can be connected to another disjoint
pin. Each pin can be connected to three wires in different edge of the SB (Fs=3, the
number of tracks to which the input pin on the SB edges can connect [26]). An SB
(type a) in Fig. 5.8 is the original Wilton SB, we call it “symmetric Wilton SB”. Sa
means a set of connections for this symmetric SB [48] and it is defined as follows:
81
Sa =
WSB−1⋃
i=0
{(t0, i, t2, i), (t1, i, t3, i),
(t0, i, t1, (WSB−i) mod WSB), (t1, i, t2, (i+1) mod WSB),
(t2, i, t3, (2WSB−2−i) mod WSB), (t3, i, t0, (i+1) mod WSB)}
(5.4.1)








   
   








   
   




 
 








     








   
     
 








     
     
     








     
     
(b) (c)
(a) WSB
WSB
WSB
WSB
WSB
L
SB
WSB
L
SB
L
SB
L
SB
L
SB
L
SB
Figure 5.8 Symmetric and asymmetric SBs
When SB is dynamical powered off in some regions, net routed by this SB will be
affected. So we should route the net in the region which has the same power state
with it or route it in the public channels which is not belonging to any region and will
not be powered off. The public channels may have more burden than the channel in
82
the region. Since some routing nodes cannot be used for all the PDs, more routing
channels in PD(0) may be need by routing. To alleviate this problem, we introduce
an “asymmetric Wilton SB” which has different channel count in length (LSB) and
width (WSB) for the public routing channels in PD(0). We only use asymmetric
Wilton SBs on the region edges for dynamical power gating FPGA architecture. The
SB in the corner or inside of the region is symmetric Wilton SB, shown in Fig. 5.9.
We call this type routing architecture the “mixed SB” architecture. So, we need to
find a switch box with different length and width edges. SB (b) and (c) in Fig. 5.8
show the enhanced switch box architectures based on the symmetric Wilton SB in
two different ways.
The type (b) SB is generated by the following two steps. First, we enlarge the SB
size to WSB ∗LSB and add input pins in both long edges. Then, we connect the pins
on different edges of SB. Connections are shown in Fig. 5.8, and the connection set
of type (b) is defined by Eq. 5.4.2.
Sb = {
WSB−1⋃
i=0
(t0,i, t2,i),
LSB−1⋃
i=0
{(t1,i, t3,i),
(t0, i mod WSB , t1, (LSB−i) mod LSB), (t1, i, t2, (i+1) mod WSB),
(t2, i mod WSB , t3, (2LSB−2−i) mod LSB), (t3, i, t0, (i+1) mod WSB)}}
(5.4.2)
For the SB of type (c), we first increase the size of SB from WSB ∗ WSB to
LSB ∗ LSB. Then, we remove the redundant pins in the short edges and also remove
the connections for these pins. Finally, we redraw the SB as WSB ∗ LSB shown as
type (c) in Fig. 5.8 and keep the connections. The connections set of type (c) is as
follows:
83
Figure 5.9 Mixed routing architecture with both symmetric and asymmetric Wilton
SBs
Sc = {
WSB−1⋃
i=0
(t0, i, t2, i),
LSB−1⋃
i=0
(t1, i, t3, i),
WSB−1⋃
i=0
{(t0, i, t1, (LSB−i) mod LSB), (t2, i, t3, (2L−2−i) mod LSB)},
LSB−1⋃
i=0
{(t1, i, t2, (i+1) mod WSB), (t3, i, t0, (i+1) mod WSB)}}
(5.4.3)
The difference between these two types of asymmetric Wilton SBs is that the
Fs of some incoming wires on the short edges in type (b) is more than 3, while the
Fs of some incoming wires on the long edges in SB of type (c) is less than 3. The
asymmetric Wilton SB has the same routing algorithm with symmetric Wilton SB
(Fs is 3 for each incoming wire). The horizontal edges of two types asymmetric SBs
are longer than the vertical edges when LSB is larger than WSB. The SBs whose
vertical edges are longer than horizontal edges are also needed shown in Fig. 5.9. The
generation method of these SBs is the same as the above SBs.
84
To compare two asymmetric Wilton SBs with the symmetric Wilton SB, we ex-
perimented a test using a parameter δ which is an aspect ratio (LSB/WSB) of the
asymmetric Wilton SB. Figure 5.10 shows the results on the average channel width
for 20 MCNC benchmarks when using different type SBs. Two same benchmarks
circuits are used each time for placement and routing on one FPGA chip. The region
size is 16 (4*4). When δ is 1.2, we can get the minimum channel width by using
mix SB with symmetric Wilton SB and asymmetric Wilton SB of type (b). But if
δ grows larger than 1.4, when all the incoming wires on the short edges are used,
the utilization on longer edge of asymmetric Wilton SB is low. So the channel width
grows larger while leaving more channels unused.
Figure 5.10 Average channel width vs. δ. (Type (a) is symmetric Wilton SB, type
(b) and (c) are asymmetric Wilton SBs)
85
5.5 Experimental Results and Discussion
After the placement for the dynamical coarse-grained power gating, we compare the
routing result by the mixed Wilton SB with the symmetric one. Each CLB consists of
10 basic logic elements which has one 4-LUT and one flip-flop. The routing parameters
defined in [26] are set as Fcin=0.25, Fcout=0.1 (the number of tracks to which input
and output pins of CLB connect), Fs=3, and length of segments is 4. Combined
global-detailed routing is used based on 20 MCNC benchmarks (PD(1)) and alu4
(PD(2)). After successes of finding the minimum channel width, all the channels are
enlarged 1.2 times for the final routing. We use 45nm process architecture [25] and
set NSR−size as 16 (4*4).
Because mixed SB needs less channels than the symmetric one, we can see that
6.4% routing area is reduced in average shown in Table 5.1. CPU time to route wires
in the mixed SBs architecture is larger than the one whose SBs are all the same. But
we can ignore 5.4% CPU time overhead because it does not have the relationship
between CPU time and the performance of chip. The critical path delay of mixed SB
is 2.2% faster than that of symmetric SB.
We run this test on a workstation with AMD opteron 2435 CPU whose frequency
is 2.6 GHz. The total memory of this workstation is 4 GB. The operation system is
Linux.
86
Table 5.1 Experimental results based on different SB styles
M1 M2 CPU time of Routing Area Critical path delay
routing(min) (No. of Tran.) (s)
Sym. Mixed Sym. Mixed Sym. Mixed
SB SB SB SB SB SB
alu4 alu4 174 186 2.05E+06 1.78E+06 3.55E-09 3.47E-09
apex2 alu4 292 280 2.92E+06 2.59E+06 3.95E-09 3.86E-09
apex4 alu4 143 153 2.34E+06 2.07E+06 3.99E-09 3.77E-09
bigkey alu4 234 285 5.57E+06 5.42E+06 3.45E-09 3.53E-09
clma alu4 997 1044 7.28E+06 6.78E+06 6.74E-09 6.48E-09
des alu4 246 238 7.99E+06 7.99E+06 3.87E-09 3.91E-09
diffeq alu4 167 195 1.93E+06 1.74E+06 4.34E-09 4.38E-09
dsip alu4 262 283 5.74E+06 5.60E+06 3.68E-09 3.52E-09
elliptic alu4 177 196 2.97E+06 2.72E+06 6.17E-09 5.69E-09
ex1010 alu4 2098 2134 7.13E+06 6.87E+06 4.74E-09 4.53E-09
ex5p alu4 140 129 2.11E+06 1.87E+06 3.83E-09 3.58E-09
frisc alu4 341 367 3.48E+06 3.08E+06 8.64E-09 8.32E-09
misex3 alu4 111 135 2.04E+06 1.94E+06 3.36E-09 3.76E-09
pdc alu4 891 906 5.91E+06 5.33E+06 5.00E-09 4.98E-09
s298 alu4 108 153 1.92E+06 1.76E+06 5.31E-09 5.29E-09
s38417 alu4 203 241 3.67E+06 3.27E+06 4.48E-09 4.43E-09
s385841 alu4 190 210 3.62E+05 3.55E+05 3.75E-09 3.60E-09
seq alu4 294 318 2.92E+06 2.83E+06 4.10E-09 4.00E-09
spla alu4 352 369 3.66E+06 3.53E+06 5.67E-09 5.21E-09
tseng alu4 169 175 1.88E+06 1.62E+06 4.46E-09 4.75E-09
average 379 400 3.7E+06 3.5E+06 4.7E-09 4.6E-09
percent 100% 105.4% 100% 93.6% 100% 97.8%
5.6 Conclusion
In this chapter, the dynamical power gating for coarse-grained FPGA architecture is
discussed. Based on the placement which can place CLBs from different modules into
different regions, this chapter proposed a new routing method to support the dynam-
ical power gating. By adding power domain information into the routing resource
graph, we can avoid nets routing by using the SBs in different power domains. We
also show a mixed Wilton switch box to reduce the routing area when using the new
routing method. Results show that 6.4% routing area can be reduce by using mixed
87
Wilton SB compared to the symmetric SB. Consequently, after the placement [45, 47]
considering effective power reduction by using dynamical power gating architecture of
FPGA, our proposed routing method can perform both low power and area-efficiency.
88
Chapter 6
Conclusions
This thesis focuses on the low-power FPGA architecture design and related region-
oriented P&R. Low power states in the user circuit can be realized by the proposed
FPGA to reduce the power consumption.
First, the coarse-grained FPGA architecture and PCHM design with power gating
and clock gating are discussed. The region size (NSR−size = 16) is determined by
some experiments. To avoid the conjunction, we also change the width proportion
between always-on routing channels and dynamic-off routing channels. After that, we
find a new asymmetric SB for different channel width proportion. It can reduce the
congestion caused by routing wires in the RCs which have the same PD information
or in the public RCs whose PD is PD(0). As a result, 6.4% routing area is reduced in
average. The critical path delay of mixed SB is 2.2% shorter than the original FPGA
architecture.
Second, both static and dynamic coarse-grained power gating and clock gating
are used based on the region oriented placement method. By introducing sleep re-
gion cost, the simulated annealing placement algorithm can put the CLBs into the
minimum number of sleep regions for the static power-off architecture whose blank
regions are powered off all the time. When the CBs and SBs in the unused SRs are
powered off simultaneously, the power consumption is reduced up to 21.1%. It is
composed of 14.7% static power reduction, 8.0% dynamic power reduction and 1.6%
power consumption which is at the cost of low power related logic. When user’s cir-
cuit supports low power design, for example, there are sleep modules which can enter
89
90
sleep mode, the placement algorithm can divide CLBs with different power domain
information into different regions. Two MCNC benchmarks are used as two differ-
ent power domains. By gating the unused SRs, 14.1% power is reduced during two
modules’ working. The power consumption is reduced about 30.3% and 31.8% when
different power domains are powered off separately.
Finally, to solve the placement with multiple modules, power domain information
is added into the routing resource graph. Each power domain information of routing
node is updated during the placement. rr graph is reconstructed by connecting the
routing resource with the same power domain information. VPR is improved, named
SR-VPR, to find a best trade-off between wire cost, timing cost and sleep region cost.
Although VPR architecture has some common base with the commercial FPGAs,
such as the products of Altera and Xilinx, it cannot be applied to them as it is. We
hope we could do future research based on the real FPGA chip and let FPGA be
used in more low power areas.
Publication List
Journal Paper
1. C. Li, Y. Dong and T.!Watanabe, “Region-oriented Placement Algorithm for Coarse-
grained Power-gating FPGA Architecture”, IEICE Trans. on Information and Sys-
tems, Feb. 2012. (to be published)
2. C. Li, Y. Dong, and T. Watanabe, “Low Power Placement and Routing for the
Coarse-Grained Power Gating FPGA Architecture”, IEICE Trans. on Fundamentals
of Electronics Communications and Computer Sciences, Vol. E94-A, No. 12, pp.
2519-2527, Dec. 2011.
3. Y. Dong, C. Li, Z. Lin, and T. Watanabe, “A Hybrid Layer-multiplexing and
Pipeline Architecture for Efficient FPGA-based Multilayer Neural Network,” IE-
ICE NOLTA, IEICE Vol. E94-N, No.10, pp. 522-532, Oct. 2011.
4. Y. Dong, C. Li, Z. Lin, H. Zhang, and T. Watanabe, “High Performance Feedfor-
ward Neural Network Mapped by NoC Architecture with A New Routing Strategy
Implementation Method,” Journal of Signal Processing (JSP), Vol. 15, No. 3, pp.
113–122, Mar. 2011.
5. Y. Dong, C. Li, Z. Lin, and T. Watanabe, “Multiple Network-on-Chip Model for
High Performance Neural Network,” IEEK trans. Journal of Semiconductor Tech-
nology and Science (JSTS), Vol. 10, No. 2, pp. 28–36, May 2010.
91
92
6. Y. Dong, C. Li, K. Kumai, Y. H. Li, Y. Wang, and T. Watanabe, “A new flexi-
ble network on chip architecture for mapping complex feedforward neural network,”
Journal of Signal Processing (JSP), Vol. 13, No. 6, pp. 453–462, Nov. 2009.
Book
7. C. Li, T. Watanabe, Z. Wu, H. Li and Y. Huangfu, “The Real-time and Embedded
Soccer Robot Control System”, InTech, pp. 1-17, Jan. 2010.
Conference Paper with Review
8. Y. Dong, Y. Tang, C. Li and T. Watanabe, “Single Packet Multiple Destinations
Router of NoC for Hardware ANN”, International Conference on Intelligent Com-
puting and Intelligent Systems, pp. 43-46, Nov. 2011.
9. C. Li, Y. Dong, and T. Watanabe, “New Power-aware Placement for Region-based
FPGA Architecture combined with Dynamic Power Gating by PCHM,” in Proc.
ISLPED’11, pp. 223-228, Aug. 2011.
10. C. Li, Y. Dong, and T. Watanabe, “New Power-Efficient FPGA Design Com-
bining with Region-Constrained Placement and Multiple Power Domains,” in Proc.
NEWCAS’11, pp. 69-72, Jun. 2011.
11. Y. Dong, C. Li, H. Liu, and T. Watanabe, “A High Performance Digital Neural
Processor Design by Network on Chip Architecture,” in Proc. VLSI-DAT’11, pp.
243–246, Apr. 2011.
12. C. Li, Y. Dong, and T. Watanabe, “A Novel Low Power FPGA Architecture,” in
Proc. FIT’10, pp. 65–68, Sep. 2010.
93
13. C. Li, Y. Jiang, Z. Wu and T. Watanabe,“A Multiprocessor System for a Small Size
Soccer Robot Control System”, 4th IEEE International Symposium on Electronic
Design, Test and Applications, pp. 115-118, Jan. 2008.
94
Bibliography
[1] Altera, Stratix IV Device Handbook. Altera Corp., February 2011. http://www.
altera.com/literature/hb/stratix-iv/stratix4_handbook.pdf.
[2] Altera, Cyclone IV Device Handbook. Altera Corp., Dec. 2011.
[3] Xilinx, 7 Series FPGAs Configurable Logic Block User Guide. Xilinx Inc., Sept.
2011.
[4] Xilinx, Virtex-5 User Guide. Xilinx Inc., May 2010. http://www.xilinx.com/
support/documentation/user_guides/ug190.pdf.
[5] J. Luu, I. Kuon, P. Jamieson, T. Campbell, A. Ye, W.M. Fang, and J. Rose,
“VPR 5.0: FPGA cad and architecture exploration tools with single-driver rout-
ing, heterogeneity and process scaling,” Proceeding of the ACM/SIGDA inter-
national symposium on Field programmable gate arrays, FPGA ’09, New York,
NY, USA, pp.133–142, ACM, 2009.
[6] A. Bsoul and S. Wilton, “An FPGA Architecture Supporting Dynamically Con-
trolled Power Gating,” Field-Programmable Technology (FPT), 2010 Interna-
tional Conference on, pp.1–8, Dec. 2010.
[7] V. Betz, J. Rose, and A. Marquardt, Architecture and CAD for Deep-Submicron
FPGAs, Kluwer Academic Publishers, Norwell, MA, USA, 1999.
[8] G. Moore, “Cramming More Components Onto Integrated Circuits,” Proceed-
ings of the IEEE, vol.86, no.1, pp.82–85, Jan. 1998.
95
96
[9] M. Keating, D. Flynn, R. Aitken, A. Gibbons, and K. Shi, Low Power Method-
ology Manual: For System-on-Chip Design, Springer Publishing Company, In-
corporated, 2007.
[10] R.S. Cha, “Intel’s Breakthrough in High-K Gate Dielectric Drives Moore’s Law
Well into the Future,” Jan. 2004.
[11] B.H. Calhoun, F.A. Honore, and A. Chandrakasan, “Design methodology for
fine-grained leakage control in MTCMOS,” ISLPED’03, New York, NY, USA,
pp.104–109, ACM, 2003.
[12] K. Usami and M. Horowitz, “Clustered Voltage Scaling Technique for Low-power
Design,” Proceedings of the 1995 international symposium on Low power design,
ISLPED ’95, New York, NY, USA, pp.3–8, ACM, 1995.
[13] R. Puri, L. Stok, J. Cohn, D. Kung, D. Pan, D. Sylvester, A. Srivastava, and
S. Kulkarni, “Pushing ASIC performance in a power envelope,” Proceedings of
the 40th annual Design Automation Conference, DAC ’03, New York, NY, USA,
pp.788–793, ACM, 2003.
[14] M. Takahashi, M. Hamada, T. Nishikawa, H. Arakida, Y. Tsuboi, T. Fujita,
F. Hatori, S. Mita, K. Suzuki, A. Chiba, T. Terazawa, F. Sano, Y. Watanabe,
H. Momose, K. Usami, M. Igarashi, T. Ishikawa, M. Kanazawa, T. Kuroda,
and T. Furuyama, “A 60 mW MPEG4 video codec using clustered voltage scal-
ing with variable supply-voltage scheme,” Solid-State Circuits Conference, 1998.
Digest of Technical Papers. 1998 IEEE International, pp.36–37, Feb. 1998.
[15] K. Hiroshi, Z. Gang, L. Seongsoo, S. Youngsoo, and S. Takayasu, “A Controller
LSI for Realizing VDD-Hopping Scheme with Off-the-Shelf Processors and Its
Application to MPEG4 System,” IEICE TRANS. ELECTRON., pp.263–271,
2002.
97
[16] F. Ishihara, F. Sheikh, and B. Nikolic´, “Level conversion for dual-supply system-
s,” Proceedings of the 2003 international symposium on Low power electronics
and design, ISLPED ’03, New York, NY, USA, pp.164–167, ACM, 2003.
[17] A. Gayasen, K. Lee, N. Vijaykrishnan, M. Kandemir, M.J. Irwin, and T. Tuan,
“A Dual-VDD Low Power FPGA Architecture,” In Proceedings of International
Conference on Field Programmable Logic and Applications, pp.145–157, 2004.
[18] F. Li, Y. Lin, L. He, and J. Cong, “Low-power FPGA using pre-defined dual-
Vdd/dual-Vt fabrics,” Proceedings of the 2004 ACM/SIGDA 12th international
symposium on Field programmable gate arrays, FPGA ’04, New York, NY, USA,
pp.42–50, ACM, 2004.
[19] F. Li, Y. Lin, and L. He, “Field Programmability of Supply Voltages for FPGA
Power Reduction,” Computer-Aided Design of Integrated Circuits and Systems,
IEEE Transactions on, vol.26, no.4, pp.752–764, April 2007.
[20] F. Li, Y. Lin, and L. He, “FPGA power reduction using configurable dual-Vdd,”
Proceedings of the 41st annual Design Automation Conference, DAC ’04, New
York, NY, USA, pp.735–740, ACM, 2004.
[21] Altera, Altera Product Catalog. Altera Corp., 2011.
[22] V. Betz and J. Rose, “VPR: A New Packing, Placement, and Routing Tool for
FPGA Research,” August 1997.
[23] F. Li, D. Chen, L. He, and J. Cong, “Architecture evaluation for power-efficient
FPGAs,” Proceedings of the 2003 ACM/SIGDA eleventh international sympo-
sium on Field programmable gate arrays, FPGA ’03, New York, NY, USA,
pp.175–184, ACM, 2003.
98
[24] G. Lemieux, E. Lee, M. Tom, and A. Yu, “Directional and single-driver wires in
FPGA interconnect,” Field-Programmable Technology, 2004. Proceedings. 2004
IEEE International Conference on, pp.41–48, Dec. 2004.
[25] intelligent FPGA Architecture Repository. http://www.eecg.utoronto.ca/
vpr/architectures.
[26] J. Rose and S. Brown, “Flexibility of interconnection structures for field-
programmable gate arrays,” Solid-State Circuits, IEEE Journal of, vol.26, no.3,
pp.277–282, Mar. 1991.
[27] T. Tuan and B. Lai, “Leakage power analysis of a 90nm FPGA,” Custom In-
tegrated Circuits Conference, 2003. Proceedings of the IEEE 2003, pp.57 – 60,
sept. 2003.
[28] J.H. Anderson, F.N. Najm, and T. Tuan, “Active leakage power optimization for
FPGAs,” Proceedings of the 2004 ACM/SIGDA 12th international symposium
on Field programmable gate arrays, FPGA ’04, New York, NY, USA, pp.33–41,
ACM, 2004.
[29] S. Ishihara, M. Hariyama, and M. Kameyama, “A Low-power FPGA Based on
Autonomous Fine-grain Power-gating,” Proceedings of the 2009 Asia and South
Pacific Design Automation Conference, ASP-DAC ’09, Piscataway, NJ, USA,
pp.119–120, IEEE Press, 2009.
[30] C.Q. Tran, H. Kawaguchi, and T. Sakurai, “95% Leakage-Reduced FPGA using
Zigzag Power-gating, Dual-VTH/VDD and Micro-VDD-Hopping,” Asian Solid-
State Circuits Conference, 2005, pp.149–152, Nov. 2005.
[31] A. Gayasen, Y. Tsai, N. Vijaykrishnan, M. Kandemir, M.J. Irwin, and T. Tuan,
“Reducing Leakage Energy in FPGAs Using Region-Constrained Placement,” in
Proc. ACM Intl. Symp. Field-Programmable Gate Arrays, pp.51–58, 2004.
99
[32] C.E. Cheng, “RISA: Accurate And Efficient Placement Routability Model-
ing,” Computer-Aided Design, 1994., IEEE/ACM International Conference on,
pp.690–695, Nov. 1994.
[33] A. Marquardt, V. Betz, and J. Rose, “Timing-driven Placement for FPGAs,”
Proceedings of the 2000 ACM/SIGDA eighth international symposium on Field
programmable gate arrays, New York, NY, USA, pp.203–213, ACM, 2000.
[34] J. Lamoureux and S. Wilton, “Clock-Aware Placement for FPGAs,” Field Pro-
grammable Logic and Applications, 2007. FPL 2007. International Conference
on, pp.124 –131, aug. 2007.
[35] S. Yang, “Logic Synthesis and Optimization Benchmarks User Guide Version
3.0,” 1991.
[36] P. Jamieson and J. Rose, “A Verilog RTL synthesis tool for heterogeneous FP-
GAs,” Field Programmable Logic and Applications, 2005. International Confer-
ence on, pp.305–310, Aug. 2005.
[37] B.L. Synthesis and V. Group, ABC: A System for Sequential Synthesis and
Verification, 2007. http://www.eecs.berkeley.edu/~alanmi/abc/abc.htm/.
[38] K.K.W. Poon, S.J.E. Wilton, and A. Yan, “A Detailed Power Model for Field-
Programmable Gate Arrays,” ACM Trans. Des. Autom. Electron. Syst., vol.10,
pp.279–302, April 2005.
[39] U.o.C. Berkeley, Berkeley Logic Interchange Format, Feb. 2005. http://www.
cs.uic.edu/~jlillis/courses/cs594/spring05/blif.pdf.
[40] P. Jamieson, W. Luk, S. Wilton, and G. Constantinides, “An energy and pow-
er consumption analysis of FPGA routing architectures,” Field-Programmable
Technology, 2009. FPT 2009. International Conference on, pp.324–327, Dec.
2009.
100
[41] V. Betz, T. Campbell, W. Fang, I. Kuon, J. Luu, A. Marquardt, J. Rose, and
A. Ye, VPR and T-VPACK User’s Manual, July 2009. http://www.eecg.
utoronto.ca/vpr/VPR_5.pdf.
[42] S. Huang, “45nm Stable SRAM Structure for Ultra Low Leakage Power,” Mas-
ter’s thesis, IPS, waseda university, Fukuoka, Japan, 2008.
[43] http://ptm.asu.edu.
[44] C. Li, Y.P. Dong, and T. Watanabe, “A Novel Low Power FPGA Architecture,”
FIT2010 of IPSJ, pp.65–68, Sep. 2010.
[45] C. Li, Y.P. Dong, and T. Watanabe, “New Power-Efficient FPGA Design Com-
bining with Region-Constrained Placement and Multiple Power Domains,” 9th
IEEE International NEWCAS Conference, pp.69–72, Jun. 2011.
[46] T. Tuan, A. Rahman, S. Das, S. Trimberger, and S. Kao, “A 90-nm Low-Power
FPGA for Battery-Powered Applications,” Computer-Aided Design of Integrated
Circuits and Systems, IEEE Transactions on, vol.26, no.2, pp.296–300, Feb. 2007.
[47] C. Li, Y.P. Dong, and T. Watanabe, “New Power-aware Placement for Region-
based FPGA Architecture Combined with Dynamic Power Gating by PCHM,”
ISLPED 2011, Aug. 2011.
[48] S.J.E. Wilton, Architectures and Algorithms for Field-Programmable Gate Ar-
rays with Embedded Memory, Ph.D. thesis, University of Toronto, 1997.
[49] H. Arslan and S. Dutt, “ROAD: an order-impervious optimal detailed router for
FPGAs,” Computer Design, 2003. Proceedings. 21st International Conference
on, pp.350–356, oct. 2003.
101
[50] G.J. Nam, K. Sakallah, and R. Rutenbar, “A new FPGA detailed routing ap-
proach via search-based Boolean satisfiability,” Computer-Aided Design of Inte-
grated Circuits and Systems, IEEE Transactions on, vol.21, no.6, pp.674–684,
Jun. 2002.
[51] D.F.G. Prado, “Tutorial on FPGA Routing.” UNMSM Magazine, 2006.
[52] C. Ebeling, L. McMurchie, S. Hauck, and S. Burns, “Placement and Routing
Tools for the Triptych FPGA,” Very Large Scale Integration (VLSI) Systems,
IEEE Transactions on, vol.3, no.4, pp.473–482, Dec. 1995.
[53] M.I. Masud and S.J.E. Wilton, “A New Switch Block for Segmented FPGAs,”
Proceedings of the 9th International Workshop on Field-Programmable Logic
and Applications, London, UK, pp.274–281, Springer-Verlag, 1999.
