Architecture study of a 3D CMOS-NEM FPGA by Li, Chong
c© 2013 Chong Li
ARCHITECTURE STUDY OF A 3D CMOS-NEM FPGA
BY
CHONG LI
THESIS
Submitted in partial fulfillment of the requirements
for the degree of Master of Science in Electrical and Computer Engineering
in the Graduate College of the
University of Illinois at Urbana-Champaign, 2013
Urbana, Illinois
Adviser:
Associate Professor Deming Chen
ABSTRACT
In this paper, we introduce a reconfigurable architecture, named 3D CMOS-
NEM FPGA, which utilizes nanoelectromechanical (NEM) relays and 3D
integration techniques synergistically. Unique features of our architecture
include: hybrid CMOS-NEM FPGA look-up tables (LUTs) and configurable
logic blocks (CLBs), NEM-based switch blocks (SBs) and connection blocks
(CBs), and face-to-face 3D stacking. This architecture also has a built-in
feature named direct link which are dedicated local communication chan-
nels using the short vertical wires between the two stacks to further en-
hance performance. A customized 3D FPGA placement and routing flow
and a customized cycle-accurate mixed-level power/thermal simulator have
been developed. It is shown that 3D stacking together with NEM devices
achieves a 33.11% delay reduction, 29.19% power reduction, and 78.23% foot-
print reduction over the baseline simultaneously, with negligible temperature
penalty.
ii
To my wife and my parents, for their love and support
iii
ACKNOWLEDGMENTS
First and foremost I would like to express my gratitude to Prof. Deming
Chen for his kind guidance, without which this thesis would not have been
possible.
I thank Prof. Martin Wong for teaching those intriguing courses that led
me into VLSI CAD research. Special thanks to Prof. Jont Allen for recruiting
me to the University of Illinois in the first place.
My sincere thanks go to my fellow lab mates at the University of Illinois
for stimulating discussions.
I would like to thank my wife and my parents who have always been
extremely supportive.
This work is partially supported by NSF CCF 07-46608 and DARPA
(NBCH 1090002).
iv
TABLE OF CONTENTS
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . 1
CHAPTER 2 RELATED WORK . . . . . . . . . . . . . . . . . . . . 4
2.1 3D FPGA and Emerging Nano-devices in FPGA . . . . . . . . 4
2.2 Power and Thermal Analysis of FPGA . . . . . . . . . . . . . 5
CHAPTER 3 NEM RELAYS . . . . . . . . . . . . . . . . . . . . . . 7
CHAPTER 4 NEM-BASED FPGA ARCHITECTURE . . . . . . . . 9
4.1 NEM Relays as LUT Memory Elements . . . . . . . . . . . . . 9
4.2 NEM Relay as FPGA Routing Switch . . . . . . . . . . . . . . 11
4.3 Face-to-Face Stacking and Via Density . . . . . . . . . . . . . 11
4.4 3D Switch Block . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.5 Direct Link . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.6 Area and Footprint Reduction . . . . . . . . . . . . . . . . . . 16
CHAPTER 5 CAD FLOW . . . . . . . . . . . . . . . . . . . . . . . . 18
5.1 3D Architecture Generation . . . . . . . . . . . . . . . . . . . 19
5.2 3D Placement and Routing . . . . . . . . . . . . . . . . . . . . 20
CHAPTER 6 POWER AND THERMAL SIMULATION . . . . . . . 23
6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.2 Parameter Extraction . . . . . . . . . . . . . . . . . . . . . . . 23
6.3 Power Estimation Framework . . . . . . . . . . . . . . . . . . 24
6.4 Thermal Characterization . . . . . . . . . . . . . . . . . . . . 25
CHAPTER 7 EXPERIMENTAL RESULTS AND DISCUSSION . . . 27
7.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 27
7.2 Timing Performance . . . . . . . . . . . . . . . . . . . . . . . 27
7.3 Power and Temperature Reduction . . . . . . . . . . . . . . . 29
CHAPTER 8 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . 33
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
v
CHAPTER 1
INTRODUCTION
FPGA (field-programmable gate array) devices can lower the amortized
manufacturing cost per unit and dramatically improve the design produc-
tivity through re-use of the same silicon implementation for a wide range of
applications. The major performance bottleneck of the FPGA are the pro-
grammable interconnects and routing elements, which account for up to 80%
of the total delay, 80% of power consumption and 70% of the die area [1], [2].
One recognized solution to reduce interconnect delay and power penalty is
to move to a three-dimensional (3D) architecture, where layers of logic are
stacked on top of each other instead of being spread across a 2D plane. 3D
integration optimizes the interconnect network vertically. Both delay and
power will be reduced due to the reduction in wire resistance and capaci-
tance.
In deep sub-micron technology, leakage power has become the dominant
power dissipation mechanism [3]. Leakage power of routing switches is con-
sidered to be one of the major challenges for FPGAs. Nanoelectromechani-
cal (NEM) relays [4], which are electro-statically-actuated switches with zero
leakage at off-state and low resistance at on-state, show promising electrical
characteristics and offer the potential to overcome the leakage challenges.
Chen et al. [5] utilized NEM relays to replace routing switches and routing
SRAMs in traditional two-dimensional complementary metaloxidesemicon-
ductor (CMOS) FPGAs (2D CMOS FPGAs). In addition, by stacking NEM
relays on top of a CMOS1 [6], reference [5] reported promising results in
delay, power, and footprint reductions.
Realizing both 3D stacking and NEM relays can be used to optimize the
This thesis is a collaborative work. Part of this thesis has been published in Dong et
al., “Architecture and performance evaluation of 3D CMOS-NEM FPGA,” System Level
Interconnect Prediction (SLIP), 2011 13th International Workshop on, pp 2011, 1-8.
1NEM relays can be encapsulated into metal layers and do not occupy extra device
area.
1
FPGA architecture. In this paper we explore the synergy between these
two technologies and evaluate the combined effect of both technologies. We
present a 3D hybrid CMOS-NEM FPGA architecture. As proposed in [5],
NEM switches are integrated into metal layers and overlaid on top of a CMOS
device layer to save footprint. Using such a technology, we designed a new
NEM-based look-up table (LUT) cell, which uses NEM relays as its pro-
grammable SRAM cells. Our LUT design offers reduction in LUT area,
power, and delay. In addition, a 3D face-to-face bonding process [7], [8]
has been applied in this study to optimize the interconnect vertically. Fur-
thermore, to maximize the performance gain of 3D stacking, dedicated di-
rect links are inserted between vertical neighboring configurable logic blocks
(CLBs). These direct links connect CLBs without programming switches,
thus enabling fast layer-to-layer transportation. To evaluate the benefit of
this new architecture, a 3D placement and routing flow has been developed
based on the state-of-art FPGA placement and routing tool VPR5.0 [9]. The
placement and routing algorithms in VPR are tuned and enhanced.
Vertically stacked devices cause a rapid increase in power density and tem-
perature [10]. From a device perspective, higher temperature causes lower
IDSAT and higher leakage current in MOS devices, both of which are unde-
sirable [11]. Temperature gradient across the die results in mechanical stress
on the device. Such mechanical stress can lead to various temperature de-
pendent failures [12]. To accurately characterize thermal profile of the chip,
we enhanced fpgaEVA-LP2 [13], which is a customized cycle-accurate FPGA
power simulator under the VPR4.3 framework for 2D FPGAs. The enhanced
power estimation framework is realized on top of the VPR 5.0, and is named
fpgaEVA-3D-LP. fpgaEVA-3D-LP supports power simulation of 3D FPGAs.
It is flexible enough to explore various FPGA architectures. By utilizing
the layout information from our 3D placement and routing tool, it is pos-
sible to accurately calculate power consumption in each tile of the FPGA.
Hotspot [14], which is a popular thermal simulator in the community with
good fidelity, is used to generate the thermal profile of the FPGA.
This paper is organized as follows: Chapter 2 reviews related work in 3D
FPGA, NEM relays and thermal issues of FPGAs. Chapter 3 introduces
the principle of operation and advantages of the NEM device. In Chapter
4, NEM-based LUT and routing switch designs are discussed. The overall
description of 3D CMOS-NEM FPGA architecture is presented. Chapter 5
2
discusses our CAD flow. In Chapter 6, details about our power and ther-
mal simulation are given. In Chapter 7, we compare our 3D CMOS-NEM
architecture to 2D CMOS FPGA and 3D CMOS FPGA in terms of delay,
footprint, power consumption, and temperature. Chapter 8 concludes this
paper.
3
CHAPTER 2
RELATED WORK
2.1 3D FPGA and Emerging Nano-devices in FPGA
With advances in 3D packing technology [15], [16], the 3D FPGA has become
an increasingly promising solution to address challenges in deep sub-micron
FPGA design. Lin et al. [17] studied performance improvements in mono-
lithically stacked 3D-FPGA in terms of logic density, critical path delay, and
power consumption. Ababei et al. [18] proposed a partitioning-based place-
ment algorithm for 3D FPGAs. Their placement algorithm provides compa-
rable result to that of VPR’s simulated annealing-based algorithm, with con-
siderable shorter run time. Chandrasekhar [19] proposed a dual-interconnect
architecture for a 3D FPGA which has parasitic capacitance comparable to
2D FPGAs. On the fabrication front, Naito et al. [20] reported the world’s
first monolithic 3D FPGA. All of the works aforementioned target CMOS
FPGAs.
A number of works focused on applying emerging nano-devices in FPGAs.
Wang et al. [21] reported an architecture in which CLBs are built using silicon
carbide NEM switches. This architecture is able to operate at a very high
temperature since the operation of mechanical NEM switches is insensitive to
ambient temperature. However, a major disadvantage of building CLB using
NEM devices lies in their large mechanical switching delay (around 5 ns),
which would significantly degrade timing performance of the FPGA. To avoid
such performance degradation, some researchers take a hybrid CMOS-NEM
approach in which routing switches and/or SRAMs are replaced with nano-
devices, while the CLBs remain to be CMOS built. Dong et al. [22] reported a
hybrid CMOS-NEM architecture that incorporates carbon nanotube bundles
and nanowire crossbars into the CMOS fabrication process. Chong et al.
[23] presented a hybrid NEM/CMOS SRAM cell, in which the two pull-
4
down transistors of a conventional CMOS six transistor (6T) SRAM cell are
replaced with NEM relays. Such a hybrid SRAM cell is applied to FPGA
routing switches to evaluate performance benefits in terms of leakage power
and delay [24].
2.2 Power and Thermal Analysis of FPGA
There has been intensive research on power modeling and thermal-aware de-
sign techniques since the advent of 3D integrated circuit (IC). Lau et al. [25]
studied effects of thermal through-silicon-via (TSV)-based on heat-transfer
CFD (computational fluid dynamic) analyses. Kim et al. [26] proposed a
cooling scheme using interlayer microfluidic channels. There is a rich lit-
erature on the physical design front as well. Cong et al. [27] reported a
thermal-driven placement algorithm. This algorithm first does placement
in 2D, then transforms 2D placement into 3D, followed by refinement steps.
Chu and Wong [28] suggested a placement algorithm based on matrix synthe-
sis techniques. Cong et al. [29] proposed a 3D routing algorithm that reduces
maximum temperature on the chip with a reasonable wire-length penalty.
Leading FPGA vendors have long been interested in power and thermal is-
sues in FPGA. Some commercial FPGA architectures have built-in hardware
to monitor temperature of the chip [30]. Boemo and Lopez-Buedo [31] pro-
posed using a ring-oscillator as a thermal monitor and successfully demon-
strated utilizing a Xilinx OSC4 cell as the thermal transducer. Franco et
al. [32] reported an experiment on ring oscillator-based temperature sensors
for low-voltage Virtex series FPGAs. Those hardware-based techniques are
highly effective in monitoring the operating temperature of the FPGA. How-
ever, they provide little insight for the designers for designing power- and
thermal-aware FPGAs.
In an effort to carry out pre-layout power and thermal estimation on the de-
sign level, researchers proposed flexible power and thermal estimation tools
for FPGAs. Anderson and Najm [33] proposed techniques to predict net
activity and interconnect capacitance of FPGAs. Those techniques are im-
portant for accurate dynamic power estimation. Ho et al. [34] reported a
dynamic power estimation flow. Poon et al. [35] proposed a detailed power
estimation flow which considers both dynamic and static power based on the
5
VPR framework. Li et al. [13] proposed a FPGA power estimation frame-
work which supports programmable voltage source and power-gating. This
cycle-accurate power simulator is able to capture glitch power, which is not
usually characterized in other FPGA power simulators. Compared to SPICE
simulation, the framework is verified to be of high accuracy.
Tools [14], [36], [37] that are developed for IC thermal analysis can be used
to estimate the thermal profile of FPGAs. These tools allow users to specify
physical parameters of the chip, including die size, heat spreader size, and
thermal conductivity/capacity of die material. Such flexibility is important
for accurate simulation.
On the physical design front, some researchers studied thermal-aware place-
ment for both homogeneous FPGAs and platform FPGAs. Platform FPGAs
are FPGAs whose functional block consists of not only configurable logic
blocks (CLB), but also embedded circuit blocks, such as multipliers, DSP
modules, and digital clock managers (DCM). Sundararajan et al. [38] studied
placement of embedded circuit blocks on platform FPGAs, showing temper-
ature variation as high as 20◦C on the die. In contrast to platform FPGAs,
the homogeneous FPGA’s logic part consists of CLBs exclusively. Siozios
and Soudris [39] proposed a temperature-aware placement algorithm that at-
tempts to evenly spread CLBs with higher switching activity across the die.
In [39], thermal simulation is based on thermal-resistive network. Thermal-
resistive network-based methods are computationally efficient, which is cru-
cial for simulated annealing-based placement algorithms. However, such
methods sacrifice accuracy. It is shown in [38] and [40] that in homoge-
neous FPGA architecture, temperature across different parts of the chip is
consistent with a small temperature gradient, suggesting that thermal-aware
placement for homogeneous FPGAs may not be useful.
For our architecture study, we explore a homogeneous FPGA. Although
hotspots are hard to observe for such a FPGA, due to large footprint re-
duction through 3D stacking and NEM encapsulation, it is still desirable to
evaluate the steady-state thermal profile of our new 3D CMOS-NEM archi-
tecture. Therefore, we constructed a CAD flow that accurately estimates
power consumption in each tile of the FPGA, then we used Hotspot 5.0 to
precisely characterize the thermal profile of the die. Studying power, delay,
and thermal profiles of platform FPGAs requires new CAD simulation flows,
and will be our future work.
6
CHAPTER 3
NEM RELAYS
NEM relays are electrostatically-actuated switches that have zero off-state
leakage and are promising to achieve relatively low on-state resistance com-
pared to CMOS pass transistors. Figure 3.1 (a) shows the structure of a
three-terminal (3T) NEM relay, which consists of: (1) a deflecting beam
(connected to the source electrode), which forms the channel for current
flow; (2) a gate electrode with a gap to the beam which can control the state
of the switch through electrostatic force; and (3) a drain electrode, which
connects to the beam when the NEM-relay is in its on-state [5].
(a) (b)
Figure 3.1: (a) Structure of a 3-terminal(3T) NEM relay and its IDS − VGS
curve; (b) Measured I-V characteristics of a fabricated NEM relay, which
shows zero leakage in the off state and around 2000 Ω on-resistance;
When gate voltage (VGS) is applied, electrostatic force attracts the beam
towards the gate. At pull-in voltage (Vpi), the elastic force of the beam can no
longer balance the electrostatic force, and the beam collapses toward the gate
until contact is made at the drain. Since pull-in is achieved through elec-
tromechanical instability, the voltage at which the beam disconnects from
the drain (pull-out voltage, Vpo) is smaller than Vpi. This leads to hysteresis
in the current-voltage characteristics of NEM relays (Figure 3.1(a)). Figure
3.1(b) shows the I-V characteristics of a fabricated 3T NEM relay, where zero
leakage in the off-state is confirmed, and an on-resistance of 2000 Ω is demon-
strated [4]. All structural materials to fabricate NEM relays can potentially
7
be typical materials in the standard CMOS back-end-of-line (BEOL) process.
Due to low processing temperatures of these materials, it is promising that
the fabrication of NEM relays could be compatible with the CMOS BEOL
process. Encapsulating NEM relays between metal layers after fabrication
enables monolithic 3D integration of NEM relays on top of the CMOS to
reduce area, as shown in Figure 3.2.
Figure 3.2: Encapsulated NEM relays between metal layers to enable
monolithic 3D integration with silicon CMOS.
8
CHAPTER 4
NEM-BASED FPGA ARCHITECTURE
4.1 NEM Relays as LUT Memory Elements
Hysteresis characteristics of NEM relays enable the use of NEM relays as
memory elements, which makes it possible to replace each CMOS SRAM cell
inside CMOS LUTs with two NEM relays. As shown in Figure 3.1(c), after
being pulled in by applying a VGS greater than Vpi, applying VGS inside the
hysteresis window (Vpo < VGS < Vpi) will keep the NEM relay in the pull-in
(close) state. However, if a NEM relay has not been pulled in, applying VGS
inside the hysteresis window (Vpo < VGS < Vpi), the relay will stay in the
pull-out (open) state. As NEM relays have zero leakage in off-state, and
can be placed on top of the CMOS, replacing CMOS SRAM cells with NEM
relays will help reduce LUT leakage and reduce LUT layout area.
In CMOS SRAM-based FPGAs, look-up tables (LUTs), each consisting of
CMOS SRAM cells and an NMOS pass transistor-based multiplexer (Figure
4.1(a)), are used to provide programmable logic functions. Inside each LUT,
pre-programmed SRAM cells provide corresponding values to the output,
which could be either logic high (Vdd) or logic low (Gnd). Although each
NEM relay has two stable states, i.e., open or close, an NEM relay in open
state cannot generate a specific output voltage. Therefore, we propose a
new memory cell design in this work. In order to provide both Vdd and Gnd
outputs, two NEM relays are needed to replace one CMOS SRAM cell, as
shown in Figure 4.2(b). For convenience, we call this design an NEM memory
cell. In this NEM memory cell, only one NEM relay will be programmed to
the close state, connecting either Vdd or Gnd to the output. Each NEM
relay can be programmed individually through the half-select programming
scheme, as described in [5].
Figure 4.2 shows the idea of replacing CMOS SRAM cells in CMOS LUTs
9
(a) (b)
Figure 4.1: (a) Traditional CMOS SRAM-based 4-input LUT; (b)
CMOS-NEM 4-LUT, where NEM memory elements are stacked on top of
the CMOS.
Figure 4.2: (a) CMOS 6-transistor SRAM cell used in CMOS SRAM-based
LUT; (b) NEM memory cell which can be used to replace one CMOS
SRAM cell in LUT.
with NEM memory cells. For convenience, we call the hybrid LUT as CMOS-
NEM LUT. In this new type of LUT, pre-configured NEM memory cells are
used to store corresponding logic values; an NMOS pass transistor-based mul-
tiplexer is used to select the desired output based on input values. Stacking
NEM relays on top of the CMOS, the NEM-based LUT achieves a 53.1%
reduction in the LUT layout area. In the meantime, a 55% leakage reduction
and a 9.3% delay reduction are achieved due to zero leakage of the off state
and low on-resistance of the NEM relay.
10
4.2 NEM Relay as FPGA Routing Switch
Traditional CMOS SRAM-based FPGA uses SRAM-controlled NMOS pass
transistors to implement the programmable routing switch. As described
in [5], both the controlling SRAM cell and the NMOS pass transistor can
be replaced at the same time using just a single NEM relay. In this work,
we used the same scheme as [5] for CB (connection block) and SB (switch
block) designs. Unlike NEM memory cells, only one NEM relay is needed to
replace one NMOS pass transistor and the corresponding controlling SRAM
cell. Chen et al. [5] also reported using NEM to design MUX. The NEM
relay will be programmed using a half-select programming scheme [5].
Selective buffer removal was discussed in [41]. With NEM relay as routing
switches, the input and output buffers of CLB can be removed and the buffers
on the routing track can be down-sized without affecting performance of the
circuit.
4.3 Face-to-Face Stacking and Via Density
3D face-to-face CMOS-NEM FPGA adopts the traditional island-style FPGA
architecture. As shown in Figure 4.3, each 3D layer contains a fabric of re-
peated tiles where each tile consists of one switch block (SB), two connection
block (CB), and one configurable logic block (CLB), which contains a group
of LUTs.
(a) (b)
Figure 4.3: (a) CMOS SRAM and corresponding NEM switch; (b) NEM
relay-based FPGA connection block (CB) and switch block (SB).
11
Figure 4.4: Two-layer face-to-face stacking.
In this work, the face-to-face bonding process as introduced in [7] is adopted
to fabricate the 3D NEM FPGA. During face-to-face bonding, metallization
layers are joined, and the size of the connecting vias is determined by the
accuracy of the layer alignment technique used. Since these vias are not
through-silicon vias (TSVs), their feature sizes can be smaller.
Figure 4.4 demonstrates the concept of such a face-to-face bonding solution
for our study. The top and bottom CMOS device layers contain addressing
circuits, flip-flops, and buffers and multiplexers in LUTs. SRAM cells, SBs,
and CBs are implemented by NEM switches and encapsulated within the
metal layers as shown in Figure 3.2 and Figure 4.1(b). Vertical connections
have been added among SBs as well as CBs between the two layers through
face-to-face bonding. Details will be described in following sections. Com-
pared to TSVs used in face-to-back (or back-to-back) bonding and multilayer
stacking [42], face-to-face bonding enables high via density [43], [44], [43], [44].
In this study, a 3D via can be 0.75 µm × 0.75 µm with a pitch of 1.5 µm
at 22 nm technology node [43], [44]. This high 3D via density enables great
layer-to-layer communication bandwidth in the 3D design. Two-layer face-
to-face bonding is also relatively easier to fabricate than the multi-layer 3D
stacking case. Therefore, we limit our study to a two-layer 3D architecture
design with a novel combination of NEM relay and CMOS for higher logic
density and performance.
The density of 3D vias being inserted through bonding layer is determined
by bonding layer area and 3D via pitch. A tile area is just equal to the CLB
area in our 3D layout, as discussed in Section 4.6. However, additional area
is required to insert the 3D via array. In this study, a 5 × 5 via array is used
for a tile. Each via occupies an area of 64 λ × 64 λ in the 22 nm technology
12
based on ITRS 2009. The total tile area is the sum of logic area and via
area, which is 2200 λ× 2200 λ. Figure 4.5 shows the conceptual layout of
the 5 × 5 via array within the tile. It also shows 10 extra 3D vias used for
direct links for faster and dedicated layer-to-layer communication, which will
be discussed later.
Figure 4.5: CLB area and 3D via density.
4.4 3D Switch Block
Figure 4.6 shows two vertically-stacked tiles and the SB and CB designs
sandwiched in between. Each CMOS layer has its own metal layers (upper
metal layers and lower metal layers in Figure 4.4). The top metal layers of
the two face-to-face stacks are connected through NEM 3D switch blocks
incorporating 3D vias. The 3D switch block is a MUX-based design. Each
wire in the routing channel is unidirectional and driven by a MUX. Inputs of
a driver MUX are coming from different channels of different directions. In
the 3D case, the MUX also contains inputs from the vertical direction. More
details on the 3D switch block will be introduced in Section 5.1.
In Figure 4.6, one output of CLB3 is connected to a switch point under-
neath. This switch point can connect through a 3D via to reach the switch
block of CLB1. Note that the figure only shows the switch block of CLB3
on the upper layer and does not show the switch block of CLB1 on the lower
13
Figure 4.6: 3D Stacking with SB and CB.
layer. By configuring the MUXs accordingly, the output signal can be routed
through a MUX on the lower layer associated to CLB1 and reach the con-
nection block of CLB2, then to the CLB2 input MUX as an example. Wires
in the upper metal layer and lower metal layer are drawn in black and brown
respectively. Routing on the same layer can be carried out in the same way
by configuring MUX connections. These MUXs can be implemented by NEM
relays and encapsulated within the metal layers so they do not occupy extra
footprint.
4.5 Direct Link
As observed in Figure 4.7(a), if two vertically stacked CLBs need to commu-
nicate with each other, a routing path would go through switch block MUX
and connection block MUX. Figure 4.7(b) shows the equivalent topology in
the 2D FPGA. Given the face-to-face bonding with short layer-to-layer in-
terconnect length, going through several MUXs is costly. This motivates us
to provide another architectural enhancement by including direct connec-
tions between two layers. Table 4.1 shows the delays of different Length-1
interconnects. The delay values are based on SPICE simulations at 22 nm
technology node using the predictive technology model (PTM ) [45]. In this
study, all Length-1 wires are driven by the same 5x buffer, a typical buffer
size used in FPGA study. The wire propagation delay is measured from in-
put crossing 50% at the wire starting point to output crossing 50% at the
wire ending point. Compared to regular Length-1 interconnect delays which
consist of the wire delay and routing switch delay crossing one CLB, direct
14
links are much faster. There are two reasons. First, direct link connects two
CLBs without routing switches in SB. Secondly, direct link is a dedicated
link which has much smaller wire load capacitance from CB inpins. 3D di-
rect links can provide best performance in term of RC delay due to the small
inter-layer distance.
As shown in Figure 4.7(c), a direct connection between an output of CLB1
and the CB of CLB2 is created. Figure 4.7(d) shows the equivalent topology
in the 2D FPGA. This connection bypasses the switch block and saves a
MUX delay as well as the wire RC load from the routing track. Some 2D
CMOS FPGAs (e.g., some Xilinx devices) have direct links among neighbor-
ing CLBs. 3D direct links extend this concept to the third dimension. Since
these vertical wires are much shorter and faster than the 2D direct links (the
third and fifth columns in Table 4.1 ) they can enable our 3D placement
and routing tool to group and pack closely connected CLBs together in a
3D fashion to reduce the routing delay. To have a better utilization rate of
these direct links, we designed that each CLB can direct-link to 5 neighbors
in the other layer as illustrated in Figure 4.8. The direct links are inserted
in a balanced way on four sides of each CLB. Figure 4.8 shows the case that
the CLB cluster size is 10 (10 LUTs in a CLB, thus 10 outputs). Two direct
links on each side of the upper CLB (CLB1) go to a bottom layer CLB that
is immediately adjacent to the corresponding side of CLB2. Two extra links
are inserted in between CLB1 and CLB2. Note that the figure only shows
top-down direct links. There are 10 bottom-up direct links from CLB2 to
the CLBs on the top layer as well. The overhead of direct links is the slight
increase of the size of the CLB input MUX (including inputs from its own
layer and direct link inputs from the other layer). For example, if an archi-
tecture with channel width 100 and Fc = 0.5 (50% of wires in wire channel
are connected to a CLB input), a 50 to 1 MUX is required at each CLB
input pin. By inserting 10 direct links as shown in Figure 4.8 on each side of
the CLB, two or three more MUX inputs need to be added. This increases
the original non-direct linked MUX size from 50 to 52 or 53, respectively.
The propagation delay of the MUX itself will slightly increase; however, this
delay increase is very small compared to the delay reduction of direct link on
global interconnects.
15
(a) (b) (c)
(d)
Figure 4.7: Connection of two vertically stacked CLBs: (a)-(b) without
direct link; (c)-(d) with direct link.
Figure 4.8: Direct links insertion.
Table 4.1: Delay Comparison of 2D and 3D Length-1 Interconnect
Length-1 Wires 2D 2D Direct Link 3D 3D Direct Link
Delay(ps) 43 7.75 35.8 2.76
Length(µm) 29.6 29.6 22.5 1.08
4.6 Area and Footprint Reduction
The CMOS baseline FPGA tile area is estimated using the minimum-transistor-
width area model [2]. For CMOS-NEM FPGAs, since NEM relays are stacked
16
on top of the CMOS circuitry, the final layout area will be determined by
the larger area between the CMOS layer and the NEM layer, as shown in
Figure 4.4. To estimate the area of the NEM layer, we use the same dimen-
sion as described in [5] (also shown in Figure 3.1 (d)), which will lead to a
pull-in voltage around 0.8 V at 22 nm technology node (λ = 11 nm). Based
on the 3T NEM relay layout, the minimum NEM relay layout area can be
estimated. In 3D CMOS-NEM architecture, the area occupied by the 3D via
is added to the NEM layer. The size of the 3D via in 22 nm technology is
set to 0.75 µm × 0.75 µm with a pitch of 1.5 µm [43] , [44].
Due to NEM relay encapsulation, comparing to the base-line CMOS tile,
we observe a 54.67% area reduction in the 2D CMOS-NEM FPGA tiles
and a 46.92% area reduction in the 3D CMOS-NEM FPGA tiles. Overall,
footprint reduction of the 3D CMOS-NEM FPGA architecture can reach
78.23% compared to the 2D CMOS baseline.
17
CHAPTER 5
CAD FLOW
In this work a timing-driven CAD flow has been developed. As shown in
Figure 5.1, each benchmark circuit goes through technology-independent
logic optimization using SIS [9] and is technology-mapped to K-LUTs us-
ing DAOmap [46], which is a popular performance-driven mapper working
on area minimization as well. The mapped netlist then feeds into T-VPACK
which performs timing-driven packing (i.e., clustering LUTs into CLBs). The
final step is another contribution in this work, which performs placement and
routing for the design targeting our 3D architecture. The new placement and
routing engine is developed within VPR 5.0. Our modified VPR program also
calculated delay and capacitance at each node of the circuit. Accurate power
and thermal estimation are then carried. Details of our power and thermal
estimation flow will be discussed in Chapter 6.
Figure 5.1: The CAD evaluation flow.
18
5.1 3D Architecture Generation
One of VPR’s advantages is that it supports flexible FPGA architecture ex-
ploration, and users can easily redefine the architecture in the architecture
file. In this work, we enhanced the existing architecture by introducing ad-
ditional 3D related options to guide the 3D FPGA architecture generation.
Several new options have been added, including:
• max 3d vias per tile This parameter sets an upper limit of the number
of the 3D vias that can be inserted within each tile. A 3D via has a
relatively large pitch (1.5 µm pitch) compared to its size. This value
needs to be extracted based on a detailed area model to make sure that
there would be enough space to accommodate all 3D vias in a tile.
• 3d via percentage This parameter defines the number of wires in a wire
channel that are connected to vertical vias. For example, considering
an architecture with channel width 100, setting 3d via percentage to
0.15 will create 15 3D vias within each tile. The detailed process of 3D
via creation will be discussed below. Please note that this value will
be overwritten by max 3d vias per tile if it exceeds the max value.
• 3d via parameter This option defines the resistance and capacitance
value of a 3D via. These values should be derived from the unit re-
sistance and capacitance of vertical interconnects and the 3D FPGA
architecture information, i.e., the distance between two layers and the
bonding process of 3D stacking.
Figure 5.2 is an example showing how 3D connections have been made. In
VPR 5.0’s single driver architecture, each outgoing wire in SB is driven by
an MUX and each incoming wire connects to a set of MUXs based on the
SB model. For example, in current CMOS FPGA architecture, each input
of the switch block connects to three other MUXs on the other three sides
respectively. In our 3D face-to-face architecture, an input not only connects
to the wires within its own layer, it can additionally connect to all the four
sides on the other layer. An example is the input in1 in Figure 5.2 where the
connections for in1 are all shown in red. Similarly, the upper layer wire in2
can also connect to four outgoing wires on the bottom layer (shown in blue).
The wires which connect to vertical interconnects are evenly distributed
across the wire channel. If we take channel width 100 and 3d via percentage
19
Figure 5.2: 3D via creation.
0.15 as an example again, 15 3D vias in total will be generated: 8 out of the
15 vias have the direction from the bottom to the top layer and the other
7 have the direction from the top to the bottom layer. The 8 or 7 vertical
connections will be evenly assigned into wire channels. For example, if wires
in the wire channel with an odd wire ID (e.g. 1, 3, 5, 7...99) are incoming
wires to an SB (the wires with even wire IDs are outgoing wires from the
SB), then the 8 3D vias will be added to wire 1, 15, 29...99, respectively.
Figure 5.2 demonstrates a simple example with 4 wires in the channel
numbered from 1 to 4 clockwise. The percentage of switch points that have
3D capability is an architecture parameter.
5.2 3D Placement and Routing
To carry out 3D placement and routing, the first step is the construction
of the 3D routing graph. In VPR 5.0, each component is represented as a
routing node and possible connections between components are represented
as routing edges. 3D routing graph construction links appropriate routing
nodes in different layers and changing values stored within them accordingly,
such as outgoing edge array, resistance, and capacitance. The detailed algo-
rithm is shown in Figure 5.3. A 3D routing graph is generated based on two
individual 2D routing graphs which represent two stacking layers, respec-
tively. However, each routing node in these two planar graphs has a unique
node ID. The amount and location of 3D vias are then calculated based on
the flow described in previous sections. Since each wire segment has unique
20
Figure 5.3: Process of 3D routing graph construction.
routing node ID, we can then add routing edges to represent 3D vias. The
resistance and capacitance values of the destination routing node can then be
updated to incorporate 3D via resistance and capacitance values for accurate
timing analysis.
3D placement takes a similar approach using the simulated annealing algo-
rithm, but the random swaps are carried out both within a layer and between
layers. To speed up the process of placement, VPR pre-calculates a delay
matrix for net delay lookup:
NetDelay = DelayMatrix [∆X,∆Y ]
where ∆X and ∆Y are the Manhattan distances between two pins of the
net. In the 3D case, the pre-calculated delay matrix is expanded into three
dimensions.
NetDelay3D = DelayMatrix [∆X,∆Y,∆Z]
If [∆X, ∆Y] is [0, 0], [1, 0] or [0, 1] and ∆Z is not 0, it means these two pins
can be connected by a direct link as shown in Figure 4.8. When a direct link
is used, the DelayMatrix[∆X,∆Y,∆Z] is computed based on the RC delay
of the direct link via. Otherwise the DelayMatrix[∆X,∆Y,∆Z] is computed
through the 3D switch block routing.
21
In 3D placement with direct links, the cost of each swap is estimated based
on the 3D DelayMatrix[∆X,∆Y,∆Z]. If two locations are directly linked,
the smaller net delay will be loaded. For example, considering the case
[∆X,∆Y,∆Z] = [1,0,0] before swap and [0, 0, 1] after swap; this indicates
a placement that two connected CLBs are placed side by side in the same
layer before swap, and being moved and stacked vertically after swap. As
explained in Section 4.6, directly linked [0, 0, 1] placement will have a smaller
delay value. Therefore, solution [0, 0, 1] will be preferred and this swap will
be accepted.
In VPR placement, the region that two CLBs can be swapped within is
restricted. During the annealing process, rlim is decreased from a whole chip
distance to the minimum of 1. This means that at higher temperatures two
blocks far away could be swamped. However, at lower temperatures, only
two adjacent blocks can be swapped.
In our experiment, we found that for 3D placement, the optimal value of
rlim [2] is changed as follows: rlim = rlim ∗ (0.75 + success rat). The rlim
starts to shrink as the swapping success rat drops below 25%. This means
3D placement achieves better results, at a lower rate of shrinking the window
where two blocks are picked and swapped compared to the 2D placement.
22
CHAPTER 6
POWER AND THERMAL SIMULATION
6.1 Overview
Though it is possible to directly measure die temperature with on-chip hard-
ware such as the temperature-sensitive diode and the ring-oscillator [32], a
power and thermal modeling tool for FPGAs could provide insight for de-
signers in architecture exploration. In this Chapter, a detailed description
of our power and thermal estimation flow based on fpgaEVA-3D-LP and
Hotspot [14] is given.
6.2 Parameter Extraction
Our power and thermal estimation flow is flexible enough to explore various
FPGA architectures. Parameters related to timing analysis and dynamic
power consumption, such as gate capacitance of buffers, input capacitance of
MUX, and capacitance of wire, are technology- and architecture-dependent.
Leakage power of the components is also technology- and architecture-dependent.
These parameters will directly affect the timing performance and power con-
sumption of the circuit. An interface for users to define these parameters is
provided in the fpgaEVA-3D-LP framework (Auxiliary Capacitance Library
and FPGA Power Library as shown in Figure 5.1).
CMOS baseline FPGA tile area is estimated using the minimum-transistor-
width area model [2]. For CMOS-NEM FPGA, tile area is estimated using
a similar method. For the 3T NEM relay layout, we use the same dimension
as described in [5] (also shown in Figure 3.1 (d)), which will lead to a pull-in
voltage around 0.8 V at 22 nm technology node (λ = 11 nm). Based on the
3T NEM relay layout, the minimum NEM relay layout area can be estimated.
23
Using the minimum NEM relay layout area model and the minimum CMOS
transistor area model, we estimated separately the area for the required NEM
relays on top of the CMOS, and the area for the remaining CMOS circuitry.
Since NEM relays are stacked on top of the CMOS, the final layout area will
be determined by the larger area between the CMOS layer and the NEM
layer.
To extract related power parameters, we first extract the transistor-level
netlist from the FPGA tile layout, then run SPICE simulation using 22 nm
PTM models (high performance) to estimate capacitance, delay, and leakage
power.
Since LUTs have a regular structure, in our framework, the dynamic power
of LUTs is estimated based on macromodels to save run time, as in [13].
A macromodel characterizes an LUT by its leakage power and energy per
transition. Energy per transition of an LUT is estimated using a transistor-
level netlist of the LUT and PTM models. It is shown in [13] that energy
per transition of an LUT is dependent on the input vector pair (input vector
before and after the transition). To account for such effects, we randomly
generate a series of input vector pairs to find the average energy per transition
of the LUT.
6.3 Power Estimation Framework
Dynamic power consumption is partially decided by parasitic capacitance of
circuit elements and their switching activity. With information in routing-
resource graph of VPR 5.0, fpgaEVA-3D-LP calculates effective input capac-
itance at gate terminal of all the circuit elements in the routing path. The
Elmore delay model is used to estimate delay from one circuit element to
another. Capacitance and delay of elements of logic blocks and interconnects
are summarized in the basic circuit (BC) netlist file. Information in the BC
netlist file is used to estimate switching activity and dynamic power. The BC
netlist file generation process considers uni-directional routing in VPR 5.0.
To accommodate our 3D CMOS-NEM architecture, a routing-resource graph
is constructed for 3D routing. Since the routing-resource graph remains to
be the data structure for routing information in 3D routing, capacitance and
delay information in a 3D architecture can be generated in a similar fashion
24
as in the 2D case.
The fpgaEVA-3D-LP has a built-in mechanism to estimate the leakage
power when a used (active) circuit component is not switching. Meanwhile,
accurate estimation of the numbers of unused circuit elements is crucial for
static power calculation. With routing information and architectural defi-
nition of the FPGA tile (such as routing channel width), it is possible to
calculate usage of the circuit elements. Numbers of unused circuit elements
are summarized by fpgaEVA-3D-LP. Leakage power of buffers, LUTs, MUXs,
Flip-Flops, and SRAMs is characterized by SPICE simulation and recorded
in the FPGA Power Library file. With components statistics and the FPGA
Power Library, we can calculate static power of the unused circuit compo-
nents.
6.4 Thermal Characterization
To characterize the thermal profile of the FPGA, it is necessary to divide the
chip into sub-circuits. We use Hotspot [14] to characterize the thermal profile
of the various architectures explored. There are two thermal models provided
in Hotspot [14], namely grid thermal model and block thermal model. In this
study, the grid thermal model is used. Compared to the block thermal model,
the grid thermal model is more accurate, but also more computationally
expensive. Since we are not attempting to incorporate thermal simulation
result in placement or routing, the grid thermal model’s disadvantage in run
time can be tolerated.
In this study, thermal modeling parameters, such as thermal conductivity,
heat capacity, and thickness of the die, are extracted from an actual 3D
integration fabrication [47]. As discussed in [47], in a face-to-face bonding
structure, the top die can be thinned to 12 µm, and the thickness of the
bottom die is 765 µm. We assume the thickness of metal layer in which
the NEM relays are encapsulated is 12µm. Important thermal modeling
parameters are summarized in Table 6.1. Ambient temperature of 318.5 K
is assumed.
25
Table 6.1: Important Thermal Modeling Parameters in Thermal Simulation
Layer
2D
CMOS
3D
CMOS/NEM
Top Die
3D
CMOS/NEM
Bottom Die
Thickness (µm) 765 18 771
Heat Capacity
(J/(m3K))
1.65E6 1.65E6 1.65E6
Thermal Resistiv-
ity ((mK)/W)
0.0067 0.0067 0.0067
As shown in Figure 6.1 and 6.2, compared to the base-line CMOS archi-
tecture, the thickness of the face-to-face bonding structure would increase,
making it difficult for the heat generated by the bottom CMOS layer to reach
the heat sink.
Figure 6.1: Face-to-face bonding for 3D CMOS-NEM structure in thermal
simulation (not to scale).
Figure 6.2: Base-line 2D CMOS structure in thermal simulation (not to
scale).
26
CHAPTER 7
EXPERIMENTAL RESULTS AND
DISCUSSION
7.1 Experimental Setup
To evaluate the 3D NEM FPGA, we use an LUT input size K = 4, and
explore a logic cluster size of N = 10. The length of the interconnect wire is
set to be 4. For timing analysis, we run the CAD flow shown in Figure 5.1
for different FPGA architectures.
In power and thermal simulation, we compared 3D CMOS-NEM architec-
ture to 2D CMOS baseline architecture, and 3D CMOS architecture. Some
smaller benchmarks used in timing analysis do not generate enough switch-
ing activity for meaningful thermal analysis. This is especially true for small
combinational benchmarks, such as alu4. We picked 8 larger benchmarks
for power and thermal simulation. Each mapped benchmark is operated at
its maximum frequency, which is decided by its critical path delay. At each
clock cycle, all of the primary inputs are assigned a 50% switching probabil-
ity. A total of 2000 random input vectors for primary inputs are generated
for each benchmark to find average power consumption. Please note that the
flow we developed is flexible and capable of supporting different architecture
parameters.
7.2 Timing Performance
In this section, we quantify the overall performance improvements of the 3D
NEM FPGA over the base-line 2D CMOS FPGA, 2D NEM FPGA, and 3D
CMOS FPGA in terms of critical path delay. Specifically, the 2D CMOS
baseline is the CMOS-based FPGA design at 22 nm technology node. Ar-
chitecture parameters of CMOS baseline are obtained through SPICE sim-
27
ulation at 22 nm node using the PTM model [45]. A 2D NEM FPGA has
a similar architecture as the 2D CMOS baseline design, but all LUTs and
routing structures are NEM-based as described in Chapter 4. Strictly speak-
ing, a 2D NEM FPGA is not a pure 2D architecture anymore because some
transistors (such as routing MUXs) and SRAM cells are implemented using
NEMs, which are stacked on top of the CMOS devices. However, we use
this term to differentiate this architecture from the two-layer 3D stacking
architecture.
Figure 7.1: Performance comparisons of CMOS and NEM FPGA.
Figure 7.1 details the performance comparison results. The performance
improvement of 3D NEM FPGA is achieved from the combination of NEM-
based LUT, NEM-based routing design, and the 3D architecture.
On average, a 2D NEM FPGA provides a 15.27% delay reduction compared
to the base-line 2D CMOS FPGA. This delay reduction is achieved by the
reduced tile area using the NEM design for CB, SB, and CLB, which reduces
the global wire length. Replacing the SRAM-based LUT with the NEM-
based LUT also contributes to delay reduction for the CLB itself. For 3D
CMOS-NEM architecture, a 33.11% delay reduction compared to the base-
line 2D CMOS FPGA can be achieved with direct links. The performance
gain comes from the 3D stacking, which dramatically reduces the FPGA
28
Figure 7.2: Thermal profile of benchmark des perf in 3D CMOS-NEM
architecture. Temperature of active layer farther away from heat-sink is
shown. Unit is kelvins.
footprint.
7.3 Power and Temperature Reduction
Superiority of 3D CMOS-NEM FPGA in terms of timing performance is
clearly established in Figure 7.1. However, with drastic footprint reduc-
tion as discussed in Section 4.6, it is possible that the power density of 3D
CMOS-NEM FPGA would increase, leading to a potential operating tem-
perature penalty. In this section we show that power reduction enabled by
our 3D CMOS-NEM architecture is sufficient to offset reduction in the heat
spreading capability of the die.
The thermal profile of the benchmark des perf is shown in Figure 7.2. The
maximum/minimum temperature difference is less than 1 degree Kelvin. We
observe small temperature variations across the die for all the benchmarks
in every architecture we examined. Such observation is consistent with the
measurement data shown in [38] and [40].
29
Figure 7.3: Dynamic power comparison of CMOS and CMOS-NEM FPGA.
Circuits are clocked at their maximum frequency, which is decided by the
critical path delay. FCMOS-NEM/Fbaseline CMOS is shown above each bar.
Figure 7.4: Static power comparison of CMOS and CMOS-NEM FPGA.
Since temperature gradients are small, instead of presenting the thermal
profile figure, we only report average die temperature for each benchmark.
For 3D architecture, we report the average temperature of the active layer
that is farther away from the heat sink (the bottom die).
30
Figure 7.5: Total power comparison of CMOS and CMOS-NEM FPGA.
Figure 7.6: Average temperature comparison of CMOS and CMOS-NEM
FPGA.
Comparisons of power consumption of 2D CMOS FPGA, 3D CMOS FPGA,
and 3D CMOS-NEM FPGA are shown in Figure 7.3 to Figure 7.6. Figure
7.3 shows the comparison of dynamic power. Please note that the circuits are
clocked at their maximum frequency, which is decided by the critical path
delay. On average we observe a 33.11% critical path delay reduction in 3D
31
CMOS-NEM architecture compared to the baseline 2D CMOS, meaning 3D
CMOS-NEM FPGAs are clocked at much higher frequencies. Despite higher
frequency, 3D CMOS-NEM architecture consumes less dynamic power com-
pared to the base-line 2D CMOS architecture. Dynamic power reduction
in CMOS-NEM architecture is mainly due to smaller wire capacitance and
significant reduction of output capacitance in the MUXs.
In 3D CMOS-NEM architecture, leakage power caused by SRAM cells, and
off-state switches are eliminated. Average static power reduction of 69.34%
is observed in 3D CMOS-NEM architecture compared to the base-line 2D
CMOS architecture. The static power of 3D CMOS architecture is higher
than that of 2D CMOS architecture. Such static power increase is due to
the increase in total number of tiles. For example, considering benchmark
clma with 842 CLBs, a 30 × 30 tile grid is sufficient to place the CLBs in
the 2D CMOS architecture. However, in 3D CMOS architecture, a 22 × 22
× 2 grid is needed, leading to a 7% increase in total number of tiles and a
117% increase in total number of unused tiles.
As shown in Figure 7.5, on average, we observe around 29.19% total power
reduction. Despite large footprint reduction and increased chip thickness,
with power consumption reduction enabled by 3D CMOS-NEM architecture,
the average operating temperature of 3D CMOS-NEM is only 0.48 degree
kelvin higher than that of the base-line 2D CMOS architecture, as shown in
Figure 7.6.
32
CHAPTER 8
CONCLUSION
In this thesis, we introduced a novel 3D CMOS-NEM FPGA architecture
that utilizes 3D integration techniques and NEM relays. The proposed ar-
chitecture consists of NEM-based LUT and routing elements. Two layers
of NEM-based CLBs are stacked face-to-face to pursue better performance
gain. A customized 3D design automation flow has been developed. We
evaluated the performance of this 3D CMOS-NEM FPGA with the largest
20 MCNC benchmarks and some large VPR 5.0 benchmarks. The evalua-
tion result demonstrates that the proposed 3D architecture is able to provide
a 33.11% delay reduction, 29.19% power reduction, and 78.23% footprint
reduction over the traditional 2D CMOS FPGA. Despite drastic footprint
reduction and increased chip thickness caused by 3D stacking, with large
power reduction, the operating temperature penalty is negligible.
33
REFERENCES
[1] E. Ahmed and J. Rose, “The effect of LUT and cluster size on deep-
submicron FPGA performance and density,” Very Large Scale Integra-
tion (VLSI) Systems, IEEE Transactions on, vol. 12, no. 3, pp. 288–298,
March 2004.
[2] J. R. V. Betz and A. Marquardt, Architecture and CAD for Deep-
Submicron FPGAs. Kluwer Academic Publishers, 1999.
[3] T. Sakurai, “Perspectives on power-aware electronics,” in Solid-State
Circuits Conference, 2003. Digest of Technical Papers. ISSCC. 2003
IEEE International, 2003, pp. 26–29 vol.1.
[4] R. Parsa, M. Shavezipur, W. Lee, S. Chong, D. Lee, H.-S. Wong,
R. Maboudian, and R. Howe, “Nanoelectromechanical relays with de-
coupled electrode and suspension,” in Micro Electro Mechanical Systems
(MEMS), 2011 IEEE 24th International Conference on, Jan. 2011, pp.
1361–1364.
[5] C. Chen, R. Parsa, N. Patil, S. Chong, K. Akarvardar, J. Provine,
D. Lewis, J. Watt, R. T. Howe, H.-S. P. Wong, and S. Mitra, “Efficient
FPGAs using nanoelectromechanical relays,” in Proceedings of the 18th
annual ACM/SIGDA international symposium on Field programmable
gate arrays, ser. FPGA ’10. New York, NY, USA: ACM, 2010.
[Online]. Available: http://doi.acm.org/10.1145/1723112.1723158 pp.
273–282.
[6] S. Chong, B. Lee, S. Mitra, R. Howe, and H.-S. Wong, “Integration
of nanoelectromechanical relays with silicon nMOS,” Electron Devices,
IEEE Transactions on, vol. 59, no. 1, pp. 255–258, January 2012.
[7] S. J. Koester, A. M. Young, R. R. Yu, S. Purushothaman, K.-N.
Chen, D. C. La Tulipe, N. Rana, L. Shi, M. R. Wordeman, and
E. J. Sprogis, “Wafer-level 3D integration technology,” IBM J. Res.
Dev., vol. 52, no. 6, pp. 583–597, Nov. 2008. [Online]. Available:
http://dx.doi.org/10.1147/JRD.2008.5388565
34
[8] W. Davis, J. Wilson, S. Mick, J. Xu, H. Hua, C. Mineo, A. Sule,
M. Steer, and P. Franzon, “Demystifying 3D ICs: the pros and cons
of going vertical,” Design Test of Computers, IEEE, vol. 22, no. 6, pp.
498–510, Nov.-Dec. 2005.
[9] J. Luu, I. Kuon, P. Jamieson, T. Campbell, A. Ye, W. M. Fang, K. Kent,
and J. Rose, “VPR 5.0: FPGA CAD and architecture exploration tools
with single-driver routing, heterogeneity and process scaling,” ACM
Trans. Reconfigurable Technol. Syst., vol. 4, no. 4, pp. 32:1–32:23, Dec.
2011. [Online]. Available: http://doi.acm.org/10.1145/2068716.2068718
[10] P. Mangalagiri, S. Bae, R. Krishnan, Y. Xie, and V. Narayanan,
“Thermal-aware reliability analysis for platform FPGAs,” in Proceedings
of the 2008 IEEE/ACM International Conference on Computer-Aided
Design, ser. ICCAD ’08. Piscataway, NJ, USA: IEEE Press,
2008. [Online]. Available: http://dl.acm.org/citation.cfm?id=1509456.
1509613 pp. 722–727.
[11] Y. Taur and T. H. Ning, Fundamentals of modern VLSI devices. New
York, NY, USA: Cambridge University Press, 1998.
[12] G. H. Loh, Y. Xie, and B. Black, “Processor design in 3D die-stacking
technologies,” Micro, IEEE, vol. 27, no. 3, pp. 31–48, May-June 2007.
[13] F. Li, Y. Lin, L. He, D. Chen, and J. Cong, “Power modeling and charac-
teristics of field programmable gate arrays,” Computer-Aided Design of
Integrated Circuits and Systems, IEEE Transactions on, vol. 24, no. 11,
pp. 1712–1724, Nov. 2005.
[14] W. Huang, K. Skadron, S. Gurumurthi, R. Ribando, and M. Stan, “Dif-
ferentiating the roles of IR measurement and simulation for power and
temperature-aware design,” in Performance Analysis of Systems and
Software, 2009. ISPASS 2009. IEEE International Symposium on, April
2009, pp. 1–10.
[15] J. U. Knickerbocker, P. S. Andry, B. Dang, R. R. Horton, M. J. In-
terrante, C. S. Patel, R. J. Polastre, K. Sakuma, R. Sirdeshmukh, E. J.
Sprogis, S. M. Sri-Jayantha, A. M. Stephens, A. W. Topol, C. K. Tsang,
B. C. Webb, and S. L. Wright, “Three-dimensional silicon integration,”
IBM Journal of Research and Development, vol. 52, no. 6, pp. 553–569,
Nov. 2008.
[16] K. Banerjee, S. Souri, P. Kapur, and K. Saraswat, “3-D ICs: a novel chip
design for improving deep-submicrometer interconnect performance and
systems-on-chip integration,” Proceedings of the IEEE, vol. 89, no. 5,
pp. 602–633, May 2001.
35
[17] M. Lin, A. El Gamal, Y.-C. Lu, and S. Wong, “Performance benefits
of monolithically stacked 3D-FPGA,” in Proceedings of the 2006
ACM/SIGDA 14th international symposium on Field programmable
gate arrays, ser. FPGA ’06. New York, NY, USA: ACM, 2006.
[Online]. Available: http://doi.acm.org/10.1145/1117201.1117219 pp.
113–122.
[18] C. Ababei, Y. Feng, B. Goplen, H. Mogal, T. Zhang, K. Bazargan,
and S. Sapatnekar, “Placement and routing in 3D integrated circuits,”
Design Test of Computers, IEEE, vol. 22, no. 6, pp. 520–531, Nov.-Dec.
2005.
[19] V. Chandrasekhar, “CAD for a 3-dimensional FPGA,” M.S. thesis,
MIT, 2007. [Online]. Available: http://hdl.handle.net/1721.1/40520
[20] T. Naito, T. Ishida, T. Onoduka, M. Nishigoori, T. Nakayama, Y. Ueno,
Y. Ishimoto, A. Suzuki, W. Chung, R. Madurawe, S. Wu, S. Ikeda, and
H. Oyamatsu, “World’s first monolithic 3D-FPGA with TFT SRAM
over 90nm 9 layer Cu CMOS,” in VLSI Technology (VLSIT), 2010 Sym-
posium on, June 2010, pp. 219–220.
[21] X. Wang, S. Narasimhan, A. Krishna, F. Wolff, S. Rajgopal, T.-H. Lee,
M. Mehregany, and S. Bhunia, “High-temperature reconfigurable com-
puting using silicon carbide NEMS switches,” in Design, Automation
Test in Europe Conference Exhibition (DATE), 2011, March 2011, pp.
1–6.
[22] C. Dong, D. Chen, S. Haruehanroengra, and W. Wang, “3-D nFPGA: A
reconfigurable architecture for 3-D CMOS/Nanomaterial hybrid digital
circuits,” Circuits and Systems I: Regular Papers, IEEE Transactions
on, vol. 54, no. 11, pp. 2489–2501, Nov. 2007.
[23] S. Chong, K. Akarvardar, R. Parsa, J.-B. Yoon, R. Howe, S. Mi-
tra, and H.-S. Wong, “Nanoelectromechanical (NEM) relays inte-
grated with CMOS SRAM for improved stability and low leakage,”
in Computer-Aided Design - Digest of Technical Papers, 2009. ICCAD
2009. IEEE/ACM International Conference on, Nov. 2009, pp. 478–484.
[24] C. Chen, S. Lee, J. Provine, S. Chong, R. Parsa, D. Lee, R. Howe, H.-S.
Wong, and S. Mitra, “Nano-electro-mechanical (NEM) relays and their
application to fpga routing,” in Design Automation Conference (ASP-
DAC), 2012 17th Asia and South Pacific, Feb. 2012, p. 639.
[25] J. Lau and T. Yue, “Thermal management of 3D IC integration with
TSV (through silicon via),” in Electronic Components and Technology
Conference, 2009. ECTC 2009. 59th, May 2009, pp. 635–640.
36
[26] Y. J. Kim, Y. K. Joshi, A. G. Fedorov, Y.-J. Lee, and S.-K.
Lim, “Thermal characterization of interlayer microfluidic cooling of
three-dimensional integrated circuits with nonuniform heat flux,”
Journal of Heat Transfer, vol. 132, no. 4, p. 041009, 2010. [Online].
Available: http://link.aip.org/link/?JHR/132/041009/1
[27] J. Cong, G. Luo, J. Wei, and Y. Zhang, “Thermal-aware 3D IC Place-
ment Via Transformation,” in Design Automation Conference, 2007.
ASP-DAC ’07. Asia and South Pacific, Jan. 2007, pp. 780–785.
[28] C. Chu and D. Wong, “A matrix synthesis approach to thermal place-
ment,” Computer-Aided Design of Integrated Circuits and Systems,
IEEE Transactions on, vol. 17, no. 11, pp. 1166–1174, Nov. 1998.
[29] J. Cong and Y. Zhang, “Thermal-driven multilevel routing for 3D ICs,”
in Design Automation Conference, 2005. Proceedings of the ASP-DAC
2005. Asia and South Pacific, vol. 1, Jan. 2005, pp. 121–126, Vol. 1.
[30] Xilinx, Virtex-5 FPGA System Monitor User Manual, Nov. 2008.
[31] E. I. Boemo and S. Lo´pez-Buedo, “Thermal monitoring on FPGAs
using ring-oscillators,” in Proceedings of the 7th International
Workshop on Field-Programmable Logic and Applications, ser. FPL
’97. London, UK, UK: Springer-Verlag, 1997. [Online]. Available:
http://dl.acm.org/citation.cfm?id=647924.738726 pp. 69–78.
[32] J. Franco, E. Boemo, E. Castillo, and L. Parrilla, “Ring oscillators
as thermal sensors in FPGAs: Experiments in low voltage,” in Pro-
grammable Logic Conference (SPL), 2010 VI Southern, March 2010,
pp. 133–137.
[33] J. Anderson and F. Najm, “Power estimation techniques for FPGAs,”
Very Large Scale Integration (VLSI) Systems, IEEE Transactions on,
vol. 12, no. 10, pp. 1015–1027, Oct. 2004.
[34] C. H. Ho, P. Leong, W. Luk, and S. Wilton, “Rapid estimation of power
consumption for hybrid FPGAs,” in Field Programmable Logic and Ap-
plications, 2008. FPL 2008. International Conference on, Sept. 2008,
pp. 227–232.
[35] K. K. W. Poon, S. J. E. Wilton, and A. Yan, “A detailed power
model for field-programmable gate arrays,” ACM Trans. Des. Autom.
Electron. Syst., vol. 10, pp. 279–302, April 2005. [Online]. Available:
http://doi.acm.org/10.1145/1059876.1059881
[36] X. Chen, R. Dick, and L. Shang, “Properties of and improvements to
time-domain dynamic thermal analysis algorithms,” in Design, Automa-
tion Test in Europe Conference Exhibition (DATE), 2010, March 2010,
pp. 1165–1170.
37
[37] G. Link and N. Vijaykrishnan, “Thermal trends in emerging technolo-
gies,” in Quality Electronic Design, 2006. ISQED ’06. 7th International
Symposium on, March 2006, pp. 8 pp.–632.
[38] P. Sundararajan, A. Gayasen, N. Vijaykrishnan, and T. Tuan, “Thermal
characterization and optimization in platform FPGAs,” in Proceedings
of the 2006 IEEE/ACM international conference on Computer-aided
design, ser. ICCAD ’06. New York, NY, USA: ACM, 2006. [Online].
Available: http://doi.acm.org/10.1145/1233501.1233589 pp. 443–447.
[39] K. Siozios and D. Soudris, “A novel methodology for temperature-aware
placement and routing of FPGAs,” in VLSI, 2007. ISVLSI ’07. IEEE
Computer Society Annual Symposium on, March 2007, pp. 55–60.
[40] S. Velusamy, W. Huang, J. Lach, M. Stan, and K. Skadron, “Monitor-
ing temperature in FPGA based SoCs,” in Computer Design: VLSI in
Computers and Processors, 2005. ICCD 2005. Proceedings. 2005 IEEE
International Conference on, Oct. 2005, pp. 634–637.
[41] Chen Chen et al., “Nano-Electro-Mechanical relays for FPGA routing:
Experimental demonstration and a design technique,” in Design, Au-
tomation Test in Europe Conference Exhibition (DATE), 2012, 2012.
[42] P. Lindner, V. Dragoi, T. Glinsner, C. Schaefer, and R. Islam, “3D
interconnect through aligned wafer level bonding,” in Electronic Com-
ponents and Technology Conference, 2002. Proceedings. 52nd, 2002, pp.
1439–1443.
[43] Tezzaron Semiconductor, “Tezzaron’s Patented Technologies,” http://
www.tezzaron.com/, 2010.
[44] ITRS, “International Technology Roadmap for Semiconductors,” http:
//public.itrs.net, 2010.
[45] A. Balijepalli, S. Sinha, and Y. Cao, “Compact modeling of carbon
nanotube transistor for early stage process-design exploration,” in
Proceedings of the 2007 international symposium on Low power
electronics and design, ser. ISLPED ’07. New York, NY, USA: ACM,
2007. [Online]. Available: http://doi.acm.org/10.1145/1283780.1283783
pp. 2–7.
[46] D. Chen and J. Cong, “DAOmap: a depth-optimal area optimization
mapping algorithm for FPGA designs,” in Proceedings of the 2004
IEEE/ACM International conference on Computer-aided design, ser.
ICCAD ’04. Washington, DC, USA: IEEE Computer Society, 2004.
[Online]. Available: http://dx.doi.org/10.1109/ICCAD.2004.1382677
pp. 752–759.
38
[47] D. H. Kim, K. Athikulwongse, M. B. Healy, M. M. Hossain, M. Jung,
I. Khorosh, G. Kumar, Y.-J. Lee, D. Lewis, T.-W. Lin, C. Liu, S. Panth,
M. Pathak, M. Ren, G. Shen, T. Song, D. H. Woo, X. Zhao, J. Kim,
H. Choi, G. Loh, H.-H. Lee, , and S. K. Lim, “3D-MAPS: 3D massively
parallel processor with stacked memory,” in Proceeding of the 2012 In-
ternational Solid-State Circuits Conference, 2012.
39
