Rochester Institute of Technology

RIT Scholar Works
Theses
4-25-2017

Overcoming the Challenges for Multichip Integration: A Wireless
Interconnect Approach
Md Shahriar Shamim
ms5614@rit.edu

Follow this and additional works at: https://scholarworks.rit.edu/theses

Recommended Citation
Shamim, Md Shahriar, "Overcoming the Challenges for Multichip Integration: A Wireless Interconnect
Approach" (2017). Thesis. Rochester Institute of Technology. Accessed from

This Dissertation is brought to you for free and open access by RIT Scholar Works. It has been accepted for
inclusion in Theses by an authorized administrator of RIT Scholar Works. For more information, please contact
ritscholarworks@rit.edu.

OVERCOMING THE CHALLENGES FOR
MULTICHIP INTEGRATION: A WIRELESS
INTERCONNECT APPROACH

BY

MD SHAHRIAR SHAMIM

A DISSERTATION SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF DOCTOR OF PHILOSOPHY IN COMPUTING AND
INFORMATION SCIENCES

B. THOMAS GOLISANO COLLEGE OF COMPUTING AND INFORMATION SCIENCES
DEPARTMENT OF COMPUTING AND INFORMATION SCIENCES-PHD
ROCHESTER INSTITUTE OF TECHNOLOGY
ROCHESTER, NEW YORK
April 25, 2017

Overcoming the Challenges for Multichip Integration: A
Wireless Interconnect Approach
By
Md Shahriar Shamim
Committee Approval:
We, the undersigned committee members, certify that we have advised and/or supervised the
candidate on the work described in this dissertation. We further certify that we have reviewed the
dissertation manuscript and approve it in partial fulfillment of the requirements for the degree of
Doctor of Philosophy in Computing and Information Sciences.

____________________________________________________________________________
Dr. Amlan Ganguly
Date:
Dissertation Advisor, Dept. of Computer Engineering, R.I.T.

_____________________________________________________________________________
Dr. Andres Kwasinski
Date:
Dissertation Committee Member, Dept. of Computer Engineering, R.I.T.

_____________________________________________________________________________
Dr. Satish G. Kandlikar
Date:
Dissertation Committee Member, Dept. of Mechanical Engineering, R.I.T.

_____________________________________________________________________________
Dr. Minseok Kwon
Date:
Dissertation Committee Member, Dept. of Computer Science, R.I.T.
_____________________________________________________________________________
Dr. Dorin Patru
Date:
Dissertation Defense Chair, Dept. of Electrical & Microelectronic Engineering, R.I.T.
Certified by:
_____________________________________________________________________________
Dr. Pengcheng Shi
Date
Ph.D. Program Director, College of Computing and Information Sciences, R.I.T.
ii

© [2017] [Md Shahriar Shamim]
All rights reserved.

iii

Overcoming the Challenges for Multichip Integration: A
Wireless Interconnect Approach
By
Md Shahriar Shamim
ABSTRACT
The physical limitations in the area, power density, and yield restrict the scalability of the singlechip multicore system to a relatively small number of cores. Instead of having a large chip,
aggregating multiple smaller chips can overcome these physical limitations. Combining multiple
dies can be done either by stacking vertically or by placing side-by-side on the same substrate
within a single package. However, in order to be widely accepted, both multichip integration
techniques need to overcome significant challenges.
In the horizontally integrated multichip system, traditional inter-chip I/O does not scale
well with technology scaling due to limitations of the pitch. Moreover, to transfer data between
cores or memory components from one chip to another, state-of-the-art inter-chip communication
over wireline channels require data signals to travel from internal nets to the peripheral I/O ports
and then get routed over the inter-chip channels to the I/O port of the destination chip. Following
this, the data is finally routed from the I/O to internal nets of the target chip over a wireline
interconnect fabric. This multi-hop communication increases energy consumption while
decreasing data bandwidth in a multichip system. On the other hand, in vertically integrated
multichip system, the high power density resulting from the placement of computational
components on top of each other aggravates the thermal issues of the chip leading to degraded
performance and reduced reliability. Liquid cooling through microfluidic channels can provide
iv

cooling capabilities required for effective management of chip temperatures in vertical integration.
However, to reduce the mechanical stresses and at the same time, to ensure temperature uniformity
and adequate cooling competencies, the height and width of the microchannels need to be
increased. This limits the area available to route Through-Silicon-Vias (TSVs) across the cooling
layers and make the co-existence and co-design of TSVs and microchannels extreamly
challenging.
Research in recent years has demonstrated that on-chip and off-chip wireless interconnects
are capable of establishing radio communications within as well as between multiple chips. The
primary goal of this dissertation is to propose design principals targeting both horizontally and
vertically integrated multichip system to provide high bandwidth, low latency, and energy efficient
data communication by utilizing mm-wave wireless interconnects. The proposed solution has two
parts: the first part proposes design methodology of a seamless hybrid wired and wireless
interconnection network for the horizontally integrated multichip system to enable direct chip-tochip communication between internal cores. Whereas the second part proposes a Wireless
Network-on-Chip (WiNoC) architecture for the vertically integrated multichip system to realize
data communication across interlayer microfluidic coolers eliminating the need to place and route
signal TSVs through the cooling layers. The integration of wireless interconnect will significantly
reduce the complexity of the co-design of TSV based interconnects and microchannel based
interlayer cooling. Finally, this dissertation presents a combined trade-off evaluation of such
wireless integration system in both horizontal and vertical sense and provides future directions for
the design of the multichip system.

v

ACKNOWLEDGEMENTS
The completion of this dissertation and the research behind it would not be possible without the
guidance, support, and encouragement from many individuals. I would like to take this opportunity
to express my earnest and heartfelt gratitude towards them.
First and foremost, I would like to give special thanks to my advisor, Dr. Amlan Ganguly,
for his constant support and mentorship throughout my Ph.D. tenure at Rochester Institute of
Technology. He has taught me how to become a good and effective researcher. The meaningful
discussions that we have shared during my research have been truly inspirational to me. My
thoughts and ideas were always more thorough and well developed after he refined them. I cannot
thank him enough for his efforts on revising my manuscripts for journal and conference
publications. His positive attitudes in both research and life have influenced me deeply, which will
certainly benefit me for the rest of my career.
I would also like to express my sincere appreciation to my Ph.D. dissertation committee
members, Dr. Andres Kwasinski, Dr. Satish G. Kandlikar, Dr. Minseok Kwon, and Dr. Dorin Patru
for agreeing to serve on my dissertation committee and giving me feedback and suggestions to
improve my dissertation. A special thanks to Dr. Jayanti Venkatarman for helping us with all the
antenna modeling and simulations. I would also like to thank Dr. Pengcheng Shi and Dr. Shanchieh
Yang for their support throughout my Ph.D. tenure.
I would like to thank all my lab mates, co-authors, and friends at Multi-core System Lab,
especially Rounak Narde, Ashraf Mitul, and Meraj Ahmed for their support. I would like to express
my special gratitude to my research sibling, Naseef Mansoor, whom I am very fortunate to work

vi

with from the beginning of my Ph.D. career. I cannot thank him enough for his companionship
and contributions to my research throughout last four years. Our brainstorming and conversation
over the many cups of coffee helped me to strengthen the foundations of my conceptual
understanding of the problems. Without any doubt, caffeine and research make the perfect blend!
My internship at Intel Corporate Quality Network (CQN) Lab during my Ph.D. has been
an invaluable experience, and I appreciate the support, guidance, and the pleasant working
atmosphere my colleagues at Intel have provided. I particularly thank my mentors, Dan
Bockelman, Steven Chen, and Tim Pifer, as it has been a pleasure to work with them during the
time I spent at Intel.
I would like to thank my parents, Md Ghias Uddin and Sarifun Nahar, who have sacrificed
a lot to get me well educated. They have always encouraged my curiosity and my passion to follow
my dreams. None of this would happen without their support and inspiration. I would also like to
thank my younger brothers, Kashfi & Mashfi, my elder brother Shahrish Shuvo, his wife Susmita
Karim, and Sabrina’s family, who always have supported and inspired me. I feel extremely blessed
to have such a wonderful family.
Last but most importantly, I would like to thank my wife and best friend, Sabrina Afrin,
who always had my best interest in her heart; who motivated, and endured my constant nagging
(with occasional yelling ) during the course of my Ph.D. She has made me the person I am
today, taught me to have faith in myself, and reminded me what is important in life.
The research that forms the basis of this dissertation has been supported in part by the US
National Science Foundation (NSF) CAREER grant CNS-1553264 and grant CCF-1162123, and
by the department of Computer Engineering, RIT.
vii

The text of Chapter 3 is in part a reprint of the material from the paper, M. S. Shamim, N.
Mansoor, R. S. Narde, V. Kothandapani, A. Ganguly and J. Venkataraman, "A Wireless
Interconnection Framework for Seamless Inter and Intra-Chip Communication in Multichip
Systems," in IEEE Transactions on Computers, vol. 66, no. 3, pp. 389-402, March 1, 2017. The
dissertation author was the primary researcher and author, and the co-authors involved in the
publication assisted the research which forms the foundation for that manuscript. The zigzag
antennas are modeled and simulated by Rounak Singh Narde under the supervision of Dr. Jayanti
Venkatarman.
The text of Chapter 4 is in part a reprint of the material from the paper, M. S. Shamim, A.
Ganguly, C. Munuswamy, J. Venkatarman, J. Hernandez, and S. Kandlikar, "Co-design of 3D
wireless network-on-chip architectures with microchannel-based cooling," in Sixth International
Green and Sustainable Computing Conference (IGSC), Las Vegas, NV, 2015, pp. 1-6. The
dissertation author was the primary researcher and author, and the co-authors involved in the
publication assisted the research which forms the foundation for that manuscript. The
microchannels are modeled and simulated by Jose-Luis Gonzalez-Hernandez under the
supervision from Dr. Satish G. Kandlikar. The zigzag antennas for the 3D wireless NoCs are
modeled and simulated by Chetan Munuswamy and Rounak Singh Narde under the guidance of
Dr. Jayanti Venkatarman.

viii

DEDICATION

To my parents and Sabrina,
for their boundless love and support.

ix

TABLE OF CONTENT
ABSTRACT ............................................................................................................................................................. IV
ACKNOWLEDGEMENTS ......................................................................................................................................... VI
DEDICATION ......................................................................................................................................................... IX
TABLE OF CONTENT ............................................................................................................................................... X
LIST OF FIGURES..................................................................................................................................................XIV
LIST OF TABLES ...................................................................................................................................................XVI
LIST OF ABBREVIATIONS ....................................................................................................................................XVII
CHAPTER 1

INTRODUCTION ............................................................................................................................... 1

1.1

MULTICHIP SYSTEM ..............................................................................................................................................1

1.2

CHALLENGES OF THE MULTICHIP SYSTEM ..................................................................................................................4

1.2.1

Horizontal Integration of the Multichip System .....................................................................................4

1.2.2

Vertical Integration of the Multichip System .........................................................................................6

1.3

POSSIBLE SOLUTION: WIRELESS INTERCONNECTS ........................................................................................................8

1.4

SUMMARY OF RESEARCH OBJECTIVES .....................................................................................................................10

1.5

CONTRIBUTIONS.................................................................................................................................................12

1.6

DISSERTATION ORGANIZATION ..............................................................................................................................13

CHAPTER 2

BACKGROUND AND STATE-OF-THE-ARTS ...................................................................................... 15

2.1

INTRA AND INTER-CHIP INTERCONNECTION FOR HORIZONTALLY INTEGRATED MULTICHIP SYSTEM ......................................15

2.2

VERTICALLY INTEGRATED MULTICHIP SYSTEM AND MICROCHANNEL BASED COOLING ......................................................18

2.3

WIRELESS TECHNOLOGY: A PROMISING INTERCONNECT PARADIGM .............................................................................21

x

CHAPTER 3

DESIGN OF A WIRELESS INTERCONNECTION FRAMEWORK FOR SEAMLESS INTER AND INTRA-CHIP

COMMUNICATION IN HORIZONTALLY INTEGRATED MULTICHIP SYSTEMS .......................................................... 24
3.1

WIRELESS INTERCONNECTION FRAMEWORK FOR MULTICHIP SYSTEMS..........................................................................25

3.1.1

Topology ..............................................................................................................................................25

3.1.2

Physical Layer.......................................................................................................................................28

3.1.3

Flow Control and Routing ....................................................................................................................30

3.1.4

Wireless Communication Protocol .......................................................................................................32

3.2

EXPERIMENTAL RESULTS ......................................................................................................................................36

3.2.1

Simulation Platform .............................................................................................................................36

3.2.2

Wireless Channel Characteristics and Wireless Link Budget Analysis ..................................................38

3.2.3

Comparative Performance Evaluation .................................................................................................41

3.2.4

Deployment of the Wireless Interconnection with Scaling of System Size ...........................................51

3.2.5

Performance Evaluation with Non-uniform Traffic Patterns ...............................................................54

3.2.6

Comparative Evaluation with Respect to Emerging Multichip Integration Technologies ....................56

3.2.7

Area Overheads ...................................................................................................................................61

3.3

SUMMARY ........................................................................................................................................................62

CHAPTER 4

WIRELESS INTERCONNECT AS AN ENABLER FOR DATA COMMUNICATION ACROSS MICROCHANNEL

BASED COOLING LAYER IN VERTICALLY INTEGRATED MULTICHIP SYSTEM ........................................................... 63
4.1

INTEGRATED DESIGN METHODOLOGY FOR WIRELESS 3D NOC WITH MICROCHANNEL BASED LIQUID COOLING ....................67

4.1.1

Design of Microchannel Cooling Layer .................................................................................................67

4.1.2

Proposed Topology of the 3D Wireless NoC Architecture ....................................................................70

4.1.3

Physical Layer.......................................................................................................................................71

4.1.4

Seamless Flow Control and Routing .....................................................................................................73

4.1.5

Wireless Communication Protocol and Transceiver .............................................................................74

4.2

EXPERIMENTAL RESULTS ......................................................................................................................................75

4.2.1

Dimensions of the Cooling Channels ....................................................................................................76

xi

4.2.2

Evaluation of the Hybrid 3D NoC Architecture in the Presence of Microchannel based Liquid Cooling...
……………………………………………………………………………………………………………………………………………………… 81

A.

Wireless Channel Modeling and Link Budget Analysis ......................................................................................... 83

B.

Temperature and Performance Characteristics of 3D Wireless NoC Architectures with Synthetic Workload ..... 87

4.2.3

Temperature and Performance Characteristics of 3D NoC Architectures with Real Application-based

Workloads…............................................................................................................................................................92
4.2.4

Increasing Cooling Capacity of the Interlayer Coolers and its Impact on Performance of the 3D

Wireless NoC ..........................................................................................................................................................94
4.2.5

Comparison with Alternative Wireless Communication Mechanism ...................................................95

4.2.6

Area Overheads ...................................................................................................................................98

4.2.7

Trade-off Analysis ................................................................................................................................99

4.2.8

Holistic Comparison of the Vertically Integrated Wireless System with Horizontally Integrated

Wireless Multichip System ...................................................................................................................................101
4.3

SUMMARY ......................................................................................................................................................104

CHAPTER 5

CONCLUSION AND FUTURE RESEARCH DIRECTIONS .................................................................... 106

5.1

CONCLUSION ...................................................................................................................................................106

5.2

FUTURE RESEARCH DIRECTIONS ..........................................................................................................................109

5.2.1

Energy-Efficient Multi-gigabit Transceiver Design for Intra and Inter-Chip Wireless Interconnects .109

5.2.2

Traffic-Aware Medium Access Mechanism for Multi-Chip System ....................................................110

5.2.3

A Wireless Interconnection Framework for Multichip System with In-Package Memory ..................111

5.2.4

60 GHz mm-wave Wireless Interconnects to Enable Contactless Testing ..........................................113

APPENDIX A ...........................................................................................................................................................116
PUBLICATIONS ......................................................................................................................................................116
JOURNALS: ...............................................................................................................................................................116
CONFERENCES: ..........................................................................................................................................................117
WORK-IN-PROGRESS (WIP)/POSTER PRESENTATION: ......................................................................................................118

xii

BIBLIOGRAPHY ................................................................................................................................................... 119

xiii

LIST OF FIGURES
FIGURE 1.1. A TYPICAL COMPUTING ORGANIZATION IN HPC ENVIRONMENTS I.E. DATA CENTER/SERVERS ............................................2
FIGURE 1.2. THE RANGE OF DIFFERENT INTERCONNECTION ARCHITECTURES. ..................................................................................3
FIGURE 1.3. SCALING OF I/O PITCH AND MINIMUM GLOBAL INTERCONNECT PITCH [6] ....................................................................5
FIGURE 2.1. CONCEPTUAL DIAGRAM OF A WIRELINE MESH BASED NOC AND ITS MULTIHOP COMMUNICATION NATURE. .......................15
FIGURE 2.2. CONCEPTUAL DIAGRAM OF A MULTICHIP SYSTEM INTERCONNECTED WITH C4 BUMPS AND IN-PACKAGE TRANSMISSION LINE. 17
FIGURE 2.3. VERTICALLY STACKED DIES WITH CONVENTIONAL AIR-COOLED HEAT SINK. ...................................................................18
FIGURE 2.4. (A) INDUCTIVE COUPLING (B) CAPACITIVE COUPLING BASED INTERCONNECTED FOR VERTICALLY STACKED DIES [50]. ...........20
FIGURE 3.1. CONCEPTUAL DIAGRAM OF THE WIRELESS MULTICHIP SYSTEM. .................................................................................27
FIGURE 3.2. SPECIFIC DIMENSIONS OF THE ANTENNA AND FEED STRUCTURE. ................................................................................29
FIGURE 3.3. BLOCK DIAGRAM OF MM-WAVE TOKEN-BASED WIRELESS INTERFACE..........................................................................34
FIGURE 3.4. (A) TOP VIEW OF THE MODEL (B) SIDE VIEW OF THE MODEL .....................................................................................38
FIGURE 3.5. (A) RADIATION PATTERN, (B) RETURN LOSS, AND (C) WORST CASE PATH LOSS FOR WIRELESS MULTICHIP SYSTEM ................40
FIGURE 3.6. CONCEPTUAL VIEW OF (A) BUS I/O AND (B) NETWORK I/O BASED WIRELINE CONFIGURATION .......................................42
FIGURE 3.7. PEAK ACHIEVABLE BANDWIDTH PER CORE WITH VARYING SYSTEM SIZE FOR DIFFERENT CONFIGURATIONS WITH UNIFORM
TRAFFIC. ................................................................................................................................................................ 45

FIGURE 3.8. AVERAGE PACKET LATENCY OF VARIOUS MULTICHIP SYSTEMS. ..................................................................................46
FIGURE 3.9. PACKET ENERGY WITH VARYING SYSTEM SIZE FOR DIFFERENT CONFIGURATIONS WITH UNIFORM TRAFFIC. ..........................48
FIGURE 3.10. RELATIVE GAIN IN BANDWIDTH AND PACKET ENERGY WITH DIFFERENT FLIT WIDTH FOR 2-CHIP SYSTEM. ..........................50
FIGURE 3.11. BANDWIDTH PER CORE AND AVERAGE PACKET ENERGY FOR 1, 2, 4-CHIP SYSTEMS FOR TWO DIFFERENT DEPLOYMENT
APPROACHES OF WIS. .............................................................................................................................................. 51

FIGURE 3.12. RELATIVE GAIN IN BANDWIDTH AND AVERAGE PACKET ENERGY WITH DIFFERENT SYSTEM SIZES. .....................................52
FIGURE 3.13. (A) BANDWIDTH PER CORE AND (B) PACKET ENERGY WITH NON-UNIFORM TRAFFIC FOR I/O BASED AND WIRELESS MULTICHIP
SYSTEMS. ............................................................................................................................................................... 54

FIGURE 3.14. BANDWIDTH PER CORE AND PACKET ENERGY FOR 2-CHIP I/O BASED AND WIRELESS MULTICHIP SYSTEMS WITH VARYING
LOCALIZATION......................................................................................................................................................... 55

xiv

FIGURE 3.15. BANDWIDTH PER CORE AND AVERAGE PACKET ENERGY FOR DIFFERENT INTERCONNECT TECHNOLOGIES FOR 4-CHIP SYSTEM 58
FIGURE 3.16. AREA OVERHEADS OF DIFFERENT WIRELINE AND WIRELESS ARCHITECTURE CONSIDERED IN THIS PAPER ............................61
FIGURE 4.1. SIDE VIEW OF PROPOSED 3D WIRELESS NOC ARCHITECTURE WITH THE INTERLAYER COOLING LAYER.................................65
FIGURE 4.2. TOP VIEW OF ONE ACTIVE LAYER. ........................................................................................................................70
FIGURE 4.3. THERMAL RESISTANCE VARIATION FOR THE MICROCHANNELS. ..................................................................................77
FIGURE 4.4. PRESSURE DROP VARIATION FOR THE MICROCHANNELS. ..........................................................................................78
FIGURE 4.5. PUMPING POWER VARIATION FOR THE MICROCHANNELS. ........................................................................................79
FIGURE 4.6. (A) FULL SIDE VIEW (B) VIEW INSIDE THE BOX .......................................................................................................82
FIGURE 4.7. DIMENSIONS OF THE DESIGNED ZIG-ZAG ANTENNA. ...............................................................................................84
FIGURE 4.8. RETURN LOSSES OF ALL 16 ANTENNAS. ...............................................................................................................85
FIGURE 4.9. THE INSERTION LOSS OF ANTENNA 1 IN LAYER1.....................................................................................................86
FIGURE 4.10. PEAK BANDWIDTH AND ENERGY COST PER BIT FOR DIFFERENT 3D NOC ARCHITECTURES. ............................................91
FIGURE 4.11. NORMALIZED PEAK BANDWIDTH AND ENERGY COST PER BIT IN THE PRESENCE OF REAL APPLICATION TRAFFIC. .................92
FIGURE 4.12. PEAK CHIP TEMPERATURE IN PRESENCE OF REAL APPLICATION TRAFFICS. ...................................................................93
FIGURE 4.13. THE IMPACT OF FREQUENCY OF THE COOLING LAYERS...........................................................................................94
FIGURE 4.14. PEAK BANDWIDTH AND ENERGY COST PER BIT FOR DIFFERENT WIRELESS COMMUNICATION MECHANISMS.......................96

xv

LIST OF TABLES
TABLE 3.1. MULTICHIP SYSTEMS WITH DIFFERENT INTER-CHIP INTERCONNECTION CONSIDERED IN THIS PAPER .....44
TABLE 3.2. ENERGY PER BIT AND AGGREGATE BANDWIDTH FOR DIFFERENT INTERCONNECT TECHNOLOGIES .......................................57
TABLE 4.1. SUMMARY AND COMPARISON OF THE SINGLE-PHASE MICROCHANNEL GEOMETRIES IN THE LITERATURE ..............................80
TABLE 4.2. PEAK TEMPERATURE OF TWO ARCHITECTURES CONSIDERED HERE ..............................................................................89
TABLE 4.3. ENERGY PER BIT FOR A SINGLE POINT-TO-POINT LINK AND POSSIBLE AGGREGATE BANDWIDTH FOR DIFFERENT WIRELESS
COMMUNICATION PROTOCOLS ................................................................................................................................... 97

TABLE 4.4. AREA OVERHEAD FOR THE SIGNAL TSVS THROUGH MICROCHANNEL COOLING LAYER. .....................................................98
TABLE 4.5. PERFORMANCE COMPARISON WITH THE HORIZONTALLY INTEGRATED WIRELESS MULTICHIP MODULE..............................101
TABLE 4.6. HOLISTIC COMPARISON OF BOTH MULTICHIP INTEGRATION TECHNIQUES IN VARIOUS DOMAINS. ...................................103

xvi

LIST OF ABBREVIATIONS
3D-CHiWiNoC

CDMA based 3D Wireless NoC

3D-HiWiNoC

3D- Hierarchical Wireless NoC

3D-ICs

Three-dimensional Integrated Circuits

3D-MTSV

3D Mesh NoC with TSV

3D-THiWiNoC

Token based 3D Wireless NoC

APUs

Accelerated Processor Units

AR

Aspect Ratio

A-TDMA

Asynchronous TDMA

ATE

Automatic Test Equipment

BER

Bit-Error Rate

CDMA

Code Division Multiple Access

CNT

Carbon Nanotube

CSMA/CD

Carrier Sense Multiple Access/Collision Detection

CTE

Coefficient of Thermal Expansion

DSSS

Direct Sequence Spread Spectrum

DUT

Device-Under-Test
xvii

DVFS

Dynamic Voltage and Frequency Scaling

EDA

Electronic Design Automation

EMF

Electromotive Force

FDMA

Frequency Division Multiple Access

HPC

High-Performance Computing

ITRS

International Technology Roadmap for Semiconductor

MAC

Medium Access Control

MAD

Minimum Average Distance

MCM

Multi-Chip Module

MIMD

Multiple Instruction Multiple Data

mm-wave

Millimeter-Wave

MWCNT

Multi-Walled Carbon Nanotube

NoC

Network-on-Chip

OOK

On-Off Keying

PCB

Printed Circuit Boards

RFI

Radio Frequency based Interconnects

SiP

System-in-Package

xviii

SNR

Signal-to-Noise Ratio

SW

Small-World

TSV

Through-Silicon-Vias

UWB

Ultra-Wide Band

VCs

Virtual Channels

WDM

Wavelength Division Multiplexing

WI

Wireless Interfaces

WiNoC

Wireless Networks-on-Chip

xix

Chapter 1
1.1

INTRODUCTION

Multichip System

Moore’s law, the primary guiding principle for the chip development, states that the numbers of
transistors on a chip will roughly double in every technology generation. Because of this scaling,
billions of transistors are now packed tightly into each microprocessor. Until recently, dynamic
power was considered as the most significant source of power consumption with technology
scaling, and Dennard’s scaling has helped to control it by reducing supply voltages. Since dynamic
power is proportional to the square of the supply voltage, reducing the voltage decreases dynamic
power consumption significantly. However, with technology scaling, sub-threshold leakage and
gate-oxide leakage increase in an exponential manner which are the main sources of leakage
current. As a result, static power is now starting to dominate total power consumption. This
upsurge in power consumption is not only increasing chip temperature and cooling cost but also
decreases chip reliability and performance. The multicore system has appeared as a feasible
solution to address the power and frequency limitations of the uniprocessor system. According to
Flynn’s taxonomy of parallel computer classification, the multicore system is an example of
Multiple Instruction Multiple Data (MIMD) computing organization where different cores execute
different threads (Multiple Instructions), working on different parts of memory (Multiple Data)
concurrently [1]. As a result, instead of running at the higher clock frequency, multicore system
improve overall performance by running more tasks in parallel. This, in turn, helps to reduce the
rapid growth of power consumption of uniprocessors. The Network-on-Chip (NoC) paradigm has
emerged as an enabling methodology for interconnecting hundreds of cores on the same die by
designing separate scalable interconnection fabrics to support high-speed communication between
1

cores [2] and has captured the attention of both the academia and the industry. Tilera’s 64-core
TILE64 [3], Intel’s 80-core Polaris [4], 48-core Single-chip Cloud Computer [5] are some
examples of such NoC based multicore chips.
However, demand for performance improvement is still quite high, and according to the
projection from International Technology Roadmap for Semiconductor (ITRS), it is expected to
grow to 300x by 2022 [6]. This growth will result in an integration of 100x more cores than the
current state of art multicore system. Designing such large multicore chip will not come without
any price. Larger chip size usually results in lower yield. For example, let us consider two different
die sizes, one with 20 mm × 20 mm and another with 10 mm × 10 mm. A round wafer with a
diameter of 300 mm, it can pack 143 and 640 dies of those two sizes respectively. If we consider
only 20 manufacturing defects due to process and fabrication variations (In reality, it can be more
than that), larger die size results in 14% yield loss whereas in smaller die size, it is only 3.12%.
Although, smaller die size improves yield and provides fine-grained granularity regarding binning,
per die fixed costs i.e. packaging, assembly, and test can increase the total combined cost of

Figure 1.1. A typical computing organization in HPC environments i.e. data center/servers

2

development [7]. It also provides less functionality as a relatively small number of cores can be
packed into that small area. Aggregating multiple moderately smaller dies within a package can
provide the functionality of a large chip and at the same time can provide significant advantages
in terms of higher yield and better packing of rectangular die on a round wafer [8]. Moreover, the
disintegration allows easier reuse by supporting different system sizes. Combining multiple chips
in a single package can be done either by vertically stacking several chips with Through Silicon
Vias (TSVs) i.e. 3D integration [9] or by placing them horizontally on the same substrate within
the package i.e. multichip module [10][11]. Computing modules with the multichip system are allpervasive in hardware infrastructures from servers to data centers. A typical computing
organization in High-Performance Computing (HPC) environment like data centers/clusters is
shown in Figure 1.1. Due to scaling up of a number of individual computing nodes by several

Figure 1.2. The range of different interconnection architectures.

3

orders of magnitude in the HPC systems, the interconnection between them has become
increasingly sophisticated. For example, inter-chip interconnections vary from solder bumps or C4
interconnects in the multichip module within a System-in-Package (SiP) spanning 10cm in range
on one end to Ethernet used in data center warehouses spanning about a kilometer on the other as
shown in Figure 1.2. While intra-chip communication infrastructure is seeing a paradigm shift
from bus-based systems to NoC architectures [2], inter-chip communication also needs to evolve
at a rapid pace to cater to increasing bandwidth demands within the strict power and thermal
envelopes.

1.2

Challenges of the Multichip System

The multichip system presents new opportunities to overcome the floorplan restriction, yield, and
scalability limitations of the traditional single-chip multicore system. However, to be widely
accepted, the multichip system requires high bandwidth, low latency, and energy-efficient
communication across the distinct chips and at the same time, needs to operate within a thermal
budget while maintaining high performance. Depending on the integration methodology, the
challenges of designing multichip system can be different. Volume production and commercial
exploitation of multichip system will only be feasible after addressing these concerns.
1.2.1 Horizontal Integration of the Multichip System
In environments like data center or servers, the lower level cache is physically distributed between
all cores. Hence, cache or memory access eventually requires communication between components
in different chips. However, recent trends according to the ITRS predict that the pitch of the I/O
interconnects in ICs is not scaling as fast as the gate lengths or pitch of on-chip interconnects [6]

4

160

minimum global interconnect pitch (nm)
160
140

120

120

100

100

80

80

60

60

40

40

20

20

0

0

I/O Pitch size (um)

140

On-chip global interconnect pitch (nm)

I/O pitch size (um)

Figure 1.3. Scaling of I/O pitch and Minimum global interconnect pitch [6]

as shown in Figure 1.3. This implies a gap in density and performance of traditional I/O systems
relative to on-chip interconnections. The wiring complexity of both on-chip and off-chip
interconnects exacerbates the problem by presenting design challenges, crosstalk and signal
integrity issues [10]. Additionally, because of different interconnection frameworks for on-chip
and off-chip communication, data from cores located within the chips need to travel to the I/O
blocks, traverse the inter-chip link and then be routed to the final destination inside the target chip.
Besides, switching between protocols is necessary if the off-chip communication protocol is
different from the on-chip one. All these factors reduce the efficiency in terms of energy
consumption as well as latency and bandwidth of the data transfer between cores in a multichip
system. Integrated inter and intra-chip photonic interconnections [12] [13] is a promising solution
to the off-chip interconnection challenges of traditional I/Os. However, the pitch of photonic
interconnects do not scale well due to the limitations in size of silicon photonic devices. Also, the
5

optical loss of silicon waveguides (typically 3.6 dB/cm) [14] makes routing long inter-chip optical
channels impractical. Thus, to be widely accepted, the horizontally integrated multichip system
requires high bandwidth, low latency, and energy efficient communication across the different
chips.
1.2.2 Vertical Integration of the Multichip System
Vertically stacking several dies with TSVs i.e. 3D integration is another alternative way to
overcome the physical limitations of single chip multiprocessor system [9]. However, utilizing the
third dimension to provide additional device layers poses significant thermal challenges. Higher
power density is already a major problem in single chip system, and stacking vertical layers
increases the power dissipation density and the thermal footprint per unit area substantially. This
fact augmented with the slow lateral diffusion of heat in silicon creates localized thermal hotspots.
Also, conventional cooling techniques are limited in ability to extract heat only from the top or
bottom layer of the entire 3D stack. Conventional thermal management techniques adopted in a
single planar chip like Dynamic Voltage and Frequency Scaling (DVFS) or Clock/Power gating
sacrifice performance to control the thermal behavior by slowing down or turning off the
processors when a critical temperature threshold is exceeded. On the other hand, task scheduling
or task reallocation based dynamic thermal management technique redistributes existing processes
to available cores based on the current thermal profile of the chip. However, the effectiveness of
such DTM technique depends on the availability of relatively cooler processors or cores, which
might be difficult to find in 3D integration due to relatively high power density and shorter interchip distance. Moreover, technology scaling is pushing the limits of affordable cooling, thereby
requiring suitable design techniques to reduce peak temperatures [15]. The design of sophisticated

6

cooling mechanisms like liquid cooling through microfluidic channels can provide cooling
capacities necessary for effective management of chip temperatures in 3D ICs [16][17][18][19].
In liquid cooling, embedded inter-layer microchannels or a cooling chip is inserted in between
layers of the 3-D chip and a coolant fluid (i.e., water or other liquids) is pumped through the
microchannels to extract the heat from the interlayer regions effectively. However, pumping liquid
through the microchannels can cause high-pressure drops compromising the mechanical integrity
of the thin walls between TSVs and microchannels [20]. Lowering the coolant flow rate to reduce
the pressure drop has another disadvantage of higher temperature non-uniformity in the silicon
substrate along the flow length. Moreover, large thermal gradients along the fluid flow direction
inside microchannels can affect the structural reliability of the TSVs by inducing temperature
related expansion and contraction due to a mismatch in Coefficient of Thermal Expansion (CTE)
between copper and silicon. To reduce the mechanical stresses and at the same time, to provide
temperature uniformity and adequate cooling capabilities, the height and width of the
microchannels need to be increased. Several dimensions of microchannels are suggested in
literature ranging from 50 µm to 1000 µm in height and 100 µm to 1000 µm in width depending
on desired pressure drop and cooling capabilities [21][22]. This, in turn, imposes significant
restrictions on where and how many TSVs and microchannels can co-exist together. TSVs with
Aspect Ratio (AR=Height/Diameter) greater than 10 are tough to manufacture at high yield due to
challenges related to etching, sidewall passivation, and formation, insulation, and filling of vias
[6] and co-dependency of the microchannels and electrical design makes the process even more
complex. Wider microchannels occupy a significant portion of the floor area of the vertically
integrated multichip system severely restricting the freedom of placement and routing of TSV
based links. Moreover, increasing the microchannels height will eventually increase the die
7

thickness and consequently, the height of TSVs, which in turn will increase the diameter of the
TSVs to maintain a fixed AR. Also, to reduce voltage drop, TSVs used for power delivery network
require higher diameter and pitch than the signal TSVs. All these factors restrict the area available
to route TSVs across the cooling layers and make the co-existence and co-design of TSVs and
microchannels challenging especially when thousands of TSVs are required for interconnections
in large chips with die areas higher than 100 mm2 [23]. Contactless interconnects in 3D ICs through
inductive and capacitive coupling based vertical links have been proposed in recent years
[24][25][26]. However, their feasibility across microchannel based cooling layers is unknown.
Moreover, these inductive/capacitive coupling links have larger area overheads compared to TSVs
[26]. In addition, energy per bit of such links increases significantly with the communication
distance [26], making these contactless interconnects inefficient for communication across the
microchannels with a height greater than 50 µm.

1.3

Possible Solution: Wireless Interconnects

Research in recent years has demonstrated that on-chip and off-chip wireless interconnects are
capable of establishing radio communications within as well as between multiple chips. The
absence of the need for physical layouts makes wireless interconnects stand out from other
emerging interconnects. Wireless data communication links up to 10 m in length with multiGigaHertz bandwidths in millimeter wave (mm-wave) bands are fabricated and demonstrated in
[27]. Using such on-chip antennas embedded in the chip Wireless Networks-on-Chip (WiNoC)
architectures have been proposed [28][29]. These wireless NoCs are shown to improve energy
efficiency and bandwidth of on-chip data communication in multicore chips [29][30]. On-chip
antennas like Carbon Nanotube (CNT) or Graphene-based structures are predicted to provide high
8

bandwidth wireless communication channels [30][31]. However, integration of these antennas
with standard CMOS processes needs to overcome significant challenges. Whereas mm-wave
antennas fabricated using top layer metals are CMOS process compatible making them suitable
for near-term solutions to the wired interconnect problem [29]. In mm-wave wireless
interconnects, bandwidth is limited by the state-of-the-art transceiver design and on-chip antenna
technology. To improve performance, multiple wireless transceivers need to access the wireless
medium to communicate with other wireless transceivers without interference. Medium access
mechanisms in WiNoCs using mm-wave transceivers range from simple token passing based
protocol to more sophisticated Code Division Multiple Access (CDMA) based mechanisms
[29][32][33][34]. The chosen on-chip antenna has to provide the best power gain for the smallest
area overhead. A metal mm-wave zigzag antenna has been demonstrated to possess these
characteristics as they are more compact compared to other antenna structures such as a patch
antenna. Such mm-wave 60GHz antennas are shown to have a bandwidth of 16GHz for both intrachip and inter-chip [28][35] communications links. It has been noted in many earlier works that
the mm-wave wireless antennas are not directional and hence can be used for broadcast type
transmission over the shared wireless channel. This property gives an additional advantage as
wireless interconnects can provide a broadcast-capable medium to distribute any kind of control
messages faster efficiently.
This work proposes to utilize mm-wave wireless interconnect to overcome the challenges
of the multichip system. Few cores inside the chips will be equipped with wireless transceivers,
which will be capable of establishing direct one-hop communication with other such cores in the
same as well as other chips. Depending on the target integration technology, the dissertation
proposes design principles and methodologies to use wireless interconnect for multichip based
9

system. The first part proposes the design methodologies to use wireless interconnect to provide
high bandwidth, low latency, and energy efficient communication across the distinct chips for the
horizontally integrated multichip system. Whereas the second part addresses the co-design
challenges of TSVs and microchannels in vertically stacking multichip system by utilizing wireless
interconnects for data communication across microchannels coolers eliminating the need to place
and route signal TSVs through the cooling layer. In conclusion, this dissertation performs a
comparative evaluation of horizontally and vertically integrated wireless multichip systems in
terms of performance, energy-efficiency, and temperature and points towards various promising
future directions initiating from this research work. The design scope of this dissertation is
encircled in Figure 1.2.

1.4

Summary of Research Objectives
1. Design of a Wireless Interconnection Framework for Seamless Inter and Intra-chip

Communication in Horizontally Integrated Multichip System
This research objective proposes to use wireless interconnects to establish a seamless
communication backbone which enables data exchange between cores in a single chip as well as
between chips in a multichip system with dimensions spanning up to a few tens of centimeters.
The same communication protocols used for on-chip data transfer in the intra-chip NoC will be
utilized for off-chip data as well, eliminating the need for protocol transfer. By deploying the
wireless transceivers in the internal nodes of the chips such that all cores are within a short distance
from their nearest transceivers, energy-efficient inter and intra-chip communication can be
achieved. The design methodologies for such multicore multichip systems will be developed, and

10

comparative system-level performance evaluation with traditional I/O based multichip system will
be completed.
2. Wireless Interconnect as an Enabler for Data Communication across
Microchannel based Cooling Layer in Vertically Integrated Multichip System
The objective of this research goal is to investigate design methodologies and suitable architecture
for vertical integration to realize the data communication across the cooling layers with on-chip
wireless interconnects depending upon the dimensions of the microchannels for best trade-offs in
thermal and hydraulic performance. Integration of wireless interconnects in 3D integration for
data communication across microchannel-based interlayer will eliminate the need for
accommodating signal TSVs through the cooling chip while providing energy efficient data
communication. Therefore, the only TSV based links to be placed and routed across the cooling
layers would be the power and clock delivery networks. This integration methodology will
significantly reduce the complexity of the co-design challenge of TSV based interconnects and
microchannels based interlayer cooling.
3. Holistic Comparative Evaluation of the Horizontally Integrated Multichip System
with Vertically Stacked Multichip System
For first two research goals, depending on the target integration technology, the design
methodologies and architectures will differ. However, both research activities propose to use
wireless interconnect to overcome the limitations of horizontal and vertical integration of the
multichip system. The final research goal aims to perform a holistic comparative evaluation of
these two different integration approaches in terms of performance, energy-efficiency, and
temperature and to provide future directions for design of the multichip system.
11

1.5

Contributions

In this dissertation, we develop design principles and methodologies to utilize wireless
interconnect for horizontally and vertically integrated the multichip system. The principal
contribution of this dissertation can be summarized as below:
1. Design of a Wireless Interconnection Framework for Seamless Inter and Intra-chip
Communication in Horizontally Integrated Multichip Systems
As part of this objective, this thesis explored the advantages possible if inter-chip communication
in horizontally integrated multichip modules can be realized with state-of-the-art mm-wave
wireless links operating in the 60GHz band. The specific contributions of this research goal are:
o Proposed two different interconnect frameworks to utilize wireless interconnects
for seamless inter and intra-chip communication. This proposed framework
eventually extends the NoC spanning to multiple chips.
o The design of suitable on-chip antennas to establish wireless interconnection in a
multichip system.
o Evaluated the performance of the wireless multichip system and compare it with
several traditional I/O based multichip systems.
o

Proposed a methodology to deploy wireless interconnects when system scales up.

o

Comparative evaluation of emerging multichip integration technologies.

2. Wireless Interconnect as an Enabler for Data Communication across
Microchannel based Cooling Layer in Vertically Integrated Multichip System
As part of this research goal, this dissertation accomplished the following tasks:
12

o The design of 3D wireless NoC architectures for 3D ICs with microchannel based
cooling to eliminate the need for TSVs across the cooling layers.
o The design of suitable on-chip antennas to establish wireless interconnection in
vertical stacking multichip integration.
o Evaluate the performance and thermal characteristics of 3D wireless NoCs
equipped with microchannel cooling layers and compare with respect to traditional
3D interconnection systems using TSVs.
3. Holistic Comparative Evaluation of the Horizontal and Vertical Integration of
Multichip Integration with Wireless Interconnect
As part of this research goal, this dissertation compared the horizontally integrated wireless
multichip system with respect to the vertically stacked 3D wireless system in terms of interconnect
performance, energy-efficiency, and temperature and provided future directions for the design of
such multichip system.

1.6

Dissertation Organization

The dissertation is organized in five chapters. Chapter 1 introduces the complexity of the
horizontal and vertical integration of the multichip system and an overview of the possible means
of addressing those issues. Chapter 2 discusses the background and summarizes the current state
of knowledge in this field. Chapter 3 presents design methodologies for a seamless, hybrid wired
and wireless interconnection network for horizontally integrated multichip systems with
dimensions spanning up to tens of centimeters with on-chip wireless transceivers. The same
communication protocols used for on-chip data transfer in the intra-chip NoC will be utilized for
off-chip data as well, eliminating the need for protocol transfer. Few cores inside the chips will be
13

equipped with wireless transceivers, which will be capable of establishing direct one-hop
communication with other such cores in the same as well as other chips. By deploying the wireless
transceivers in the internal nodes of the chips, such that all cores are within a short distance from
their nearest transceivers, energy-efficient inter and intra-chip communication can be achieved.
With system-level simulations, this chapter demonstrates that such a design increases the
bandwidth and reduces the energy consumption in comparison to state-of-the-art wireline I/O
based multichip communication. Chapter 4 proposes to realize the vertical interconnects for data
communication across the cooling layers with on-chip wireless interconnects and presents energyefficient wireless 3D NoC architectures designed for optimal dimensions of microchannels for best
thermal cooling capability and pressure drop characteristics. This chapter demonstrates that the
proposed 3D wireless NoC is capable of establishing data communication across the cooling layers
using wireless interconnects with lower energy consumption and reduces chip temperatures due to
interlayer cooling channels. This chapter also presents holistic Comparison of the horizontally
integrated wireless multichip system with respect to the vertically stacked 3D wireless system in
terms of interconnect performance, energy-efficiency, and temperature. Finally, Chapter 5
summarizes the important conclusions and points out the direction of future research.

14

Chapter 2

BACKGROUND AND STATE-OF-THE-ARTS

The research activities proposed in this dissertation are inspired and founded upon the current state
of knowledge in three main directions. Here we discuss the most recent activities in these
directions.

2.1

Intra and Inter-chip Interconnection for Horizontally Integrated

Multichip System
In the last few years, intra-chip communication infrastructure has seen a paradigm shift from busbased systems to Network-on-Chips (NoCs) architectures [2]. The NoC paradigm aims to mitigate
global wire delays by designing separate scalable, plug-and-play interconnection fabrics to support
high-speed communication between cores. In this network-centric approach, packetized data is
routed from source to destination through a series of switches and links. Commonly, wormhole
switching is adopted for NoCs, where data packets are broken down into flow control units or flits.
The first flit or the header flit contains routing information that helps to establish a path from the

Figure 2.1. Conceptual diagram of a wireline Mesh based NoC and its multihop communication
nature.

15

source to destination, and all the other payload or body flits follow that path [36]. Grid based Mesh
topology shown in Figure 2.1 is most widely used NoC topology as it is relatively easy to design,
manufacture, and test [3][4][5]. However, as can be seen from Figure 2.1, data transfer between
two distant nodes happens in multi-hop fashion due to its regular structure. This can cause high
latency and energy dissipation in metal wireline based traditional mesh architecture limiting the
possible performance gain of NoCs. Insertion of long range links using conventional metal wires
[37] or ultralow-latency and low-power express channels between communicating cores [38] have
been proposed in the literature to alleviate this problem. However, with technology scaling, the
gap between the global wire delays and gate delays increases significantly [6] and consequently,
restricts the performance benefits from these approaches. Hence, to enhance the performance of
metal wireline based NoC architectures, few radically different interconnection technologies such
as photonic interconnects [39], multi-band RF transmission line interconnects [40], or wireless
interconnects [29][30] are currently being explored. The on-chip photonic interconnects are
implemented using on-chip optical waveguides, micro-ring resonators, and laser sources and are
capable of achieving low latency and low power dissipation due to single hop communication
between distant cores. However, the challenges regarding integration of photonic devices, precise
thermal tuning of electro-optic modulators and demodulators, the signal noise due to coupling
between waveguides, and manufacturing process involving a separate photonic plane [39] need to
resolve for the photonic interconnects. While RF-interconnects (RF-I) [40] are compatible with
CMOS technology, they require long on-chip transmission lines to enable data transmission which
can lead to routing challenges and significant area overhead. On the other hand, on-chip wireless
interconnects enable long distance, energy efficient, high bandwidth, and low-latency
communication over long-range paths. Moreover, the absence of the need for physical
16

interconnection layouts makes wireless interconnects a promising alternative to the performance
limitations seen by long-distance wired links. In addition, such mm-wave antennas fabricated
using top layer metals are CMOS process compatible making them suitable for near-term solutions
to the wired interconnect problem. More detailed literature survey on wireless interconnect is
discussed in Section 2.3.
In a parallel direction, conventionally C4 bumps coupled with in-package transmission
lines are used to interconnect chips within a multichip system [41] as shown in Figure 2.2.
However, signal quality deteriorations due to microwave effects, crosstalk coupling effects, signal
reflections, and frequency-dependent lines losses in the transmission line limit the number of
concurrent, high-density inter-chip I/O [10]. This, in turn, restricts the possible off-chip bandwidth.
Moreover, the pitch of chip-to-chip I/O does not scale in the same proportion as on-chip global
wires [6]. This creates a gap in performance of on-chip interconnections with respect to the offchip communication. Different interconnect technologies such as photonic interconnects [12][13],
inductive or capacitive coupling based interconnects [24][25], and wireless interconnects [42] are
being explored to mitigate the performance issues of conventional I/O based multichip systems. In
[13], onboard integrated intra and inter-chip photonic network are proposed. In [43] transceivers

Figure 2.2. Conceptual diagram of a multichip system interconnected with C4 bumps and inpackage transmission line.

17

for 60 GHz inter and intra-chip communications are designed. However, system-level performance
gains are not evaluated in this work. In [44], wirelessly connected multichip modules are proposed
for a High-Performance Computing (HPC) environment.

2.2

Vertically Integrated Multichip System and Microchannel based Cooling

Vertically stacked multichip integration or commonly known as 3D integration, provides
promising solutions to the challenges of footprint, device density, and energy cost of the planar
single-chip multicore system. The problem with metallic interconnect is not severe in 3D
integration because of the smaller footprint. Through-Silicon-Vias (TSVs) are most often used to
realize the interconnections between dies in the multichip system. TSV is a metallic interconnect
that passed through silicon substrate to provide high bandwidth and energy efficient die-to-die
interconnection due to its short length. However, One of the major issues in the implementation of
3D integration is the excessive heat flux generated by stacking multiple chips, giving rise to an

Figure 2.3. Vertically stacked dies with conventional air-cooled heat sink.

18

increase in the power generated per unit surface area as well as in the peak temperature [16][18].
Conventional cooling techniques are limited in ability to extract heat only from the top or bottom
of the entire 3D stack as shown in Figure 2.3. Moreover, the conventional air-cooled heat sink
requires heat flux from the vertically stacked dies to travel through a longer conductive path in
order to dissipate through the heat sink, increasing the overall thermal resistance. For aggressive
cooling of 3D ICs, interlayer liquid cooling methods have been investigated by several researchers.
Tuckerman and Pease [17] first proposed the use of microchannels to cool ICs effectively.
Tuckerman and Pease [17] were able to dissipate 790 W/cm2 with a thermal resistance of 0.09
K/W by using microchannels with a height of 302 µm and a width of 50 µm. However, the flow
rate required was 8.6 ml/s, which resulted in a pressure drop of 214 kPa. In [45], authors decreased
the pressure drop from 25 to 1.01 kPa by increasing the microchannel height from 50 to 300 µm.
They reported that it is possible to achieve low-pressure drops (< 10 kPa) and relatively low
substrate temperatures by having flow rates between 2 and 5 ml/s in a 100 µm tall microchannel.
In [46], authors noted that the fluid temperature increases significantly for microchannels widths
beyond 300 µm under low flow rates. However, increasing the flow rate reduces the fluid
temperature rise. Kandlikar [20] presented a review of the available cooling schemes for 3D IC
stacks and identified that 3D IC cooling is suitable for hotspot management and proper thermal
regulation of the chip components. Comprehensive thermal management techniques are developed
in [47] where combined approach utilizing dynamic voltage frequency scaling (DVFS),
temperature-aware task allocation and liquid cooling is proposed. In [48], authors analyzed the
impact of the liquid cooling on a 3D multi-core processor compared to the conventional air cooling
and showed that integrating interlayer cooling improves the lifetime reliability of a chip
significantly by reducing the peak temperature. However, from a thermal viewpoint, to provide
19

adequate cooling capabilities at the low-pressure drop, microchannels are required to be wide and
taller which in turn complicated the placement and routing of TSVs. Two different approaches
exist in literature to address the co-design problem. In microchannels first approach,
microchannels dimensions are optimized first to get the best cooling capabilities and then, TSVs
are placed in the remaining area [21][22]. Whereas in TSV-constrained placement approaches,
TSVs are placed first to reduce the average wire length and then, microchannels are employed in
the remaining area [49]. However, in both methods, whatever is placed first occupy a significant
silicon area leaving the lesser area for other and consequently, can result in either lower
performance or cooling capabilities.
Contactless wireless interconnects in 3D ICs through inductive and capacitive coupling
based vertical links have been proposed in recent years [24][25][26]. In inductive coupling, a
planar spiral transmitter and receiver inductor pairs are placed on the silicon dies and time-varying
current is passed through the transmitter coil to generate magnetic flux as shown in Figure 2.4 (a).
This coupled magnetic field, in turn, induces an Electromotive Force (EMF) in the receiving coil,
and the receiver coil converts this EMF into an electrical signal. However, these inductive coupling

(a)

(b)

Figure 2.4. (a) Inductive coupling (b) Capacitive coupling based interconnected for vertically
stacked dies [50].

20

links have larger area overheads compared to TSVs [26]. In addition, energy per bit of such links
increases significantly with the communication distance [26]. On the other hand, in the case of
capacitive coupling links, small metal plates are placed on two silicon dies which create parallel
plate capacitance between them as shown in Figure 2.4 (b). However, these dies are required to be
close to each other to create this electrical field. Due to this proximity requirements, the capacitive
coupling based links require the dies to be face-to-face [50]. This limits the number to dies to be
connected with capacitive coupling based interconnects.
The idea of using wireless interconnects using mm-wave on-chip antennas in 3D NoC was
explored in [51]. However, how it affects the design methodology in a 3D IC with liquid cooling
was not discussed. Unlike capacitive/inductive coupling links, the energy consumption of mmwave wireless interconnects do not increase with the distance. This makes mm-wave wireless
interconnect more viable solution for data communication across microchannels cooling layers
with heights more than 50 µm. Enabling 3D NoC with wireless interconnects will eliminate the
need for place and route of signal TSVs across the cooling layer for data transfer. This integration
technique will ease the restrictions on the dimensions of the microchannels, reduce the
complexities of co-design challenges of microchannels & TSVs, and make the fabrication of the
cooling layer more flexible.

2.3

Wireless Technology: A Promising Interconnect Paradigm

On-chip wireless interconnect is a promising alternative to the performance limitations seen by
long-distance wired links. The absence of the need for physical interconnection layouts makes
wireless interconnects stand out from other emerging interconnects. Moreover, wireless
interconnect are capable of enabling high bandwidth and low latency communication over long21

range paths which is beneficial for the inter-chip communication. A comprehensive survey
regarding various WiNoC architectures and their design principles is presented in [29]. Notable
examples include the design of a WiNoC based on CMOS ultra-wideband (UWB) technology [32],
hierarchical mm-wave WiNoC architecture [28], 2D concentrated mesh-based WCube architecture
using sub-THz wireless links [52], and the inter-router wireless scalable express channel for NoC
(iWISE) architecture [33]. In [44], on-chip wireless transceivers are used to facilitate quick prebonding wafer testing enabled by direct accesses to components under test within the ICs. On-chip
antennas from graphene or Carbon Nanotube (CNT) based structures are predicted to provide high
bandwidth wireless communication channels [30][31]. However, integration of these antennas
with standard CMOS processes needs to overcome significant challenges. On the other hand, mmwave CMOS transceivers operating in the sub-THz frequency ranges is a more near-term solution.
However, the bandwidth of the mm-wave wireless channels is limited by the state-of-the-art in
transceiver design. The design of multiple non-overlapping channels enabling Frequency Division
Multiple Access (FDMA) is a non-trivial challenge from the perspective of transceiver design and
is not easily scalable. Hence, to efficiently utilize the available bandwidth, several WIs need to
share the wireless bandwidth for data communication. A synchronous and distributed medium
access mechanism is proposed in [32] for the Ultra-Wide Band (UWB) wireless NoC. In [34], a
Code Division Multiple Access (CDMA) based medium access scheme is proposed by utilizing
orthogonal codewords to enable simultaneous wireless transmission through the wireless channel.
In [30], a hybrid medium access scheme combining both Time Division Multiple Access (TDMA)
and FDMA is proposed for WiNoCs based on CNT antennas. A distributed MAC protocol is
proposed in [53]. The proposed mechanism uses simple orthogonally coded request packets,
processing the request packets and granting permission to the channel by a priority based
22

mechanism. However, this mechanism has an overhead of maintaining the state of current
transmission at each transceiver. In [54], authors discussed the performance of ALOHA and
CSMA for graphene-based WiNoCs. A comparative performance evaluation of CSMA and Tokenbased MAC is presented in [55]. For WiNoCs utilizing mm-wave transceivers, a token passing
based medium access mechanism is used in [29][56].

23

Chapter 3

DESIGN OF A WIRELESS INTERCONNECTION

FRAMEWORK FOR SEAMLESS INTER AND INTRA-CHIP
COMMUNICATION

IN

HORIZONTALLY

INTEGRATED

MULTICHIP SYSTEMS
In this chapter, we propose to use wireless interconnects to establish a seamless communication
backbone which enables data exchange between cores in a single chip as well as between chips in
a multichip system with dimensions spanning up to a few tens of centimeters. The same
communication protocols used for on-chip data transfer in the intra-chip NoC will be used for offchip data as well, eliminating the need for protocol transfer. Few cores inside the chips will be
equipped with wireless transceivers, which will be capable of establishing direct one-hop
communication with other such cores in the same as well as other chips. By deploying the wireless
transceivers in the internal nodes of the chips, such that all cores are within a short distance from
their nearest transceivers, energy-efficient inter and intra-chip communication can be achieved.
Here, we present the design methodologies for such multicore multichip systems and demonstrate
that the proposed design out-performs traditional wired I/O based multichip systems through
system-level simulations. The specific contributions of this chapter are:
1. Proposed two different interconnect frameworks to utilize wireless interconnects for
seamless inter and intra-chip communication.
2. The design of suitable on-chip antennas to establish wireless interconnection in a multichip
system.

24

3. Evaluated the performance of the wireless multichip system and compare it with traditional
I/O based multichip system.
4. Proposed a methodology to deploy wireless interconnects when system scales up.
5. Comparative evaluation with respect to emerging multichip integration technologies.

3.1

Wireless Interconnection Framework for Multichip Systems

The interconnection fabric of the proposed multichip system with wireless interconnects is a hybrid
network with both wired and wireless links. Each core in all the multicore chips is connected with
a NoC switch. The switches within a single chip are interconnected in an intra-chip NoC
architecture. Certain switches in the NoC are equipped with Wireless Interfaces (WIs) to realize
the inter-chip communications. These switches can directly communicate with their counterparts
in the other chips. Figure 3.1 shows the conceptual architecture of the multichip system
interconnected with inter and intra-chip wireless network.
3.1.1 Topology
In the proposed wireless interconnection framework, cores within each chip are interconnected
using an intra-chip NoC. We discuss the interconnection architectures for the multichip systems
with two different intra-chip NoC topologies as case studies to exhibit their role in the overall
system. The topology of the chosen two intra-chip NoCs is Mesh and a Small-World topology.
The Mesh is selected as it is a conventional NoC topology used in several multicore-based products
[4][5][3] and is relatively easy to design, verify, and manufacture. The Small-World topology is
chosen, as it is suitable to design wireless NoCs as noted in [57] and is demonstrated to outperform

25

the Mesh-based NoC [37]. The multichip systems with the two chosen intra-chip NoCs are
described below.
A.

Multichip System with Intra-chip Mesh

In the first multichip interconnection framework the intra-chip interconnection topology is a
traditional Mesh-based NoC. For inter-chip communication, traditional chip I/O is connected to
the periphery of the chip in one of the corner switches. This requires inter-chip data between cores
embedded inside the chips to travel to the periphery then be transferred to over the I/O resulting in
high latency and high-energy consumption. To alleviate this problem, we equip NoC switches
associated with cores embedded within the chip with WIs. To deploy the WIs each intra-chip mesh
NoC in each chip is further subdivided into a certain number of logical subnets. The WIs are
deployed in a switch at the center of the subnets as shown in Figure 3.1, to avoid long multi-hop
paths from all cores in its subnet. This WI deployment strategy corresponds to the approach that
achieves Minimum Average Distance (MAD) between all switches in an intra-chip NoC in [58].
This improves the connectivity of the entire multichip system by establishing direct wireless links
between internal switches eliminating the need to travel to and from the periphery of the source
and destination chips respectively to access the traditional I/O modules.
B.

Multichip System with Intra-chip Small-World

Insertion of bypass paths or long-range shortcuts realized with metal interconnects is shown to
improve the performance in a traditional Mesh-based NoC [37]. Small-World networks are a type
of complex networks often found in nature that is characterized by both short-distance and longrange links. This improves the efficiency of the network as they have a very low average number
of hops between nodes even for very large network sizes. Hence, such network topologies are
26

suitable for designing scalable, hybrid intra and inter-chip interconnection networks using wireless
links in [56][57].
To establish the wireline links within each intra-chip NoC while satisfying the properties
of Small-World graphs, we generate the wireline topology according to the following inverse
power law to minimize wiring costs [59]

P(i, j) = ∑n

lij −α fij

−α
n
fij
i=1 ∑j=1 lij

.

(3.1)

Where, P(i,j) is the probability of establishing a link, between two switches i and j, lij is the
Manhattan distance, fij is the frequency of communication between switch i and j and n is the total
number of switches. As can be seen from (3.1), the probability of a link insertion between two
switches i and j where lij separates them is proportional to the distance raised to a finite negative
power. The value of α is chosen such that optimal wiring costs [59] are obtained. The distance is
obtained by considering a tile-based floor plan of the cores on the die. The frequency of traffic

Figure 3.1. Conceptual diagram of the wireless multichip system.

27

interaction between the cores, fij, is also factored into (3.1), so that more frequently communicating
cores have a higher probability of having a direct link optimizing the topology for applicationspecific traffic. This power-law based link distribution results in both short distance connections
and long-range links due to the non-zero probability of links between far-away nodes. The total
number of these wireline links is considered same as that in a mesh of the similar size to ensure no
undue advantage is granted to the small-world architecture due to additional links. Also, an upper
bound of 7 is imposed on the number of links attached to a particular switch so that no particular
switch becomes unrealistically large [57]. The link setup method is repeated until no core or groups
of cores are left unconnected. In this way, the intra-chip wireline small-world NoC topology is
created. In addition to these wireline links, the wireless transceivers are deployed to form the WIs
at the same switches as in the Mesh-based intra-chip NoC. This method is followed to form the
same overlaid inter-chip wireless interconnect topology between the mesh and small-world based
multichip systems.
3.1.2 Physical Layer
The main enabling technology for such inter and intra-chip wireless interconnection is the physical
layer design comprising of the transceiver circuits and antennas. We envision the multichip system
where wireless interconnects will enable seamless intra and inter-chip communications. On-chip
communication will happen over the hybrid wireline and wireless NoC. Wireline links are realized
with traditional global-wire based interconnects depending on the specific topology adopted.
Several alternative technologies exist for realizing on-chip and off-chip wireless
interconnections [29] [30][31][32][35]. We envision the use on-chip embedded miniature antennas
operating in the 60 GHz mm-wave band that can be fabricated within the chip to establish direct
28

communication channels between internal switches of the chips. To realize such wireless channels,
we choose on-chip metal zig-zag antennas which have been shown to be effective in establishing
both on-chip and off-chip communication [27]. The chosen on-chip antenna has to provide the best
power gain for the smallest area overhead. Several on-chip antenna designs in the mm-wave bands
have been investigated [29][30][31][32][35]. A linear dipole occupies a large area proportional to
the wavelength of the carrier frequency. A patch antenna is directional mostly radiating
perpendicular to its plane. A log-periodic antenna is highly directional [60][61]. We intend the
chosen antenna to be compact as well as not directional. This is because we want to communicate

Figure 3.2. Specific dimensions of the antenna and feed structure.

29

between antennas, which are located in different chips and potentially at different angles with
respect to each other’s axes. A metal mm-wave zigzag antenna has been demonstrated to possess
these characteristics as they are more compact compared to a linear dipole due to the zig-zag
folding of the arms. Also, such mm-wave antennas fabricated using top layer metals are CMOS
process compatible making them suitable for near-term solutions to the wired interconnect
problem [29]. Such mm-wave 60GHz antennas are shown to have a bandwidth of 16GHz for both
intra-chip [28] and inter-chip [35] communications links. We have designed mm-wave zig-zag onchip antennas to resonate in the 60GHz frequency and studied its characteristics in terms of return
loss and path loss in a multichip system. A coplanar feed structure is chosen for the antenna as it
has low losses compared to other feed structures such as microstrips. These antennas are also
shown not to be directional. This enables the WIs to communicate with any other WI in the system
making the wireless medium a shared channel. Figure 3.2 shows the specific dimensions of the
antenna and its coplanar feed structure. A trace width of 5um is used for all arms of the antenna.
The WI transceiver circuitry has to provide a very wide bandwidth as well as low power
consumption to ensure high throughput and energy efficiency. Hence, we adopt the transceiver
design from [28] where low power design considerations are taken into account at the architecture
level. Non-coherent On-Off Keying (OOK) modulation is chosen, as it allows relatively simple
and low-power circuit implementation.
3.1.3 Flow Control and Routing
The routing protocol for the proposed multichip system is a seamless intra and inter-chip data
communication mechanism. We adopt wormhole switching for both inter and intra-chip data
where data packets are broken down into flow control units or flits [36]. Wormhole switching is
30

known to reduce the buffering requirements at the switches as unlike packet switching; whole
packets are not stored and forwarded. This makes the on-chip NoC switches consume low power
and occupies lesser area. All switches have bidirectional ports for all links attached to it. All cores
in the system have unique addresses. As the overall system is not a regular network, we adopt the
shortest path routing to optimize network performance. For the wireless links, we adopt the same
wormhole switching with simple modifications to enable the energy-efficient token-based
sleep/awake transceiver modes of operation as discussed in the next subsection.
We use a forwarding table based routing over pre-computed shortest paths determined by
Dijkstra’s algorithm. Dijkstra’s algorithm extracts a minimum spanning tree, which provides the
shortest path between any pair of nodes in a graph. The exact minimum spanning tree depends on
the chosen start node for the algorithm but the length of paths between any particular pair, along
the tree does not rely on the start node. Hence, it is chosen randomly from among all the switches
in the system. However, for a specific start node, the shortest path along the extracted tree is always
unique as the minimum spanning tree eliminates loops inherently. Consequently, deadlock is
avoided by transferring flits along the shortest path routing tree extracted by Dijkstra’s algorithm,
as it is inherently free of cyclic dependencies. As a result of using shortest path routing, the wireless
links can also be used for intra-chip communication if they reduce the path lengths compared to a
complete wireline path. Each switch only forwards the header flits to the next switch in the path
to the final destination. The body flits simply follow the path laid out by the header according to
the adopted wormhole switching protocol. Hence, each switch only has local forwarding
information eliminating the need for maintaining non-scalable global routing information.

31

3.1.4 Wireless Communication Protocol
In mm-wave interconnects, wireless bandwidth is limited by the state-of-the-art transceiver design
and on-chip antenna technology. To improve performance, multiple wireless transceivers need to
access the wireless medium to communicate via the energy-efficient wireless interconnects.
Consequently, multiple transceivers share a single wireless frequency channel. Therefore, an
efficient and collision-free Medium Access Control (MAC) mechanism is needed.
A.

Wireless Medium Access Control (MAC) Scheme

Several MAC protocols have been investigated in the context of wireless NoCs. To enable
Frequency Division Multiple Access (FDMA) using mm-wave bands transceivers tuned to
multiple carrier frequencies need to be designed. Power efficient design of such transceivers is a
non-trivial challenge. The system-level performance of Code Division Multiple Access (CDMA)
based on-chip and off-chip wireless interconnection architectures have been evaluated in [34][42].
However, such CDMA schemes require precise synchronization between the transceivers to avoid
inter-channel interference by preserving the orthogonality of the code channels. Such a
synchronization is difficult to achieve in transceivers distributed across multiple chips. Similarly,
synchronized classical Time Division Multiple Access (TDMA) is difficult to adopt in a multichip
system for the same reason. Therefore, Asynchronous TDMA (A-TDMA) based on token passing
[29] or Carrier Sense Multiple Access/Collision Detection (CSMA/CD) [62] are proposed.
However, CSMA-based A-TDMA does not perform well in the presence of the high traffic density
due to exponential back-off [63][64]. A token-based medium access mechanism is proposed in
[29] for WiNoCs to access the wireless channel in a distributed fashion while avoiding a collision.
Hence, in this work, we adopt a similar token-based medium access mechanism for the multichip
32

systems using wireless interconnection. In a token-based medium access mechanism, the access
to the wireless medium is granted by the possession of a token. Only the WI possessing the token
can transmit via the wireless medium. No separate request mechanism or priority is considered as
a part of the token passing scheme to avoid the need for a central grant or arbitration unit enabling
a distributed access mechanism.
To enable autonomous token passing among the WIs with fairness in accessing the wireless
medium, the WIs are numbered sequentially in a virtual token ring. The token circulates
autonomously between the WIs as a wireless flit in a round robin fashion. Each WI holds the token
for a variable number of time slots (i.e. token possession period) where the one-time slot is same
as the system clock cycle. The WI currently possessing the token passes it to the WI next in the
virtual token ring when it does not have any more packets to send or the maximum token
possession period expires. The maximum token possession period is given by
𝑇𝑚𝑎𝑥 = (𝑛 × 𝜂𝑓𝑙𝑖𝑡 + 1)𝑡𝑓𝑙𝑖𝑡 .

(3.2)

Where η_flit is the number of flits in a packet (packet size), t_flit is the time (number of cycles)
required to transmit a single flit over the wireless medium and n is the number of Virtual Channels
(VCs) in the wireless port of the WI. This method allows the adoption of wormhole switching in
the wireless links. The buffer depth of the VCs in the wireless ports need to same as that of the
maximum packet size (in the case of variable packet size) to hold entire packets before they can
be transmitted via the wireless channel to adopt the modified wormhole switching in the wireless
links for a seamless communication. This also helps in increasing the energy-efficiency by using
power gating of the wireless transceivers as discussed in the next subsection. From (3.2), the
maximum token possession period is the time required to send all the packets in n VCs in the WI
33

as well as the token as a wireless flit. The architecture of the WI to enable the token-based medium
access is shown in Figure 3.3. The Token Unit is the main logical unit responsible for managing
the token passing mechanism. The Token Unit contains three registers, IDself, IDnext, and
HasToken. The IDself and IDnext stores the address of the WI itself and the address of the next WI
in the round robin circulation of the token. The HasToken indicates the presence of the token in
the WI. The Token Unit also contains a token possession period counter. When a token flit with a
destination address set to IDself is received at a WI, the Token Unit sets the HasToken and triggers
the token possession period counter. On the other hand, when the token possession period counter
expires indicating the end of the token possession period, a token flit containing the fields TokenID,
NextWI, and PrevWI is constructed. The WI currently possessing the token then transmits it over
the wireless medium. The field TokenID is an identifier to differentiate the token flit from data flit.
The IDnext and IDself are used to set the field, NextWI, and PrevWI respectively. Although the token
is circulated among the WIs in a round robin fashion, all these fields are necessary for the token to

Figure 3.3. Block diagram of mm-wave token-based wireless interface.

34

enable a distributed token passing mechanism without relying on synchronization between the WIs
distributed in different chips.
B.

Improving Energy-Efficiency of the Wireless Interconnect Using MAC

The energy efficiency of the wireless interconnects can be further increased by using sleep
transistor based power-gated transceivers instead of keeping them always awake. In [65][66] such
a sleepy transceiver is implemented where control signals to turn them on/off are sent over
specialized low-latency wired global line wires. However, in a multichip environment where WIs
are distributed across different chips communicating the sleep/awake control signals using wired
lines is challenging as it will require an additional pin and I/O overhead. Therefore, to enable a
power-efficient sleep mode in the transceivers when they are not used we modify the wireless
communication protocol. Each header flit to be sent across the wireless medium from a
transmitting WI contains the number of flits in the packet and the address of the destination WI.
All other WIs, which are not the destination of that particular header, will receive the header due
to the broadcast nature of the non-directional antennas. On decoding the number of flits contained
in that packet, all WIs except the source and destination WIs will go to sleep for the duration of
the packet transmission. It will wake up after this duration to receive the next header (if there are
other packets to send from other VCs) or token (if the token is being passed) and react accordingly.
The flit type field in the header, body or token flits will enable this feature. As the VCs contain
entire packets as noted in the previous subsection, only full packets will be transmitted together
over the wireless medium. Therefore, a new header or a token flit is transmitted and received by
all WIs when they wake up. In this mechanism, wormhole switching is modified as flits of a packet

35

are not transmitted over the wireless channel unless the whole packet is available for continuous
transmission in the VCs.

3.2

Experimental Results

In this section, we evaluate the performance and energy efficiency of the wireless multichip
systems. We compare the wireless interconnect based multichip system with both conventional
I/O and alternative emerging interconnection based multichip system. The chip-to-chip I/O is
adopted from [67] and is shown to have a bandwidth of 15 Gbps and an energy consumption of 5
pJ/bit. On the other hand, the delay and energy dissipation on the intra-chip wireline link is
obtained through Cadence simulations taking into account the specific lengths of each link based
on the established topology in the 20 mm x 20 mm dies. The wireless transceiver adopted from
[28][65] is designed and simulated using the TSMC 65-nm CMOS process and is shown to
dissipate 2.31 pJ/bit sustaining a data rate of 16 Gbps with a Bit-Error Rate (BER) of less than
10-15 while occupying an area of 0.3 mm2. The network switches and the Token Unit are
synthesized from an RTL level design using 65-nm standard cell libraries from CMP [68], using
Synopsys.
3.2.1 Simulation Platform
The delay and power dissipation including both dynamic and static power consumption of the
digital components are incorporated in an in-house cycle accurate simulator to evaluate the
performance and energy efficiency of different multichip systems. This cycle-accurate simulator
is written in MATLAB. This simulator takes network topology, flit injection rates, traffic
information, the number of switches, simulation cycles, and message length as input, processes

36

the messages at flit level, implements routing policies and flow control, and outputs performance
metrics i.e. peak bandwidth, packet energy, and latency. The simulator characterizes the multichip
architecture and models the progress of the flits over the switches and links per cycle accounting
for those flits that reach the destination as well as those that are stalled. Each core is considered to
be connected to a three-stage pipeline network switch [69]. The three stages correspond to input
buffering, routing/arbitration, and output buffering operations, respectively. Each switch can have
a number of VCs, which can be set by the user. The user can also configure the buffer depth of
each VC (in terms of number of flits it can store). In our experiments, ten thousand iterations were
performed eliminating transients in the first thousand iterations. The switches are connected to
other switches according to the topology. The conventional I/O is modeled as a high-speed serial
I/O port [67]. Similarly, the WI is also modeled as a port connected to the network switches where
they are deployed. We consider each input and output port of a switch to have 4 VCs with a buffer
depth of 2 flits for all the architectures considered in this paper. To avoid an excessive number of
packets being stalled while waiting for the token, the ports associated to the WIs have an increased
buffer depth of 8 flits. We consider a representative packet size of 64 flits with a flit size of 32 bits
in our experiments unless otherwise mentioned. All the digital components are driven by a 2.5
GHz clock and 1V power supply, which are the nominal frequency and voltage for the 65nm
technology node. In the Mesh-based NoCs, all wired links a considered being single-cycle links
whereas the long intra-chip wireline links in the Small-World architectures are pipelined by
insertion of FIFO buffers such that between any two stages it is possible to transfer an entire flit
in 1 clock cycle.

37

3.2.2 Wireless Channel Characteristics and Wireless Link Budget Analysis
In this subsection, we present a link budget analysis to determine the transmitted power that is
required to achieve an acceptable BER on the intra and inter-chip wireless links. Figure 3.4 shows
the 4-chip system that we have used to design the on-chip antennas required for inter and intra-

(a)

(b)
Figure 3.4. (a) Top view of the model (b) Side view of the model

38

chip communication. As seen in Figure 3.4 (a) the individual chips are 20 mm X 20 mm and are
separated from each other by 10 mm. Figure 3.4 (b) shows the side-view of the multichip system
with various layers and materials considered in our evaluation model. We have considered the
chips to be housed on a substrate of FR4 Epoxy material, which typically is the material for Printed
Circuit Boards (PCBs) [70]. The individual chips are considered to be packaged in a dielectric
material called RXP4 [71], which allows electromagnetic wave propagation to enable the interchip wireless communication. The antennas are considered to be embedded in a 2 um layer of
silicon dioxide (silica) over a 633 um thick substrate of silicon of the chips.
The transmitted power, Pt in dBm on the wireless channels is given by the following
equation:
𝑃𝑡 = 𝑆𝑁𝑅 + 𝑃𝐿 + 𝑁𝑓 .

(3.3)

Where SNR is the signal to noise ratio at the receiver in dB, PL is the path loss in dB and Nf is the
receiver noise floor in dBm. The relationship between Signal-to-Noise Ratio (SNR) and BER for
non-coherent OOK modulation [72] is given by:
1

BER = 2 (1 − 𝑄 (√2𝑆𝑁𝑅, √2 +

𝑆𝑁𝑅
2

) + exp(−

(2+

𝑆𝑁𝑅
)
2

2

)).

(3.4)

Where Q(.) is the Marcum-Q function. An SNR of 15 dB results in a BER of less than 10-15 for the
OOK modulation scheme adopted here. A BER of 10-15 is comparable to wireline data transfer in
current technologies. Hence, we consider a required SNR of 15 dB in our link-budget analysis.
Figure 3.5 (a), (b), and (c) show the radiation pattern, return loss, and worst-case path loss
respectively of the designed mm-wave antennas in a 4-chip system, which we use for system level

39

analysis in this work. The characteristics of the antennas are simulated using HFSS [73]. The

(a)

(b)

(c)
Figure 3.5. (a) Radiation Pattern, (b) return loss, and (c) worst case path loss for wireless multichip
system
40

insertion loss shows that the antennas are tuned to resonate at 60 GHz. The antennas are designed
and tuned by Rounak Singh Narde under the supervision of Dr. Jayanti Venkatarman [74].
The worst-case path loss, PL between two antennas, which are farthest apart in the 4-chip
system as illustrated in figure 3.4 (a), is 34.9 dB. The noise floor of the receiver is -59 dBm [75].
Consequently, the output power of the transmitter is -19.43 dBm in the worst case. The power
consumption of the transceivers, which is capable of generating this transmitted power as shown
in [28], is considered in the following sections for system-level performance evaluation. We have
observed that the return loss of the antennas is 0 dB at low frequencies between 0 to 10 GHz. This
eliminates the possibility of interference with digital signals in the ICs due to their non-overlapping
operational bands.
3.2.3

Comparative Performance Evaluation

We evaluate the wireless multichip systems in terms of bandwidth per core and energy efficiency
and compare with several wireline I/O based multichip systems. The bandwidth per core is
measured as the peak sustainable data rate in number of bits successfully routed per core per second
at network saturation. The energy efficiency is measured as the packet energy, defined as the
average energy (i.e. both switch and link energy) required to route an entire packet from source to
destination successfully. It is measured by the sum of the energy dissipation of all the components
in the multichip systems such as switches and links divided by the total number of successfully
routed packets. For the multichip systems with wireline I/O whenever a flit traverses an inter-chip
link, the energy dissipated by the I/O is added to the total sum. Conversely, in the wireless
multichip systems, the wireless transceivers are always active, and hence their energy dissipation
is added to the total energy dissipation.
41

In the following subsections, we demonstrate the performance of different multichip systems in
terms of the available bandwidth per core and packet energy dissipation.
A.

Architectures for Comparison

We consider six multichip systems with different inter-chip connection configurations for a
comparative performance evaluation. These configurations are shown in Table 3.1. Among these
six configurations, four configurations use I/O based wireline chip-to-chip interconnection, and
two configurations use wireless interconnection for chip-to-chip communication. In these
arrangements, each multicore chip is considered to have 64 cores where each core is connected to
a network switch. We also consider two different intra-chip topologies, Mesh and Small-World to
evaluate the effect of the intra-chip network on the performance of the multichip system. The
number of chips in the system is varied from one to four interconnected together yielding different
system sizes of 64, 128, 192 and 256 cores collectively.

Figure 3.6. Conceptual view of (a) Bus I/O and (b) Network I/O based wireline configuration

42

Among the six configurations, four configurations use I/O based wireline chip-to-chip
interconnection, and two configurations use wireless interconnection for chip-to-chip
communication. Due to the large pitch of substrate-to-board pins [41], the number of pins
dedicated for I/O operations is limited. Moreover, crosstalk between parallel inter-chip
interconnects which can be several tens of millimeters long severely limits signal integrity. As
shown in [76], signal integrity can be maintained in high-speed I/O based inter-chip
communication only in the total absence of crosstalk. Therefore, to eliminate crosstalk, only a
single inter-chip interconnect line is considered to exist between a pair of chips. To achieve this,
only one switch along one edge of each chip (except the corner) is connected to the I/O module in
the Bus I/O based wireline configurations. One of the middle switches is chosen as it is connected
to three neighbors in the mesh based intra-chip NoC as shown in Figure 3.6 (a). For the smallworld based configurations, the same switches are chosen for the I/O modules to implement the
same inter-chip architecture.
To investigate the effect of increased bandwidth of the inter-chip wired links, we
investigate the Network I/O configuration where we equip multiple switches in each chip with the
I/O modules. However, between a particular pair of chips, there is only a single inter-chip link thus
eliminating signal crosstalk. The chips are in turn connected in a mesh configuration among
themselves via switches along the edges, using the I/O based inter-chip interconnects as shown in
Figure 3.6 (b).
In Bus I/O based wireline configuration, inter-chip communication happens through a
shared bus. For bus access, we have adopted an independent, guaranteed bandwidth arbitration
appropriate for high-speed I/O buses, which combines a distributed Time Division Multiple Access

43

TABLE 3.1. MULTICHIP SYSTEMS WITH DIFFERENT INTER-CHIP INTERCONNECTION CONSIDERED
IN THIS PAPER

Architecture

Intra-chip configuration

Inter-chip configuration

Mesh+I/O(Bus)

Conventional grid based Mesh NoC

Bus-based I/O

Mesh+I/O (Network)

Conventional grid based Mesh NoC

Switching-based I/O

Small-World+I/O(Bus)

Small-world wireline NoC

Bus-based I/O

Small-World+I/O (Network)

Small-world wireline NoC

Switching-based I/O

Mesh+Token

Conventional grid based Mesh NoC

Token-based Wireless Links

Small-World+ Token

Small-world wireline NoC

Token-based Wireless Links

(TDMA) approach with round robin access [77]. Simple slotted TDMA scheme is not realistic in
a multichip system because it is impossible to achieve precise synchronization between multiple
chips in current and future technologies. Therefore, an asynchronous and distributed access
mechanism is necessary. However, a traditional request/grant based asynchronous centralized
arbitration common in on-chip SoC busses is impractical as it needs additional control lines to the
arbiter in addition to the data lines. In high-speed I/O as discussed before, implementing additional
control (request/grant) lines would need additional I/O ports and pins and exacerbate the crosstalk
noise causing severe signal integrity issues. Therefore, we enabled the distributed TDMA with a
control flit broadcast to all the chips on the bus that passes the access to the next chip at the end of
the transmission from the current chip. Each chip can access the bus for a maximum duration as
given by (3.2) similar to the WIs to avoid bandwidth starvation. Also, the VC configurations of
the switches attached to the bus are the same as in the WIs.
44

Unlike the Bus I/O based configuration, a wormhole based switching is adopted for the
inter-chip communication in the Network I/O configuration. On the other hand, for both the
wireless configurations, we have considered 4 WIs per chip located at the center of the subnets
within the chips as discussed in the design methodology earlier.
B.

Achievable Performance

In this subsection, we evaluate the performance of the multichip systems with wireless
interconnections. First, we evaluate the peak achievable bandwidth per core of different multichip
systems at network saturation using uniform random traffic. The peak achievable bandwidth per
core of these multichip systems is shown in Figure 3.7. It can be observed that the systems with
wireless interconnections have higher bandwidth compared to all the wireline I/O interconnection
for all system sizes. This is because the wireless nodes connect switches inside the chips directly

Bandwidth (Mbps per core)

100000

Mesh+I/O(Bus)

Mesh+I/O(Network)

Small-World+I/O(Bus)

Small-World+I/O(Network)

Mesh+Token

Small-World+Token

10000

1000

100
1-chip

2-chip

3-chip

4-chip

Figure 3.7. Peak achievable Bandwidth per core with varying system size for different configurations
with uniform traffic.

45

over single-hop links for both intra as well as inter-chip data transfer. Therefore, even for a singlechip case, even when there is no inter-chip traffic the configurations with wireless interconnects
have higher bandwidth compared to the complete wireline intra-chip NoCs. This is in agreement
with several mm-wave wireless intra-chip NoC papers. On the other hand, for all the wireline I/O
based systems the data packets need to [29][30][33] travel from internal cores to the peripheral I/O
module and then, get routed over the inter-chip link and again travel to internal nodes at the
destination chip. Among the wireline configurations, the Bus based multichip systems have the
lowest performance and are not scalable due to the non-scalable bus-based interconnection. The
Network I/O based wireline configuration has higher performance than the Bus. This is because
the Network configuration allows concurrent communication between the adjacent chips. It can be
observed that the wireless multichip system can sustain a bandwidth per core higher than 10 Gbps
even for a 4-chip system. The degradation also seems to be asymptotic at 10 Gbps. However, with

Mesh+I/O (Bus)
Mesh+I/O (Network)
Mesh+Token

Small-World+I/O (Bus)
Small-World+I/O (Network)
Small-World+Token

Average packet latency (cycles)

1000
800
600
400
200
0
0.00001

0.0001

0.001

0.01

Injection load (packets/core/cycle)

Figure 3.8. Average packet latency of various multichip systems.

46

the conventional I/O the bandwidth is more than 10× lower. This demonstrates the significantly
higher bandwidth provided by the direct chip-to-chip wireless links.
Figure 3.8 shows the average packet latency for the various multichip systems with four
chips with uniform random traffic. Due to different average distances between cores in the
different multichip interconnection architectures, the latency characteristics are different. This is
demonstrated by the average latencies at low injections loads. It can be observed that the wireless
multichip has the lowest latency compared to the systems with inter-chip wireline
interconnections. This is because of the shorter average path lengths due to WIs located inside the
chips providing single hop links between cores located inside distant chips.
The performance of the Small-World NoC is higher than the Mesh-based NoC for both
wired and wireless systems. This is due to direct one-hop connections between distant nodes on
the chip. However, this gain is the most apparent in the single chip system. This is because, with
an increase in the number of chips, the impact of the local NoC in each chip decreases on the
overall system performance. We believe this trend to continue and hence, the importance of intrachip NoC architecture to diminish compared to the inter-chip interconnection as the system size
scales up.
C.

Packet Energy

In this section, we compare the packet energy dissipation of different multichip systems
interconnected with I/O based wireline interconnects and wireless interconnects. Figure 3.9 shows
the packet energy dissipation of different multichip systems investigated in this paper. The packet
energy dissipation for all system sizes is lower for the wireless multichip systems compared to all
the I/O based multichip systems. The difference in packet energy between these wireline and
47

wireless multichip systems becomes more evident with an increase in system size as the packet
energy dissipation for the I/O based multichip systems increase significantly with an increase in
the number of chips. Alternatively, the packet energy in the wirelessly connected system does not
increase as drastically. This is due to the direct energy-efficient wireless links between cores
embedded in the multicore chips. Due to spatially uniform traffic, as the number of chips increase,
the inter-chip traffic also increases in proportion from zero percentage in the single-chip scenario
to 75% in the 4-chip case. This implies that 75% of the total number of packets generated uses the
wireline I/O in case of the wired inter-chip interconnection systems. Because of this, a large
proportion of traffic travels to and from the I/O modules using multi-hop wired paths over the
intra-chip NoCs. This multi-hop path is reduced by use of the WIs deployed inside the chips. This
is the main factor behind the gains in energy savings for the wireless multichip systems.
Among all the I/O based configurations, the configuration of network-based interconnect
has lower packet energy dissipation. This is because, in the networked interconnection based

Packet Energy (nJ)

1000000

Mesh+I/O(Bus)
Small-World+I/O(Bus)
Mesh+Token

Mesh+I/O(Network)
Small-World+I/O(Network)
Small-World+Token

100000

10000

1000

100
1-chip

2-chip

3-chip

4-chip

Figure 3.9. Packet energy with varying system size for different configurations with uniform traffic.

48

multichip systems, the I/O buffers are less congested compared to Bus configurations resulting in
faster movement of data packets occupying buffers and interconnection resources for shorter
durations. As both the performance and the energy efficiency of the network or switching based
configuration is better than other I/O based configurations discussed in this paper, we consider this
configuration as a baseline I/O based configuration for comparison in the following sections.
On the other hand, the Small-World NoC based wireless interconnection has lower packet
energy compared to the Mesh NoC based wireless interconnection. This is because the SmallWorld nature of the topology reduces the average hop-count of the network by establishing longrange single-hop direct links. This effect is also demonstrated in recent literature [57][78][79].
Hence, in next sections, we consider the Small-World+Token architecture to evaluate the
performance of the wireless multichip system.
D.

Effect of Flit Width on Overall System Performance

In this section, we analyze the effect of increasing flit width for Small-World+Token based
wireless multichip system with uniform random traffic pattern and compare it with the SmallWorld+I/O (Network) architecture. A 2-chip system is considered in this subsection. For this
experiment, we used four different flit sizes of 32, 64, 128, and 256 bits. This is because as noted
in [80], higher flit widths beyond 128 are shown to provide marginal gains in performance of a
NoC based system.
In the case of wireline intra-chip interconnections, widening physical channel width to
accommodate larger flit width will increase the data rate on the wireline links. In the case of the
conventional, I/O based inter-chip interconnect, the increase in flit-width translates into increasing
the bandwidth of the interconnection by using multiple channels per link. However, signal
49

deterioration due to crosstalk coupling effects, microwave effects, and frequency-dependent losses
in the transmission lines limits the number of parallel lines in the I/O modules. Here we only
analyze the system-level performance metrics such as bandwidth per core and packet energy for
these systems.
On the other hand, the data rate of the wireless links is governed by the speed of the
transceiver and bandwidth of the antennas, which does not change with flit size. Hence, while the
wireline communication becomes faster with an increase in flit size, the wireless communication
speed remains constant. This results in a reduction in relative gains for the wireless multichip
communication architecture with respect to the conventional I/O based system as shown in Figure
3.10. However, even with a flit width of 256 bits (8 parallel I/O channels per link), we see a relative
improvement of 4.6x in data bandwidth and 3.1x in packet energy for a 2-chip system. In addition,
we note that the reduction in relative gains for both bandwidth and packet energy display an

Figure 3.10. Relative gain in bandwidth and packet energy with different flit width for 2-chip system.

50

asymptotic behavior. This means that although the gains of using wireless interconnections
decrease with increase in flit size the gain will stabilize beyond a point as the performance of the
wireline interconnection does not continue to improve with flit size beyond 128 or 256 bits.
3.2.4 Deployment of the Wireless Interconnection with Scaling of System Size
In this section, we discuss the deployment methodology of the wireless interconnection for
multichip systems when the system scales up. In our earlier experiments, we keep the number of
WIs per chip constant and scale up the system i.e. increase the number of chips per system.
However, in this approach, the total number of WIs keep increasing which will negatively affect
the performance beyond a point as it will take increasingly longer time for each WI to possess the
token and gain access to the medium. Hence, we have considered another alternative approach to
deploying the WIs when system scales up. In this second method, we keep the total number of WIs
per system constant and distribute the WIs among the chips. For the first approach, we consider
Bandwidth per core (Constant WIs/chip)
Bandwidth per core (Constant WIs/system(4 WIs))
Bandwidth per core (Constant WIs/system(16 WIs))
Packet Energy (Constant WIs/chip)
Packet Energy (Constant WIs/system(4 WIs))
Packet Energy (Constant WIs/system(16 WIs))
40000
35000

50

30000
40

25000

30

20000
15000

20

10000
10

Average packet energy(nJ

Bandwidth per core(Gbps)

60

5000

0

0
1

2

4

No of Chips

Figure 3.11. Bandwidth per core and Average packet energy for 1, 2, 4-chip systems for two different
deployment approaches of WIs.

51

4WIs per chip and increase the system size whereas, in the second method, the same number of
WIs is distributed among the chips of the system. In both cases, we evaluate the performance in
terms of bandwidth per core and packet energy. However, in the second approach, we are
distributing 4WIs among the chips that result in 1 WI per chip for a 4-chip system significantly
degrading the performance. Hence, to study the deployment methodology more comprehensively,
we consider another configuration with 16 WIs for the whole system and evaluate its performance.
The bandwidth per core and packet energy for 1, 2, and 4 chip systems for two different
wireless interconnection deployment methodologies are shown in Figure 3.11. For this study, the
Small-World+Token architecture is considered in all the cases. The peak bandwidth per core is
higher for the system with a constant number of WIs per chip than that of the alternative approach.
This is because in the first approach with increasing system size number of WIs also increases. It
is true that increasing the number of WIs increase the token return period; it also helps to distribute
the inter-chip traffic among the WIs. On the other hand, for the second approach with 4WIs for

Figure 3.12. Relative gain in bandwidth and average packet energy with different system sizes.

52

the whole system, with increasing system size the volume of inter-chip communication increases
whereas the number of WIs per chip decreases. This increases congestion at the wireless interfaces
and adversely affects the bandwidth. This also causes a relative increase in the packet energy. To
study the impact of the second methodology with a higher number of WIs, we deploy 16 WIs in
the whole system. However, even in this case, the peak bandwidth per core is lower than that of
the system with a constant number of WIs per chip. In the single-chip case, the performance with
16 WIs is lower than that with 4 WIs as each WI has to wait much longer for accessing the wireless
channel. These two approaches are equal in peak bandwidth per core and packet energy in the 4chip case because the two systems are identical. Hence, for the system sizes considered in this
experiment, having a constant number of WIs per chip is a better deployment approach for the
wireless multichip system.
To investigate the effect of this deployment policy on the scaling of system size and
dimensions further, we evaluate a multichip system with 9 chips. Each chip is considered to be
20 mm x 20 mm, and a space of 10 mm is assumed between the edges of the chips as well as the
edge of the substrate board. Thus, the overall dimensions of the board are 10 cm x 10 cm. Figure
3.12 shows the relative gains in bandwidth and average packet energy of the small-world based
wireless multichip system with respect to small world based wireline (Network) multichip system
for various system sizes. With the increase in number of chips and consequent increase in the
number of WIs, the token-based wireless interconnection suffer a degradation in performance.
However, it can be seen that the relative gains do not decrease significantly with increase in size
because the performance of the wireline multichip systems also decreases with increase in size.

53

3.2.5 Performance Evaluation with Non-uniform Traffic Patterns
In this section, we analyze the bandwidth and packet energy in the Small-World+Token based
multichip system with non-uniform traffic patterns and compare it with the Small-World+I/O
(Network). We use the small-world based configurations in this section as they outperform the
Mesh-based ones as demonstrated in the previous section. First, we use hotspot and transpose
synthetic traffic pattern to evaluate these multichip systems. In the hotspot, 5% of all traffic
generated from all cores has the same destination, which is the hotspot core. A single core was
chosen randomly from the system as the hotspot. All other packets are destined to other cores
following a uniform random distribution. This type of traffic pattern is fairly common for
directory-based cache-coherent shared memory multiprocessor system where communication
among the on-chip core and memory subsystem is more frequent [81]. Each core generates packet
only destined to cores that are diametrically opposite to it in the whole system to generate the
transpose traffic pattern. For example, the ith core will only send data packets to the (N-i+1)th core,
where, N is the total number of cores in the entire system.
Small-World+I/O(Network)

Small-World+Token

45

80

40

70

35

60

Packet Energy(uJ)

Bandwidth (Gbps per core)

Small-World+I/O(Network)

30
25
20
15
10

Small-World+Token

50
40
30
20
10

5
0

0
1-chip

2-chip

1-chip

Hot spot

2-chip

1-chip

Transpose

2-chip

Hot spot

(a)

1-chip

2-chip

Transpose
(b)

Figure 3.13. (a) Bandwidth per core and (b) Packet energy with non-uniform traffic for I/O based and
wireless multichip systems.

54

The bandwidth per core and packet energy for the small-world NoC based multichip system for
the one and two-chip cases are shown in Figure 3.13 (a) and (b) at network saturation respectively.
As can be seen from results the wireless small-world system outperforms the I/O based multichip
system for all the non-uniform traffic patterns. In the 2-chip system with both hotspot and transpose
traffic patterns a significant portion of the traffic accesses the inter-chip communication medium.
Hence, the distributed wireless interconnects improve the bandwidth and packet energy in both
cases compared to the wireline inter-chip communication. In transpose traffic pattern, all data
packets from all cores travel across the inter-chip communication medium. Hence, the relative
gains of the wireless inter-chip interconnection are the most evident with this traffic pattern.
Our observations with the uniform and non-uniform traffic patterns indicate a strong
correlation between the overall performance of the multichip system with that of the proportion of
inter-chip traffic. However, it is hard to estimate or predict the proportion of inter-chip vs. intrachip traffic in the set of applications suitable for modern and future multichip systems. Hence, we

Bandwidth (Small-World+I/O(Network))

Bandwidth (Small-World+Token)

Packet Energy (Small-World+I/O(Network))

Packet Energy (Small-World+Token)

60

100

80
70

40

60
30

50
40

20

30

Packet Energy (uJ)

Bandwidth (Gbps per core)

90
50

20

10

10
0

0
25%

50%

75%

100%

Localization

Figure 3.14. Bandwidth per core and Packet Energy for 2-chip I/O based and wireless multichip
systems with varying localization

55

study the change in performance by varying the degree of localization in the traffic as a direct
parameter. We define the localization parameter as the percentage of data packets from each core
that has a destination randomly chosen from among the cores within the same chip. Figure 3.14
shows the bandwidth per core and packet energy respectively, as the localization parameter is
varied from 25% to 100% for a 2-chip system with each chip having 64 cores interconnected with
the small-world architecture. This captures the possible spectrum of traffic patterns while
demonstrating how the performance depends on it. As the localization parameter increases, the
performance of the multichip systems increases and the packet energy consumption decreases for
both wireless as well as conventional I/O based systems due to lower dependence on the inter-chip
communication fabric. More importantly, for low localization and increased inter-chip traffic, the
role of the inter-chip interconnections become significant as one would expect and the gains of the
wireless chip-to-chip links increases compared to the wired I/O system.
3.2.6 Comparative Evaluation with Respect to Emerging Multichip Integration
Technologies
In our prior sections, we considered simple token passing mechanism based wireless multichip
system that outperforms wireline I/O based configurations. However, in token passing based
wireless medium access mechanism, only a single transmitter can access the wireless channel at
any given instant of time although multiple transceivers are deployed over the entire system. This
limits the potential performance benefits of wireless architecture. Enabling simultaneous
communication channels without any interference can ensure better utilization of the available
bandwidth. This can be achieved by either designing a MAC protocol like Direct Sequence Spread
Spectrum (DSSS) based CDMA channel access mechanism [34][42], or FDMA using novel

56

TABLE 3.2. ENERGY PER BIT AND AGGREGATE BANDWIDTH FOR DIFFERENT INTERCONNECT TECHNOLOGIES

Token-based
Wireless
interconnect

CDMA-based
Wireless
interconnect

CNT-based
Wireless
Interconnect

Inter-chip-photonic
interconnect

Energy (pJ/bit)

2.3

3.43

0.48

0.43

Aggregate physical
bandwidth (Gbps)

16

6

160

160

antenna technology like Carbon Nanotube (CNT) based nano-antennas operating in THz frequency
bands [30]. To study the potential performance improvement with these advanced techniques, we
evaluate the same interconnection framework just replacing the token-based wireless transceivers
with CDMA-based and CNT antenna based ones. In the CDMA-based medium access
mechanisms, Walsh codes are used to create orthogonal code channels for multiple access. Due to
this orthogonality between the code channels bits in one code channel are not affected by other
channels. Transmitted bits are first encoded using the codeword, and at the receiving WI, the
received bit is XORed with the code words to extract the transmitted data. We adopt the
performance and power characteristics of the CDMA transceiver operating at 6 Gbps from [34] to
estimate the performance of the system. For the CNT antenna technology, we consider MultiWalled Carbon Nano Tube (MWCNT) antennas as they are shown to be in excellent quantitative
agreement with traditional radio antenna theory [82]. These CNT antennas are excited using laser
sources of different frequencies, which results in concurrent frequency channels supporting a data
rate of 10 Gbps/channel [30].
Figure 3.15 shows the peak bandwidth per core and packet energy for the multichip systems
with these different wireless interconnect technologies. For a fair comparison, we considered same
57

intra-chip architecture i.e. Small-World (SW) with same system size (4-chip) and the same number
of WIs per chip (4 WIs). The raw energy/bit of the wireless technology and aggregate physical
bandwidth provided by each of these technologies are summarized in Table 3.2. From Figure 3.15
it can be seen that SW+Token based system has the lowest bandwidth per core and highest packet
energy among all the wireless configurations considered here. This is because only a single
transmitter can access the wireless channel at any given instant of time. This increases the queuing
delay for the packets yielding a lower bandwidth and higher packet energy. Designing complex
MAC schemes like CDMA or using a novel antenna technology can improve this bandwidth to an
extent due to concurrent communication among the WIs. However, in the CDMA-based system,
these simultaneous communications happen to utilize the orthogonal Walsh code channels
resulting in lower bandwidth per channel due to spectrum spreading effect of DSSS. Consequently,
SW+CDMA provides lower bandwidth and consumes higher packet energy compared to
SW+CNT based system. However, the improvement of the performance by implementing
complex MAC or utilizing novel antenna technology does not come without a price. The
Bandwidth per core

Packet Energy
25

25

20

20
15
15
10
10

Packet Energy(uJ)

Bandwidth per core(Gbps)

30

5

5
0

0
SW+Token

SW+CDMA

SW+CNT

SW+Photonic

Figure 3.15. Bandwidth per core and average packet energy for different interconnect technologies for
4-chip system

58

transmitters are required to be synchronized to maintain the orthogonality among different code
channels in CDMA-based MAC. This synchronization is difficult to achieve in a multichip
environment as the WIs are distributed across different chips. Alternatively, integration of these
CNT antennas with standard CMOS fabrication processes needs to overcome significant
challenges [30]. Addressing these limitations by improving technology and designing efficient
wireless medium access channel mechanism can exploit the full potential of the direct chip-to-chip
wireless interconnects in future.
In recent literature, off-chip photonic interconnects has emerged as another enabling
technology for chip-to-chip communication [12][13][83][84]. Next, we compare the wireless
interconnection architecture with a photonic multichip system. For a fair comparison, we simulate
the same 4-chip system. In this system as well, intra-chip architecture is same i.e. small-world.
The only difference is in the inter-chip communication. In the photonic multichip system, the interchip communication happens through high bandwidth photonic interfaces. To connect these
interface switches through a single waveguide, we consider these switches to be located at one
edge of the chip. For our experiment, we consider four photonic interfaces per chip and one
waveguide with 16-way Wavelength Division Multiplexing (WDM) channels having a bandwidth
of 10 Gbps per channel [85]. The peak bandwidth per core and packet energy for the photonic
system is shown in Figure 3.15. The power consumption of the laser sources is factored in the
energy consumption per bit along with electro-optic conversions for both the photonic and the
CNT-based multichip systems as shown in Table 3.2. The SW+Photonic outperforms both the
token and CDMA-based wireless multichip system due to the presence of high bandwidth
concurrent links. However, the performance of the photonic multichip system is lower than that of
the CNT-based wireless system. As noted in Table 3.2 the physical bandwidth of both CNT-based
59

wireless multichip interconnection is considered to be the same as that of the photonic inter-chip
waveguide. The energy per bit is also comparable. The gains in system bandwidth and packet
energy comes from the fact that data packets can get routed from internal switches using the CNTbased wireless links. Whereas in the case of the photonic system, the data packets will have to
reach the photonic interfaces of the chip in its periphery thereby affecting system level
performance and packet energy negatively. Moreover, in the CNT-based wireless multichip
system, intra-chip communication is also possible using the wireless channels without requiring
any additional overheads. It is worth noting that the aggregate physical bandwidth of both the
CNT-based wireless and the photonic interconnection framework can be increased by deploying
more CNT-based antennas and using denser WDM respectively. This will improve the
performance of both systems.
It is possible to improve the performance of the photonic multichip system by integrating
the inter-chip waveguide with an intra-chip photonic NoC [13]. However, the challenges regarding
integration of photonic devices, precise thermal tuning of electro-optic modulators and
demodulators and a manufacturing process involving 3D technology for a separate photonic plane
[39] need to be overcome.
Undoubtedly, the wireless multichip system with token-based MAC offers lower
bandwidth compared to the other systems considered in this section. However, due to fabrication
challenges and reliability concerns, implementing complex MAC or emerging novel interconnect
technologies require further investigation. On the other hand, the token-based wireless system
utilizing metal-zigzag antennas are CMOS compatible and outperforms conventional wireline I/O
based inter-chip communication systems which make it a nearer term solution as the

60

communication backbone for designing such multichip systems providing significant gains in
system performance.
3.2.7 Area Overheads
In this section, we estimate the corresponding area overheads of the various architectures studied
in this paper. The number of wired intra-chip links in all configurations are same as that of a
conventional Mesh NoCs i.e. number of intra-chip links in the small-world based architecture is
constrained to be the same as that of the conventional Mesh. The only difference is the I/O
modules, wireless transceivers and the area of ports associated with them. Figure 3.16 shows the
total area overhead of the various interconnection architectures for different multichip
configuration considered in this paper for a 4-chip system. In the case of token-based architectures,
each transceiver occupies an area of 0.3 mm2 [28] whereas in I/O based architectures, each

Area in mm sq.

transceiver has an area of 0.088 mm2 [67]. For the wireless multichip systems of the largest

35
30
25
20
15
10
5
0

Wired Port Area

I/O port area

WirelessPort Area

Figure 3.16. Area overheads of different wireline and wireless architecture considered in this paper

61

configuration, the total area of the interconnection network is 1.92% of the entire system while the
wireless overhead is only 0.46% assuming each chip is 20 mm × 20 mm. The proportion of the
various area overheads remains similar for other system sizes using the wireless interconnections,
as the number of WIs per chip remains the same.

3.3

Summary

High-performance computing environments and data centers employ modules with multiple
multicore chips in a package or on board. The density and bandwidth of high-speed I/O for interchip interconnections are becoming the power-performance bottleneck for such multichip systems.
In this work, we explore the advantages possible if inter-chip communication in multichip modules
can be realized with state-of-the-art mm-wave wireless links operating in the 60GHz band. While
the physical bandwidth of such wireless links is not necessarily higher compared to the high-speed
serial I/O links, the wireless links are capable of establishing direct communication channels
between cores in different chips via on-chip embedded antennas. Moreover, the wireless links can
be used for a seamless data transfer between cores in the same chip as well to augment the
traditional NoC backbone for intra-chip communications. These factors result in significant gains
in performance and energy efficiency in both intra and inter-chip data communications. The
energy-efficiency of the wireless interconnects have been improved by careful wireless data
transfer protocol design to put unused WIs to sleep using power-gated transceivers. It can be
further enhanced by using variable levels of power amplifications [86] depending upon the length
of the wireless interconnects and associated path losses in the future.

62

Chapter 4

WIRELESS INTERCONNECT AS AN ENABLER

FOR DATA COMMUNICATION ACROSS MICROCHANNEL
BASED COOLING LAYER IN VERTICALLY INTEGRATED
MULTICHIP SYSTEM
Vertically integrated multichip system i.e. Three-dimensional Integrated Circuits (3D-ICs) have
emerged as another feasible solution to overcome the performance limitation of 2D planar ICs [9].
Vertical interconnects realized using Through-Silicon-Vias (TSVs) provide high density, high
bandwidth communication paths between the active layers of the 3D ICs greatly mitigating the
global interconnect problems faced by planar ICs. However, utilizing the third dimension to
provide additional device layers poses thermal challenges as stacking vertical layers increases the
power dissipation density significantly, and the thermal footprint per unit area TSVs are used for
both data as well as critical signals like clock and power delivery across the layers in a 3D ICs
[16]. Conventional cooling techniques are limited in ability to extract heat only from the top or
bottom of the entire 3D stack. Consequently, the design of aggressive and sophisticated cooling
mechanisms are envisioned to alleviate the thermal issues in 3D ICs.
One such solution is where embedded inter-layer cooling microchannels or a cooling chip
is inserted in between layers of the vertically stacked multichip system. Tuckerman and Pease [17]
first proposed the use of microchannels to cool IC chips effectively. Introducing microchannels
between the active layers of the 3D ICs and circulating cooled liquid through these channels can
extract the heat from the interlayer regions and cool the 3D ICs more efficiently as compared to
conventional cooling techniques. However, to increase the cooling capability of the microchannels
63

their effective thermal conductivity should be high, or conversely, they should have low thermal
resistance. Short microchannel heights reduce their convective thermal resistance. However, short
microchannels increase the pressure drop between the entry and exit points of the liquid causing
thermal stresses [20]. These high stresses may compromise the mechanical integrity of the thin
walls between TSVs and microchannels. Lowering the coolant flow rate to reduce the pressure
drop has another disadvantage of higher temperature non-uniformity in the silicon substrate along
the flow length. Moreover, large thermal gradients along the fluid flow direction inside
microchannels can affect the structural reliability of the TSVs by inducing temperature related
expansion and contraction due to a mismatch in coefficient of thermal expansion (CTE) between
copper and silicon. To reduce the mechanical stresses and at the same time, to provide temperature
uniformity and adequate cooling capabilities, the height and width of the microchannels need to
be increased. Several dimensions of microchannels are suggested in literature ranging from 50 µm
to 1000 µm in height and 100 µm to 1000 µm in width depending on desired pressure drop and
cooling capabilities [21][22]. This, in turn, imposes significant restrictions on where and how
many TSVs

and

microchannels

can

co-exist

together.

TSVs

with

Aspect

Ratio

(AR=Height/Diameter) greater than 10 are tough to manufacture at high yield due to challenges
related to etching, sidewall passivation, and formation, insulation, and filling of Vias [6] and codependency of the microchannels and electrical design makes the process even more complex.
Wider microchannels occupy a significant portion of the floor area of the 3D IC severely restricting
the freedom of placement and routing of TSV based links in 3D ICs. Moreover, increasing the
microchannels height will eventually increase the die thickness and consequently, the height of
TSVs that in turn will increase the diameter of the TSVs to maintain a fixed AR. Also, to reduce
IR and Ldi/dt drop, TSVs used for power delivery network require higher diameter and pitch than
64

the signal TSVs. All these factors restrict the area available to route TSVs across the cooling layers
and make the co-existence and co-design of TSVs and microchannels challenging especially when
thousands of TSVs are required for interconnections in large chips with die areas higher than
100mm2 [23].
In recent years on-chip wireless interconnects in the millimeter-wave frequency bands are
demonstrated to be more energy-efficient compared to conventional wireline interconnect fabrics
[28][29][30][31][33][34]. Moreover, wireless interconnects do not require physical layout of links
and provide direct single-hop links between transceivers distributed across the chips. Based on
these recent studies, we propose a wireless 3D NoC architecture to enable energy-efficient
communication utilizing wireless links across interlayer microfluidic coolers. This will reduce the
number of TSV based links across the microchannel layers as data transfer across the cooling layer

Figure 4.1. Side view of proposed 3D wireless NoC architecture with the interlayer cooling layer.

65

will only be achieved through wireless links. This will eliminate the need to place and route signal
TSVs across the cooling layers. Therefore, the only TSV based links to be placed and routed across
the cooling layers would be the power and clock delivery networks. This will significantly reduce
the complexity of the co-design of TSV based interconnects and microchannel based interlayer
cooling. Figure 4.1 shows the side view of the proposed wireless 3D NoC.
In this chapter, we first determine the dimensions of the microchannel for optimal thermal
and hydraulic characteristics considering current trends in power consumption densities of 3D
multicore ICs. Next, we design the wireless physical layer suitable for communication across the
designed microchannel based coolers. Lastly, we evaluate the performance of a 3D wireless NoC
designed with the above physical layer.
The specific contributions of this chapter are listed below:
1.

Determining optimal dimensions of microchannel based interlayer coolers for 3D
multicore ICs.

2.

The design of on-chip antennas for the physical layer of the 3D wireless NoC
suitable for communication across the above designed cooling layers.

3.

The design of 3D Wireless NoC architectures for 3D multicore ICs with
microchannel based cooling to reduce the number of TSVs across the cooling
layers.

4.

Evaluate the performance, energy consumption and thermal characteristics of 3D
wireless NoCs equipped with microchannel cooling layers.

5.

Comparison of the proposed 3D Wireless NoCs with respect to traditional 3D
interconnection systems using TSVs.
66

6.

Holistic comparative evaluation with the horizontally integrated wireless multichip
module.

7.

4.1

Present a discussion on the various trade-offs available for the proposed design.

Integrated Design Methodology for Wireless 3D NoC with Microchannel

based Liquid Cooling
In this research work, we propose to realize data communication links across the micro-channel
cooling layer in the 3D NoC with the wireless interconnects. The design of the NoC architecture
and the antennas will depend on the dimensions and characteristics of the cooling layers such that
a reasonable trade-off between the cooling capacity and pressure drop is obtained. Based on those
design specifications the methodology for designing the 3D Wireless NoC architecture and
antennas will be discussed next.
4.1.1 Design of Microchannel Cooling Layer
To enable efficient and powerful cooling in 3D IC systems, microchannel based liquid cooling
needs to be employed between active layers of the 3D IC. The modular interlayer cooling chip
needs to meet several design constraints. (i) The thermal performance should be high enough to
dissipate heat fluxes of 100 to 500 W/cm2, which are expected from 3D multicore processors in
the near future [20]. (ii) provide temperature uniformity along the flow length, and (iii) have low
pressure drops across the channel to reduce the power required for pumping the fluid and reduce
structural strain on the IC. Need to provide temperature uniformity along the flow length, and (iv)
have high hydraulic performance, which corresponds to low pressure drops across the channel to
reduce the power required for pumping the fluid and reduce structural strain on the IC. The number

67

of cooling layers, and how many active layers can be sandwiched between two cooling layers
depends on the number of active layers, their heat flux and cooling capability of the microchannels.
The thermal and hydraulic performance are two important factors in 3D IC cooling as they
determine the cooling capability and the mechanical stress on the IC package. The geometry of the
microchannels governs both the thermal and hydraulic performance. Finding a trade-off between
the cooling performance and reliability is not trivial because of the interdependence of the thermal
and hydraulic performance and the microchannel geometry. Here we discuss the interdependence,
which will guide the design of the microchannel geometry.
The thermal performance depends on the thermal resistance of the designed microchannels.
The thermal resistance Rth is the ratio of the increment of the average surface temperature above
the input temperature of the fluid to the total heat dissipated. For a fully developed flow under a
constant heat flux and considering fluid flow to the parallel to the x-axis, the 1-D thermal resistance
can be defined as:
𝑅𝑡ℎ =

𝛥𝑇𝑚𝑎𝑥
𝑞”𝐴𝑠

=

𝑇𝑠,𝑎𝑣𝑔 −𝑇𝑓,𝑖𝑛
𝑞”𝐴𝑠

.

(4.1)

Where 𝛥𝑇𝑚𝑎𝑥 = 𝑇𝑠,𝑎𝑣𝑔 − 𝑇𝑓,𝑖𝑛  is the maximum temperature rise in the microchannels i.e. the
temperature difference between the peak temperature in the heat sink at the surface(𝑇𝑠,𝑎𝑣𝑔 ) and the
fluid inlet temperature 𝑇𝑓,𝑖𝑛 , 𝐴𝑠 is the surface area, and 𝑞” is the heat flux at the channel wall. Fluid
inlet temperature is used in eqn. 4.1 to incorporate the impact of mass flow rate. Hence, the
convective resistance (𝑅𝑐𝑜𝑛𝑣 ) and the effective resistance due to temperature rise of the liquid
(𝑅ℎ𝑒𝑎𝑡 ) are both included in the analysis. Low values of the thermal resistance are desired for
68

micro-channels in order to achieve lower temperatures in the surfaces where heat must be
dissipated.
On the other hand, as liquid flows through the microchannels, the pressure decreases from
the inlet section to the outlet section due to frictional losses. This pressure drop needs to be
overcome by an external pump circulating the liquid coolant. Thus, the hydraulic performance is
quantified by the pumping power i.e. the power required to drive the coolant through the flow
passages to achieve the desired flow rate required for cooling. The pumping power, W to overcome
the flow resistance can be defined as:
W = ∆p ∙ V.̇

(4.2)

Where ∆𝑝 is the difference between the pressure at the inlet and the pressure at the outlet of the
microchannels, V̇ is the volumetric flow rate.
From these equations, we can see that a higher flow rate will reduce the temperature rise
for a given heat input and hence, results in lower thermal resistance. This will improve the thermal
performance. However, increasing the flow rate will increase the pressure drop and hence,
pumping power. As a result, it can cause structural strain on the IC. Moreover, the pressure drop
across the microchannel and thermal resistance depend on the geometry of the microchannel. All
these factors complicate the overall design of the microchannel. In later sections, we perform
numerical simulations to evaluate the thermal and hydraulic performance of the modular
microchannel cooling chip to determine the optimal design. For the numerical analysis, the
geometric parameters that were varied are the height of the microchannel, b, and the width of the
microchannel, w. The width of fin between the micro-channels is wf, and the microchannel pitch,

69

p is defined to be (w+wf)/2. Detailed results are presented in section 4.2.1. The results obtained
from these experiments will be used to guide the architecture design methodology as discussed
next.
4.1.2 Proposed Topology of the 3D Wireless NoC Architecture
We propose to realize links across the microchannel cooling layer in the 3D IC with the wireless
interconnects. The cores in the 3D multicore system will be interconnected using a NoC fabric
through switches and links. In our proposed architecture, each core is connected to a NoC switch.
Switches within a single layer are connected in a mesh topology with conventional copper wire
based NoC links. To enable interlayer communication between layers that are not separated by a
cooling layer, all the switches are equipped with TSV based links to switches vertically above or
below itself.

Figure 4.2. Top view of one active layer.

70

Communication channels across the cooling layer are realized through wireless links. For this
purpose, each layer is logically divided into subnetworks or subnets, such that a particular switch
in each subnet is equipped with a Wireless Interface (WI). The WIs are deployed in a switch at the
center of the subnets to avoid long multi-hop paths from all cores in its subnet, assuming any core
can transmit inter-chip data at some point during the operation of the system. This WI deployment
strategy has been shown to provide the Minimum Average Distance (MAD) between all switches
in an intra-chip NoC in [58]. All cores that need to send data across the cooling layer access the
wireless channel through the WI in its subnet. The WIs are connected in an all-to-all fashion using
the shared wireless band. The data is transferred to the WI in the subnet of the destination core
from where it is routed to the final destination. To improve performance, data transfer within the
same layer or in adjacent layers not separated by a cooling layer can also use the WIs depending
on the adopted routing policy as discussed in section 4.1.4. In this way, a hybrid hierarchical 3D
wireless, wireline and TSV based NoC architecture (3D-HiWiNoC) is formed. Figure 4.2 shows
the top view of one active layer (vertical TSVs are not shown).
4.1.3 Physical Layer
Several alternative technologies exist for realizing on-chip and off-chip wireless interconnections
[29][30][31][32][35]. We envision the use on-chip embedded miniature antennas that can be
fabricated with-in the chip to establish direct communication channels between internal switches
of the chips. To realize such wireless channels, we choose on-chip metal zig-zag antennas which
have been shown to be effective in establishing on-chip communication [27]. The chosen on-chip
antenna has to provide the best power gain for the smallest area overhead. A linear dipole occupies
a large area proportional to the wavelength of the carrier frequency. A patch antenna is directional

71

mostly radiating perpendicular to its plane. A log-periodic antenna can have higher power gains
but is highly directional. We intend the chosen antenna to be compact as well as not directional.
This is because we want to communicate between antennas that are located in different layers of
the 3D IC and potentially at different angles with respect to each other’s axes. A metal mm-wave
zigzag antenna has been demonstrated to possess these characteristics as they are more compact
compared to a linear dipole due to the zig-zag folding of the arms. In addition, such mm-wave
antennas fabricated using top layer metals are CMOS process compatible making them suitable
for near-term solutions to the wired interconnect problem [27]. Therefore, to realize such wireless
channels, we choose on-chip metal zig-zag antennas which have been shown to be effective in
establishing on-chip communication [29][32][33]. This antenna also has the negligible effect of
rotation (relative angle between transmitting and receiving antennas) on received signal strength,
making it most suitable for on-chip wireless interconnects, as each antenna has to communicate
with other WIs in multiple directions. Such mm-wave 60GHz antennas are shown to have a
bandwidth of 16GHz for on-chip communications links. The antennas are placed at the center of
each subnet being fed from the WIs of its respective subnet in each layer. Consequently, the
antennas need to be tuned best radiation characteristics in this 3D system with the microchannel
based cooling layers separating active layers. The specific details of the designed antenna, and
radiation characteristics depend on the dimensions of the cooling layers and are shown in section
4.2 under experimental results. The antennas are tuned to work in the mm-wave band with a carrier
frequency of 60GHz.

72

4.1.4 Seamless Flow Control and Routing
The routing protocol for the proposed 3D wireless system with microchannel based cooling is a
seamless intra and inter-chip data communication mechanism. We adopt wormhole switching for
wireline links in the proposed system where data packets are broken down into flow control units
or flits [36]. All switches have bidirectional ports for all links attached to it. The WIs have an
additional port equipped with the wireless transceivers to access the wireless physical channel. For
the wireless links, we adopt the same wormhole switching with a slight modification as explained
in the next subsection.
We use a forwarding table based routing algorithm over pre-computed shortest paths
determined by Dijkstra’s algorithm for both inter-chip and intra-chip data. Dijkstra’s algorithm
extracts a minimum spanning tree, which provides the shortest path between any pair of nodes in
a graph. The exact minimum spanning tree depends on the chosen start node for the algorithm but
the length of paths between any pair, along the tree does not rely on the start node. Hence, it is
chosen randomly from among all the switches in the system. However, for a specific start node,
the shortest path along the extracted tree is always unique as the minimum spanning tree eliminates
loops inherently. Consequently, deadlock is avoided by transferring flits along the shortest path
routing tree extracted by Dijkstra’s algorithm, as it is inherently free of cyclic dependencies. The
route computation overheads are reduced significantly, as the routing decisions are made locally
based on the forwarding table only for determining the next hop and is done only for the header
flit. The tail flits simply follow the reserved path as per wormhole switching.

73

4.1.5 Wireless Communication Protocol and Transceiver
In mm-wave interconnects wireless bandwidth is limited by the state-of-the-art transceiver design
and on-chip antenna technology. Multiple wireless transceivers need to access the wireless
medium to communicate via the energy-efficient wireless interconnects to improve connectivity
and performance. Consequently, multiple transceivers share a single wireless frequency channel.
Therefore, an efficient and collision-free Medium Access Control (MAC) mechanism is needed.
The authors in [29] have proposed such a distributed and low-overhead token-based MAC
mechanism for on-chip wireless interconnects. The token-based MAC grants access to the shared
wireless medium to a single WI resulting in a contention free communication using the wireless
channel. However, in such a MAC only whole packets are transmitted to other WIs, to maintain
the integrity of the wormhole switching [87]. This increases the buffer requirement and hence
static power consumption in the WIs. Therefore, we propose a MAC mechanism that allows partial
packet transmission from a WI while maintaining the integrity of the wormhole switching.
In the proposed MAC, instead of circulating a token at the end of each transmission, each
WI broadcasts a control packet at the beginning of its transmission. The control packet consists of
a header for identification and differentiation of data packets. In addition, to enable partial packet
transmission and correct routing, the control packet has 3-tuples: (DestWI, PktID, NumFlits) for
every partial packet that it will transmit. Each 3-tuple contains the information about the number
of flits (i.e., NumFlits) to be transmitted from the WI to a particular destination (i.e., DestWI)
along with the packet ID (i.e., PktID) of the packet, to which the flits belong. The PktID enables
the destination WI to identify the VC number at the destination WI to put the flits, thus maintaining
wormhole switching. In case the PktID does not exist at the destination WI, the WI reserves an

74

unoccupied VC. The number of output VC of the transmitting WI limits the number of 3-tuples in
a control packet. The control packet is broadcast to all WIs. Therefore, the next WI in sequence
computes the duration of the current transmission from the information in the control packet and
transmits its control packet when the current transmission is completed. For this purpose, the WIs
are numbered in a sequence. Thus, contention between WIs in accessing the channel is avoided.
This control packet based MAC enables an energy-efficient operation of the WIs by using sleep
transistors. We adopt the design of such sleepy transceivers from [65] to put particular receivers
to sleep when the transmitted data is not intended for them based on the information in the control
packets. This eliminates the overhead and layout complexity of the global signaling wires to carry
the sleep/̅̅̅̅̅̅̅
𝑤𝑎𝑘𝑒 signals as in [65].
The WI transceiver circuitry has to provide a very wide bandwidth as well as low power
consumption. The transceiver design is adopted from [28] where low power design considerations
are taken into account at the architecture level. Non-coherent On-Off Keying (OOK) modulation
is chosen, as it allows relatively simple and low-power circuit implementation. Next, we present
the performance evaluations of the proposed wireless interconnection framework.

4.2

Experimental Results

In this section, we determine the dimensions of the cooling layer, demonstrate the design of the
on-chip antennas with suitable communication properties, and estimate performance and
temperature profile of a 3D multicore IC incorporating the cooling layers and interconnected by
wireless links across them.

75

4.2.1 Dimensions of the Cooling Channels
To capture the operating conditions in the microchannel cooling chip module, a heat flux, q”=100
W/cm2 is imposed at the upper and lower walls each, of a single microchannel unit. A single
passage is modeled by considering symmetry planes along the length of the microchannel passing
at mid-planes of the fins to avoid simulating the whole module. The total width of the
computational domain is the pitch, p between mid-planes along the fins and the total length is 10
mm. Fully developed laminar flow at 303 K is considered at the inlet of the fluid passage, and the
radiation effects are neglected in the entire computational domain. Deionized water is used as the
cooling fluid flowing through the fluid passage, and the material of the microchannel is modeled
as silicon. The dielectric property of deionized water enhances the radiation characteristics of
wireless links making it a suitable choice. All properties for the materials are taken as constant and
evaluated at 303 K, except the viscosity, μ (Pa.s) of the deionized water, which is curve fitted by
following equation:
µ = 2.414 × 10−5 (10247.8/(𝑇[𝐾]−140) ).

(4.3)

The numerical models are simulated using the finite volume software Fluent 14.5 [88],
which provide accurate temperatures in the solid; and pressure, velocity, and temperature fields in
the fluid using the assumptions above, A structured mesh consisting only of hexahedral elements
is used to split the computational domain into several control volumes. The mass, momentum, and
energy conservation equations are solved for each control volume. Convergence is considered
when the residuals for the governing equations are less than 10-6. These microchannels are modeled
and simulated by Jose-Luis Gonzalez-Hernandez using ANSYS Fluent 14.5 [88] under the
supervision from Dr. Satish G. Kandlikar [74].

76

A.

Thermal Resistance

For this experiment, we consider the inlet temperature and the heat supplied to the layers are
constant. Therefore, low values of the thermal resistance are desired for microchannels in order to
achieve lower temperatures in the surfaces where heat must be dissipated. The mass flow rate in
the simulations is selected such that the lowest mass flow rate needed to achieve the 80 °C surface
temperature at the outlet is employed. The thermal resistance for the microchannel heatsinks
modeled in this study is shown in Figure 4.3 for different heights of the microchannels as a function
of the pitch. As the microchannel height increases, the thermal resistance also increases, leading
to a lower thermal performance when the height is higher than 200 μm. For heights of 10 and 100
μm, thermal resistance values of 0.1 and 0.4 K/W, are obtained, respectively. It is observed that
for a fixed microchannel height, the effect of the pitch on the thermal resistance becomes negligible
when the pitch is greater than 2000 μm.

1.4
1.2

Rth [K/W]

1
0.8

p = 400 m
p = 500  m
p = 667 m
p = 1000 m
p = 2000  m
p = 10000 m

0.6
0.4
0.2
0

0

100

200

300

400

500

b [  m]

Figure 4.3. Thermal resistance variation for the microchannels.

77

B.

Pressure Drop and Pumping Power

The pressure field is obtained from the simulations as explained in section 4.2.1. The pressure drop
is computed as the difference between the pressure at the inlet and the pressure at the outlet of the
channels. In Figure 4.4, the pressure drop characteristics of the modeled microchannel heat sinks
are shown. It is observed that the pressure drop decreases with increase in the height of the
microchannel for all the pitches considered. The highest-pressure drops of the order of 1 MPa are
observed for heights of 10 μm which is the typical interlayer thickness in monolithic 3D ICs [20].
Such high hydraulic stress severely impacts the reliability of the 3D ICs [20]. The high slopes
between a height of 10 μm and 100 μm indicate that a dramatic change in hydraulic performance
is achieved between these microchannel heights. Beyond a microchannel height of 100 μm the
slope of the pressure drop plots become less steep and for a height higher than 200 μm the effect
of the microchannel height is negligible. Under these geometric parameters very low values for

 P [kPa]

the pressure drop are achieved (∆P < 10 kPa).

10

3

10

2

p = 400 m
p = 500 m
p = 667 m
p = 1000 m
p = 2000 m
p = 10000 m

101

100

10-1

0

100

200

300

400

500

b [  m]

Figure 4.4. Pressure drop variation for the microchannels.

78

Figure 4.5 shows the pumping power required to drive the fluid through the cooling layer. For
microchannel heights lower than 100 µm, the pumping power is very high (> 2W), and it decreases
almost exponentially as the value of b increases (taller microchannels). For a microchannel height
of b = 100 µm, the values of W are low enough (< 400 mW) to be attractive. Higher microchannel
heights lead to a very low pumping power; however, the thermal resistance is high, which indicates
a poorer thermal performance.
We observe that a height of 100 μm results in low thermal resistances and the pumping
power remains low. Thus implementing b=100 μm offers the best overall performances, providing
a thermal resistance of ~0.4 K/W. For a fixed microchannel height of 100 μm, the effect of varying
the microchannel width on the thermal resistance is negligible for widths greater than 800 µm.
Hence, a microchannel of 100 μm and pitch of 800 μm is recommended for high overall
performances. While it is true that narrower channels result in higher heat transfer coefficients
[46], these widths also result in higher pressure drops. Increasing the channel width while keeping

3500

p = 400  m
p = 500 m
p = 667 m
p = 1000 m
p = 2000 m
p = 10000  m

3000

W [mW]

2500
2000
1500
1000
500
0

0

100

200

300

400

500

b [  m]
Figure 4.5. Pumping power variation for the microchannels.

79

TABLE 4.1. SUMMARY AND COMPARISON OF THE SINGLE-PHASE MICROCHANNEL GEOMETRIES IN THE LITERATURE
Ref

Thermal
Resistance (K/W)

Pressure Drop
(kPa)

Flow rate
(ml/min)

Microchannel Dimensions
(µm)

[17]

0.09

214

516

width=50 µm
height=320 µm

[45]

0.20

50

220

width=200 µm
height=100 µm

[46]

0.17

280

155

width= 50 µm
height=100 µm

[89]

0.32

70

96

width=200 µm
height=2000 µm

[90]

0.38

2.41

145

width= 560 µm
height=200 µm

This work

0.40

9

45

width=800 µm
height =100 µm

the flow rate per unit width results in a lower pressure drop with the same temperature rise
throughout the chip. The somewhat lower heat transfer coefficients are more than offset by having
considerably lower pressure drops in the wider channels. For 3D IC applications, the pressure drop
is desired to be very low; thus, we recommend a channel width of 800 µm for the subsequent
analysis, we assume a height of 100 μm for the microchannels for wireless link design and a
corresponding thermal resistance of 0.4 K/W in the cooling layers.
Table 4.1 shows the comparison of the performance for different microchannel
configurations reported in the literature with respect to our design. Among the works surveyed
here, our design has the highest thermal resistance. However, this thermal resistance is capable of
cooling a heat flux of 200 W/cm2, while maintaining a surface temperature of 80 °C. The trade-off
with thermal resistance, however, enables us to reduce the pressure drop and flow rate compared
to earlier work significantly. This leads to a lower power required to drive the flow through the
80

channels. Therefore, we adopt the microchannel configurations with comparatively high widths
and heights. However, this does not have any impact on TSV placement as we propose to establish
communication across these microchannels using wireless interconnects. In the next subsections,
we evaluate the impact of using wireless interconnects in the 3D NoC on its performance and
energy efficiency.
4.2.2 Evaluation of the Hybrid 3D NoC Architecture in the Presence of Microchannel based
Liquid Cooling
According to the projection from ITRS, 3D stacked ICs will have 4 high-performance layers by
2020, where each layer is projected to dissipate more than 100 W/cm2 [6]. Hence, to evaluate
performance and temperature characteristics of the proposed architecture, we consider a 64-core
3D chip consisting of 4 layers each of 10 mm x 10 mm footprint. . For interlayer cooling, we have
considered one cooling chip after every two active layers as our designed microfluidic cooling
layer is capable of cooling a heat flux of 200 W/cm2. We also evaluate the impact of increasing
the frequency of insertion of the cooling layers to increase the cooling capability in section 4.2.4.
Each layer consists of 16 cores and switches. In the 3D-HiWiNoC architecture, each layer is
divided into 4 subnets with 4 cores in each. Therefore, there are 16 WIs, one in each subnet
connected to antennas located at the center of each subnet. The NoC architecture is characterized
using a cycle accurate simulator that models the progress of the data flits accurately per clock cycle
accounting for those flits that reach the destination as well as those that are stalled. For each NoC
simulation, ten thousand iterations were performed eliminating transients in the first thousand
iterations. The width of all wired links is considered to be same as the flit size, which is considered
to be 32 and 256 bits in this paper. We have considered the lower and higher end of flit sizes found

81

in modern NoC designs. We consider a moderate packet size of 64 flits for all our experiments.
The particular NoC switch architecture has three functional stages, namely, input arbitration,
routing/switch traversal, and output arbitration [69]. Each switch port has four virtual channels
each with a buffer depth of 2 flits. The wireless ports have an increased buffer depth of 8 flits to
avoid excessive packet dropping while waiting for the token [28]. The delay and energy dissipation
of network switches are obtained from the post-synthesis RTL models using 65nm standard cell
libraries from CMP (http://cmp.imag.fr ) at 1V, using Synopsys. The NoC switches are driven by
a clock of frequency 2.5 GHz. The delays and energy dissipation on the wired links were obtained
through Cadence simulations taking into account the specific lengths of each link based on the
established connections in the 10 mm x 10 mm layer following the topology of the NoCs. Each

(a)

(b)
Figure 4.6. (a) Full side view (b) view inside the box

82

device layer is considered to be 10 µm thick whereas the cooling layer is considered as 100 µm
thick. The power dissipation and delay of the TSVs are adopted from [26]. The wireless transceiver
adopted from [28] is shown to dissipate 36.7 mW sustaining a data rate of 16 Gbps while occupying
an area of 0.3 mm2 using TSMC 65-nm CMOS process.
A.

Wireless Channel Modeling and Link Budget Analysis

In this subsection, we discuss the radiation characteristics of the antennas that are designed to
communicate across a cooling layer with microchannels of height 100 μm and width of 800 μm.
These dimensions are chosen based on the results of the previous subsection.
A.1

Characteristics of the Antennas

The characteristics of the antennas are simulated using HFSS [73], which solves Maxwell's
equation in the entire volume of the model by automatically dividing the volume into tetrahedral
elements and hence, considers both near field and far field of the antennas. The antennas are
designed and tuned by Rounak Singh Narde under the supervision of Dr. Jayanti Venkatarman
[74]. The detailed design model of a 3D IC can be seen in Figure 4.6. From the figure, it can be
seen that four layers of Silicon substrate are placed on top of each other with silicon dioxide (silica)
layer sandwiched in between. Moreover, there is an extra 200 µm thick silicon layer in the midst
of the IC consisting of cooling microchannels. The cooling layer is considered to be 200 µm thick
with 100 µm tall microchannels embedded in them. The microchannels are considered 800 µm
thick with the microchannels walls being 200 µm as widening the microchannels beyond 800 µm
alters the thermal characteristics marginally as noted in the previous subsections.

83

Figure 4.7. Dimensions of the designed zig-zag antenna.

We evaluated the communication capability of these antennas deployed on all the layers of the
3D IC both separated as well as not separated by the cooling layer. The antennas are placed
assuming the wireless 3D NoC architecture as proposed in section 4.1.2 at the center of each subnet
in each layer. Therefore, some pairs are separated only vertically while other pairs are separated
both vertically as well as horizontally (in hubs that are not vertically aligned).
The metal antennas are considered to be embedded in the midst of a 6 µm layer of silica.
Figure 4.7 shows the specific dimensions of the antenna and its coplanar feed structure. A trace
width of 5 µm and thickness of 2 µm is used for all arms of the antenna. All the antennas are tuned
to resonate at 60 GHz with low return losses of at least -16 dB. Figure 4.8 shows the return loss
84

for the proposed antenna system. We have observed that at low frequencies below 10 GHz there
is no resonance in these antennas as the return loss is 0 dB. This demonstrates that there is no
radiation from the antennas at those frequencies. This eliminates the possibility of interference
with clock or other electrical signals in the IC mostly in those frequency bands lower than 10 GHz.
As described in earlier sections, we have considered a 64 core system with 4 active layers
and one cooling layer for our system level simulation where each active layer is divided into 4
subnets resulting in total 16 WIs. The worst case path loss is seen for antennas those are placed in
lower layer i.e. close to the metallic ground plane. Figure 4.9 shows insertion loss for one antenna
placed in the lower layer. From the figure, it can be seen that the transmission for the antennas,
which are in the near field i.e. placed vertically on top of each other, are better than other pairs.
This effect is also demonstrated in [91] that transmission is much higher in near field than in the

Figure 4.8. Return losses of all 16 antennas.

85

far field. The worst case path loss is -48.6 dB between antennas that are deployed in the first layer
(closest to the ground plane) and the third layer. The cooling layer separates these layers. Also,
these antennas are in subnets that are diagonal across from each other along the planar dimension.
In the next subsection, we estimate the reliability of wireless communication using this antenna
system and existing on-chip mm-wave transceivers from literature [28].
A.2

Link Budget Analysis

In this subsection, we present a link budget analysis to estimate the Bit Error Rate (BER) in the
wireless communication channel in the 3D IC with cooling layers. The following equation gives
the transmitted power, Pt in dBm on the wireless channels:
𝑃𝑡 = 𝑆𝑁𝑅 + 𝑃𝐿 + 𝑁𝑓 .

Figure 4.9. The insertion loss of antenna 1 in layer1.

86

(4.4)

Where, SNR is the signal to noise ratio at the receiver in dB, PL is the path loss in dB and Nfloor is
the receiver noise floor in dBm. The worst-case path loss, PL obtained from Figure 4.9 is -48.6
dB. The noise floor of the receiver is given by,
𝑁𝐹𝑙𝑜𝑜𝑟 = 10 log 𝑘𝑇𝐵 + 𝑁𝐹.

(4.5)

Where k is the Boltzmann constant, T is the temperature, B is the bandwidth of the receiver and
NF is the noise figure of the receiver in dB. The noise figure of the adopted receiver is 13 dB [29].
This makes the receiver noise floor -62.68 dBm at 50 °C for a BW of 16 GHz. Using the power
output of 0 dBm by the transmitter adopted from [28], the received SNR turns out to be 16.846
dB. In the non-coherent OOK modulation scheme adopted in the transceiver, this achieves a BER
of about 10-12, which is comparable to conventional interconnects in multicore SoCs.
B.

Temperature and Performance Characteristics of 3D Wireless NoC Architectures with

Synthetic Workload
In this section, we evaluate the temperature profiles and performance of 3D multicore chips
interconnected with various NoC architectures and employing microchannel based cooling as well
as conventional cooling with synthetic workloads and traffic scenarios. To capture the impact of
the power dissipation of the cores on-chip temperatures, we use predictive power models proposed
in [92] to estimate the power profile of the individual cores for synthetic workloads. According to
[92], the chip power density in the 65nm technology node is 0.5 W/mm2. We consider a 64-core
chip consisting of 4 layers each of 10 mm x 10 mm. Hence, based on the tile-based floorplan each
core is estimated to dissipate 2.645 W. Using these power profiles of the cores and obtaining the
power profile of the NoC with uniform random traffic, the network switches and links are arranged
on a 10 mm x 10 mm layer. These floor plans, along with the power profiles, are used in a thermal
87

modeling tool, HotSpot3D [93] to obtain the thermal profiles. We use a sampling interval of 100k
cycles, and all simulations are initialized with ambient temperature. Specifically, we consider the
following configurations:
1. 3D Mesh NoC with TSV with a conventional air-cooled heat sink (3D-MTSV) adopted from
[94]. For this architecture, each switch is connected to its cardinal neighbors as well as its vertical
neighbors above and below itself.
2. Token-based 3D Hierarchical Wireless NoC with interlayer cooling (3D-THiWiNoC) as
proposed here.
B.1

Temperature Evaluation with Synthetic Workload

The power dissipation of the cores, the power dissipation of the NoC switches, and interconnects,
as well as the impacts of the cooling infrastructures, are considered for evaluation of the
temperatures. The maximum chip temperature can be either the temperature of a core, link or
switch. For interlayer cooling, we have considered one cooling chip after every two active layers.
It is important to note that, we did not incorporate any dynamic thermal management technique
for this experiment as our goal was to study the effectiveness of two cooling approaches
(conventional forced air-cooling and inter-layer liquid cooling) considered in this subsection.
Table 4.2 shows the maximum chip temperature for each of the architectures studied here for two
different flit sizes. For interlayer coolers, we used a thermal resistance of 0.4 K/W. As noted in
section 4.2.1, a thermal resistance of 0.4K/W is obtained by optimizing the width and height of the
microchannels such that the pressure drop across them is not too high. From the table, it can be
seen that for flit size of 32 bits, maximum steady state temperature of the 3D-MTSV reaches
102.17ºC with forced-air cooling based conventional heat sink whereas, with interlayer layer
88

TABLE 4.2. PEAK TEMPERATURE OF TWO ARCHITECTURES CONSIDERED HERE
Flit size (32 bits)

Flit size (256 bits)

3D-MTSV

3D-THiWiNoC

3D-MTSV

3D-THiWiNoC

Thermal resistance (K/W)

1*

0.4

1*

0.4

Peak chip temperature (°C)

102.7

65.12

158.1

67.81

* Convectional cooling only

cooling, it decreases by 35.5% for 3D-HiWiNoC architecture. On the other hand, increasing flit
size eventually will increase the power dissipated by interconnects (links and switches) as
described in next section. Because of this, for flit size of 256 bits, maximum steady state
temperature for 3D-MTSV reaches to 158.1ºC with conventional forced air-cooling based heat
sinks. Whereas for the same flit sizes, the maximum steady state temperature remains below 68ºC.
Again, no additional DTM/DPM mechanism was considered in these cases. The interlayer
microfluidic cooling channels are more efficient in heat removal from the stacks of active layers.
In contrast with conventional convection cooling heat is only removed from the top surface of the
3D IC trapping the heat in internal layers making temperatures rise dramatically in response to
similar power dissipation profiles
B.2

Performance Evaluation with Synthetic Workload

In this section, we evaluate the proposed token-based 3-D hierarchical WiNoC architecture for a
system size of 64 cores with uniform random traffic distribution in terms of energy cost per bit and
peak network bandwidth and compare it with 3D wireline mesh architecture for two different flit
sizes of 32 bits and 256 bits respectively. Energy cost per bit is the energy dissipated in transferring
89

one bit completely from source to destination at network saturation. Peak bandwidth is the
maximum achievable data rate for the NoC. The bandwidth is measured as the average number of
bits successfully arriving per core per second. From the Figure 4.10, it can be seen that for 3DMTSV architecture yields maximum bandwidth for both flit sizes compared to wireless
architecture with liquid cooling. This is because number of active links in wireless architecture is
less than 3D-MTSV as this architecture eliminates the TSV based interconnects for data
communication across the cooling layers. In the 64-core system divided into 4 layers with 16 cores
each, there are cooling layers after every 2 layers. This implied that in the WiNoC architecture, 16
TSV based links connecting the vertically adjacent switches across the cooling layer is eliminated.
This results in a loss of an aggregate physical bandwidth of 1.2 Tbps for flit size of 32 bits (16
links with 32 bits each @ 2.5 GHz). The additional wireless bandwidth of 16 Gbps helps to
interconnect the two segments across the cooling layer. The removal of the interlayer TSV based
connections across the cooling chip results in bandwidth degradation by approximately the same
amount of the loss in the bisectional bandwidth of the removed TSV based links. On the other
hand, for flit size of 256 bits, 3D-MTSV architecture shows nearly 7.6x improvement over flit
sizes of 32 bits whereas, in 3D-THiWiNoC architecture, it is about 6.2x. This is because widening
physical channel width to accommodate larger flit increases the data rate of the wireline links and
TSVs. However, the data rate of the wireless links is governed by the speed of the transceiver and
bandwidth of the antennas, which does not change with flit size. Hence, while the wireline
communication becomes faster with an increase in flit size, the wireless communication speed
remains constant resulting in a lower relative improvement in bandwidth with an increase in flit
size compared to a completely wired NoC. However, wider TSV based vertical links have a
significant impact on the area required to route them and consequently, increase the challenges of
90

place and route of these communication channels across the cooling layers. The interlayer wireless
links alleviate this problem significantly, as is discussed later in section 4.2.6.
The WiNoCs, on the other hand, can reduce the energy cost per bit compared to the wireline
TSV based 3D NoCs for both flit sizes. This is because long-range data transfer between different
layers of the 3D NoC between switches, which are also separated by several millimeters in the
planar dimension, can be accomplished in a single hop using the wireless links. In contrast, in the
3D mesh architectures, data transfer over even short vertical distances happen over multi-hop paths
over multiple switches resulting in higher energy dissipation compared to the 3D-THiWiNoCs for
both flit sizes. In addition, the interlayer cooling chip reduces the temperature due to increased
cooling capacity. Therefore, while the bandwidth of the WiNoCs is less than that of the 3D mesh,

Peak bandwidth

35

Energy cost per bit

Peak bandwidth (Tbps)

30
25
20
15
10
5
0
3D-MTSV

3D-THiWiNoC

3D-MTSV

Flit size (32 bits)

1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0

Energy cost per bit (nJ)

the packet energy, and peak temperatures are significantly reduced.

3D-THiWiNoC

Flit size (256 bits)

Figure 4.10. Peak bandwidth and energy cost per bit for different 3D NoC architectures.

91

4.2.3 Temperature and Performance Characteristics of 3D NoC Architectures with Real
Application-based Workloads
In this section, we evaluate the performance and thermal characteristics of 3D multicore ICs
interconnected by the proposed token-based 3D-THiWiNoC and 3D-MTSV architectures in the
presence of real application-based workloads mapped onto a 64 core chip mentioned architectures.
We use GEM5 [95], a full system simulator to obtain detailed processor and network-level
information on SPLASH-2 [96] and PARSEC [97] benchmarks and HotSpot3D [93] to obtain
detailed thermal profiles. We consider a system of 64 alpha cores running Linux within the GEM5
platform for all experiments. The memory system is MOESI_CMP_directory, setup with private
64KB L1 instruction and data caches and a shared 64MB (1MB distributed per core) L2 cache.
The processor-level utilization statistics generated by the GEM5 simulations are incorporated into
McPAT simulator [98] to determine the processor-level power statistics. The traffic interaction

Normalized peak bandwidth

Normalzied peak bandwidth (3D-MTSV)
Normalized energy cost per bit (3D-MTSV)

Normalized peak bandwidth (3D-THiWiNoC)
Normalized energy cost per bit (3D-THiWiNoC)
1.2

1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0

1
0.8
0.6
0.4
0.2
0

Normalized energy cost per bit

patterns for each benchmark obtained from Gem5 are used in the NoC simulator to get the NoC

Figure 4.11. Normalized Peak bandwidth and energy cost per bit in the presence of real application
traffic.

92

performance in terms of peak bandwidth, average network latency, and average energy cost per
bit. Similar to previous experiments, we consider cooling chip after every two layers.
Figure 4.11 and 4.12 show the normalized peak bandwidth & energy cost per bit and
difference in peak chip temperature (ΔT) of two architectures considered here in the presence of
different real application traffics. As number of vertical links is less in 3D-THiWiNoC architecture
than that in the 3D-MTSV, peak bandwidth of the 3D-THiWiNoC is lower than that of 3D-MeshTSV for all real life application traffics. However, the presence of single hop energy efficient
wireless links reduces the energy cost per bit compared to the wireline TSV based 3D NoCs.
Also, the presence of the interlayer cooling chip can reduce the peak temperature due to
increased cooling capacity which can be seen from Figure 4.12. With cooling chip after every two
layers, 3D-THiWiNoC architecture shows on average 13 degrees and a maximum of 18 degrees
reduction in peak temperature compared to 3D-MTSV architecture with conventional fan based
air-cooled heat sink. We observe that although the peak bandwidth in 3D-THiWiNoC is less than

ΔTemperature (°C)

20

Difference in peak chip temperature W.R.T 3D-MTSV architecture

15
10
5
0

Figure 4.12. Peak chip temperature in presence of real application traffics.

93

that of wireline 3D-MTSV architecture, it achieves lower energy cost per bit and peak temperatures
for all the application-specific workloads.
4.2.4 Increasing Cooling Capacity of the Interlayer Coolers and its Impact on Performance
of the 3D Wireless NoC
In this section, we study the impact of various techniques for increasing the cooling capacity of
the interlayer coolers in case the power consumption profiles of the 3D IC layers require. The
cooling capability of the microchannels can be increased by increasing the flow rate of the coolant
fluid through the microchannels. However, that results in an increase in pressure drop. However,
the flow rate can only be increased such that the resultant pressure drop is within acceptable limits
of hydraulic stress that the 3D IC can sustain. Otherwise, the microchannel heights will need to be
increased requiring the antennas to be retuned for the altered dimensions. Hence, we study the
increase in cooling capacity by increasing the frequency of insertion of the cooling layers in this
section.
3D-THiWiNoC (Cooling layer after 1 active layer)
3D-THiWiNoC (Cooling layer after 2 active layers)
80

65.84
50.23

60
40
20

1.74

2.49

0.64

0.55

0
Peak
bandwidth(Tbps)

Energy cost per bit
(nJ)

Max temperature (°C)

Figure 4.13. The impact of frequency of the cooling layers.

94

The frequency of insertion of the cooling layers signifies the number of active device layers
between the cooling layers. The impact of varying the frequency of the cooling layers on
performance, energy dissipation and maximum chip temperatures of token-based 3D-THiWiNoC
architecture. We have considered uniform random traffic with maximum injection load for this
experiment. Figure 4.13 shows the effect of varying the frequency of cooling layers from one after
every active layer to one cooling layer after every 2 layers on maximum chip temperature, peak
bandwidth, and energy cost per bit. From the figure, it can be seen that increasing the frequency
of cooling layers reduce the maximum chip temperature to a great extent for both architectures
considered for the experiment. With cooling chip after every 1 layer, with a thermal resistance of
0.4W/K, maximum temperature reduces to 50.23ºC for token-based 3D-THiWiNoC from around
65ºC for the same NoCs with one cooling layer after every two layers. However, increasing the
frequency of cooling layer have a diminishing effect on performance for wireless 3D architecture.
This is because in token-based 3D-THiWiNoC architecture with a cooling chip after every active
layer, the number of active links is 33.33% less than 3D wireline mesh architecture, to
accommodate microchannel liquid cooling chip, which in turn reduces performance. For cooling
chip after every two active layers, due to augmentation of TSV based links, the performance of the
Token-based 3D-THiWiNoC architecture increases. However, decreasing the frequency of
cooling layers increases maximum chip temperature because of the increased stacking of layers.
4.2.5 Comparison with Alternative Wireless Communication Mechanism
In this section, we evaluate the performance of the proposed wireless multichip system coupled
with an alternative wireless communication mechanism i.e. Direct Sequence Spread Spectrum
(DSSS) based Code Division Multiple Access (CDMA) channel access mechanism and compared

95

the performance in terms of bandwidth and energy cost per bit with token-based 3D-TWiNoC
architecture.
In token passing based wireless medium access mechanism, only a single transmitter can
access the wireless channel at any given instant of time. This limits the performance benefits of
the wireless interconnection although multiple transceivers are deployed over the entire system.
However, it is possible to design sophisticated MAC mechanisms which will enable multiple WIs
to share this wireless channel without any interference and ensure optimal utilization of the
available bandwidth. In order to enable multiple concurrent communication between the WIs,
authors in [34] proposed a DSSS based CDMA channel access mechanism for a wireless NoC. In
the CDMA-based medium access mechanisms, Walsh codes are used to create orthogonal code
channels for multiple access. Due to this orthogonality between the code channels bits in one code
channel is not affected by other channels. Transmitted bits are first encoded using the code-word,
and at the receiving WI, the received bit is XORed with the code words to extract the transmitted

Peak bandwidth (Tbps)

Energy cost per bit (nJ)

3.1

0.7
0.6

2.9
0.5

2.8
2.7

0.4

2.6

0.3

2.5

0.2

2.4

Energy cost per bit (nJ)

Peak bandwidth (Tbps)

3

0.1

2.3
2.2

0
3D-THiWiNoC

3D-CHiWiNoC

Figure 4.14. Peak bandwidth and energy cost per bit for different wireless communication
mechanisms.

96

TABLE 4.3. ENERGY PER BIT FOR A SINGLE POINT-TO-POINT LINK AND POSSIBLE AGGREGATE BANDWIDTH FOR
DIFFERENT WIRELESS COMMUNICATION PROTOCOLS

Token-based wireless
interconnect

CDMA based wireless
interconnect

Energy (pJ/bit)

2.3

3.43

Aggregate physical bandwidth (Gbps)

16

6

Transceiver area (mm2)

0.3

0.4

data. We adopt the performance and power characteristics of the CDMA transceiver from [34] to
estimate the performance of the system. The raw energy/bit of the wireless technology for a single
point-to-point link and aggregate physical bandwidth provided by each of these technologies are
summarized in Table 4.3.
Figure 4.14 shows the peak achievable bandwidth and energy cost per bit for token-based
and CDMA-based 3D Wireless NoC (3D-CHiWiNoC) architecture. For a fair comparison, we
considered same network topology and the same number of WIs per layer in both cases. From the
figure, it can be seen that the peak achievable bandwidth of the CDMA-based 3D-HiWiNoC is
19.86% higher than the token-based counterparts. This is because in CDMA concurrent wireless
communication is possible in different orthogonal code channels whereas in token passing only
one wireless communication is possible at a time. Similarly, the energy cost per bit of the CDMAbased 3D-CHiWiNoC is also lower than the token-based 3D-THiWiNoC architecture. This is
because, in the token-based system when the number of WIs is increased, token returning period
to the WIs also increase. This result in higher energy cost per bit in the token-based system as

97

packets has to wait longer in the wireless buffers before getting transmitted via the wireless
interconnect.
However, the improvement of the performance by implementing sophisticated MAC does
not come without a price. Reliability can be an issue for the CDMA-based 3D-CHiWiNoC
architecture, as CDMA requires the transmitters to be synchronized with each other so that the
transmitted data is perfectly aligned with respect to other transmitters to maintain the orthogonality
between the code words. Such synchronization is difficult to achieve in a 3D environment as the
WIs are distributed across different layers and require further investigation.
4.2.6 Area Overheads
In this section, we estimate the comparative area overheads of the signal TSVs through
microchannel based cooling layer. TSVs for power and clock networks are not considered here.
We have considered a moderate size signal TSV with diameter 2 µm and pitch 4 µm to calculate
TABLE 4.4. AREA OVERHEAD FOR THE SIGNAL TSVS THROUGH MICROCHANNEL COOLING LAYER.
Flit size of 32 bits

Flit size of 256 bits

3D-MTSV

3D-THiWiNoC

3D-MTSV

3D-THiWiNoC

Required area for place and
route of signal TSVs (mm2)

2.016

0

16.352

0

Available area for place and
route of all the TSVs through
microchannel walls (mm2)

20

20

20

20

Percentage of occupied area for
place and route of signal TSVs
through microchannel walls

10.08%

0%

81.76%

0%

98

the area overhead. The optimal width of each microchannel is found to be 800 µm in section 4.2.1
keeping 20 mm2 area available for place and route of all the TSVs through the interlayer cooler.
Table 4.4 shows the area required for the signal TSVs to route through the microchannel walls.
From the table, it can be seen that for flit size of 32 bits, 10.08% area is occupied by the signal
TSVs for place and route through the interlayer cooling layer. As a result, 89.92% of area is
available for power and clock delivery TSVs to place and route. Whereas, for flit size of 256 bits,
the signal TSVs occupy an area of 16.352 mm2 out of 20 mm2 available area keeping only 18.24%
area to properly route the power and clock TSVs which might not be sufficient. On the other hand,
in the case of wireless 3D NoC architecture with interlayer cooling, we are eliminating the entire
signal TSVs across the microchannels keeping the entire 2mm2 area for power and clock delivery
TSVs. This, in turn, will reduce the complexity of the co-design of TSV based interconnects and
microchannel based interlayer cooling significantly.
4.2.7 Trade-off Analysis
From all the results we can conclude that the 3D-THiWiNoC with 256 bits per flit and
microchannel based interlayer coolers provides the best trade-offs in performance, energy
consumption per bit and temperature. While the fully wired TSV based counterpart provides higher
bandwidth, it imposes severe challenges to co-design and place and route of the TSV based links
across the cooling layers, due to its high area requirements. The same fully wired architecture with
32 bits per flit has lower area requirements for the vertical links but has lower bandwidth and
higher energy consumption as well as worse temperature compared to the 3D-THiWiNoC with
256 bits per flit. The 3D-THiWiNoC with 256 bits per flit enjoys the benefit of the wide TSV
based vertical links among adjacent layers while enabling communication across the cooling layer

99

with the wireless links. Moreover, utilizing wireless interconnect for data communication across
the cooling layer relaxes the height restrictions of the microchannels, and therefore, results in lower
pressure drop across the cooling layer. This, in turn, leads to a lower power requirement to drive
the flow through the channels and increases structural stability.
Increasing the frequency of cooling layer have a diminishing effect on performance for
wireless 3D architecture. This is because in token-based 3D-THiWiNoC architecture with a
cooling chip after every active layer, to accommodate microchannel liquid cooling chip, the
number of active links is reduced compared to 3D wireline mesh architecture. This, in turn, reduces
the performance of the 3D wireless NoC. For cooling chip after every two active layers, due to
augmentation of TSV based links, the performance of the Token-based 3D-THiWiNoC
architecture increases. However, decreasing the frequency of cooling layers increases maximum
chip temperature because of the increased stacking of layers.
Compared to other contactless wireless interconnect such as capacitive/inductive coupling
based links, the energ consumtion per bit for mm-wave wireless links do not increase with
communication distance. This makes mm-wave wireless interconnect feasible solution for data
communication across microchannels cooling layers with heights of 100 µm.
In terms of the wireless communication protocol, in token passing based wireless medium
access mechanism, only a single transmitter can access the wireless channel at any given instant
of time. This limits the performance benefits of the wireless interconnection although multiple
transceivers are deployed over the entire system. Complex MAC mechanism like CDMA can
enable better performance compared to the token passing based method. However, the requirement
of precise synchronization in CDMA links is difficult to achieve in a large 3D multicore chip. The
100

loss of synchronization will introduce inter-channel interference resulting in unreliable
performance and require further investigation.
4.2.8 Holistic Comparison of the Vertically Integrated Wireless System with Horizontally
Integrated Wireless Multichip System
In this section, we perform a holistic comparison of the wireless 3D system with respect to the
wireless multichip system in terms of performance, energy efficiency, and temperature.
Horizontal integration or Multi-Chip Module (MCM) is another way of integrating
multiple chips where chips are placed horizontally on the same substrate or interposer within a
package and are considered as a predecessor to monolithic 3D integration. Conventionally, MCM
systems are interconnected by C4 bumps coupled with in-package transmission lines [41].
However, signal quality deteriorations due to microwave effects, crosstalk coupling effects, signal
reflections, and frequency-dependent lines losses in the transmission line restrict the possible
TABLE 4.5. PERFORMANCE COMPARISON WITH THE HORIZONTALLY INTEGRATED WIRELESS MULTICHIP MODULE.
Wireless MCM
System

3D-THiWiNoC
(Cooling layer after
2 active layers)

3D-THiWiNoC
(Cooling layer after
2 active layers)

Peak Bandwidth (Tbps)

1.737

2.491

1.738

Energy Cost per Bit (nJ)

0.64

0.55

0.64

Temperature (°C)

71.31

65.84

50.23

101

performance benefits of the multichip system [10]. To improve performance and energy efficiency,
[87] proposes to use wireless interconnects for data transfer between chips in an MCM system.
Due to the analogous nature of the wirelessly connected MCM [87], and the 3D wireless NoC
integrating multiple layers with the same mm-wave technology we compare the two systems in
terms of their performance, energy-efficiency, and thermal profiles.
For a fair comparison, same number and location of wireless interconnect, and same tokenbased wireless medium access mechanism is considered in both 3D and MCM systems. In
addition, the intra-layer NoC architecture and intra-chip NoC architecture for the individual chips
in the MCM are considered to be identical. The routing and switching protocol in the two systems
are also same with same NoC switch design adopted in both. Table 4.5 shows the peak bandwidth,
energy cost per bit, and temperature for 4-chip wireless MCM and the 3D-THiWiNoC
architectures at network saturation using uniform random traffic. For 3D-THiWiNoC architecture,
we have considered two different configurations: cooling layers after every active layer and
cooling layer after every 2 active layers. It is important to note that inserting cooling chip after
every active layer will eliminate all the data TSVs utilizing solely wireless interconnections for
data communication across cooling layer and will result in identical topology for wireless MCM
and 3D-THiWiNoC architecture. Consequently, these two systems are equal in peak bandwidth
and energy cost per bit as can be seen from Table 4.5. However, the cooling layers after every
active layer result in a much better thermal characteristic. The 3D-THiWiNoC with cooling chip
after every two layers has higher bandwidth and lower energy cost per bit compared to other
configurations, due to augmentation of TSVs for data communication between active layers, which
are not separated by a cooling layer. On the other hand, its thermal characteristics are worse than
the 3D-THiWiNoC with cooling layers between every active layer. Both 3D-THiWiNoC
102

TABLE 4.6. HOLISTIC COMPARISON OF BOTH MULTICHIP INTEGRATION TECHNIQUES IN VARIOUS DOMAINS.

3D Wireless Multichip System
with Microchannel Cooling

Wireless MCM System

Projected Performance

Higher compared to wireless
MCM systems

Outperforms conventional
wireline interchip communication
systems

Form Factor

Smaller footprint due to the
vertical stacking

Larger footprint due to horizontal
integration

Design Flow

Requires new EDA tools

Can utilize existing EDA tools

Manufacturing

Challenging

Contemporary

Testing

New methods

Contemporary

Device Impact

Stress from TSV fabrication

None

Cost

High (initially)

High Volume Manufacturing
(HVM) possible at low cost

configurations have lower peak chip temperature compared to wireless MCM system due to the
presence of microchannels that are more efficient in heat removal compared to traditional fan based
air-cooled heat sinks. Therefore, it can be observed that the 3D-THiWiNoC can provide identical
or better performance compared to a wireless MCM for better thermal characteristics.

103

While improvement in energy efficiency and temperature profile is possible in the 3D
wireless NoC architecture with microchannel based liquid cooling, however, to take full
advantages of 3D integration, more research is necessary to address various challenges in multiple
areas including TSV fabrication, testing, and CAD tool development. Table 4.6 shows the holistic
comparison of both multichip integration techniques. Fabrication of TSVs requires additional
process steps, and these additional steps make TSV manufacture at high yield extremely difficult.
In addition, to maintain good conductivity and minimize resistance, the TSVs between dies must
be aligned precisely. Moreover, there are limited number of Electronic Design Automation (EDA)
tools available to design and test 3D integrated ICs. On the other hand, a horizontally stacked
wireless multichip system utilizing metal-zigzag antennas are CMOS compatible and do not
require any additional fabrication steps. Moreover, the wireless planar multichip system
outperforms conventional wireline interchip communication systems.

4.3

Summary

In this chapter, we present a wireless 3D NoC architecture that enables energy-efficient on-chip
data transfer along with liquid cooling technology suitable for 3D multicore ICs. We designed the
on-chip antennas to establish wireless communication across cooling layers depending upon the
dimensions of the microchannels for best trade-offs in thermal and hydraulic performance. The
hybrid wireless and wireline 3D NoC architecture was designed using these antennas. We
demonstrate that with wireless 3D NoC coupled with a cooling layer with microchannels can
improve the thermal characteristics of the 3D IC compared to 3D NoCs with TSVs and
conventional cooling significantly while also reducing the energy cost per bit. This is due to a
reduction in multi-hop communication in both planar and vertical directions in a wireless 3D NoC.
104

However, due to the removal of TSV based high bandwidth links across the cooling layer, the
bandwidth of this wireless NoC is lower than the 3D Mesh.

105

Chapter 5

CONCLUSION

AND

FUTURE

RESEARCH

DIRECTIONS
The aim of this work is to demonstrate the potential of mm-wave wireless interconnects to
overcome the challenges of multichip integration. This chapter concludes the work accomplished
in this dissertation by summarizing significant contributions. It also points towards various
promising future directions originating from this research work.

5.1

Conclusion

The multichip system has emerged as a feasible solution to overcome the physical constraint of
the area, yield, and scalability limitations of the single chip multiprocessor system. However, as a
new technology, disintegrating multiple systems needs to overcome few key challenges to be
widely accepted and depending on the integration approaches, these challenges are diverse in
nature. In the case of the horizontal integration, the difficulties lie in the inter-chip communication
whereas, in vertical integration, the challenge is to find an enabling technology to communicate
across the cooling layer. Wireless interconnect can be a promising solution to deal with these
challenges. This dissertation proposes the design methodologies to utilize wireless interconnects
as the communication backbone for both horizontally and vertically integrated multichip system.
For the horizontally integrated multichip system, this work explores the advantages
possible if inter-chip communication in multichip modules can be realized with state-of-the-art
mm-wave wireless links operating in the 60GHz band. The wireless links are capable of
establishing direct communication channels between cores in different chips via on-chip embedded
106

antennas. Moreover, the wireless links can be used for a seamless data transfer between cores in
the same chip as well to augment the traditional NoC backbone for intra-chip communications.
These factors result in significant gains in performance and energy efficiency in both intra and
inter-chip data communications. The energy-efficiency of the wireless interconnects have been
improved by careful wireless data transfer protocol design to put unused WIs to sleep using powergated transceivers. It can be further enhanced by using variable levels of power amplifications [86]
depending upon the length of the wireless interconnects and associated path losses in the future.
As for vertical integration, this dissertation proposes to use wireless interconnect to enable
energy-efficient on-chip data transfer across the cooling layers. We designed the on-chip antennas
to establish wireless communication across cooling layers depending upon the dimensions of the
microchannels for best trade-offs in thermal and hydraulic performance. Using these antennas, the
hybrid wireless and wireline 3D NoC architecture are designed. From all the results we can
conclude that the 3D-THiWiNoC with 256 bits per flit and microchannel based interlayer coolers
provides the best trade-offs in performance, energy consumption per bit and temperature. While
the fully wired TSV based counterpart provides higher bandwidth, it imposes severe challenges to
co-design and place and route of the TSV based links across the cooling layers due to its high area
requirements. The same fully wired architecture with 32 bits per flit has lower area requirements
for the vertical links but has lower bandwidth and higher energy consumption as well as worse
temperature compared to the 3D-THiWiNoC with 256 bits per flit. The 3D-THiWiNoC with 256
bits per flit enjoys the benefit of the wide TSV based vertical links among adjacent layers while
enabling communication across the cooling layer with the wireless links. Hence, this architecture
is suitable for applications and environments that have strict constraints on energy consumption
and temperature of the system and no not require extremely high bandwidths. This method of
107

reducing both energy and temperature of a 3D multicore IC is orthogonal and can co-exist with
other dynamic thermal and power management mechanisms like DVFS which will provide further
enhanced control and trade-offs in the power-performance-temperature spectrum of the chip.
Moreover, dynamic control of the cooling capability of microchannels can be achieved by micropumps, which can dynamically vary the flow rates in the channels. Such a mechanism together
with DVFS can be designed for a more holistic dynamic thermal control of a 3D multi-core chip
in the future.
When compared to the planar multichip module as an alternative to monolithic 3D
integration with interlayer coolers, we find out that the monolithic 3D-THiWiNoC (with cooling
layer after 2 active layers) has better performance, energy per bit as well as temperature compared
to the planar counterpart. This is because of the TSV based links that interconnect the active layers
not separated by the cooling layer. The microchannel based coolers help in reducing the
temperature in the 3D-THiWiNoC. While improvement in energy efficiency and temperature
profile is possible in the 3D-THiWiNoC architecture, there have several other challenges.
Fabrication of TSVs requires additional process steps, and these additional steps make TSV
manufacture at high yield extremely difficult due to challenges related to etching, sidewall
passivation, and formation, insulation, and filling of Vias. In addition, to maintain good
conductivity and minimize resistance, the TSVs between dies must be aligned precisely. On the
other hand, a horizontally stacked wireless multichip system utilizing metal-zigzag antennas are
CMOS compatible and do not require any additional fabrication steps. Moreover, the wireless
planar multichip system outperforms conventional wireline interchip communication systems.
Hence, it is a nearer term alternative as the communication backbone for multichip systems
providing significant gains in performance over conventional wireline system. On the other hand,
108

going forward with matured TSV fabrication process, 3D-HiWiNoC with microchannel cooling
can be future alternative as it provides better performance, energy per bit as well as temperature
compared to the planar counterpart.

5.2

Future Research Directions

The opportunities for progressing the research performed for this dissertation work will be
discussed in the following sections.
5.2.1 Energy-Efficient Multi-gigabit Transceiver Design for Intra and Inter-Chip Wireless
Interconnects
The main enabling technology for inter and intra-chip wireless interconnection proposed in this
dissertation is the physical layer design comprising of the transceiver circuits and antennas. To
compete with state-of-the-art technologies the power consumption of the transceiver circuits
should be a minimum while providing the maximum possible data rates. Trends indicate a target
link energy efficiency of <1 pJ/bit at data rates of >10 Gbps [43][99]. Consequently, mm-wave
transceiver with non-coherent OOK modulation are suitable for such wireless communication
technologies [29][99] due to its low complexity and power consumption. In [100] a 60-GHz
transceiver system with 2.5 Gbps data rate is implemented in 90-nm CMOS process that has an
energy efficiency of 114 pJ/bit. In [101] authors proposed the design of a transceiver with the bitenergy efficiency of 6.26 pJ/bit, with a data rate of 10.7 Gb/s. However, it requires an onboard
Yagi–Uda antenna on a non-silicon substrate, which can result in integration difficulties.
Moreover, none of the transceiver implementation meets all the desirable specifications of the
wireless interconnects i.e. high multi-gigabit data rate and high energy efficiency [43][99].

109

Therefore, an energy-efficient mm-wave transceiver suitable for both intra and inter-chip
application is yet to be demonstrated.
5.2.2 Traffic-Aware Medium Access Mechanism for Multi-Chip System
The MAC for intra-chip WiNoC has been identified by all research groups as one of the main
challenges in the design particularly, dynamic mechanisms which are adaptive to changes in the
system. Utilizing the full potential of the novel mm-wave interconnect technology in a multichip
system requires overcoming two critical design challenges: i) design of efficient, simple and fair
MAC mechanism, and ii) managing the wireless bandwidth effectively. Moreover, intra and interchip traffic can have very different characteristics and requirements. Information exchange
between components in such multichip environments can be either control information or data
exchange. Control information required for tasks such as cache coherency protocol,
synchronization, thread migrations typically require sporadic but low latency time sensitive
communication. Whereas, read and writes between processing engines and memory elements and
require a higher volume of data exchange. Consequently, the data exchange in the multichip
system can vary from low load extremely sporadic yet latency sensitive data transfer to high load
high throughput data exchange.
Due to the distributed and low-overhead implementation, and fairness in channel access, a
Token passing based Time Division Multiple Access (T-MAC) is used in many intra-chip
architectures [29][32][33]. The MAC should also manage the sharing of the wireless
communication medium depending on traffic variation in the intra and inter-chip communication
to maximize performance. In a token-based MAC, a single WI possessing the token gains access
to the wireless medium to transmit for a certain number of time slots. However, in the multichip
110

environment, the traffic demand through the switches vary both temporally and spatially
depending on the application [102][103]. Therefore, the MAC for such multichip system should
be able to allocate transmission slots dynamically to WIs in response to sudden and large variations
in traffic. Hence, energy-efficient dynamic MAC mechanisms that can predict the traffic demand
of the WIs and respond accordingly by adjusting transmission slots of the WIs need to be
developed.
5.2.3 A Wireless Interconnection Framework for Multichip System with In-Package
Memory
Multichip computing modules with several chips integrated with memory banks can be found in a
wide range of platform based designs from servers to embedded systems. These chips can be
processing chips such as multicore chips, CPUs or GPUs or a heterogeneous mix of such chips
(For example AMD’s Fusion Accelerated Processor Units (APUs)) depending upon desired
functionality. Due to scaling up of a number of individual computing nodes by several orders of
magnitude in these systems, the interconnection between them has become increasingly complex.
Moreover, to satisfy the memory bandwidth demands, integration of in-package memory has
become a norm in these systems. Integrating memory within a single package can be done either
by placing memory which itself will possibly be vertically stacked on top of a multicore die i.e.
monolithic 3D integration [9] or placing them side-by-side on the same substrate or interposer i.e.
2.5D integration [8]. However, in 3D stacked approach, the amount of memory that can be
integrated into the package is limited by the size of multicore die (increasing the die size generally
reduces yield, and hence, increases manufacturing cost). In addition, the multicore processing
chips need to be thinned to accommodate TSVs through it, which can induce die-cracking and

111

structural yield issues. Moreover, as the integration of the memory will essentially block the path
of heat flow of the multicore die, the average die temperature of such system can become
prohibitively high [104]. As a result, this approach requires sophisticated thermal management
techniques. On the other hand, in horizontal or 2.5D integration, the amount of memory that can
be integrated is not bound by the size of the multicore die, rather limited by the size of substrate
board or interposer. As a result, it can provide more memory capacity. Moreover, this integration
technique will allow disintegration of a large multicore processing chip into several smaller
processing chips. Consequently, for the same computational capabilities (i.e. same number of cores
and memory sizes), this disintegration will lower the total manufacturing cost considering the fact
that smaller die size will eventually result in higher yield and better packing of the rectangular die
on a circular wafer [8]. It also enables an easy integration of heterogeneous chips and technologies
on the same platform. All these benefits over 3D stacking make 2.5D integration a nearer term
solution for a multichip system with in-package memory integration.
Recent trends according to the ITRS (http://www.itrs2.net/ ) predict that the pitch of the
I/O interconnects in ICs is not scaling as fast as the gate lengths or pitch of on-chip interconnects.
This implies a gap in density and performance of traditional I/O systems relative to on-chip
interconnections. The wiring complexity of both on-chip and off-chip interconnects exacerbates
the problem by posing design challenges, crosstalk, and signal integrity issues. Moreover, in the
case of disintegrated processing chips, cores that were previously on the same chip are now on
different processing chips. Therefore, inter-chip communication becomes critical and a potential
bottleneck. These factors reduce the efficiency in terms of energy consumption as well as latency
and bandwidth of the data transfer between communicating components such as processing cores
and memory blocks in a multichip system. Therefore, we need an energy efficient, seamless,
112

scalable interconnection network that spans across distances of a few millimeters (single chip) to
several centimeters (on a multichip environment). Integrated inter and intra-chip photonic
interconnections is a promising solution to the off-chip interconnection challenges of traditional
I/O. However, the pitch of photonic interconnects does not scale well due to the limitations in size
of silicon photonic devices. Moreover, this technology is challenging to integrate with standard
CMOS processes typically requiring a separate photonic device layer with large footprints on the
chip. Research in recent years has demonstrated that on-chip and off-chip wireless interconnects
are capable of establishing radio communications within as well as between multiple chips. Using
such on-chip antennas embedded in the chip, Wireless Network-on-Chip (NoC) architectures are
shown to improve energy efficiency and bandwidth of on-chip data communication in multicore
chips. As a future direction, we propose to use such wireless interconnects to establish a seamless
communication backbone which enables data exchange between chips in a multichip system with
in-package memory. The same communication protocols used for on-chip data transfer in the intrachip NoC will be used for off-chip data as well, eliminating the need for protocol transfer. Wireless
transceivers will be deployed inside each chip and memory stack, which will be capable of
establishing direct one-hop communication with other such transceivers in the system. Hence, the
benefits of the design methodologies outlined in this dissertation can be further exploited for such
multichip systems with several multicore processing chips and memory stacks.
5.2.4 60 GHz mm-wave Wireless Interconnects to Enable Contactless Testing
Aggregating multiple smaller chips can overcome the physical limitations in the area, power
density, and yield of a single chip multiprocessor system. However, a major concern in multichip
integration is the quality of the arriving dies and wafers before stacking. The yield and

113

consequently, the cost benefits of the multichip system largely depend on the availability of the
known good dies. This is because the overall manufacturing yield of the multichip system is a
function of the yields of the bare die being stacked. State-of-the-arts chip manufacturing process
consists of numerous fabrication steps, and defects can induce in any of these steps. The lifecycle
of a chip starts after getting the masks of the circuit design. These masked are then used to fabricate
the circuits on a silicon wafer. After the fabrication, the circuits first go through the wafer testing
to check functional defects by applying the special test vectors. After wafer testing, the dies are
then packaged individually after separating from the wafer using laser dicing. These packaged dies
then undergo assembly or packaging test to check the package induced defects. Finally, the
packaged dies that pass the packaging test are then assembled onto a substrate (preferably on a
PCB board) to make the final product, and the product level testing is done to check the
interconnections and the assembly process. In each stage of the testing, the faulty ones are marked
to thrown away. As a result, it is very important to catch these defects at the early stages of the life
cycle of a chip to save cost and ensure quality. Detecting a defect late in the life cycle is not only
decrease cost significantly but also, it hinders the overall reputation of the company, especially if
the consumer notices it after delivery. Hence, the chip manufacturing company aims to reach the
highest possible coverage in the wafer test in order to minimize the yield losses at the later stages.
However, wafer testing in multichip integration needs to overcome several serious difficulties.
In wafer testing, probing dies typically requires a contact to be made between the
Automatic Test Equipment (ATE) and die, as the ATEs must be connected to the primary
input/output of the Device-Under-Test (DUT) during the test. These connection are made through
a set of mechanical probes or probe cards. However, the frequent physical contacts between the
probe card and the wafer-under-test have several shortcomings. The probe needles and contact
114

points can suffer deformation, and debris accumulates on the probe card also affect test outcome.
Abrasive cleaning can be used to remove the debris. However, this can damage the probe needles
[105][106]. Moreover, stress applied by these probe needles can cause die cracks. In addition, the
reduction in feature sizes and the increasing demand for parallel multisite testing limit the benefits
of the probe cards [107]. Several wireless or contactless test methods are proposed in the recent
literature to overcome the limitation of wafer testing [105][108][109]. However, these methods
utilize the near field communication of the antennas and hence, need to be aligned preciously due
to the proximity requirements. Mm-wave wireless interconnect realized by metallic zigzag
antennas fabricated using top layer metals are CMOS process compatible making them suitable
for wafer testing. Moreover, it has been noted in many earlier works that the mm-wave wireless
antennas are not directional and hence can be used for broadcast type transmission over the shared
wireless channel. This property gives an additional advantage as wireless interconnects can
provide a broadcast-capable medium to distribute any kind of test contents faster and efficiently.
In our proposed multichip integration methodology using wireless interconnects, the intra-chip
interconnection is a hybrid one, consisting both wired and wireless links. For the wafer testing, we
will equip the probe card with multiple wireless transceivers operating at 60 GHz carrier
frequency. These wireless transceivers will be used to send the test vectors to the chips and to get
the testing outcome. As we are envisioning a shared broadcast medium, the probe card can send
the test vectors to multiple dies at the same time as all the transceiver will be tuned to work on the
same frequency and can reduce the testing time significantly.

115

APPENDIX A
PUBLICATIONS
Following is a list of publications published in reputed journals and conferences during this
research.

Journals:
J1. M. S. Shamim, N. Mansoor, R. S. Narde, V. Kothandapani, A. Ganguly and J.
Venkataraman, "A Wireless Interconnection Framework for Seamless Inter and
Intra-Chip Communication in Multichip Systems," In IEEE Transactions on
Computers, vol. 66, no. 3, pp. 389-402, March 1 2017.
J2. H. Mondal, S. Gade, M. Shamim, S. Deb, and A. Ganguly, "Interference-Aware
Wireless Network-on-Chip Architecture using Directional Antennas," In IEEE
Transactions on Multi-Scale Computing Systems, vol. PP, no. 99, pp.1-1.
J3. N. Mansoor, A. Vashist, M. Ahmed, M. S. Shamim, S. Mamun, and A. Ganguly.
“A Traffic-Aware Medium Access Control Mechanism for Energy-Efficient
Wireless Network-on-Chip Architectures." In IEEE Transactions on Computers.
(Under review).
J4. M. S. Shamim, R. S. Narde, J. Hernandez, A. Ganguly, J. Venkatarman, S. G.
Kandlikar. “Evaluation of Wireless Network-on-Chip Architectures with
Microchannel-Based Cooling in 3D Multicore Chips." In IEEE Transactions on
Computers. (Under review).

116

Conferences:
C1. M. S. Shamim, J. Muralidharan, and A. Ganguly. “An Interconnection
Architecture for Seamless Inter and Intra-Chip Communication Using Wireless
Links.” In Proceedings of the 9th International Symposium on Networks-on-Chip
(NOCS '15), Article 2, 8 pages, 2015.
C2. M. S. Shamim, N. Mansoor, A. Samaiyar, A. Ganguly, S. Deb, and S.S. Ram.
“Energy-efficient wireless network-on-chip architecture with log-periodic on-chip
antennas.” In Proceedings of the 24th edition of the great lakes symposium on VLSI
(GLSVLSI), 2014.
C3. M. S. Shamim, A. Mhatre, N. Mansoor, A. Ganguly and G. Tsouri, "Temperatureaware wireless network-on-chip architecture," In Proceedings of International
Green Computing Conference, pp. 1-10, Dallas, TX, 2014.
C4. M. S. Shamim, A. Ganguly, C. Munuswamy, J. Venkatarman, J. Hernandez, S. G.
Kandlikar. “Co-design of 3D wireless network-on-chip architectures with
microchannel-based cooling" In Proceedings of the Sixth International Green
Computing and Sustainable Computing Conference (IGSC), pp. 1-6, 2015, Las
Vegas, NV, 2015.
C5. N. Mansoor, M. S. Shamim, and A. Ganguly, "A demand-aware predictive
dynamic bandwidth allocation mechanism for wireless network-on-chip," In
Proceedings of the ACM/IEEE International Workshop on System Level
Interconnect Prediction (SLIP), pp. 1-8, Austin, TX, 2016.

117

Work-in-Progress (WiP)/Poster Presentation:
P1. M. S. Shamim, N. Mansoor, M. Ahmed, M. Dhull, A. Ganguly, “An EnergyEfficient Integration Methodology for Multichip Systems: A Low-Latency
Wireless Interconnection Approach," Accepted for poster presentation as Work-inProgress (WiP) at Design and Automation Conference (DAC), Austin, TX, 2017.
P2. M. S. Shamim, N. Mansoor, A. Ganguly, “Energy-Efficient Wireless
Interconnection Framework for Multichip Systems with In-package Memory
Stacks," Accepted for poster presentation at Golisano College of Computing &
Information Science Research Showcase, Rochester, NY, 2017.

118

BIBLIOGRAPHY
[1] M. J. Flynn, "Very high-speed computing systems," In Proceedings of the IEEE, vol.54,
no.12, pp. 1901-1909, Dec. 1966.
[2] L. Benini, and G. De Micheli, "Networks on chips: a new SoC paradigm," In Computer,
vol.35, no.1, pp. 70-78, Jan. 2002.
[3] D. Wentzlaff et al., "On-Chip Interconnection Architecture of the Tile Processor," In
Proceedings of the IEEE Micro, vol.27, no.5, pp. 15-31, Sep./Oct. 2007.
[4] S.R. Vangal et al., "An 80-Tile Sub-100-W TeraFLOPS Processor in 65-nm CMOS," In IEEE
Journal of Solid-State Circuits, vol.43, no.1, pp.29-41, Jan. 2008.
[5] J. Howard et al., "A 48-Core IA-32 message-passing Processor with DVFS in 45nm
CMOS," In Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC),
San Francisco, CA, pp. 108-109, 2010.
[6] International
Technology
Roadmap
http://www.itrs2.net/itrs-reports.html

for

Semiconductors

(ITRS),

online:

[7] S. Beamer et al., “Designing multi-socket systems using silicon photonics.” In Proceedings
of the International Conference on Supercomputing (ICS), pages 521- 522, 2009.
[8] A. Kannan, N. E. Jerger, and G. H. Loh. “Enabling interposer-based disintegration of multicore processors.” In Proceedings of the 48th International Symposium on
Microarchitecture (MICRO-48), Pages 546-558, 2015.
[9] A.W. Topol et al., "Three-dimensional integrated circuits," In IBM Journal of Research and
Development, vol.50, no.4.5, pp.491-506, July 2006.
[10] K. C. Yong; W. C. Song; B. E. Cheah; M. F. Ain, "Signaling analysis of inter-chip I/O package
routing for Multi-Chip Package," In Proceedings of the 4th Asia Symposium on Quality
Electronic Design (ASQED), vol., no., pp.243-248, July 2012.
[11] IBM Power Systems for High-Performance Computing System, URL: http://www03.ibm.com/systems/power
[12] R. Hendry; D. Nikolova; S. Rumley; K. Bergman, "Modeling and Evaluation of Chip-to-Chip
Scale Silicon Photonic Networks," In Proceedings of the 22nd Annual Symposium on IEEE
High-Performance Interconnects (HOTI), vol., no., pp.1-8, 26-28 Aug. 2014
[13] X. Wu et al., "UNION: A Unified Inter/Intrachip Optical Network for Chip Multiprocessors,"
In Proceedings of the IEEE Transactions on Very Large Scale Integration (VLSI) Systems,
vol.22, no.5, pp. 1082-1095, May 2014.

119

[14] Y. Pan et al., "Firefly: Illuminating future network-on-chip with nanophotonics.”
In Proceedings of the 36th annual international symposium on Computer
architecture (ISCA), 2009.
[15] S. Gunther, F. Binns, D. M. Carmean, J. C. Hall. “Managing the Impact of Increasing
Microprocessor Power Consumption”. In Intel Technology Journal, vol.5, no.1, 2001.
[16] H. Mizunuma; Lu Yi-Chang; Yang Chia-Lin. “Thermal Modeling and Analysis for 3-D ICs
With Integrated Microchannel Cooling.” In IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems, vol.30, no.9, pp. 1293 – 1306, 2011.
[17] D. B. Tuckerman, and R. F. W. Pease. “High-performance heat sinking for VLSI.” In IEEE
Electron Device Letters, vol. 2, no. 5, pp. 126-129, 1981.
[18] K. Puttaswamy and G. H. Loh. “Thermal analysis of a 3D die-stacked high-performance
microprocessor.” In Proceedings of the 16th ACM Great Lakes Symposium on VLSI, pp. 1924, 2006.
[19] B. Dang, M. S. Bakir, D. C. Sekar, C. R. Jr. King, and J. D. Meindl, “Integrated Microfluidic
Cooling and Interconnects for 2D and 3D Chips” In IEEE Transactions on Advanced
Packaging, vol.33, no.1, pp. 79–87, 2010.
[20] S. G. Kandlikar. “Review and Projections of Integrated Cooling Systems for ThreeDimensional Integrated Circuits.” In Journal of Electronic Packaging, 2014.
[21] S. Ndao, Y. Peles, and M. K. Jensen, “Multi-objective thermal design optimization and
comparative analysis of electronics cooling technologies,” In International Journal of Heat
and Mass Transfer, vol.52, pp. 4317–4326, Sep. 2009.
[22] D. Liu and S. V. Garimella, “Analysis and optimization of the thermal performance of
microchannel heat sinks,” In International Journal of Numerical Methods for Heat & Fluid
Flow, vol.15, no.1, pp. 70–26, 2005.
[23] Young-Joon Lee et al., "Co-design of signal, power, and thermal distribution networks for 3D
ICs," In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition,
pp. 610-615, 2009.
[24] H. Matsutani et al., "Low-latency wireless 3D NoCs via randomized shortcut chips," In
Proceedings of the Design, Automation and Test in Europe Conference and Exhibition
(DATE), vol., no., pp.1,6, 24-28 March 2014.
[25] N. Miura, D. Mizoguchi, T. Sakurai and T. Kuroda, "Analysis and design of inductive
coupling and transceiver circuit for inductive inter-chip wireless superconnect," in IEEE
Journal of Solid-State Circuits, vol.40, no.4, pp. 829-837, April 2005.
[26] J. Ouyang, J. Xie, M. Poremba, and Y. Xie, "Evaluation of using inductive/capacitivecoupling vertical interconnects in 3D network-on-chip," In Proceedings of the IEEE/ACM
International Conference on Computer-Aided Design (ICCAD), pp. 477-482, 2010.

120

[27] J. J. Lin et al., "Communication Using Antennas Fabricated in Silicon Integrated Circuits,"
in IEEE Journal of Solid-State Circuits, vol.42, no.8, pp. 1678-1687, Aug. 2007.
[28] K. Chang et al, “Performance evaluation and design trade-offs for wireless network-on-chip
architectures,” In ACM Journal on Emerging Technologies in Computing Systems (JETC),
vol.8, no.3, pp. 23:1–23:25, Aug. 2012.
[29] S. Deb, A. Ganguly, P. P. Pande, B. Belzer, and D. Heo, “Wireless NoC as Interconnection
Backbone for Multicore Chips: Promises and Challenges,” In IEEE Journal on Emerging and
Selected Topics in Circuits and Systems, vol.2, no.2, pp. 228–239, 2012.
[30] A. Ganguly et al., “Scalable Hybrid Wireless Network-on-Chip Architectures for Multicore
Systems,” In IEEE Transactions on Computers, vol.60, no.10, pp. 1485–1502, 2011.
[31] S. Abadal, E. Alarcón, A. Cabellos-Aparicio, M. Lemme, M. Nemirovsky, "Grapheneenabled wireless communication for massive multicore architectures," In IEEE
Communications Magazine, vol.51, no.11, pp.137,143, Nov. 2013.
[32] D. Zhao and Y. Wang, “Sd-mac: Design and synthesis of a hardware efficient collision-free
QoS-aware mac protocol for wireless network-on chip,” In IEEE Transactions on Computers,
vol.57, no.9, pp. 1230–1245, 2008.
[33] D. DiTomaso, A. Kodi, S. Kaya, and D. Matolak, “iWISE: Inter-router Wireless Scalable
Express Channels for Network-on-Chips (NoCs) Architecture,” in Proceedings of the 19th
IEEE Annual Symposium on High-Performance Interconnects (HOTI), pp. 11–18, 2011.
[34] V. Vijayakumaran et al., “CDMA Enabled Wireless Network-on-Chip.” In ACM Journal on
Emerging Technologies in Computing Systems (JETC). vol.10, no. 4, June 2014.
[35] H. H. Yeh and K. L. Melde, "60 GHz multi-antenna design in Multi-Core system," In
Proceedings of the IEEE Antennas and Propagation Society International Symposium
(APSURSI), pp. 1-2, 2012.
[36] J. Duato, S. Yalamanchili, and L. NI, “Interconnection Networks-An Engineering
Approach,” Morgan Kaufmann, 2002.
[37] U.Y. Ogras and R. Marculescu, "It's a small world after all": noc performance optimization
via long-range link insertion," In IEEE Transactions on Very Large Scale Integration (VLSI)
Systems. vol.14, no.7, pp. 693-706, July 2006.
[38] A. Kumar, L. S. Peh, P. Kundu, and N. K. Jha, "Express virtual channels: towards the ideal
interconnection fabric," in Proceedings of the 34th annual international symposium on
Computer architecture (ISCA). ACM, New York, NY, USA, 150-161, 2007.
[39] D Vantrease et al., “Corona: System Implications of Emerging Nanophotonic Technology.”
In Proceedings of the 35th Annual International Symposium on Computer
Architecture (ISCA). IEEE Computer Society, Washington, DC, USA, 153-164, 2008.

121

[40] M. F. Chang et al., "CMP network-on-chip overlaid with multi-band RF-interconnect," In
Proceedings of the IEEE International Symposium on High-Performance Computer
Architecture, Salt Lake City, UT, pp. 191-202, 2008.
[41] J. E. Jaussi, M. Leddige, B. Horine, F. O'Mahony and B. Casper, "Multi-Gbit I/O and
interconnect co-design for power efficient links," In Proceedings of the IEEE 19th Conference
on Electrical Performance of Electronic Packaging and Systems (EPEPS), pp. 1-4, 2010.
[42] M. S. Shamim, J. Muralidharan, and A. Ganguly. “An Interconnection Architecture for
Seamless Inter and Intra-Chip Communication Using Wireless Links.” In Proceedings of the
9th International Symposium on Networks-on-Chip (NOCS '15). ACM, New York, NY, USA,
Article 2, 8 pages, 2015.
[43] S. Laha, S. Kaya, D. W. Matolak, W. Rayess, D. DiTomaso and A. Kodi, "A New Frontier in
Ultralow Power Wireless Links: Network-on-Chip and Chip-to-Chip Interconnects," in IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol.34, no.2, pp.
186-198, Feb. 2015.
[44] U. Chandran and D. Zhao, "Cost-optimal design of wireless pre-bonding test framework," In
Proceedings of the 27th IEEE International System-on-Chip Conference (SOCC), pp. 324329, 2014.
[45] L. Biswal, S. Chakraborty and S. K. Som, "Design and Optimization of Single-Phase Liquid
Cooled Microchannel Heat Sink," in IEEE Transactions on Components and Packaging
Technologies, vol.32, no.4, pp. 876-886, Dec. 2009.
[46] M. M. Sabry, A. Sridhar, J. Meng, A. K. Coskun and D. Atienza, "GreenCool: An EnergyEfficient Liquid Cooling Design Technique for 3-D MPSoCs Via Channel Width
Modulation," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and
Systems, vol.32, no.4, pp. 524-537, April 2013.
[47] M. M. Sabry, A. K. Coskun, D. Atienza, T. Š Rosing and T. Brunschwiler, "Energy-Efficient
Multi-objective Thermal Control for Liquid-Cooled 3-D Stacked Architectures," in IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol.30, no.12,
pp. 1883-1896, Dec. 2011.
[48] H. B. Jang, I. Yoon, C. H. Kim, S. Shin and S. W. Chung, "The impact of liquid cooling on
3D multi-core processors," In Proceedings of the IEEE International Conference on
Computer Design, pp. 472-478, 2009.
[49] B. Shi, and A. Srivastava. “TSV-constrained micro-channel infrastructure design for
cooling stacked 3D-ICs”. In Proceedings of the ACM international symposium on
International Symposium on Physical Design, pp. 113-118, 2012.
[50] W. R. Davis et al., "Demystifying 3D ICs: the pros and cons of going vertical," in IEEE
Design & Test of Computers, vol.22, no.6, pp. 498-510, Nov.-Dec. 2005.
[51] A. More, and B. Taskin. Wireless interconnects for inter-tier communication on 3D ICs.
In Proceedings of the European Microwave Conference (EuMC), pp. 105-108, 2010.
122

[52] S. B. Lee et al., “A scalable micro wireless interconnect structure for CMPs,” In Proceedings
of ACM Annual International Conference on Mobile Computing and Networking, pp. 20-25,
2009.
[53] K. Duraisam, R. G. Kim, P. P. Pande, "Enhancing performance of wireless NoCs with
distributed MAC protocols," In Proceedings of the 16th International Symposium on Quality
Electronic Design (ISQED), vol., no., pp.406-411, 2-4 Mar. 2015.
[54] Piro et al., "Initial MAC Exploration for Graphene-enabled Wireless Networks-on-Chip,"
In Proceedings of the ACM First Annual International Conference on Nanoscale Computing
and Communication (NANOCOM), ACM, New York, NY, USA, 2014.
[55] J. Chen; Y. H. Lai, "A study of CSMA-based and token-based wireless interconnects networkon-chip," In Proceedings of the IEEE International Conference on Communication ProblemSolving (ICCP), vol., no., pp.205-209, 5-7 Dec. 2014.
[56] N. Mansoor, P. J. S. Iruthayaraj and A. Ganguly, "Design Methodology for a Robust and
Energy-Efficient Millimeter-Wave Wireless Network-on-Chip," in IEEE Transactions on
Multi-Scale Computing Systems, vol.1, no.1, pp. 33-45, Jan.-Mar. 1, 2015.
[57] A. Ganguly, P. Wettin, K. Chang, and P. Pande, “Complex network inspired fault-tolerant
NoC architectures with wireless links,” in Proceedings of the Fifth IEEE/ACM International
Symposium on Networks on Chip (NoCS), pp. 169–176, 2011.
[58] M. Yuan, W. Fu, T. Chen, and M. Wu, “An exploration on quantity and layout of wireless
nodes for hybrid wireless network-on-chip.” In Proceedings of the IEEE Conference on
High-Performance Computing and Communications, pp. 100–107, Aug 2014.
[59] T. Petermann and P. De Los Rios, “Spatial small-world networks: a wiring cost perspective,”
2005. arXiv:cond-mat/0501420v2.
[60] M.S. Shamim, N. Mansoor, A. Samaiyar, A. Ganguly, S. Deb, and S.S. Ram. “Energyefficient wireless network-on-chip architecture with log-periodic on-chip antennas.” In
Proceedings of the 24th edition of the great lakes symposium on VLSI (GLSVLSI '14), 2014.
[61] H. Mondal; S. Gade; M. Shamim; S. Deb; A. Ganguly, "Interference-Aware Wireless
Network-on-Chip Architecture using Directional Antennas," in IEEE Transactions on MultiScale Computing Systems, vol.PP, no.99, pp.1-1, 2016.
[62] T. Kagami, H. Matsutani, M. Koibuchi, Y. Take, T. Kuroda and H. Amano, "Efficient 3-D
Bus Architectures for Inductive-Coupling ThruChip Interfaces," In IEEE Transactions on
Very Large Scale Integration (VLSI) Systems, vol.24, no.2, pp. 493-506, Feb. 2016.
[63] N. Mansoor and A. Ganguly, “Reconfigurable Wireless Network-on-Chip with a Dynamic
Medium Access Mechanism,” In Proceedings of the 9th International Symposium on
Networks-on-Chip (NOCS), 2015.
[64] N. Mansoor, MS. Shamim, and A. Ganguly, "A demand-aware predictive dynamic
bandwidth allocation mechanism for wireless network-on-chip," In Proceedings of the
123

ACM/IEEE International Workshop on System Level Interconnect Prediction (SLIP), , pp. 18, Austin, TX, 2016.
[65] H. K. Mondal, and S. Deb, “An energy-efficient wireless network-on-chip using power-gated
transceivers,” In Proceedings of the 27th IEEE International System-on-Chip Conference
(SOCC), pp. 243–248, Sept 2014.
[66] H. K. Mondal and S. Deb, “Energy efficient on-chip wireless interconnects with sleepy
transceivers,” In Proceedings of the 8th International Design and Test Symposium (IDT),
pp. 1–6, Dec. 2013.
[67] G. Balamurugan et al., "A Scalable 5–15 Gbps, 14–75 mW Low-Power I/O Transceiver in 65
nm CMOS," In IEEE Journal of Solid-State Circuits, vol.43, no.4, pp. 1010-1019, Apr. 2008.
[68] Chip MultiProjects. Retrieved Nov. 2015, from http://cmp.imag.fr
[69] P.P. Pande, C. Grecu, M. Jones, A. Ivanov, R. Saleh, "Performance evaluation and design
trade-offs for network-on-chip interconnect architectures," In IEEE Transactions on
Computers, vol.54, no.8, pp.1025-1040, Aug. 2005.
[70] FR4 EPOXY URL: http://www.sunstone.com/pcb-capabilities/pcb-manufacturingcapabilities/pcb-materials/fr-4-material
[71] Lopez, Aida L. Vera et al. "Novel low loss thin film materials for wireless 60 GHz
application." In Proceedings of the 60th IEEE Electronic Components and Technology
Conference (ECTC), 2010.
[72] M. Schwartz, W. R. Bennett, and S. Stein, “Communication Systems and Techniques.” IEEE
Press., 1996.
[73] HFSS URL:
http://www.ansys.com/Products/Simulation+Technology/Electronics/Signal+Integrity/ANS
YS+HFSS
[74] Personal Communication with Rounak, Dr. Jayanti, Jose-Luis, and Dr. Kandlikar.
[75] S. Deb, A. Ganguly, K. Chang, P. Pande, B. Beizer and D. Heo, "Enhancing performance of
network-on-chip architectures with millimeter-wave wireless interconnects," In Proceedings
of the 21st IEEE International Conference on Application-specific Systems Architectures
and Processors (ASAP), pp. 73-80, 2010.
[76] K. C. Yong; W. C. Song; B. E. Cheah; M. F. Ain, "Signaling analysis of inter-chip I/O
package routing for Multi-Chip Package," in Proceedings of the 4th Asia Symposium on
Quality Electronic Design (ASQED), vol., no., pp.243-248, 10-11 July 2012.
[77] P.J. Koopman and B.P. Upender, “Time Division Multiple Access without a Bus Master,”
In United Technologies Research Center, US, Technical Report RR-9500470, 1995.

124

[78] P. Wettin et al., "Performance evaluation of wireless NoCs in presence of irregular network
routing strategies," In Proceedings of the Design, Automation, and Test in Europe
Conference and Exhibition (DATE), vol., no., pp.1-6, 24-28 Mar. 2014.
[79] M. S. Shamim, A. Mhatre, N. Mansoor, A. Ganguly and G. Tsouri, "Temperature-aware
wireless network-on-chip architecture," In Proceedings of the International Green
Computing Conference, pp. 1-10. Dallas, TX, 2014.
[80] J. Lee, C. Nicopoulos, S. J. Park, M. Swaminathan and J. Kim, "Do we need wide flits in
Networks-on-Chip?,"In Proceedings of the IEEE Computer Society Annual Symposium on
VLSI (ISVLSI), pp. 2-7, 2013.
[81] V. Soteriou, Hangsheng Wang, and L. Peh, "A Statistical Traffic Model for On-Chip
Interconnection Networks," In Proceedings of the 14th IEEE International Symposium on
Modeling, Analysis, and Simulation of Computer and Telecommunication Systems. pp. 104116, 2006.
[82] K. Kempa et al., "Carbon Nanotubes as Optical Antennae," In Advanced Materials, vol. 19,
pp. 421-426, 2007.
[83] P. Grani, S. Bartolini, E. Furdiani, L. Ramini and D. Bertozzi, "Integrated cross-layer
solutions for enabling silicon photonics into future chip multiprocessors," In Proceedings of
the 19th International Mixed-Signals, Sensors, and Systems Test Workshop (IMS3TW), pp.
1-8, 2014.
[84] K. Pranay et al., “Silicon-photonic network architectures for scalable, power-efficient multichip systems.” In Proceedings of the 37th annual international symposium on Computer
architecture (ISCA). ACM, New York, NY, USA, 117-128, 2010.
[85] P. Dong et al., “Wavelength-tunable silicon microring modulator.” Optics Express, vol.18,
no.11, pp.10941-10946, 2010.
[86] A. Mineo, M. S. Rusli, M. Palesi, G. Ascia, V. Catania, and M. N. Marsono, “A closed-loop
transmitting power self-calibration scheme for energy efficient WiNoc architectures,” In
Proceedings of the Design, Automation Test in Europe Conference (DATE), Mar. 2015.
[87] M. S. Shamim, N. Mansoor, R. S. Narde, V. Kothandapani, A. Ganguly and J.
Venkataraman, "A Wireless Interconnection Framework for Seamless Inter and Intra-Chip
Communication in Multichip Systems," In IEEE Transactions on Computers, vol. 66, no. 3,
pp. 389-402, March 1, 2017.
[88] Fluent 14.5 URL:
http://www.ansys.com/Products/Simulation+Technology/Fluid+Dynamics/Fluid+Dynamic
s+Products/ANSYS+Fluent
[89] H.Y. Zhang et al., “Single-phase liquid cooled microchannel heat sink for electronic
packages,” In Applied Thermal Engineering, vol. 25, no. 10, Pages 1472-1487, July 2005.

125

[90] D. Lorenzini-Gutierrez, S.G. Kandlikar, “Variable Fin density flow channels for effective
cooling and mitigation of temperature non-uniformity in three-dimensional integrated
circuits.” In Journal of Electronic Packaging, vol. 136, no. 02, 2014.
[91] H. G. Schantz, "Near field propagation law & a novel fundamental limit to antenna gain
versus size," In Proceedings of the IEEE Antennas and Propagation Society International
Symposium, pp. 237-240, 2005.
[92] W. Huang, R.S. Mircea, S. Gurumurthi, R. J. Ribando, and K. Skadron. “Interaction of
scaling trends in processor architecture and cooling.” In Proceedings of the 26th Annual
IEEE SEMI-THERM, pp. 198-204, 2010.
[93] J. Meng, K. Katsutoshi, and A. K. Coskun. "Optimizing energy efficiency of 3-D multicore
systems with stacked DRAM under power and thermal constraints." In Proceedings of the
49th ACM Annual Design Automation Conference, 2012.
[94] B.S. Feero, P.P. Pande. “Networks-on-Chip in a Three-Dimensional Environment: A
Performance Evaluation” In IEEE Transactions on Computers, vol.58, no.1, pp. 32-45, 2009.
[95] N. Binkert et al., “The GEM5 Simulator,” ACM SIGARCH Computer Architecture News,
vol.39, no.2, pp. 1-7, 2011.
[96] S.C Woo; M. Ohara; E. Torrie; J.P. Singh; and A. Gupta; “The SPLASH-2 Programs:
Characterization and Methodological Considerations,” In Proceedings of the Annual
International Symposium on Computer Architecture, pp. 24-36, 1995.
[97] C. Bienia, “Benchmarking Modern Multiprocessors,” Ph.D. Dissertation, Princeton Univ.,
Princeton NJ, Jan. 2011.
[98] S. Li; J.H. Ahn; R.D. Strong; J.B. Brockman; and D.M. Tullsen; “McPAT: an Integrated
Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures,”
In Proceedings of the. of the International Symposium on Microarchitecture, pp. 469-480,
2009.
[99] X. Yu, H. Rashtian, S. Mirabbasi, P. P. Pande and D. Heo, "An 18.7-Gb/s 60-GHz OOK
Demodulator in 65-nm CMOS for Wireless Network-on-Chip," in IEEE Transactions on
Circuits and Systems I: Regular Papers, vol. 62, no. 3, pp. 799-806, March 2015.
[100] Jri Lee, Yenlin Huang, Yentso Chen, Hsinchia Lu and Chiajung Chang, "A low-power fully
integrated 60GHz transceiver system with OOK modulation and on-board antenna
assembly," In Proceedings of the IEEE International Solid-State Circuits Conference Digest of Technical Papers, pp. 316-317,317a, San Francisco, CA, 2009.
[101] W. Byeon, C. H. Yoon, and C. S. Park, “A 67-mW 10.7-Gb/s60-GHz OOK CMOS
transceiver for short-range wireless communications,” In IEEE Transactions on Microwave
Theory and Techniques, vol. 61, no. 9, pp. 3391–3401, Sep. 2013.

126

[102] A. K. Mishra, N. Vijaykrishnan, and C. R. Das, "A case for heterogeneous on-chip
interconnects for CMPs," In Proceedings of the 38th Annual International Symposium on
Computer Architecture (ISCA ‘11), San Jose, CA, pp. 389-399, 2011.
[103] M. Badr and N. E. Jerger, "SynFull: Synthetic traffic models capturing cache coherent
behaviour," In Proceedings of the ACM/IEEE 41st International Symposium on Computer
Architecture (ISCA ‘14), Minneapolis, MN, pp. 109-120, 2014.
[104] G.H. Loh, N.E. Jerger, A. Kannan, and Y. Eckert. “Interconnect-Memory Challenges for
Multi-chip, Silicon Interposer Systems.” In Proceedings of the ACM MEMSYS, pp. 3-10,
2015.
[105] E. J. Marinissen et al., "Contactless testing: Possibility or pipe-dream?," In Proceedings of
the Design, Automation & Test in Europe Conference & Exhibition, pp. 676-681, 2009.
[106] W. R. Mann, F. L. Taber, P. W. Seitzer and J. J. Broz, "The leading edge of production wafer
probe test technology," In Proceedings of the International Conference on Test, pp. 11681195, 2004.
[107] C. W. Wu, C. T. Huang, S. y. Huang, P. c. Huang, T. y. Chang and Y. t. Hsing, "The HOY
Tester-Can IC Testing Go Wireless?," In Proceedings of the International Symposium on
VLSI Design, Automation and Test, pp. 1-4, 2006.
[108] B. Moore et al., "Non-contact Testing for SoC and RCP (SIPs) at Advanced Nodes," In
Proceedings of the IEEE International Test Conference, pp. 1-10, Santa Clara, CA, 2008.
[109] B. Moore et al., "High throughput non-contact SiP testing," 2007 Proceedings of the IEEE
International Test Conference, pp. 1-10, Santa Clara, CA, 2007.
[110] M. S. Shamim, A. Ganguly, C. Munuswamy, J. Venkatarman, J. Hernandez and S.
Kandlikar, "Co-design of 3D wireless network-on-chip architectures with microchannelbased cooling,"In Proceedings of the Sixth International Green and Sustainable Computing
Conference (IGSC), pp. 1-6, Las Vegas, NV, 2015.

127

