Combined Dynamic Thermal Management Exploiting Broadcast-Capable Wireless Network-on-Chip Architecture by Vasudevan, Niraj
Rochester Institute of Technology
RIT Scholar Works
Theses Thesis/Dissertation Collections
3-18-2016
Combined Dynamic Thermal Management
Exploiting Broadcast-Capable Wireless Network-
on-Chip Architecture
Niraj Vasudevan
nv1440@rit.edu
Follow this and additional works at: http://scholarworks.rit.edu/theses
This Thesis is brought to you for free and open access by the Thesis/Dissertation Collections at RIT Scholar Works. It has been accepted for inclusion
in Theses by an authorized administrator of RIT Scholar Works. For more information, please contact ritscholarworks@rit.edu.
Recommended Citation
Vasudevan, Niraj, "Combined Dynamic Thermal Management Exploiting Broadcast-Capable Wireless Network-on-Chip Architecture"
(2016). Thesis. Rochester Institute of Technology. Accessed from
    
Combined Dynamic Thermal Management Exploiting 
Broadcast-Capable Wireless Network-on-Chip Architecture 
by 
Niraj Vasudevan 
 
A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of 
Master of Science in Computer Engineering 
Supervised by 
Dr. Amlan Ganguly 
Department of Computer Engineering 
Kate Gleason College of Engineering 
Rochester Institute of Technology 
Rochester, NY 
March 18, 2016 
 
 
Approved By: 
 
_____________________________________________        ___________      ___  
Dr. Amlan Ganguly 
Primary Advisor – R.I.T. Dept. of Computer Engineering 
 
_ __ ___________________________________        _________  _____ 
Dr. Raymond Ptucha 
Secondary Advisor – R.I.T. Dept. of Computer Engineering 
 
_____________________________________________                ______________ 
Prof. Mark A. Indovina 
Secondary Advisor – R.I.T. Dept. of Electrical Engineering 
 
   ii 
 
Dedication 
I would like to dedicate this thesis to my parents Mr. Vasudevan and Mrs. Sudha who 
have supported me from the beginning of this journey. I would also like to dedicate this 
to my mentor and all my friends who have been a great source of motivation and 
inspiration. 
   iii 
 
Acknowledgements 
I take this opportunity to express my profound gratitude and deep regards to my primary 
advisor Dr. Amlan Ganguly for his exemplary guidance, monitoring and constant 
encouragement throughout this thesis. Dr. Ganguly dedicated his valuable time to review 
my work constantly and provide valuable suggestions which helped in overcoming many 
obstacles and keeping the work on the right track. I would also like to express my deepest 
gratitude to Dr. Raymond Ptucha and Prof. Mark A. Indovina for sharing their thoughts 
and suggesting valuable ideas which have had significant impact on this thesis. I am 
grateful for their valuable time and cooperation during the course of this work. I also take 
this opportunity to thank my research group members for all the constant support and 
help provided by them.  
   iv 
 
Abstract 
With the continuous scaling of device dimensions, the number of cores on a single 
die is constantly increasing. This integration of hundreds of cores on a single die leads to 
high power dissipation and thermal issues in modern Integrated Circuits (ICs). This 
causes problems related to reliability, timing violations and lifetime of electronic devices. 
Dynamic Thermal Management (DTM) techniques have emerged as potential solutions 
that mitigate the increasing temperatures on a die. However, considering the scaling of 
system sizes and the adoption of the Network-on-Chip (NoC) paradigm to serve as the 
interconnection fabric exacerbates the problem as both cores and NoC elements 
contribute to the increased heat dissipation on the chip. 
Typically, DTM techniques can either be proactive or reactive. Proactive DTM 
techniques, where the system has the ability to predict the thermal profile of the chip 
ahead of time are more desirable than reactive DTM techniques where the system utilizes 
thermal sensors to determine the current temperature of the chip. 
Moreover, DTM techniques either address core or NoC level thermal issues 
separately. Hence, this thesis proposes a combined proactive DTM technique that 
integrates both core level and NoC level DTM techniques. The combined DTM 
mechanism includes a dynamic temperature-aware routing approach for the NoC level 
elements, and includes task reallocation heuristics for the core level elements. 
 On-chip wireless interconnects recently envisioned to enable energy-efficient 
data exchange between cores in a multicore chip will be used to provide a broadcast-
capable medium to efficiently distribute thermal control messages to trigger and manage 
   v 
 
the DTM. Combining the proactive DTM technique with on-chip wireless interconnects, 
the on-chip temperature is restricted within target temperatures without significantly 
affecting the performance of the NoC based interconnection fabric of the multicore chip. 
  
   vi 
 
Table of Contents 
Dedication ........................................................................................................................... ii 
Acknowledgements ............................................................................................................ iii 
Abstract .............................................................................................................................. iv 
List of Figures ................................................................................................................... vii 
Chapter 1 Introduction ..................................................................................................... 1 
1.1. Thesis Contribution ............................................................................... 5 
Chapter 2 Related Work ................................................................................................... 6 
Chapter 3 Combined Dynamic Thermal Management Scheme ....................................... 9 
3.1. Task Allocation Heuristics .................................................................. 12 
3.2. Temperature-Aware Rerouting ............................................................ 14 
3.3. Thermal Predictor ................................................................................ 17 
Chapter 4 Test Cases and Evaluations ........................................................................... 21 
4.1. Topology .............................................................................................. 22 
4.2. Antenna and Transceiver ..................................................................... 24 
4.3. Flow Control and Wireless Communication Protocol ......................... 25 
4.4. Simulation Environment ...................................................................... 26 
4.5. Determination of an Optimized WiNoC Topology ............................. 28 
4.6. Thermal Characteristics of Combined DTM ....................................... 29 
4.7. Performance tradeoffs of combined DTM scheme .............................. 35 
Chapter 5 Conclusions ................................................................................................... 40 
 
   vii 
 
List of Figures 
Figure 1: An example small-world NoC architecture having wireless transceivers.
................................................................................................................................. 2 
Figure 2: The proposed combined Dynamic Thermal Management (DTM) 
scheme................................................................................................................... 10 
Figure 3: Thermal control message format. .......................................................... 11 
Figure 4: The Proposed task reallocation heuristic. .............................................. 14 
Figure 5: Hysterisis based link cost function. ....................................................... 17 
Figure 6: Subdivided ANN streams of the proposed structure. ............................ 19 
Figure 7: Thermal profile evaluation simulation flow. ......................................... 27 
Figure 8: Performance optimization with different number of WIs in a 64 core 
WiNoC. ................................................................................................................. 28 
Figure 9: Maximum chip temperature with and without combined DTM for (a) 
CANNEAL, (b) BODYTRACK, (c) VIPS, (d) FLUIDANIMATE, (e) 
SWAPTION, (f) FREQMINE, (g) FFT, (h) RADIX, and (i) LU traffic, for 
uniform time.......................................................................................................... 30 
Figure 10: Maximum chip temperature with and without combined DTM for (a) 
CANNEAL, (b) BODYTRACK, (c) VIPS, (d) FLUIDANIMATE, (e) 
SWAPTION, (f) FREQMINE, (g) FFT, (h) RADIX, and (i) LU traffic, run as 
long as the combined DTM scheme is triggered thrice. ....................................... 32 
Figure 11: Maximum chip temperature with and without combined DTM scheme 
for CANNEAL traffic running for long duration. ................................................ 33 
   viii 
 
Figure 12: Comparison of normalized performance metric of the system with only 
temperature-aware rerouting with system with combined DTM. ......................... 34 
Figure 13: Transient temperature response for CANNEAL with two different 
target temperatures. ............................................................................................... 35 
Figure 14: Normalized bandwidth of system with combined DTM for different 
application-specific traffics. .................................................................................. 36 
Figure 15: Normalized packet energy of system with combined DTM for different 
application-specific traffics. .................................................................................. 36 
Figure 16: Normalized latency of system with combined DTM for different 
application-specific traffics. .................................................................................. 37 
Figure 17: Effect of target temperature on performance in presence of CANNEAL 
traffic. .................................................................................................................... 38 
 
  
   1 
 
Chapter 1 Introduction 
Considering how multicore processors and System-on-Chip (SoC) architectures integrate 
hundreds of cores on a single chip, having an interconnect fabric to enhance performance plays a 
crucial role. For this purpose energy efficient and high performance Network-on-Chip (NoC) 
architectures are utilized [9][34][43][44]. However, having said that, such traditional network 
fabrics suffer from a crucial performance and power consumption limitation in multicore chips. 
This is due to the high power consumption and latency in increased global wire lengths. 
To overcome this limitation NoCs adapt small-world based architectures, which have 
long range wired-links between distant cores, reducing the average hop count [1]. As the system 
size scales up, the average number of hops for communication between any two cores increases. 
Adopting small-world based topologies reduces the average hop count in such heavily scaled up 
systems through the introduction of long distant “shortcuts” [43][45]. Small-world architectures 
can further be improved by incorporating energy efficient wireless links instead of long range 
wired links. In such architectures, these long range wireless links are known to carry heavy 
traffic, thus allowing high energy savings through the use of these wireless links.  Figure 1 shows 
a depiction of a small-world NoC architecture having broadcast capable wireless transceivers. 
However, even when using such energy efficient architectures and wireless links, the chip 
temperature of highly scaled multicore systems continues to increase. This can be attributed to 
the formation of thermal hotspots within the chip. Prolonged operation at such high temperatures 
leads to issues with respect to reliability, timing variations and lifetime of the electronic device. 
This thesis looks at a Dynamic Thermal Management (DTM) technique to mitigate such thermal 
issues. DTM has been an active area of research for several years now. State-of-the-art multicore 
   2 
 
processors such as Intel Core i7 and Single Chip Cloud computer (SCC) use DTM mechanisms 
like Dynamic Voltage Frequency Scaling (DVFS) and clock/power gating [1]. 
 
 
Figure 1: An example small-world NoC architecture having wireless transceivers. 
 
   3 
 
DTM schemes are triggered based on temperature measured using on-chip thermal 
sensors. This makes the scheme reactive, requiring long reaction time. To add to this, the 
characteristics and reliability of the sensors impact the effectiveness of the DTM scheme directly 
[2]. To overcome this, more recently predictive or proactive DTM schemes have been proposed 
[3][4]. In [4] a Look-Up Table (LUT) based approach was proposed to predict the temperature of 
the chip at future time instants using a LUT characterizing the thermal response of the chip. 
Based on the predicted future temperatures task reallocation heuristics were appropriately 
triggered as part of the DTM scheme. However, for large system sizes, this LUT based approach 
is not scalable due to the large memory requirements to store the LUT.  
On the other hand, Artificial Neural Networks (ANNs) are being increasingly used for 
modeling and prediction due to their ability to learn and adapt [5]. Also, recently an ANN based 
predictor was designed to keep track of, and avoid inter-core traffic congestion in multicore 
chips [6]. For this thesis, an ANN based prediction engine was used to trigger the proposed 
proactive DTM mechanism. By having a prediction based DTM scheme, transient overshoot in 
temperatures seen in reactive DTM approaches can be avoided. 
In addition to being a predictive mechanism, modern DTM mechanisms are required to 
combine both core-level as well as interconnect-level DTM techniques. The NoC paradigm is an 
interconnection fabric that stitches hundreds of cores together on the same die [7]. The proposed 
DTM scheme is a core-level and interconnect level combined scheme. The DTM mechanism is 
equipped with an intelligent algorithm that triggers either core-level or NoC level DTM 
depending on the requirements and cause of thermal emergency. Thus, the proposed dynamic 
thermal management scheme is a proactive thermal management technique which aims at 
   4 
 
maintaining temperature uniformity across the chip by taking into consideration the temperatures 
of both the cores and the network level elements. 
On the other hand, to mitigate the thermal issues due to NoC components there is an 
ongoing search for energy efficient interconnects. Several emerging interconnect technologies 
like 3D integration, photonic, RF and wireless transceivers for on-chip inter-core data transfer 
have been considered [8]. Millimeter-wave (mm-wave) on-chip Wireless Interconnects (WIs) are 
capable of achieving multi gigabit data transfer rates. They are also CMOS compatible and can 
be relatively easily fabricated. [9] Has shown that the design and implementation of mm-wave 
transceivers are indeed energy efficient.  
This inherent capability of the mm-wave WIs to share the wireless channel and provide a 
broadcast capability has been shown to be beneficial for the exchange of control and 
synchronization information across a multicore chip [14]. Particularly, in this thesis, these WIs 
are used to send the utilization information of all the cores, switches and links to the thermal 
predictor. On receiving the utilization information, the thermal predictor estimates the thermal 
profile of the chip to trigger the combined DTM. The trigger information is packetized and 
broadcast by the DTM module for all the switches and cores to take action. 
When compared to the conventional solely electronic NoC, the mm-wave based wireless 
NoCs (WiNoCs) have shown to achieve significantly lower energy consumption and higher 
bandwidth [10][11]. In these mm-wave based WiNoCs, multiple transceivers share the wireless 
channel using a token based Medium Access Control (MAC) for distributed asynchronous 
transmission without collisions or contention [12][13]. Thus, for this thesis, a WiNoC which 
incorporates energy efficient mm-wave wireless transceivers is considered. This WiNoC enables 
fast and efficient triggering of the proposed combined proactive DTM technique.  
   5 
 
1.1. Thesis Contribution 
The following summarizes the contribution of this work: 
 Proposed combined DTM mechanism 
 Design and implementation of combined dynamic thermal management scheme 
for WiNoCs. 
 Proposed a new task reallocation heuristic which takes into both consideration 
temperatures of elements on the chip and communication density of the network. 
 Integration of DTM scheme with thermal predictor for a WiNoC architecture 
 Integrated proposed DTM scheme with ANN based thermal predictor for a 64 
core WiNoC based architecture. 
 Evaluation of thermal characteristics and performance of the WiNoC utilizing 
the proposed combined DTM technique 
 
  
   6 
 
Chapter 2 Related Work 
Dynamic Thermal Management (DTM) techniques have been extensively investigated in 
literature. Both reactive and proactive based DTM techniques have previously been researched in 
academia. DTM techniques studied either concentrate on mitigating temperature of cores or the 
temperature of NoC components individually. Such techniques ensure that the temperature of 
either cores or NoC components do not exceed beyond a specified target threshold. Dynamic 
task migration is a common technique for reducing peak temperature. Such task migration 
techniques aim at redistributing current processes on to available cores based on the current 
thermal profile of the die. Quite a few migration schemes have been discussed about in [15]. [16] 
Talks about a distributed scheme where task migration among neighboring cores is suggested. 
Based on the current thermal profile of the die, such a technique allows the system to respond to 
real time thermal changes of the chip, and adapt to the workload. In [17], considering having 
heterogenous and morphable cores, an efficient task mapping algorithm under power constraints 
is proposed. [18] Proposes “Heat-and-Run”, which is a temperature-aware task assignments and 
migration scheme which aims to manage the chip temperature dynamically. In [3], both reactive 
and proactive thermal management schemes are proposed. These thermal management schemes 
utilized autoregressive moving average and lookup table based temperature estimation. [19] 
Talks about a convex optimization method which can be used to control and maintain the 
temperature of the chip within a pre-defined user target threshold temperature. Thermal Herd is a 
runtime distributed scheme for thermal management that allows NoC routers to collaboratively 
regulate the network temperature profile and work so as to avoid thermal emergencies while 
minimizing performance impact [20].  
   7 
 
At the same time, researchers are also investigating the potential of designing NoCs using 
wireless interconnects so as to reduce the energy consumption for on-chip data transfer. More 
recently, [10] proposes a wireless NoC architecture based on CMOS Ultra Wideband (UWB) 
technology. [21] Proposes having a wireless NoC with some unequal RF transceivers so as to 
improve the performance of the conventional mesh topology. To enable the concurrent use of 
wireless channels, [22] discusses the possibility of having a time-multiplexed Medium Access 
(MAC) protocol which utilizes the ultra-shot pulses generated from UWB transceivers. [23] 
Proposes the design of miniature antennas that can operate in the sub THz range of 100-500 
GHz. The authors in [11] talk about the design of a wireless NoC employing a small world 
topology which utilizes carbon nanotube (CNT) antennas that operate in the THz frequency 
range. However, integrating these antennas with the standard CMOS processes needs to 
overcome significant challenges associated with the manufacturing processes. Whereas mm-
wave wireless CMOS on-chip antennas operating in the sub-THz frequency range are a more 
near term solution. In [24], these mm-wave wireless on-chip antennas are designed and evaluated 
for intra and inter chip communication. In [45], system level performance evaluation for intra 
and inter chip communication using mm-wave is presented. Several possibilities of enabling on-
chip communication through wireless antennas have been explores in [25] and [9]. These two 
research works have proposed a design of a wireless architecture using long range wireless 
shortcuts. From [26], it is seen that the WiNoC has a better temperature profile when compared 
to the traditional mesh. Also, from [27] it is seen that incorporating a power management scheme 
like Dynamic Voltage and Frequency Scaling (DVFS) in a WiNoC can improve the thermal 
profile of the chip. In [28], the authors discuss the effect of a thermal management scheme on a 
WiNoC architecture. In [29], a temperature-aware rerouting scheme for a wireless network-on-
   8 
 
chip architecture is proposed. This thesis looks at including energy-efficient broadcast capable 
wireless transceivers to help communicate control packets between cores and the thermal 
predictor which has not yet been explored.  
However, most of the current research works mainly focuses on thermal management 
schemes for the processing cores or NoC components individually whereas both contribute to the 
temperature of each other significantly, affecting the overall chip temperature. Thus, as part of 
this thesis we address the local hotspots in multicore chips through a combined dynamic thermal 
management scheme utilizing a temperature-aware routing strategy and task reallocation, so as to 
maintain chip temperature within specified target thresholds. 
   9 
 
Chapter 3 Combined Dynamic Thermal Management Scheme 
In this section, a method for equipping the scheduler in a multicore chip with a combined 
proactive thermal management schemes is proposed. The proposed technique combines 
temperature-aware task reallocation with network level rerouting to improve the thermal profile 
and at the same time aims at minimizing its impact on the system performance. As part of this 
proactive thermal management technique, a neural network based thermal predictor is used to 
predict the temperature and take corrective measures instead of reacting to the temperatures 
measured from the on-chip thermal sensors. This eliminates transient temperature overshoots 
before thermal management measures are activated.  
By means of redistributing the workload, task reallocation reduces the temperature of the 
cores in the system. However, task reallocation may or may not be able to reduce the network 
hotspots. For example, if in a system a NoC switch happens to be a busy junction then this 
switch will tend to heat much quicker when compared to the other switches on the chip. This 
affects the reliable and sustainable operation of the switch. This same issue can be seen for 
specific links between switches that experience heavy traffic. In such cases task reallocation may 
not be able to reduce the traffic activity on those NoC components. Thus, as part of this thesis, a 
proposal to investigate a dynamic routing approach which takes into account the temperatures of 
NoC components (switches and links) is considered. The idea behind this approach is to be able 
to dynamically reconfigure the routing paths in response to temperature increases, so that heat 
dissipation is distributed better.  
To activate the dynamic thermal management mechanism, a single temperature target 
threshold, Tth is considered. However, as task reallocation essentially changes traffic among 
   10 
 
cores by redistributing tasks, triggering task reallocation may affect rerouting decisions. For this 
purpose, a sliding window method to trigger the combined DTM scheme, so as to avoid 
oscillations between task reallocation and rerouting is employed. The neural network based 
thermal estimator, described later in section 3.3 predicts component temperatures starting from 
next time instance to all time instances within the window. If the estimated temperature of any 
 
Figure 2: The proposed combined Dynamic Thermal Management (DTM) scheme. 
   11 
 
core increases beyond Tth, the scheduler using the combined DTM scheme activates a 
temperature aware task reallocation scheme. 
However, if the predicted switch or link temperatures is greater than Tth, then the 
scheduler slides the window to predict the core temperature within the next window interval, and 
if any of the cores exceed Tth on the next window, the scheduler triggers task reallocation instead 
of rerouting, otherwise, the scheduler triggers temperature aware rerouting. It is assumed that the 
scheduler is housed in one of the cores on the chip. Figure 2 shows a schematic representation of 
the combined thermal management scheme with a sliding window.  
In the implementation considered in this thesis, the neural network based thermal 
estimator will periodically predict the temperatures of all chip components i.e. cores, switches 
and links. If any of the components temperature exceeds the target threshold, that component is 
flagged and a notification flit containing either distance vector or DTM triggering information is 
sent to the Wireless Interconnect (WI) nearest to that component. 
 From the WI this thermal control message of one flit is routed downstream to the target 
switch whose associated component has exceeded the target threshold temperature. The format 
of this thermal control message is as visualized in Figure 3. 
 
 
Figure 3: Thermal control message format. 
 
   12 
 
3.1. Task Allocation Heuristics 
A task reallocation heuristic which considers both temperature characteristics of the chip 
as well as the performance of the WiNoC is being proposed in this thesis. In order to trigger task 
reallocation, the estimated temperature from a neural network based thermal predictor is utilized. 
This task reallocation heuristic is combined with temperature aware rerouting and the resultant 
temperature as well as the resultant performance and energy efficiency of data communication 
over the defined WiNoC architecture. A novel task reallocation algorithm based on Future 
Temperature Trends (FTT) was proposed in [4]. It is a temperature-prioritized method where 
task reallocation is done such that threads with highest power consumption are allocated to either 
the fastest cooling or slowest heating cores. The algorithm mentioned in [4] is adopted and 
modified to factor in the performance of the NoC as well as temperature of the links and 
switches along with temperature of the cores. A thermal weight, Tw is defined to determine the 
rate of heating or cooling of a core as shown in the (1). All cores are classified into two sets 
namely, Core+ for those with increasing temperature and Core- for those with decreasing 
temperature, based on the difference in current and predicted temperature for that core. Each core 
is assigned a weight, Tw as: 
 
                   
                                       
                                        
                (1)        
 
Where T is the current temperature, and a+ and a- are respectively the temperature 
increment and decrement.  Tasks with highest power dissipation are mapped onto cores with 
least thermal weights. This ensures thermally optimal distribution of tasks to achieve the best 
uniformity in temperature of all cores. However, in this case, performance resulting from the 
   13 
 
reallocation may not be optimal as it is possible that highly communicating tasks are mapped 
onto distant cores. As a result, packets need to go through multiple hops before reaching the 
destination. For this reason, a performance weight factor, Pw, for each core which considers the 
underlying architecture so as to have improved performance is introduced. Pw is computed based 
on communication density and hop counts between the cores. In order to improve performance, 
the goal is to reallocate the tasks such that tasks with high communication densities will allocate 
to cores as near as possible. The performance factor, Pw is proportional to the hop count between 
the core being considered for task reallocation and all other cores and its communication density. 
Equation (2) shows how Pw,i for the i
th
 core is computed.  
 
                                    (2) 
 
 
Here, hij is the number of hops between core i and core j, and fij represents the 
communication density between the two cores. To consider both performance and thermal 
characteristics, a final weight, W, is computed as follows: 
 
                                     (3) 
 
Where     and     are the normalized thermal and performance weights.   is a weight 
parameter between 0 and 1 that controls the importance of either performance or temperature on 
the heuristic.   is considered to be 0.5 so as to give equal importance to both temperature and 
performance weights. The core for task reallocation is selected as the one with minimum  , 
obtained from the (3). Figure 4 is an algorithmic representation of the heuristic. 
   14 
 
 
 
 
 
 
 
 
 
 
 
 
 
3.2. Temperature-Aware Rerouting 
In this section, a dynamic temperature-aware task rerouting mechanism that takes into 
account switch and link temperatures such that heat dissipation is evenly distributed is being 
proposed. The temperature-aware rerouting mechanism is based on the Distance Vector Routing 
(DVR) algorithm [32] based on the Bellman-Ford equation for the NoC environment. DVR was 
designed to support routing under dynamic conditions in large scale networks. It is the de-facto 
standard for intra-domain routing over the internet where varying congestion conditions and 
traffic flows affect network integrity [32].  
Neighboring nodes (switches) maintain the cumulative path cost to all other nodes in the 
network also known as the distance vector. In addition each switch also has a forwarding-table 
 
Figure 4: The Proposed task reallocation heuristic. 
   15 
 
containing the information about the next hop for all destinations. If a change in the link cost is 
detected, the distance vectors are packetized along with the time stamp and advertised between 
the nearest neighbors by all the switches. The routing table of a switch is updated every time a 
change in the link cost is triggered by the temperature of a NoC component exceeding the 
threshold. Thus, the routing tables in the switches may change several times until the entire 
network converges. This may result in the creation of deadlocks. In order to avoid such 
deadlocks, the proposed rerouting scheme uses two routing tables for every switch. The old 
routing table is used to route all the data packets until the network converges. Only after network 
convergence the  newly  calculated  routing  table  is  used  to  route  all  the  newly  generated  
data  packets. The worst convergence delay of the routing protocols are determined by the time 
necessary for propagating new paths. This propagation delay depends on the connectivity of the 
network i.e. maximum diameter of the network and message processing delay of the switches 
[33]. For the implementation used in this thesis, each switch has one virtual channel (VC) to 
process the packetized distance vector information. Moreover, since the control packets are of 
short length, the delay in message processing is significantly reduced. To add to this, these 
control packets use the wireless links that are deployed to optimize the network performance by 
minimizing the average hop count. A well connected small-world network augmented with 
single hop WIs ensures faster convergence of the DVR algorithm. Experimentally, it has been 
found out that a period of 600 cycles is sufficient for an entire network to converge in a 64 core 
system. Thus, all switches start using the new routing table information after 600 cycles after the 
rerouting is triggered. Switches know the exact time to start as the thermal predictor broadcasts 
the time stamp of the last prediction when it sends the control flit that triggers the rerouting.  
   16 
 
When applying DVR, a link cost function that would locally respond to changing 
conditions to avoid routing information through switches or links experiencing dangerously high 
or rising temperatures is designed. Consequently, in a dynamic routing scenario, DVR scales 
better than the Dijkstra’s algorithm, since it does not require the aggregation of link state 
information from the entire network at a single authority. Thus, a local cost function that enables 
a distributed adaptive dynamic routing mechanism so as to avoid thermal hotspots is designed 
(Equation (4)). 
 
                             
           
                  
          
                   
                (4) 
 
Where,  is the source,   is the node across the link,    is the temperature of the i
th 
component and     
  ,    
   are the hysteresis thresholds as depicted in Figure 5. 
     
   is considered to be equal to the target temperature of the proposed combined DTM 
scheme and     
   to be 3º less than the target temperature to limit frequent network 
reconfiguration. At network initialization, the link costs of all switches are 1, supporting 
minimum path routing. The initial distance vectors are pre-computed based on the topology 
using Dijkstra’s algorithm and the forwarding-tables in the switches are initialized accordingly. 
During operation, the scheduler will predict temperatures of links and switches at certain 
intervals of time. If a link or switch temperature rises above      
  , it is effectively deactivated by 
setting the cost to . Following the DVR protocol, the switch advertises a new distance vector to 
its neighbors. This forces the network to reconfigure its routing to alternative paths 
circumventing the link and/or the switch. Following the reduction of relaying through the hot 
   17 
 
link/switch, it cools down over time. When the temperature of a link or switch falls below     
  , a 
cost of 1 is assigned to that component. The switch advertises its new distance vector and the 
network can use the component as a relay once more. The hysteresis structure of the cost 
function limits the rate of network reconfigurations thereby providing a stable network. 
In practical scenarios, the network paths using DVR eventually converge to the shortest 
path routing tree obtained through Dijkstra’s algorithm [32]. The old forwarding table is used till 
the entire network reaches convergence to avoid multiple paths between same source/destination 
pairs. Deadlock is avoided as at any point of time the flits are transferred over paths along the 
shortest path routing tree. 
3.3. Thermal Predictor  
The efficiency of a DTM scheme depends on proper thermal estimation and response 
delay of the control mechanism. From this viewpoint, we can divide DTM techniques into two 
types: reactive and proactive DTM. In case of reactive DTM methods, generally on-chip thermal 
sensors are used to sense and measure the temperature. This allows the hardware to execute at 
full speed and initiate a corrective measure only when the temperature reaches a thermal limit 
 
Figure 5: Hysterisis based link cost function. 
 
   18 
 
which invokes a temperature control mechanism. However, the effectiveness of such a system 
relies on the associated response delay. Due to the response delay, thermal thresholds need to be 
set such that temperature overshoots that impact the systems performance are avoided. 
Moreover, on-chip thermal sensors are sensitive to process variations and without recalibration, 
can report erroneous temperature [2]. 
In proactive or predictive DTM, the future temperature is estimated from the performance 
counters or utilization metrics in advance, hence eliminating the chance of temperature 
overshoots. Due to this, the impact on performance is considerable minimal for such proactive 
DTM mechanisms [3]. However, since these thermal management schemes have to be 
implemented on-chip, it is imperative that these schemes have low computational and area 
overheads. In this section, we look at two possible temperature prediction schemes; one, a LUT 
based approach, and second an ANN based approach. 
In [4], the authors proposed an event driven LUT based thermal predictor for a 2x2 
system. Such an LUT based scheme was shown to be computationally less intensive and was 
capable of predicting the thermal profile very accurately. Even though the LUT based approach 
is known to be more accurate, for a 64 core system it is considerably taxing on the memory to 
house such an estimator when compared to its ANN based thermal estimator counterpart.  
ANN based prediction mechanisms have been shown to perform with relatively high 
levels of accuracy for different applications [5]. In this thesis, a hardware based ANN-based 
thermal predictor as proposed in [6] is adopted. This ANN has been trained such that it is capable 
of predicting the temperature of any component at any given time based on the utilization of the 
chip components. The inputs to the ANN are utilization of all the chip components and the time 
   19 
 
at which the temperature of the system needs to be predicted. The output of the ANN is 
compared with the target threshold, producing a single bit output for each component.  
This single bit denotes if the components’ temperature has crossed the target temperature 
threshold. The ANN consists of two elements: neurons or computational nodes, and the synapse, 
which is the interconnection between neurons. So as to reduce the hardware overheads and to 
reuse available resources, the neurons are realized as parallel multiply-accumulate (MAC) 
operational units. 
As shown in Figure 6, the trained ANN consists of three streams; a core stream, link 
stream and a switch stream. Having such a structure, with three subdivided streams for each 
element (core, link and switch) improves the accuracy and reduces the number of connections 
between the hidden layer and the output layer i.e. the fully connected neurons exist only between 
 
 
Figure 6: Subdivided ANN streams of the proposed structure. 
 
   20 
 
the hidden neurons of cores and the output neurons of the cores. This reduction in the number of 
connections between the hidden neurons and the output neurons also decreases the latency of the 
prediction. For the 64 core system considered for this thesis, the total number of hidden layer 
neurons used for the core stream is 250, while for the switch and link streams it is 50 and 100 
respectively. Log-sigmoid and linear activation functions are respectively used for the hidden 
and output layers of the ANN. This results in better accuracy in prediction.  
 
 
   21 
 
Chapter 4 Test Cases and Evaluations 
In recent times several WiNoC architectures have been explored. Small-world networks 
are a type of complex networks that are often found in nature, which are characterized by short 
distance and long range links [30]. These small-world networks have much improved 
performance as they have a very low average number of hops between any set of nodes, even in 
a very large network. Thus, such small-world networks are suitable for use in scalable, hybrid 
WiNoCs where conventional wireline interconnects are augmented with long-range wireless 
shortcut links. For this thesis, the approach followed by [29] in designing a WiNoC based 
architecture utilizing long-range wireless interconnects that are overlaid on a conventional 
wireline NoC is adopted. However, unlike in [29] for this thesis the wireless interconnects are 
actually used to broadcast temperature related control information to all the NoC switches. From 
several earlier research works it has been noted that the mm-wave wireless antennas are not 
directional and thus can be used for broadcast type transmissions over shared wireless channels.  
This thesis focuses on sending utilization information from all cores, switches and links 
using the shared wireless channel to a thermal predictor which then estimates the temperature of 
various components of the chip based on the information sent. The utilization of the cores is 
represented as the percentage of utilization of the processors, while the utilization for the 
switches are measured as the ratio of the actual buffer occupancy to the maximum buffer 
capacity. Utilization of links are measured at the switches attached to these links as the ratio of 
actual rate of flits transferred over the link to the maximum capacity of the link. All utilization 
values are expressed in percentages.  
   22 
 
Every switch collects the utilization percentages of the core and outgoing links connected 
to it as well as its own utilization percentage, then packetizes this information and sends it to the 
thermal predictor through the wireless channel. Once the thermal predictor receives this 
information, the predictor estimates the future temperature profile of the chip so as to trigger the 
combined dynamic thermal management scheme that is proposed in this thesis. Due to the usage 
of WIs efficiently in the WiNoC, the overall hop count is reduced, and the switches and cores 
receive the thermal control packets quickly. The combined effect of the wireless transmission of 
control information along with the prediction based triggering of the thermal management 
scheme ensures a quick response making sure there are no transient overshoots in temperature 
above the defined target temperature threshold. In the following two sub-sections let us look at 
the topology of the WiNoC and the physical layer of the wireless interconnects that are used to 
setup the simulation environment. This WiNoC discussed, forms the platform on which we 
evaluate the proposed combined DTM technique. 
4.1. Topology 
The proposed WiNoC is a hybrid, small world topology where each core is connected to 
a NoC switch, and each switch is connected other switches using wired and wireless links as 
already shown in Figure 1. This topology is adopted from the WiNoC architecture discussed in 
[11]. In order to establish wired links and at the same time satisfy the properties of small-world 
graphs, the wireline topology is generated according to the inverse power law, so as to reduce 
wiring costs [31]. 
In Equation (5),        is the probability of establishing a link between two switches i 
and j, lij is the manhattan distance between the switches, fij is the frequency of communication 
   23 
 
between the switches i and j, and n is the total number of switches. Using the probability 
distribution mentioned in (5), a Monte Carlo method is used to establish links between pairs of 
switches. Also, from (5) it can be seen that the probability of establishing a link between two 
switches i and j which are separated by distance lij  is proportional to the distance raised to a 
finite negative power, α. Value of α is chosen so as to have optimized wiring costs [31]. 
 
       
   
     
     
     
 
   
 
   
.                     (5) 
 
 
To compute the distance between switches, a tile-based floorplan of the cores on the die 
is considered. In (5) the frequency of communication between cores, fij is also considered. This is 
to make sure that more frequently communicating cores have a greater probability of having a 
direct link, to optimize the topology, if the application specific non-uniform traffic pattern is 
known apriori. Such a power law based link distribution results in both short distance link 
interconnections as well as long distance interconnections as the probability of long distance 
links always non-zero.  
After establishing the wireline topology, the WIs are deployed over the NoC so as to 
maximize performance. As already seen in Figure 1, the processing core consisting of the 
thermal prediction unit along with the scheduler is associated with a WI. This ensures fast and 
efficient exchange of broadcast type thermal control signals to and from the scheduler. Besides 
this WI, the number and location of other WIs are determined through an optimization step. In 
order to achieve the best performance of the WiNoC we optimize the average hop-count of the 
WiNoC by varying both the number and location of the WIs in the NoC. 
   24 
 
Due to the potentially huge search space and concerns of scalability of the approach we 
adopt a simulated annealing based optimization which is shown to achieve nearly optimal 
configurations in such applications of designing WiNoC architectures in significantly less 
number of iterations compared to exhaustive search techniques. While increasing the number of 
WIs deployed in the NoC reduces the hop-count steadily due to better connectivity, the 
performance of individual wireless links gets degraded due to the adopted MAC mechanism 
being a token passing protocol as discussed in chapter 3.4. As the token is circulated among a 
higher number of WIs each WI is forced to wait longer for access to the wireless medium. This 
has a negative impact on the performance of the WiNoC. Thus, the hop count is optimized as a 
function of the location of the WIs using SA and the WiNoC performance measured as available 
bandwidth, as a function of the number of WIs through system level simulations. Unlike how it 
is done in [29], in this thesis the traffic utilization is not optimized as it is not practical to predict 
the traffic pattern accurately for unknown applications. Moreover, the proposed combined 
thermal management scheme includes temperature-aware rerouting to avoid thermal hotspots in 
a dynamic response to transient variations in temperature which is a more practical approach to 
hotspot avoidance in the NoC components rather than static optimizations based on statistical 
knowledge of the expected traffic pattern as was proposed in [29]. 
4.2. Antenna and Transceiver 
Two principal components of the WiNoC architecture are the antenna and the transceiver. 
The on-chip antenna for the WiNoC has to provide the best power gain for the smallest area 
overhead. A metal zigzag antenna has been demonstrated to possess these characteristics [24]. In 
addition, the zig-zag antennas are not directional. For this reason, these zigzag antennas are 
   25 
 
suitable to transfer broadcast thermal control messages to all other WIs which are deployed in 
different parts of the chip as discussed in Chapter 3. For this thesis, the antenna design discussed 
in [9] is adopted which provides a 3dB bandwidth of 16 GHz with a center frequency around 60 
GHz for a communication range of 20 mm. For optimum power efficiency, the quarter wave 
antenna uses an axial length of 0.38 mm in the silicon substrate.  
To ensure high throughput and efficiency, the WI transceiver circuitry has to provide a 
very wide bandwidth as well as low power consumption. The transceiver design is adopted from 
[9] where low power design considerations are taken into account in the architectural level. Non-
coherent on-off keying (OOK) modulation is chosen, as it allows relatively simple and low 
power circuit implementation. 
4.3. Flow Control and Wireless Communication Protocol 
In the WiNoC, data is transferred via wormhole routing using virtual channel (VC) based 
switches. Data packets are broken down and transferred in the form of flow control units or flits 
which are the smallest amount of information that can be transferred between adjacent switches 
in one clock cycle [34]. In addition to normal VCs to transfer data flits, switches equipped with 
WI will have one reserved VC to send the utilization and routing control packets. All the 
switches will send link and switch packetized their activity information to their nearest WIs that 
in turn will send these packets to the scheduler. Based on this information, scheduler will predict 
the temperature using thermal predictor to trigger the proposed DTM scheme. 
Advances in antenna and mm-wave transceiver design in standard bulk CMOS 
technologies have made on-chip wireless interconnect feasible. However, due to the limited 
bandwidth of the wireless channels at such high frequencies limits the achievable performance 
   26 
 
benefits. Designing wireless transceivers in multiple frequency bands for enhancing the 
performance of the NoC is a non-trivial challenge and is not scalable in the near future. One way 
of avoiding interference and contention between multiple transmitters is wireless token passing 
protocol to give access to the medium to a single transmitter at a time. The token passing scheme 
eliminates the need for centralized control and arbitration among the transceivers, which might 
be located in distant parts of the die. Hence, the token passing scheme has been adopted in 
multiple WiNoC designs [9][13]. A single-bit register in the wireless switches can denote the 
presence of the token at a WI to minimize the associated hardware. When this register is set, it 
enables that particular WI to transmit data flits over the wireless medium. When the WI is done 
with its transmission it passes the token to the next WI in a round robin fashion. The token flit 
consists of two fields, nextWI and prevWI. The prevWI denotes the ID of the WI that released 
the token and the nextWI denotes the ID of the WI that will possess the token next. With the 
token passing protocol the WIs can communicate with any other WI when it possesses the token. 
4.4. Simulation Environment  
In order to evaluate the temperature profile, performance and energy consumption of the 
NoCs, an integrated simulation environment as shown in Figure 7 was developed. For 
application based traffic patterns GEM5 [35], a full system simulator was used to obtain detailed 
processor and network level information on SPLASH-2 [36] and PARSEC [37] benchmarks. A 
system running Linux with 64 alpha cores is considered within the GEM5 platform. The memory 
system adopted is MOESI_CMP_directory, with a 64KB L1 instruction and data cache, and a 
shared 64MB (1MB distributed to each core) L2 cache. 
   27 
 
The traffic pattern information obtained for each benchmark from the GEM5 simulator is 
used in the NoC simulator to obtain NoC performance based on average network latency, peak 
bandwidth and average packet energy. The ANN thermal predictor estimates the thermal profile, 
which triggers the combined DTM mechanism, which appropriately updates the routing paths 
and task mappings dynamically in the NoC simulator. The NoC architecture is characterized 
using a cycle accurate simulator that models the progress of flits accurately per clock cycle 
accounting for those flits that reach their destination as well as those that are stalled. 
 The width of all wired links is considered to be the size of a flit, which is considered to be 
32 bits. The packet size is considered to be 64 flits for all experiments shown. Both wired and 
wireless links have adopted wormhole routing. The NoC switch arbitration used consists of three 
functional stages; input arbitration, routing/switch traversal, and output arbitration [39]. Each 
 
Figure 7: Thermal profile evaluation simulation flow. 
 
   28 
 
switch port has four virtual channels for data transfer and one reserved virtual channel for control 
packet each with a buffer depth of 2 flits. However, the wireless ports have an increased buffer 
depth of 8 flits so as to avoid dropping packets while waiting for the token. Increasing the buffer 
depth beyond this has been shown to not produce any performance improvements for this packet 
size, but only adds to the additional area overhead [9]. The switches are synthesized from RTL 
level designs using the 65nm standard cell libraries from CMP [40], using Synopsys. The NoC 
switches are driven with a clock frequency of 2.5GHz at 1V.  
 The mm-wave wireless transceiver discussed in [9] is designed and characterized using 
TSMC 65nm CMOS process and has been shown to dissipate 36.7mW for long range  on-chip 
communication distances of the order of 20mm while sustaining  a data rate of 16Gbps with a 
bit-error rate (BER) of less than 10
-15
.  
4.5. Determination of an Optimized WiNoC Topology 
Before we delve into the experimental results of the combined DTM mechanism, it is 
 
Figure 8: Performance optimization with different number of WIs in a 64 core 
WiNoC. 
 
2 
2.2 
2.4 
2.6 
2.8 
3 
3.2 
3.4 
3.6 
2.5 
2.55 
2.6 
2.65 
2.7 
2.75 
2.8 
2.85 
6 8 10 12 14 
A
ve
ra
ge
 H
o
p
 C
o
u
n
t 
B
an
d
w
id
th
 (
Tb
p
s)
 
Number of WI 
Bandwidth Avg Hop Count 
   29 
 
important to discuss about the optimization of the WiNoC architecture. As is seen from Figure 8, 
increasing the number of wireless interconnects results in a much better connected NoC 
topology, with a reduced average hop count. The graph in Figure 8 is shown for a multicore chip 
with 64 cores. However, since the wireless nodes operate on the channel based on a token based 
MAC system, increasing the number of WIs increases the time taken for the token to return to a 
particular wireless node, especially under heavy traffic conditions, thus negatively affecting the 
performance. This overhead has been modeled as part of the token based MAC protocol in the 
network simulator. 
System level simulations were used to analyze the performance of the WiNoC in terms of 
network bandwidth as a function of the number of WIs. Bandwidth is measured as the average 
number of bits arriving per second. In order to represent a generic traffic scenario where both 
local and long distance traffics exist, uniform random traffic is used to generate the result seen in 
Figure 8. It can be seen from the figure that the bandwidth peaks for a particular number of WIs 
and then decreases as the number of WIs are increased. This is due to the increased average wait 
time for a particular WI in a token based system. Also seen from the figure, that 10 is the optimal 
number of WIs for a system with 64 cores, which is also the number of WIs considered in the 
WiNoC architecture for this thesis.  
4.6. Thermal Characteristics of Combined DTM  
In this section, we shall evaluate the thermal characteristics of the combined DTM 
mechanism. In order to evaluate the characteristics at a high target temperature, the die is first 
warmed up to a temperature of 60°C. The goal of this thesis is to show the effectiveness of the 
combined DTM scheme, and that it successfully restricts the maximum temperature of the chip 
   30 
 
within the target temperature. The target temperature can be chosen as some temperature that is 
lesser than the maximum tolerable junction temperature of all the electronic devices on the chip, 
considering the cooling capabilities of the system environment. Even though, any temperature 
 
Figure 9: Maximum chip temperature with and without combined DTM for (a) CANNEAL, (b) 
BODYTRACK, (c) VIPS, (d) FLUIDANIMATE, (e) SWAPTION, (f) FREQMINE, (g) FFT, (h) 
RADIX, and (i) LU traffic, for uniform time. 
 
   31 
 
greater than the ambient temperature can be chosen as the target temperature, for the experiments 
shown in this thesis a target temperature of 68ºC is chosen, which is set at the beginning of the 
experiment. The selection of an ideal target threshold is out of the scope of this thesis. 
Figures 9 and 10 (a), (b), (c), (d), (e), .(f), (g), (h), (i) and (j) show the peak die 
temperature for CANNEAL, BODYTRACK, VIPS, DEDUP, FLUIDANIMATE, SWAPTION, 
FREQMINE, FFT, RADIX, and LU traffic respectively. Figure 9 shows the transient response of 
all benchmarks run for a uniform amount of time. From the figure it can be seen that since each 
benchmark is unique and has a different communication density pattern, the maximum chip 
temperature for each benchmark reaches target threshold at different times. Due to this, some 
benchmarks trigger the combined DTM scheme several times, while at the same time for some 
the maximum chip temperature may not reach target threshold at all. For this reason, the 
simulations are run based on the number of times we trigger DTM as it provides a more effective 
way to study the behavior of the combined DTM technique. The simulations in Figure 10 show 
the peak die temperature for the benchmarks such that we trigger the combined DTM scheme 
three times. From the figure it can be seen that for the case with the combined DTM mechanism, 
the peak temperature after reaching the target temperature, triggers either temperature-aware 
rerouting or the task reallocation scheme to contain the peak temperature within the specified 
target temperature. For the case where the DTM scheme is not applied, it can be seen that the 
peak temperature of the chip increases exponentially.  
In all cases with the DTM scheme, the peak temperature is always below the target 
temperature. However, in case of CANNEAL, BODYTRACK, SWAPTION and RADIX traffic 
patterns, the effect of the combined DTM scheme is more prominent. This is because, for these 
benchmarks, few cores have very high communication densities, while others have a relatively 
   32 
 
uniform communication pattern amongst themselves. Thus, the temperatures of those cores 
having high communication densities tends to increase, creating hotspot like scenarios. As a 
result, redistributing this traffic to cooler cores using cooler links and switches attains a more 
uniformly distributed thermal profile of the die.  
 
 
Figure 10: Maximum chip temperature with and without combined DTM for (a) CANNEAL, (b) 
BODYTRACK, (c) VIPS, (d) FLUIDANIMATE, (e) SWAPTION, (f) FREQMINE, (g) FFT, (h) 
RADIX, and (i) LU traffic, run as long as the combined DTM scheme is triggered thrice. 
 
 
0 2 4 6 8 10
60
62
64
66
68
70
Time (ms)
M
a
x
 T
e
m
p
 (
°
C
)
 
 
(a)
with combined DTM
without combined DTM
0 5 10 15
60
62
64
66
68
70
Time (ms)
M
a
x
 T
e
m
p
 (
°
C
)
 
 
(b)
0 5 10 15
60
62
64
66
68
70
Time (ms)
M
a
x
 T
e
m
p
 (
°
C
)
(c)
 
 
0 5 10 15
60
62
64
66
68
70
Time (ms)
M
a
x
 T
e
m
p
 (
°
C
)
(d)
 
 
0 5 10 15 20
60
62
64
66
68
70
Time (ms)
M
a
x
 T
e
m
p
 (
°
C
)
(e)
 
 
0 2 4 6 8 10 12
60
62
64
66
68
70
Time (ms)
M
a
x
 T
e
m
p
 (
°
C
)
(f)
 
 
0 5 10 15 20 25
60
62
64
66
68
70
Time (ms)
M
a
x
 T
e
m
p
 (
°
C
)
(g)
 
 
0 2 4 6 8 10
64
66
68
70
72
Time (ms)
M
a
x
 T
e
m
p
 (
°
C
)
(h)
 
 
0 5 10 15
60
62
64
66
68
70
Time (ms)
M
a
x
 T
e
m
p
 (
°
C
)
(i)
 
 
0 2 4 6 8 10 12
60
62
64
66
68
70
Time (ms)
M
a
x
 T
e
m
p
 (
°
C
)
(j)
 
 
with combined DTM
without combined DTM
with combined DTM
without combined DTM
with combined DTM
without combined DTM
with combined DTM
without combined DTM
with combined DTM
without combined DTM
with combined DTM
without combined DTM
with combined DTM
without combined DTM
with combined DTM
without combined DTM
with combined DTM
without combined DTM
   33 
 
The efficacy of the combined DTM approach is more evident from Figure 11, where the 
long term transient temperature response of the chip with the combined DTM scheme is shown. 
Also seen is the long term response of a system with just a temperature-aware rerouting scheme. 
The responses seen are for the CANNEAL benchmark. In comparison, DTM is triggered more 
number of times for the system with just temperature-aware rerouting than the system with the 
combined DTM. Due to the absence of a core-level DTM policy, this results in oscillations in 
routing paths. With time, the temperature of the cores also starts to increase, which in turn affects 
the combined DTM. Due to the absence of a core-level DTM policy, this results in oscillations in 
the routing paths. 
With time, the temperature of the cores also starts to increase, which in turn affects the 
temperature of the links and switches. Due to this, the temperature of the links and switches also 
increases more rapidly, thus making it more difficult to find alternative cooler paths, which 
results in oscillations of the routing paths. 
 
Figure 11: Maximum chip temperature with and without combined DTM scheme for 
CANNEAL traffic running for long duration. 
0 2 4 6 8 10 12 14 16 18
60
62
64
66
68
70
Time (ms)
M
a
x
im
u
m
 T
e
m
p
e
ra
tu
re
 (
°C
)
 
 
Without combined DTM
With Combined DTM
With only temperature-aware Rerouting
   34 
 
 
Figure 12: Comparison of normalized performance metric of the system with only 
temperature-aware rerouting with system with combined DTM. 
 
0.85 
0.9 
0.95 
1 
1.05 
1.1 
1.15 
Bandwidth Packet Energy Latency 
N
o
rm
al
iz
ed
 p
er
fo
rm
an
ce
 m
et
ri
c 
System without DTM System with combined DTM System with only temperature-aware rerouting 
For the case with only temperature-aware rerouting, the DTM mechanism is triggered 
much more frequently than for the case with combined DTM. This in turn makes packets take 
certain sub-optimal paths so as to avoid “hot” switches/links. Moreover, the flow of utilization 
information (as input to ANN) and rerouting control packets in the network consume network 
resources. Due to these factors because of the frequent triggering of the DTM mechanism the 
performance of a system with only rerouting takes a hit. Figure 12 shows how this frequent 
triggering of the DTM scheme for the case with only temperature-aware rerouting affects the 
performance of the system for the CANNEAL benchmark. Figure 12 shows that there is a 2.8% 
decrease in bandwidth, a 3.5% increase in packet energy and a 4% increase in latency of a 
system having only rerouting when compared to a system having the proposed combined DTM 
mechanism. 
   35 
 
Figure 13 shows the thermal response for the system with CANNEAL benchmark traffic 
for two different target temperatures. The figure shows the thermal response for a target 
temperature of 66ºC and 68ºC. It goes on to show that the combined DTM scheme is capable of 
containing the peak temperature of the system within the specified target temperature. However, 
for a lower target temperature set, the DTM scheme is triggered more frequently, which in turn 
can potentially affect the performance of the system. We shall discuss the the tradeoffs in 
implementing the combined DTM scheme in the next section. 
4.7. Performance tradeoffs of combined DTM scheme 
In this section, let’s look at the tradeoffs in performance of the WiNoC with the 
combined DTM scheme for application specific workloads. We consider two WiNoC systems, 
one without any DTM and one with combined DTM scheme. Figures 14, 15 and 16 show the 
 
Figure 13: Transient temperature response for CANNEAL with two different target 
temperatures. 
0 1 2 3 4 5 6 7 8 9 10
60
61
62
63
64
65
66
67
68
69
70
Time (ms)
M
a
x
im
u
m
 T
em
p
er
a
tu
re
 (
°C
)
 
 
with combined DTM; Target Temp=68
with combined DTM; Target Temp=66
without combined DTM
   36 
 
normalized bandwidth, packet energy and latency for these systems for a target threshold of 
68°C.  
From Figure 14 it can be observed that the system with combined DTM scheme shows a 
3.2% decrease in peak bandwidth on an average as compared to a system without DTM. This is  
 because, both the task reallocation heuristic as well as the temperature aware rerouting 
intrinsically change the path of the packets by either re-mapping tasks to cooler cores or by 
avoiding hot switches/links. As a result, triggering combined DTM forces data to take sub-
 
Figure 14: Normalized bandwidth of system with combined DTM for different 
application-specific traffics. 
0.93 
0.94 
0.95 
0.96 
0.97 
0.98 
0.99 
1 
1.01 
N
o
rm
al
iz
e
d
 p
e
ak
 b
an
d
w
id
th
 
System without DTM System with combined DTM  
Figure 15: Normalized packet energy of system with combined DTM for different 
application-specific traffics. 
 
 
0.96 
0.98 
1 
1.02 
1.04 
1.06 
1.08 
N
o
rm
al
iz
e
d
 p
ac
ke
t 
e
n
e
rg
y 
System without DTM System with combined DTM 
   37 
 
optimal paths. Consequently, packets pass through more switches and links than would be 
necessary with shortest path routing. However, as our proposed task reallocation heuristic 
considers the communication density, the impact on the bandwidth of the system is always lower 
than 5%. 
 On the other hand, average packet energy and latency of the system with combined DTM 
is approximately 4-7% higher than the system without combined DTM for all benchmarks 
considered here as shown in Figure 15 and Figure 16. This is because in the system with 
Combined DTM, the data packets encounter more number of hops to avoid hot 
links/switches/cores leading to an increase in the packet energy and latency as compared to the 
system without combined DTM. Additionally, flow of utilization information (as input to ANN) 
and rerouting control packets in the network consume network resources that in turn can increase 
latency and consequently packet energy. However, since compared to data packets these special 
packets are of short length and utilize the wireless links only, these control flits account for only 
a fraction of the total number of flits flowing through the network and hence the effect is not 
 
Figure 16: Normalized latency of system with combined DTM for different 
application-specific traffics. 
 
0.96 
0.98 
1 
1.02 
1.04 
1.06 
1.08 
N
o
rm
al
iz
e
d
 la
te
n
cy
 
System without DTM System with combined DTM 
   38 
 
substantial. Although implementing a combined DTM scheme has an impact on the performance, 
the impact is not significant, achieving a good tradeoff between temperature uniformity and 
performance.  
Figure 17 shows the effect of setting a lower target threshold temperature on the 
performance in the presence of CANNEAL traffic. Figure 17 represents the normalized peak 
system bandwidth and normalized packet energy for two different target temperature thresholds, 
66ºC and 68 ºC. Both the peak system bandwidth and packet energy are normalized with respect 
to the system without any DTM scheme.  
From the figure it can be seen that the system with greater target threshold temperature 
performs better than a system with lower target threshold temperature. This is because, in order 
to avoid hot cores/links/switches, temperature aware rerouting or task reallocation inherently 
changes route of data packets. In case of low temperature threshold, combined DTM triggers 
more frequently which results in taking suboptimal routing path more often. As a result the 
bandwidth decreases and packet energy increases as is seen in the figure. Thus, setting the target 
 
Figure 17: Effect of target temperature on performance in presence of CANNEAL 
traffic. 
1.055 
1.06 
1.065 
1.07 
1.075 
1.08 
1.085 
0.967 
0.968 
0.969 
0.97 
0.971 
0.972 
0.973 
0.974 
0.975 
0.976 
0.977 
66 68 
N
o
rm
al
iz
e
d
 p
ac
ke
t 
e
n
e
rg
y 
N
o
rm
al
iz
e
d
 p
e
ak
 b
an
d
w
id
th
 
Target Temperature (ºC) 
Normalized bandwidth Normalized packet energy 
   39 
 
threshold is imperative as it can affect the performance of the chip directly. As the proposed 
combined DTM scheme is capable of working for any defined target temperature, one should set 
the target temperature depending on maximum tolerable junction temperature of the electronic 
devices on the chip and the applications environment of the system.  
 
 
  
   40 
 
Chapter 5 Conclusions 
Thermal problems on multicore chips are exacerbated by aggressive scaling of system 
size and the inter-core traffic pattern over the NoC fabric. A temperature-aware adaptive 
dynamic thermal management technique which combines both task reallocation and rerouting 
designed for the WiNoC architecture successfully restricts the temperature of both the core level 
and NoC level components within any target threshold temperature specified. The wireless 
interconnection with broadcast abilities along with the prediction capability of the ANN improve 
the reaction time of such a system ensuring on-chip temperatures never exceed the target 
threshold. Such a system can be used to design thermally efficient multicore chips with wireless 
NoC fabrics.  
Even though the combined DTM system is efficient in maintaining the on-chip 
temperature within specified target threshold temperature, it does have a tax on the overall 
system performance. There is an approximately 4-7% increase in latency and average packet 
energy. The DTM system also reduces the bandwidth of the overall system by at the most 5%. 
The DTM scheme is designed such that it makes sure the impact on the bandwidth of the system 
is no greater than a 5% reduction. Although the combined DTM scheme impacts the 
performance of the system, it acts to maintain thermal stability of the chip, thus improving 
reliability and lifetime of the chip.  
The target temperature threshold value has a direct impact on the performance of the 
system i.e. performance of the system is better for greater target threshold values. This makes it 
imperative to intelligently set a target temperature based on the maximum tolerable junction 
temperature of the electronic devices on the chip. 
   41 
 
This combined DTM design is particularly suitable for applications where the utilization 
of the cores and NoC components are heterogeneous where some parts are utilized to a large 
extent creating local thermal hotspots leaving other components relatively cooler. Under these 
circumstances the workload can be redistributed to relatively cooler parts dynamically to ensure 
a more homogeneous thermal profile. 
As part of future work for this thesis, having a self-adjusting NoC system that is capable 
of reacting to dynamic predictions in temperature and network congestion information will lead 
to a much more efficient system in terms of maintaining temperature uniformity across the chip. 
Also, intelligent integration of such a dynamic thermal management scheme with a suitable 
power management technique like Dynamic Voltage and Frequency Scaling (DVFS) could lead 
to a more efficient management of temperature uniformity across the chip. 
 
  
   42 
 
Bibliography 
[1] Intel Research, Single-chip Cloud Computer 
http://techresearch.intel.com/ProjectDetails.aspx?Id=1. 
[2] Y. Zhang and A. Srivastava, “Accurate temperature estimation using noisy thermal sensors,” 
inProc. DAC, Jul. 2009, pp. 472–477. 
[3] Coskun, A.K.; Rosing, T.S.; Gross, K.C., "Utilizing Predictors for Efficient Thermal 
Management in Multiprocessor SoCs," in Computer-Aided Design of Integrated Circuits and 
Systems, IEEE Transactions on , vol.28, no.10, pp.1503-1516, Oct. 2009. 
[4] Jin Cui; Maskell, D.L., "A Fast High-Level Event-Driven Thermal Estimator for Dynamic 
Thermal Aware Scheduling," in Computer-Aided Design of Integrated Circuits and Systems, 
IEEE Transactions on , vol.31, no.6, pp.904-917, June 2012. 
[5] Haykin, S. and Network, N., 2004. A comprehensive foundation. Neural Networks, 2(2004). 
[6] Kakoulli, E.; Soteriou, V.; Theocharides, T., "Intelligent Hotspot Prediction for Network-on-
Chip-Based Multicore Systems," in Computer-Aided Design of Integrated Circuits and 
Systems, IEEE Transactions on , vol.31, no.3, pp.418-431, March 2012. 
[7] L. Benini and G. D. Micheli, “Networks on Chips: A New SoC Paradigm,” IEEE Computer, 
Vol. 35, Issue 1, January 2002, pp. 70-78. 
[8] Carloni LP, Pande P, Xie Y. Networks-on-chip in emerging interconnect paradigms: 
Advantages and challenges. InProceedings of the 2009 3rd ACM/IEEE International 
Symposium on Networks-on-Chip 2009 May 10 (pp. 93-102). IEEE Computer Society. 
   43 
 
[9] K. Chang, S. Deb, A. Ganguly, X. Yu, S. P. Sah, P. P. Pande, B. Belzer, and D. Heo, 
“Performance evaluation and design trade-offs for wireless network-on-chip architectures,” J. 
Emerg. Technol. Comput. Syst., vol. 8, no. 3, pp. 23:1–23:25, Aug. 2012. 
[10] D. Zhao and Y. Wang, “SD-MAC: Design and Synthesis of A Hardware-Efficient Collision-
Free QoS-Aware MAC Protocol for Wireless Network-on-Chip,” IEEE Transactions on 
Computers, vol. 57, no. 9, September 2008, pp. 1230-1245. 
[11] A. Ganguly, K. Chang, S. Deb, P. P. Pande, B. Belzer, and C. Teuscher, “Scalable Hybrid 
Wireless Network-on-Chip Architectures for Multicore Systems,” IEEE Transactions on 
Computers, vol. 60, no. 10, pp. 1485–1502, 2011. 
[12] Mansoor N, Ganguly A. Reconfigurable Wireless Network-on-Chip with a Dynamic Medium 
Access Mechanism. InProceedings of the 9th International Symposium on Networks-on-Chip 
2015 Sep 28 (p. 13). ACM. 
[13] DiTomaso, D.; Kodi, A.; Kaya, S.; Matolak, D., "iWISE: Inter-router Wireless Scalable 
Express Channels for Network-on-Chips (NoCs) Architecture," in High Performance 
Interconnects (HOTI), 2011 IEEE 19th Annual Symposium on , vol., no., pp.11-18, 24-26 
Aug. 2011 
[14] Abadal S, Nemirovsky M, Alarcón E, Cabellos-Aparicio A. Networking Challenges and 
Prospective Impact of Broadcast-Oriented Wireless Network-on-Chip. InProceedings of the 
ACM/IEEE International Symposium on Networks-on-Chip (NOCS'15). ACM, Vancouver, 
CA 2015 Sep 28. 
[15] D. Cuesta, J.L. Ayala, J.I. Hidalgo, D. Atienza, A. Acquaviva, and E. Macii, “Adaptive Task 
Migration Policies for Thermal Control in MPSoCs,” Proc. of ISVLSI 2010, pp. 110-115. 
   44 
 
[16] T. Ge, P. Malani, and Q. Qiu, “Distributed task migration for thermal management in many-
core systems,” Proc. of DAC 2010, pp. 579-584. 
[17] Guangshuo Liu; Jinpyo Park; Marculescu, D., "Procrustes1: Power Constrained Performance 
Improvement Using Extended Maximize-Then-Swap Algorithm," in Computer-Aided 
Design of Integrated Circuits and Systems, IEEE Transactions on , vol.34, no.10, pp.1664-
1676, Oct. 2015 
[18] M. Gomaa, M. D. Powell, and T. N. Vijaykumar, “Heat-and-run: Leveraging SMT and CMP 
to manage power density through the operating system,” in Proc. ASPLOS, 2004, pp. 260–
270. 
[19] Murali, S.; Mutapcic, A.; Atienza, D.; Gupta, R.; Boyd, S.; Benini, L.; De Micheli, G., 
"Temperature Control of High-Performance Multi-core Platforms Using Convex 
Optimization," in Design, Automation and Test in Europe, 2008. DATE '08 , vol., no., 
pp.110-115, 10-14 March 2008 
[20] L. Shang, L.-S. Peh, A. Kumar, N.K. Jha, “Temperature-Aware on-Chip Networks,” IEEE 
Micro: Micro’s Top Picks from Computer Architecture Conferences, 2006. 
[21] Dan Zhao; Ruizhe Wu, "Overlaid Mesh Topology Design and Deadlock Free Routing in 
Wireless Network-on-Chip," in Networks on Chip (NoCS), 2012 Sixth IEEE/ACM 
International Symposium on , vol., no., pp.27-34, 9-11 May 2012. 
[22] Dan Zhao; Yi Wang; Jian Li; Kikkawa, T., "Design of multi-channel wireless NoC to 
improve on-chip communication capacity!," in Networks on Chip (NoCS), 2011 Fifth 
IEEE/ACM International Symposium on , vol., no., pp.177-184, 1-4 May 2011. 
   45 
 
[23] S. B. Lee et al., “A scalable micro wireless interconnect structure for CMPs,” in Proc. ACM 
Annu. Int. Con. Mobile Comput. Network. (MobiCom), 2009, pp. 20–25. 
[24] J. Lin et al., “Communication Using Antennas Fabricated in Silicon Integrated Circuits,” 
IEEE Journal of Solid-State Circuits, vol. 42, no. 8, August 2007, pp. 1678-1687. 
[25] S. Deb, A. Ganguly, P. P. Pande, B. Belzer, and D. Heo, “Wireless NoC as Interconnection 
Backbone for Multicore Chips: Promises and Challenges,” IEEE Journal on Emerging and 
Selected Topics in Circuits and Systems, vol. 2, no. 2, pp. 228–239, 2012 
[26] J. Murray, J. Klingner, P. Pande and B. Shirazi, “Sustainable Multi-Core Architecture with 
on-chip Wireless Links”, Proceedings of ACM Great Lake Symposium on VLSI, GLSVLSI 
2012. 
[27] J. Murray, P. Pande, B. Shirazi, “DVFS-Enabled Sustainable Wireless NoC Architecture,” 
Proc. of IEEE SOC Conf., 2012 
[28] J. Murray, P. Wettin, P. Pande, B. Shirazi, N. Nerurkar and A. Ganguly, “Evaluating Effects 
of Thermal Management in Wireless NoC-Enabled Multicore Architectures”, Proceedings of 
IEEE International Green Computing Conference (IGCC), 2013 
[29] Shamim, M.S.; Mhatre, A.; Mansoor, N.; Ganguly, A.; Tsouri, G., "Temperature-aware 
wireless network-on-chip architecture," in Green Computing Conference (IGCC), 2014 
International , vol., no., pp.1-10, 3-5 Nov. 2014 
[30] D. J. Watts and S. H. Strogatz, “Collective dynamics of ‘small-world’ network,” Nature, vol. 
393, 1998, pp. 440-442 
[31] T. Petermann and P. De Los Rios, “Spatial small-world networks: a wiring cost perspective”, 
2005. arXiv:cond-mat/0501420v2 
   46 
 
[32] Kurose, J., and K. Ross. "Computer Networking: A Top Down Approach, 4e." Hands on 1 
(2012): 1 
[33] Pei D, Zhang B, Massey D, Zhang L. An analysis of convergence delay in path vector 
routing protocols. Computer Networks. 2006 Feb 22;50(3):398-421 
[34] J. Duato, S. Yalamanchili, and L. NI, “Interconnection Networks-An Engineering 
Approach”, Morgan Kaufmann, 2002 
[35] N. Binkert, B. Beckmann, G. Black, S.K. Reinhardt,  A. Saidi, A. Basu, J. Hestness, D.R. 
Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M.D. Hill, and 
D.A. Wood, “The GEM5 Simulator,” ACM SIGARCH Computer Architecture News, 39(2), 
2011, pp. 1-7 
[36] S.C. Woo, M. Ohara, E. Torrie, J.P. Singh, and A. Gupta, “The SPLASH-2 Programs: 
Characterization and Methodological Considerations,” Proc. of ISCA, 1995, pp. 24-36. 
[37] C. Bienia, “Benchmarking Modern Multiprocessors,” Ph.D. Dissertation, Princeton Univ., 
Princeton NJ, Jan. 2011. 
[38] S. Li, J.H. Ahn, R.D. Strong, J.B. Brockman, and D.M. Tullsen, “McPAT: an Integrated 
Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures,” 
Proc. of the International Symposium on Microarchitecture, 2009, pp. 469-480 
[39] P. Pande,C. Grecu, M. Jones, A. Ivanov, R. Saleh, "Performance evaluation and design trade-
offs for network-on-chip interconnect architectures," , IEEE Transactions on Computers, 
vol.54, no.8, pp.1025-1040, Aug. 2005 
[40] Chip MultiProjects (http://cmp.imag.fr) 
   47 
 
[41] Mansoor, N.; Iruthayaraj, P.J.S.; Ganguly, A., "Design methodology for a robust and energy-
efficient millimeter-wave wireless network-on-chip," in Multi-Scale Computing Systems, 
IEEE Transactions on , vol.1, no.1, pp.33-45, March 1 2015 
[42] K. Skadron, M.R. Stan, W. Huang, S. Velusamy, K. Sankaranarayanan, and D. Tarjan, 
“Temperature-Aware Microarchitecture,” Proc. of the International Symposium on Computer 
Architecture, 2003, pp. 2-13 
[43] Ogras, U.Y.; Marculescu, R., ""It's a small world after all": NoC performance optimization 
via long-range link insertion," Very Large Scale Integration (VLSI) Systems, IEEE 
Transactions on , vol.14, no.7, pp.693,706, July 2006 
[44] Mishra, A.K.; Das, R.; Eachempati, S.; Iyer, R.; Vijaykrishnan, N.; Das, C.R., "A case for 
dynamic frequency tuning in on-chip networks," Microarchitecture, 2009. MICRO-42. 42nd 
Annual IEEE/ACM International Symposium on , vol., no., pp.292,303, 12-16 Dec. 2009 
[45] Md Shahriar Shamim, Jagan Muralidharan, and Amlan Ganguly. 2015. An Interconnection 
Architecture for Seamless Inter and Intra-Chip Communication Using Wireless Links. 
InProceedings of the 9th International Symposium on Networks-on-Chip (NOCS '15) 
 
 
