Abstract-In spite of escalating thermal challenges imposed by high power consumption, most reported 3D Network-on-chip (NoC) systems that adopt classic 3D cube (mesh) topology are unable to tackle the thermal management issues directly at the architectural level. Rather, to avoid chip being overheated, tasks running in a "hot" node have to be migrated to a "cooler" one, resulting in increased distance between communicating nodes and ultimately poor performance. In this paper, we propose a new 3D NoC architecture that genuinely supports runtime thermal-aware task management. Dubbed Hierarchical Ring Cluster (HRC), this new hierarchical 3D NoC architecture has three levels across its entire network hierarchy: 1) nodes are grouped as rings, 2) rings are then grouped into cubes, and 3) multiple cubes are connected to form the whole network. Routing in a HRC system is also performed in a hierarchical manner: Paths are set up within rings using low latency circuit switching, and data that need to cross the rings or cubes are routed following dimension-order routing supported by wormhole switching. In this organization, "hot" tasks that need to migrate can move along the rings without incurring increased communication distances. Our experimental results have confirmed that the proposed HRC architecture has a much lower network latency than other known 3D NoC architectures. When working with runtime thermal-aware task migration approaches, HRC can help reduce latency by as much as 80 percent compared to thermal-aware task migration approaches applied to 3D mesh NoC topologies.
INTRODUCTION
T O facilitate the interconnection needs of emerging 3D many-core systems, various 3D network-on-chip (NoC) architectures have been proposed in the literature [1] , and most of these 3D NoCs adopt 3D cube (3D-mesh) as their network topology. Besides topology that stipulates the theoretical bandwidth, actual performance of practical 3D NoC systems is largely dictated by heat distribution and communication latency. To prevent the chip from overheating, various runtime thermal management approaches using voltage and frequency scaling [2] and/or task migration [3] techniques have been proposed. One immediate drawback of voltage and/or frequency down scaling, however, is that doing so can adversely impact computation performance. In addition, voltage regulators needed in voltage scaling not only consume silicon area, but they also add extra response time and power consumption themselves. A more subtle thermal management approach is through task migration, where the tasks currently running on a "hot" node with its temperature exceeding a threshold would be migrated to run on a cooler node, giving the "hot" node time to cool down.
Albeit an effective thermal management scheme, task migration tends to cause performance degradation in current 3D NoCs with a 3D cube topology, illustrated in Fig. 1 . Assume the color-filled nodes are the ones running tasks with high communication volumes and also experiencing high temperatures. The four hot tasks are first mapped to the nodes in close proximity with minimal communication distance, as shown in Fig. 1a . However, such mapping will lead to rapid heat accumulation in the region where these nodes reside. When the temperature reaches the threshold, these four tasks have to be migrated to the nodes far apart for the best cooling results, as in Fig. 1b . The communication distances among tasks in Fig. 1b , however, are significantly longer than those seen in Fig. 1a . Such increased communication distances caused by thermal-induced task migration are actually common in all the existing NoC architectures [3] .
Escalating on-chip temperature has imposed a number of challenges on 3D architectures, including reduced reliability, increased leakage power consumption, and degraded performance. In [4] , it was found that for every 10 o C rise of temperature, interconnect delay increases by 5 percent, and the delay will grow exponentially. In the literature, various thermal management approaches, including task migration, DVFS, power gating, etc., have been considered with a hope to bring the chip temperature in check. However, state-ofthe-art 3D NoC topologies are designed separately from these known thermal management approaches. To our knowledge, this study is the first of its kind to recognize the important connection between network topology and thermal management.
In this paper, we thus propose a new 3D NoC architecture that genuinely supports thermal-aware task management. Referred as Hierarchical Ring Cluster (HRC), this proposed architecture is inspired by the famous Rubik's Cube, in which all the cubes on one face can rotate together without changing their relative positions. In the same token, if an architecture can allow tasks to be migrated along some fixed paths (e.g., by rotation) without increasing the communication distance, there shall be no communicationinduced performance degradation.
The rest of the paper is organized as follows. Section 2 reviews the related work and presents the motivation of this study. Section 3 introduces the HRC architecture, followed by the discussion of the application mapping related to the HRC architecture in Section 4. Section 5 experimentally evaluates different network topologies and runtime task management methods. Finally, Section 6 concludes the paper.
RELATED WORK
3D NoC systems have deemed to be the interconnection choice for many-core systems. In this section, 3D network topologies and runtime thermal management methods for 3D many-core systems are surveyed.
3D Network Topologies
Most 3D NoC topologies belong to the 3D cube family as in Fig. 2 . Essentially, a 2D mesh topology can be extended to become a 3D cube by adding a few router ports to support data transmissions in the vertical UP and DOWN directions.
A few 3D router micro-architectures were proposed to reduce the complexity and the number of the router ports needed [5] , [6] . In the conventional mesh/3D cube networks, when two communicating nodes are not physically adjacent to each other, the hop count between the nodes tends to be high, leading to high communication latency. To reduce such communication latency, hierarchical topology [7] , express topology [8] , and express channels [9] were all considered. In addition, configurable switches [10] could be used to dynamically set up the routing paths to help accelerate data transmission.
Thermal Management Methods in 3D
Many-Core Systems
Various 3D runtime thermal management methods for many-core systems were proposed in the literature. They can be broadly classified schemes based on DVFS [11] , power gating [12] , task migration [13] , [14] , [15] , [16] , [17] , [18] . In DVFS-and power-gating-based methods, the onchip resources that tend to be overheated are set to run at lower frequencies (the slow down mode) or are even completely shut down. Various strategies, including control theory, heuristics, and linear and dynamic programming, have been used to determine the best frequency/voltage setting or ON/OFF status of the nodes/routers/cache banks [19] , [20] , [21] . However, applying those approaches to a 3D NoC might also indiscriminately slow down all the tasks, leading to degraded performance.
Task Migration
Different from thermal management methods, taskmigration-based approaches move the "hot" tasks to run on nodes with lower temperatures. There are different migration policies [3] that can be used, including random migration, migration to the globally coolest nodes, or task migration following a fixed cyclic path. Migration of tasks due to thermal event (overheating) might lead to increased communication distance and latency. To gain insight of this problem, we may compare the communication costs of two well-known task migration strategies [3] : (i) "hot" tasks are migrated to a randomlyselected node (referred as "random" in Fig. 3) , and (ii) they are migrated to the global coolest nodes (referred as "global coolest" in Fig. 3 ). The communication costs of these two task migration methods are compared against the "ideal case" which assumes the chip does not run into overheat Fig. 1 . An example showing that thermal-aware task management, when applied to a 3D cube topology, might lead to increased communication distances between tasks and thus higher communication costs. In (a), the four hot tasks are mapped in close proximity since there are high volumes of communication between them. When the temperature rises, the four tasks have to be mapped to the nodes far apart for better cooling result (b), but at the cost of significantly longer communication distances. Fig. 2 . The 3D mesh/cube and concentrated mesh/cube topologies. Fig. 3 . Performance of the two migration schemes, the global coolest and the random, is compared against that of the ideal case where no task migration occurs.
and no task migration occurs, i.e., the chip can tolerate any temperature. The topology of select is a 3D mesh/cube, and the mapping algorithm is the same as the one used in [22] . The highest allowed temperature in this example is set to be 60 o C. Multiple random applications are generated and run on the system as dynamic workloads. The communication latency is the average latency of packets transmitted in the network. The execution time of an application is the makespan of the application's task graph, including both communication and computation times. More experimental setups can be found in Section 5.1.
From Fig. 3 , one can see that both migration strategies lead to increased network latency, especially true when the average communication volume of the applications is high. For example, when the average communication volume among the tasks is 10 Mbits, the latency of the "random" task migration method is 3Â of the ideal case (i.e., incurring no task migration), and the latency of "global coolest" is 5:2Â longer than that of the ideal case. Such increase in the latency is caused by the increased physical distances when performing task migration. Also, the latency of "global coolest" is 1:5Â over that of "random" case. Since "global coolest" method tries to find the globally coolest nodes to accommodate a hot task, the distances between the hot task and its communicating tasks are typically longer than those in "random".
THE HRC ARCHITECTURE FOR 3D NOC
In this section, the proposed network architecture, HRC (Fig. 4) , a hierarchical network architecture for runtime thermal-aware task migration, is presented. This HRC features a topology that meets two essential topological requirements for a 3D NoC: 1) it assures the minimal communication cost degradation associated with task migration, and 2) there is sufficient separation in space between the "hot" and the "cool" nodes for efficient heat dissipation.
The hot tasks can "rotate" (i.e., migrate cyclically) along the predefined closed paths referred as rings, and rings can be further grouped together to form cubes, and finally the entire network is created from cubes.
HRC Network Topology
At the bottom level of an HRC network (Fig. 5) , n nodes are connected to form a planar ring in the level 1 network using circuit switching (Fig. 5a ). "Hot" tasks can be migrated (or rotated) along the ring to avoid thermal overheat while ensuring low transmission latency. m rings can then be grouped/stacked vertically to form a "cube", which is part of a level 2 network (Fig. 5c ). l cubes are next horizontally connected to make a level 3 network (Fig. 5d) . The communications across rings and cubes in levels 2 and 3 need to be buffered and follow the wormhole switching.
Each node in an HRC network is indexed by its unique three-tuple ID : hlevel 1 ID, ring ID, cube IDi. A node's level 1 ID assumes an integer value in the range of 0 to n À 1. The ringID of the ring at the bottom layer is set to be 1, and the ringID increments by 1 going from one layer to the next right above it. So the ring at the top layer, layer m, has a ringID of m. All these level 1 IDs and ringIDs are fixed once the size of the network (number of nodes) is known.
A ring can be mapped to the nodes as in Fig. 5a . There are a total of V 1 pairs of virtual networks (VNs) to avoid deadlocks in a horizontal plane, and they can be grouped to form two sets: One set that includes all the virtual networks with data flowing clockwise and another set that includes all the virtual networks going counter-clockwise, as in Fig. 5b . The number of virtual network pairs is denoted by V 1 , and it is a design parameter to be chosen such that, a larger V 1 can lead to better transmission performance but also incurs higher hardware cost.
The way that the cubes are numbered is the same as that in a regular 2D mesh. For a network with 8 Â 8 cubes, the 8 cubes in the first row are indexed as 0 through 7, the cubes in the second row indexed as 8 through 15, and the remaining rows of the cubes are indexed in a similar fashion.
In the proposed architecture, data transfer within the same ring involves no buffering, but there are buffers between a router's ports and all the rings that this router connects to. When one core wants to send out a data packet, if there is another communication in all the virtual networks, the packet has to be placed into its local buffer. 
Routing in HRC
Data in an HRC network can be transmitted between nodes following the routing algorithms listed in Algorithms 1 and 2. The complexities of both algorithms are constant, as they only involve a fixed number of steps for making routing decisions. To address a destination node, its ID is encoded into the packet headers. At a specific node along a routing path, routing decision will be made based on whether the destination and the specific nodes are within the same ring (level 1 network), or in different rings but within the same cube (level 2 network), or across the cubes (level 3 network).
Routing Within the Same Ring
If the source and destination nodes are within the same ring, only level 1 network actually gets involved. The routing function within a level 1 network is denoted as f 1 . At any intermediate node of a path that connects the source and the destination nodes, the routing function f 1 compares the destination node's level 1 ID (dest:level 1 ID) with the node's own level 1 ID (curr:level 1 ID).
If dest:level 1 ID has a larger value than curr:level 1 ID, a path including nodes ðcurr; currþ 1; . . . ; destÞ is set up in the clockwise virtual networks (VNs) using circuit switching. If the path is not available because some of the links in the path ðcurr; curr þ 1; . . . ; destÞ are already taken by the clockwise VNs 1; . . . ; V 1 (Table 1) , the packets have to be stored in the local buffers. In the same token, if dest:level 1 ID has a smaller value than curr:level 1 ID, the path including nodes ðcurr; curr À 1; . . . ; destÞ will be followed using the counter-clockwise VNs.
Routing Within the Same Cube
In level 2 network (a.k.a inter-ring network), the routing function f 2 first compares the destination's ring ID if dest:ringID < curr:ringID;
Otherwise, call f 1 to route the packets within the ring.
Routing Across Cubes
In level 3 network, the routing function first compares the destination's cube ID (dest:cubeID) with the current node's cube ID (curr:cubeID). If they are different, an intermediate node in the current cube where the source node resides needs to be found, which is in the special node sets in current cube according to XY routing. Each cube has four special node sets X+, X-, Y+, and Y-for the routing process, as in Fig. 6 . In cube i, the set X+ includes nodes that connect to nodes in set X-of cube i þ 1, and the nodes in set X-connect to nodes in set X+ in cube i À 1. Similarly, the nodes in set Y + of cube i connect to nodes in set Y-in cube k, and the nodes in set Y-connect to nodes in set Y+ in cube j, where cubes j and k are of the north and south neighbors of cube i, respectively.
The routing covers 4 distinct cases. 
Conversion Between Circuit Switching and Wormhole
When the packets are transferred across the rings/cubes, there is a need to convert the flow control from circuit switching (inside a ring) to wormhole (inter-ring or intercube) at the intermediate nodes. The conversion from circuit switching to wormhole flit switching between cubes is performed as follows. Suppose node A in Ring 1 needs to send packets to node B in Ring 2. The packets are first forwarded to an intermediate node C in Ring 1. Then they are forwarded to another intermediate node D in Ring 2 before reaching the destination node B. Before a flit is sent from the source node A using circuit switching, it checks if the corresponding links are available, and the buffer at the intermediate node C in Ring 1 has empty space. If both conditions are met, the flit will be transferred to the intermediate node using circuit switching. Otherwise, it waits until the source node's local buffer is available.
If a flit needs to be forwarded to node D in Ring 2, it follows wormhole switching to reach node D and stays in that node's buffer. At node D, if the destination node B has free slots in its local buffer and there are available links, the flit is forwarded to node B using circuit switching. Otherwise, it stays in node D's local buffer until it can be serviced. Fig. 7 shows an example of finding a path from node s to node d. A packet is forwarded to node i by circuit switching in ring 1 cube 1 using f 1 . Node i sitting at the boundary of cubes 1 and 2, and it belongs to the set X+. Then the packet is forwarded to j in cube 2 by f 3 using wormhole switching. Now the packet is in the same cube of the destination. It is forwarded to node k in ring 4 of cube 2 by f 2 using wormhole switching. Finally, the packet is forwarded to the destination by f 1 .
An Routing Example

Deadlock Freedom in HRC Routing
The proposed HRC routing algorithm described in Sections 3.2.1, 3.2.2 through 3.2.3 follows one of the four pairs of VNs in Table 1 and Fig. 8 . Table 1 shows four virtual networks that have been used to avoid any deadlock inside a ring, where VNs can be divided into two groups, based on whether they can route packets in clockwise or counter-clockwise directions. Inside a cube, the VNs are separated, based on whether they can route packets upward or downward. For the intercube networks, VNs are separated as (1) those that are not allowed to route packets to the west, and (2) those not allowed to route packets to the east. Fig. 8 enumerates all possible combinations made from 4 VNs, and one can see that since no cycle can be formed, there will be no routing deadlock.
Embedding of Rings
With the design of HRC network, hot tasks are restricted to migrate (or rotate) only along the rings at runtime. Although many different ring configurations are theoretically possible inside an HRC network, the actual ring configuration is determined at design time and remains fixed afterwards (i.e., ring configuration cannot be altered at run time). Nodes form a ring using circuit switching, and they can be distributed to different layers. Then rings form a cube using wormhole routing, as in Fig. 9 .
Assume one cube has m layers and each layer has n nodes. The rings can be embedded into the HRC system in the following three ways. Fig. 5 . In this embedding, the n nodes in the same horizontal layer form a ring. The cube has one ring at each layer. The tasks in the rings (layers) physically close to the heat sink have a better cooling opportunity. Therefore, this embedding is suitable for applications with many hot tasks so that they can rotate at the bottom layers for better cooling result. Fig. 9b shows that rings are vertically placed to each layer. In this embedding, the n nodes in the same vertical plane form a ring. The cube has one ring corresponding to each vertical plane. However, since the nodes in the same vertical column typically have strong thermal correlation, nodes cannot be cooled down by moving tasks in the same vertical column. To avoid the disadvantage exhibited in Fig. 9b , Fig. 9c shows a thermally unbiased embedding. Assume a cube has m rings and each ring has n nodes. In this case, all the n nodes that a ring has is equally divided into m segments, and these m segments will be housed in exactly m physical layers, one segment a layer. Fig. 9c shows the segments of the rings. The m segments in different layers are concatenated to form a ring. Tasks in the same ring will have an equal opportunity to traverse each of the layers during rotation, making this scheme thermally unbiased.
APPLICATION MAPPING IN HRC
Architectural and Delay Models in HRC
An network-on-chip architecture can be modeled as follows. Each node consists of a processing unit, a cache and a network interface. The whole NoC system is represented as a directed graph GðT; LÞ, where T is the set of nodes and L represents the connections amongst the nodes. Every node can run at different voltage/frequency levels similar to the case seen in Intel SCC [23] . We assume multiple voltage/frequency levels are available for each node, and once a voltage level is selected, the operation frequency is also fixed.
An application i is also represented as a directed graph AG i ¼ ðA i ; E i Þ, where A i is the set of tasks of the application and E i is the set of directed edges representing communications amongst the tasks. The execution time of a task a 2 A i , after mapped onto a node, is denoted as ExecTime. The ExecTime of a task is taken from its worst-case execution-time (WCET) and remains unchanged at a given frequency.
A mapping function MðaÞ ¼ t, for a 2 A i , t 2 T binds (maps) task a to node t.
Each edge e 2 E i has a weight that is defined by the transmission time between the two communicating tasks, and this transmission time depends on (i) the communication distance between the nodes to which these tasks are mapped, (ii) the traffic volume, and (iii) temperature. That is, for each edge e ¼ ða i ; a j Þ 2 E i , the transmission time T ðeÞ is given as T ðeÞ ¼ fðvða i ; a j Þ; DðMða i Þ; Mða j ÞÞ; uÞ, where vða i ; a j Þ is the traffic volume between the two tasks a i and a j , DðMða i Þ; Mða j ÞÞ is the distance (hop counts) between the two nodes to which tasks a i and a j are mapped, and u is the average temperature of the chip. That is,
where a, b, and g are regression coefficients. The T ðeÞ model can be trained offline by measuring the latencies of transmitted packets. The execution time of each application i is the makespan of task graph, denoted as ET i . When two communicating tasks are mapped to two nodes within the same ring, the zero load latency is negligibly small because circuit switching is used within the ring and there is no buffering needed. The transmission time across rings and cubes can be determined by counting the number of hops between the source and the destination. The distance function Dðt 1 ; t 2 Þ between two nodes t 1 and t 2 in HRC is thus defined to be Dðt 1 ; t 2 Þ ¼ jt 1 :ringID À t 2 :ringIDj þ jt 1 :cubeID:x À t 2 :cubeID:xj þ jt 1 :cubeID:y À t 2 :cubeID:yj:
Application Mapping
Application mapping in HRC as listed in Algorithm 3 works as follows. For a task a i to be mapped, if one of its communicating tasks is already mapped, it is mapped to a node with the minimum distance to that mapped task. Otherwise, a new ring with the minimum power consumption is selected. Next the unmapped tasks which need to communicate with a i is mapped to the nodes with the minimum distance to a i . Application mapping in HRC takes two major steps, shown in Algorithm 3.
In the first step, a ring is selected for each task so that the communication distances of the tasks are optimized and all the rings that are selected to accommodate these tasks generate equal amount of heat. In the second step, hot tasks inside a ring will be physically distributed farthest apart for heat dissipation (by calling Algorithm 4). For example, when the first hot task is mapped to node 0 inside the ring in Fig. 10 , the second hot task should be mapped to node 6, since doing so will put the two tasks farthest apart. If the third hot task needs to be mapped, it will be placed to node 9 which has the maximum geometric distances to both nodes 0 and 6. The complexity of the mapping algorithm is Oðg 2 hÞ, where g is the number of tasks in the application, and h is the network size.
PERFORMANCE EVALUATION
Experimental Setup
To evaluate the performance of the network architecture presented in Section 3, experiments were performed using an event-driven C++ simulator, which is a modified version of Popnet, loaded with the DSENT power model. The parameters used for the simulations are summarized in Tables 2 and  3 . Hotspot is used as the temperature simulator. In order to get the temperature profile, the floorplanning of Intel SCC [23] is used as an input to Hotspot. The temperature-dependent wire delay model is adopted from [24] .
In addition to random benchmarks (from TGFF), task graphs derived from the real applications like T264 decoder (Table 4 ) were adopted in the experiment. The T264 decoder has 4 pipeline stages, and the input to the decoder is a frame of foreman saved in quarter common intermediate format (QCIF). We followed the task partition in [25] and profiled the corresponding tasks of the AES decoder and encoder, each of which was designed with 4 pipeline stages and duplicated to accommodate a total of 16 tasks. The tasks of automotive and consumer from E3S were obtained from MiBench [26] , [27] . The matrix multiplication listed in Table 4 has 5 tasks in total. We also included the task graphs of OFDM transmitter and receiver from [28] into our benchmark suites.
Algorithm 4. Map_to_Ring function
Input r: A ring. a: a task to be mapped to one node in this ring. Output MðaÞ: The mapping result for this task, i.e., a node in r. Function: Find the mapping for task a to optimize the temperature of the ring. begin /* assume t 1 ; t 2 ; . . . ; t m are nodes in r and mapped with tasks */ find an unmapped t 2 r which minimizes P m j¼1 Dðt j ; tÞ; MðaÞ ¼ t; /* since hot tasks are separated physically, heat is distributed inside a ring
*/ end
In what follows, the latency of the HRC topology is first evaluated, followed by the evaluation of HRC together with different configurations of task management. Fig. 11 compares the latency of HRC and other 3D NoC topologies, including (i) the 3D Cube (denoted as Cube), (ii) concentrated 3D cube (denoted as CCube), where four nodes are connected to one physical router, and (iii) CCube-EVC, where the concentrated 3D cube is augmented with an express virtual channel (EVC).
Latency of HRC
In Fig. 11 , MC-x refers to an HRC with x pairs of virtual networks . VNs do not need buffers, and more VNs tend to help alleviate the congestion problem otherwise. From  Fig. 11 , when there are one to four VNs, corresponding to the cases of MC-1 through MC-4, their latency performance is worse than that of CCube and that of Cube. But in the case of MC-6, (i.e., HRC with 6 pairs of VNs), its network performance is better than CCube or Cube-based architectures. In the following experiments, we use MC-8 as the HRC architecture for comparisons. Fig. 12 compares the latency of the two networks of 1,024 nodes but organized with different ring sizes. The first one, referred as HRC64x4 in Fig. 12 , has 4 cubes; each cube has 4 rings, and each ring has 64 nodes. The second network, denoted as HRC256, has only 1 cube with 4 rings, and each ring has 256 nodes. From Fig. 12 , one can see that HRC64x4 has higher latency than HRC256 when the injection rate is low; but at high injection rates, the opposite result is observed. The reason is that, when a ring has more nodes, like the case in HRC256, data traffic can take full advantage of low latency transmission within a ring due to circuit switching. As the injection rate increases, the contention within a ring of HRC256 increases as well, leading to increased waiting time.
Comparison of Network Topologies + Task Migration Methods
In this section, we compare the performance and temperature results by exploring different combinations of various 3D NoC topologies, application mapping strategies, and runtime thermal-aware task migration strategies. In what follows, we first introduce configurations combining different mapping algorithms, task migration methods, and 3D NoC topologies, as shown in Fig. 13a .
The initial task mapping strategies can be either communication-aware [22] or thermal-aware [18] .
In communication-aware mapping, tasks with high communication volumes are mapped to nodes that have shorter distances (measured in hop counts) between them. The algorithm has two major steps: (Step 1) first node selection, where the first task is mapped to a suitable node, and (Step 2) region selection and mapping of the remaining tasks, as in Fig. 13c . It uses a metric called vicinity counter to measure whether the first node has sufficient number of free neighbor nodes. In the first node selection step, a node with the best vicinity counter value is selected. This corresponds to selecting a node with many free neighbor nodes. In the region selection step, a node region is found to run the tasks in the application. The region shape should be as close to a square as possible, since a square region has the lowest average distance between each pair of nodes inside it. Then the edges in task graphs (an edge indicates a communication between the two tasks) are sorted in a descending order by their communication volumes. The mapping algorithm works iteratively. The two terminal tasks of edges with larger communication volumes are mapped earlier. In each iteration, two nodes with the minimal Manhattan distance are selected to run the two communicating tasks represented in the task graph AG i as an edge. In thermal-aware mapping, heat is balanced among the vertical stack of nodes. Nodes in a vertical stack have the same Z coordinate. The mapping algorithm tries to keep the power consumption of each vertical stack as close to each other as possible when mapping the tasks to the nodes. It again runs iteratively for each unmapped task. At each iteration, two steps are involved as in Fig. 13d: (1) A stack with the minimal aggregate power is selected as the candidate stack, and (2) the unmapped task is mapped to a node in the candidate stack.
There are four migration strategies to follow:
Strategy 1) a migration path is selected in a random fashion; Strategy 2) the "hot" tasks are migrated to the global coolest nodes; Strategy 3) follow cyclic migration paths as in Fig. 13 [3] .
The hot tasks are separated into the four regions. At each control time instance, if the nodes running the hot tasks are detected to be overheated, these tasks will be migrated to the next node along the predefined path as in Fig. 13b [3] . Strategy 4) this is the strategy proposed for HRC (Section 4).
The rings can also be horizontal and thermally unbiased, as in Figs. 9a and 9c , respectively. These combinations of network topologies and task migration methods are tabulated in Table 5 . Fig. 14 shows the comparison results of the methods listed in Table 5 in terms of network latency and energy, when varying (i) average number of tasks in each application, (ii) communication volumes, and (iii) network size.
In Figs. 14a and 14d, these six approaches are compared when the average number of tasks in each application varies. One can see that overall, the two HRC-based approaches Table 2 were used in conducting these experiments.
have the lowest network latency and consume the least amount of energy among all the approaches compared. For example, when the average task number is 54, HRC with unbiased rotation can reduce latency by 40 percent over the random migration with communication-aware task mapping, by 50 percent over cyclic migration with thermalaware task mapping, by 82 percent over migration to global coolest nodes, and by 78 percent over cyclic migration path in CCube. Such performance improvement is attributed to the fact that HRC can effectively reduce the communication distance of the tasks during task migration. HRC with unbiased rotation's communication energy is 0:42Â, 0:48Â, 0:18Â, 0:24Â of random migration with communicationaware task mapping, cyclic migration with thermal-aware task mapping, migration to global coolest nodes, and random migration path in CCube, respectively. Among all these approaches, migration to global coolest nodes in CCube has the worst communication performance, since each "hot" task is migrated to the globally coolest node, which tends to increase the communication distance. Performances of the two HRC configurations with task migration are comparable.
Figs. 14b and e compare the six approaches when each application's average communication volume varies. Overall speaking, the HRC-based approaches have the lowest latency and network energy consumption among the all. For example, when the average communication volume is 70 Kbits, HRC with unbiased rotation can reduce latency by 25, 19, 70, and 56 percent over random migration with communicationaware task mapping, cyclic migration with thermal-aware task mapping, migration to global coolest nodes, and random migration path in CCube, respectively. The reason can also be attributed to the fact that communication distance of the tasks is kept small during task migration in HRC. The higher communication volume is, the more communication performance benefit the HRC-based approaches can bring.
Figs. 14c and f compare the six approaches when varying the system sizes. Overall speaking, the two HRC-based approaches have the lowest network latency and network energy consumption among all the approaches. For example, when the NoC size is 16 Â 16 Â 4, the HRC with unbiased rotation can reduce latency by 40, 53, 81, and 70 percent over random migration with communicationaware task mapping, cyclic migration with thermal-aware task mapping, migration to global coolest nodes, and random migration path in CCube, respectively. Fig. 15 compares performance using the real benchmarks in Table 4 . Overall, the two HRC-based topologies can reduce latency by 33, 33, 71, and 55 percent over random migration with cyclic migration with thermal-aware task mapping, communication-aware task mapping, migration to global coolest nodes, and random migration path in CCube, respectively. Fig. 16 compares the worst case performance of the six methods using random benchmarks. The worst case of our proposed 3D NoC architecture happens when (1) all the tasks inside a ring have reached their thermal limits, and (2) the communication volume is very small compared to the task computation time. In this case, no matter how we "rotate" the tasks, the entire ring remains overheated. That is, the entire chip has reached its thermal limit, and there is no room to do anything other than bringing down the voltage level or reducing the frequency through DVFS. The six approaches in Fig. 16 all employ augmented DVFS. That is, if there is no node that can be used for task migration to avoid overheating, voltage/frequency levels of cores have to be reduced. The results generated from all the approaches are found comparable. This is because when the power consumption from all the tasks is very high, irrespective to the migration methods employed, the chip is already overheated, and, applying DVFS to reduce the voltage/frequency levels of the cores is the only viable solution at a cost of certain degree of performance degradation. Fig. 17 shows the peak temperature and the thermal gradient of the six approaches. The two HRC-based approaches have the lowest peak temperatures. Migration following the cyclic path in CCube results in the worst thermal performance. In cyclic task migration in 3D mesh, the hot tasks are restricted to migrate along predefined paths. Each path includes nodes confined in a small region, leading to poorer heat dissipation. All the other approaches have comparable thermal performance. Table 4 were used in the experiments. These benchmarks actually were mixed to run in the system. 18 shows how temperature varies with different control intervals of task migration (i.e., task migration frequency), 1M; 10M, and 100M cycles (M stands for million). In agreement with intuition, a control interval of 1M cycles results in the best performance.
Thermal Profile
Implementation Cost
Using Synopsys Design Compiler TM with a TSMC 45 nm library, an HRC router with 8 pairs of VNs in HRC has an area of 112; 584 mm 2 and consumes 40 mW of power, respectively. The area and power of HRC are 0:8Â and 0:4Â of the area and power of a 3D CCube-EVC router with 10 ports and buffers with a depth of 4 flits, respectively.
CONCLUSION
In this paper, Hierarchical Ring Cluster, a specially designed 3D NoC architecture with genuine support of runtime task management was presented. At the bottom of the HRC network are the circuit switching rings, and when "hot" tasks need to be redistributed/migrated within the same ring, the communication latency remains unchanged. Rings are then grouped to form cubes, and multiple cubes can be further grouped together to form the whole network. Deadlock-free routing schemes for HRC were also presented to account for intra-ring, intra-cube, and inter-cube communications. Experimental results confirmed that the proposed HRC, when combined with runtime thermalaware task migration approaches, can reduce latency by as much as 80 percent over other 3D NoCs built on mesh topologies with task migration. Table 5 in terms of peak temperature and thermal gradient. Fig. 18 . Impact of control interval on task migration (i.e., task migration frequency) in HRC with planar rings.
