Three dimensional Networks-on-Chip (3D NoCs) have evolved as an ideal solution to the communication demands and complexity of future high density many core architectures. However, the design practicality of 3D NoCs faces several challenges such as thermal issues, high power consumption and area overhead of 3D routers as well as high complexity and cost of vertical link implementation. To mitigate the performance and manufacturing cost of 3D NoCs, inhomogeneous architectures have emerged to combine 2D and 3D routers in 3D NoCs producing lower area and energy consumption while maintaining the performance of homogeneous 3D NoCs. Due to the limited number of vertical links, application mapping on inhomogeneous 3D NoCs can be complex. However, application mapping has a great impact on the performance and energy consumption of NoCs. This paper presents an energy and performance aware application mapping algorithm for inhomogeneous 3D NoCs. The algorithm has been evaluated with various realistic traffic patterns and compared with existing mapping algorithms. Experimental results show NoCs mapped with the proposed algorithm have lower energy consumption and significant reduction in packet delays compared to the existing algorithms and comparable average packet latency with Branch-and-Bound.
I. INTRODUCTION
Current advances in semiconductor integration and processing capability of System-on-Chip (SoC) enable a large number of high performance cores to be integrated into a single chip. Networks-on-Chip (NoC) have been proposed to adopt the idea of networking in the macro-world for current and future SoCs but with limited resources and tighter constraints. 3D IC fabrication has also emerged to stack several layers of 2D ICs vertically, providing shorter vertical links with enhanced connectivity and lower delays. Combining 3D IC fabrication and Network-on-Chip technique, 3D NoC creates new design and research opportunities and challenges for high density SoCs. High integration and performance opportunity of 3D NoCs encourage enhanced heterogeneity of application design with implementation of multiple applications on a SoC. Conventionally, homogeneous 3D NoCs have been employed for 3D-Integration where 3D routers are employed for interlayer and planar communication. However, the 3D routers have a larger area and power consumptions than a 2D router with a similar architecture. Consequently, the homogeneous 3D router distribution may lead to significant area and power overheads if applied to applications whose communication patterns vary significantly among embedded cores. Moreover, Through Silicon Via (TSV) which has been accepted as a viable inter-layer wiring technique has a complex and expensive manufacturing process [1] , [2] . To optimize the performance and manufacturing cost of 3D NoCs with minimal distortion to the modularity, inhomogeneous architectures have been proposed to combine 2D and 3D routers in 3D NoCs [3] , [4] , [5] , [2] , [6] . Several inhomogeneous 3D architectures focusing on different NoC router architectures, minimal hopcount between 2D and 3D routers in each layer, and uniform distribution of 2D and 3D routers have been proposed [7] , [2] , [8] . However, due to the limited number of 3D routers and vertical links, mapping of applications to an inhomogeneous 3D NoC can be challenging. Specifically, different applications have variable characteristics at design time and run-time with possible link, node or task failures due to current technology limitations. In this paper, an energy-aware application mapping technique for inhomogeneous 3D NoCs is proposed. The proposed algorithm efficiently maps a given application with minimized communication energy while maintaining the performance. The application must be known and its communication task graph much be constructed for this approach, similar to the applied benchmarks [34] [56] [57] [58] as has been used in literature [3] [4] [5] [6] [7] [8] [9] [13] [14] [15] [16] [17] [18] . Consequently, the mapping algorithm involves three main stages: 1) Initial NoC size determination and clustering which assign regions along the 3D NoC vertices to reduce the cost of 3D routers. 2) Architecture matching stage which allocates the application graph to the clusters in the 3D NoC such that optimized numbers of 2D and 3D routers are assigned by enumerating the communication bandwidth and energy constraints in each cluster. Thus, this stage re-evaluates and assigns a suitable NoC size and automatically determines the suitable number of 3D and 2D routers and their assigned tiles. 3) Task to NoC region mapping which assigns the IP cores to the tiles in different clusters with minimized total energy consumption and maximized performance. This stage efficiently exploits area and power efficiencies of 2D routers as well as the higher bandwidth and lower latency characteristics of the vertical links associated with 3D routers.
A C C E P T E D M
A N U S C R I P T 2 The remainder of the paper is organized as follows: Section II evaluates recent contributions on 3D NoC mapping techniques. Section III formulates the problem of generation efficient 3D NoC architectures. Respectively, Sections IV and V formulate the mapping problem and describe the proposed mapping approach. Experimental results presented in Section VI emphasizes on the performance benefits of employing the proposed mapping algorithm to generate inhomogeneous architectures compared to other techniques. Finally, Section VII presents the concluding remarks of the main findings.
II. RELATED WORK
The evolution of SoC design to the third dimension offers a lot of opportunities such as integration of inhomogeneous cores which results in several challenges including optimal inhomogeneous NoC topologies, router architectures and application mapping techniques [9] , [10] , [11] , [12] , [13] . Various 3D NoC topologies are presented and evaluated in [14] , [15] , [16] , [17] where homogeneous 3D routers are employed in each architecture. A 3D router has a larger area and power consumption than a 2D router with similar architectures [18] , [13] . Particularly, the 7 port symmetric router has an area and power overhead of 36% and 158%, respectively compared to a conventional 5 port router [8] . Li et al. [19] , [13] proposed to replace the large 7 port symmetric 3D routers with 6 port NoC-Bus hybrid 3D routers. However, the hybrid router requires an additional central arbiter per each vertical pillar in the NoC. Moreover, the hybrid router still has a large crossbar and energy consumption. Xiangyu et al [20] have demonstrated that the area overhead of TSV increases with increase in number of 3D layers. Particularly, the area overhead of the TSV for a 4 layer 3D NoC with 5 million gates can reach as high as 10% [6] . Similarly, Bartzas et al. [3] presented a study of the area, power and performance trade-off of combining 2D and 3D routers in 3D mesh and torus topologies. Xu et al. [7] performed an evaluation of the impact of reducing the number of TSVs to half and quarter on the performance of 3D NoCs [21] . Their proposed architectures, quarter/lo and half/lo (quarter/hi and half/hi), aim at generating inhomogeneous 3D NoC with 2D routers placed as close to (far from) 3D routers as possible in each layer [21] . These architectures suffer from uneven distribution of 3D routers and unpredictable delays for different applications. Liu et al. [22] used partition islands of routers to constitute regions for sharing the same TSV pad for inter-layer communication controlled by serialization logic [21] . However, serialization along the TSV bundle causes the average packet delay to increase exponentially as the number of routers per TSV bundle increases [21] . Moreover, the TSV pads have no direct connection to the processing cores, which is a waste of chip area compared to our proposed architectures [21] . Even when bypass links are introduced to enable adaptivity among the TSV sharing regions, these architectures suffer from high packet latencies compared to a homogeneous 3D NoC implemented with a deterministic routing [23] , [24] . Also genetic algorithm and simulation annealing employed in [23] , [6] for the selection and placement of different TSV patterns (sharing regions) in 3D NoCs have an exponential complexity with a large design exploration space [6] . Similarly, Pasricha [25] proposed a serialization technique for reducing the number of TSVs where link size of TSVs at selected nodes is reduced by a fraction. Thus if the number of TSVs exceeds a threshold, serialization is adopted to reduce the bandwidth of some TSVs. However, due to the reduced bandwidth of the TSVs and serialization logic, such architectures have high average packet latencies. Moreover, due to the higher overhead of serialization receiver and transmitter logic compared to the TSV reduction, such architectures have even higher power consumption compared to homogeneous 3D mesh. Based on the serialization methodology, Pasricha [26] proposed a 3D NoC synthesis framework by augmenting router and placement techniques proposed in [27] , these routers have several local ports which have high power consumption due to the increased number of ports and high data rates across the crossbar. On the other hand, existing inhomogeneous architectures [28] , [7] , [22] , [29] , [3] , [4] , [24] , [30] , [31] , [32] , [5] assume a fully mapped NoC and do not consider the dynamics of application traffic load in their inhomogeneous architectures [13] . Applications in such 3D NoCs are not optimized as communication bandwidth and performance constraints of the applications were not considered in the architecture generation [13] . To resolve this, a systematic approach for generating inhomogeneous 3D NoC architectures where the TSV and buffer utilization of the given application are exploited is proposed in [6] , [13] . However, the systematic approach assumes the adopted mapping algorithm (which could have a high complexity) is optimized. This paper significantly complements the systematic approach by presenting an energy and performance efficient NoC dimension generation technique. In addition, as an effective mapping algorithm is proposed, which generates inhomogeneous 3D NoC architectures. The generated architectures could be ported into the systematic approach for further buffer resizing. The impact of variable buffer sizes is detailed in [6] . Several IP mapping algorithms have been proposed for 2D NoCs that focus on minimizing the overall communication power [33] , [34] , [35] , [36] , [37] . However, most of these existing mapping algorithms for 2D NoCs are not suitable for 3D NoCs as communication in the third dimension introduces a new set of NoC constraints, which are not considered in the 2D designs. For example, different area and energy overheads of 3D routers and different transfer delay per dimension are some of these constraints. Moreover, such mapping algorithms will not be suitable for generation of 3D NoCs with inhomogeneous router distributions as they do not take advantage of the power and performance benefits of the router heterogeneity. Branch-and-Bound' algorithm proposed by Hu et al. [34] can be extended to map application to inhomogeneous 3D NoCs [21] . However, this algorithm employs partial exhaustive search trees and has an extremely long run-time. Janidarmian et al. [38] proposed Onyx, a heuristic bandwidth-constrained mapping algorithm for tile-based NoCs. Here, nodes with higher communication data rates are mapped first. Tosun [39] presented CastNet, a heuristic algorithm for mapping tasks onto mesh-based NoCs. Based on the symmetry of the mesh, tiles are grouped into partitions, and then nodes with high M A N U S C R I P T 3 communication volumes are mapped first to the partitions. Murali et al. [40] proposed Nmap, a three-phase algorithm to find near optimal mapping solution which has been extended to 3D NoCs [41] . Several mapping solutions have been proposed for 3D NoCs based on genetic algorithms [42] , [43] , [44] , [45] . However, these algorithms have very high computational complexities. Wang et al. [46] proposed a mapping algorithm for 3D NoCs based on run-time incremental mapping technique [47] . Here, the algorithm tries to map applications to convex regions while utilizing as many vertical links as possible in the mapping process. This approach increases the number of 3D routers (vertical links) which are costly and have high power consumption.
As an effort to reduce the number of TSVs, [41] employ heuristics to map an application to a homogeneous 3D NoC. A number of vertical links with high traffic load are then preserved and the mapping algorithm is reapplied to improve the performance of the application in the generated inhomogeneous 3D NoC. A similar approach is adopted in [48] , where Kernighan-Lin partitioning is employed as the mapping algorithm. In both cases, the mapping technique has to be repeated in an iterative manner. Moreover, besides the limitations associated with the Kernighan-Lin algorithm, the resulting mapping solution may not be optimal. We propose a performance and energy-aware technique for mapping applications to 3D NoCs. The proposed technique automatically generates low latency inhomogeneous 3D NoCs for Embedded platforms by evaluating the number and placement of 2D and 3D routers required as well as the communication characteristics of tasks assigned to the processing elements. Run-time migration and scheduling techniques could be employed to manage end-to-end communication delays of the performance efficient inhomogeneous 3D NoCs generated by our mapping technique [36] , [37] . Although mapping multiple tasks per core could improve the performance, we still need to deal with the same problem of mapping the clustered tasks to cores in a 3D-NoC. It should be also noted that the generated inhomogeneous 3D NoCs rely heavily on the alignment of 3D routers to provide interlayer TSV connections. However, by introducing inhomogeneity in the network, we are able to save more power hungry 3D routers and help reduce the number of expensive TSV. In the following sections, the assumed model is for application, topology, and cost functions are mainly based on the literature [34] [56] [57] [58] for a fair comparison to evaluate the impact of the proposed solution.
III. PROBLEM FORMULATION
To efficiently exploit the non-uniform bandwidth requirements of various tasks in applications, an efficient algorithm that maps tasks to NoCs can be employed to generate inhomogeneous architectures with improved performance. Consequently, the focus of this paper is to capitalize on the architecture and topology variation of inhomogeneous 3D NoCs through an efficient application mapping technique. An example of an inhomogeneous 3D NoC architecture is shown in Fig. 1 . 
A. NoC Architecture and Application Formulation
An application A x = {tsk 1 , tsk 2 , tsk 3 ...tsk n }, where tsk i represents an unmapped task in A x , is modeled as a communication graph defined as follows Definition 1: An application Communication Graph (CG) = (T, W ) is a weighted directed graph where each node denoted by tsk i ∈ T refers to a task and each w ij ∈ W , is the communication between tasks tsk i and tsk j ∈ T . The weight of the edge w ij ∈ W characterizes the communication volume (bit width) between tsk i and tsk j .
Definition 2: An X × Y × Z 3D NoC is said to have a symmetric NoC size if it satisfies:
Definition 3: A tile cluster is defined as a group of adjacent tiles allocated to a collection of communication dependent tasks. A tile cluster can combine tiles from any dimension.
Definition 4: A tile in a tile cluster is called a vertex if it is located at the boundaries of the tile cluster. Please note that a tile may be or may not be a vertex as a tile cluster can encompass multiple tiles.
B. NoC Energy Formulation
The energy consumed by a task between a given tile pair t i and t j in application A x is given by
where b ij is the communication volume in number of bits transferred between t i and t j , mn represents all communication involving the task and E ij bit is the communication energy for sending a single bit of data between the tile pair which is given by (an extended version of 2D energy model [49] ):
where η 3D and η 2D are the number of 3D and 2D routers traversed between the source and destination nodes, respectively.
A C C E P T E D M
A N U S C R I P T 4 E R3D bit and E R2D bit are energy consumption for sending a single bit across a 3D and 2D router, respectively. E ij link bit is the link energy consumption defined as:
where E Hlink bit is the energy consumption per bit along the planar links and E V link bit is the inter-layer link energy consumption per bit. Following the SMIC 90nm technology used in [50] , E Hlink bit and E V link bit can be calculated as 0.127pJ and 9.56×10 −3 , respectively. M Dh ij and M Dv ij , the Manhattan distance of the planar and inter-planar regions between t i and t j are given by Equations 5 and 6, respectively.
C. Mapping Problem Definition
We formulate the mapping problem as follows: Given the communication graph CG of an application A x find a mapping function M : A x → N oC which maps all the tasks in CG to available tiles in the inhomogeneous architecture with minimized average packet delay such that
where L ij tsk and E ij tsk are respectively the latency and energy consumption of transferring packets of a task between tiles t i and t j for a given deadlock free routing algorithm. Similar to [4] , [24] , [5] , [3] , [7] , EL cost represents the total combined cost of energy and latency of the given mapping, as an equal weight is given to both latency and energy consumption of the NoC. To illustrate the composition of the problem, Fig.2 shows an example of an application that need to be mapped. The communication task graphs of the applications are given in Step 1.
Step 2 estimates dimensions of a 3D NoC (in this case 3 × 3 × 3) based on the total number of tasks available in the application presented to the mapping tool. To reduce the average packet latency and energy consumption, tasks with high communication dependencies must be mapped as close to each other as possible. To initiate the mapping process, tasks in order of their communication volumes are assigned to the NoC regions such that high energy consumer tasks are assigned to as many 2D routers as possible while maintaining a certain number of 3D routers 1 to enhance performance. Tasks that have communication dependencies on allocated tasks are then assigned to the task clusters such that the total hop-count between the tasks is reduced while an equivalent number of tasks is kept within each task cluster. It should be noted tasks assigned to different task clusters can still communicate with each other after the mapping process. Next, all tasks in the application are mapped as shown in Step 4. In this paper, we use single task-to-core mapping in our evaluation. However, similar to [36] , [51] , [37] the proposed algorithm is applicable to cases where several tasks can be grouped together and mapped to a single tile. The task grouping can be achieved by various cost functions such as the task dependencies and their energy communications [51] . In such cases, we consider these group of tasks as a single vertex in the communication task graph and the total outgoing/incoming bandwidth is taken as input to the mapping function in stage 1. Based on the above discussion, the mapping problem for inhomogeneous 3D NoCs is divided into sub-problems as detailed in the next section.
IV. NOC SIZE AND MAPPING CONSTRAINTS
To achieve our aim of optimizing inhomogeneous 3D NoCs, the mapping process is further decomposed into three main sub-problems (NoC size determination, architecture matching and task mapping) which are individually discussed in this section.
A. Initial NoC Size Determination and Partitioning
The choice of NoC size dimensions plays a vital role on the performance and energy consumption of the network. Several issues such as floor planning, temperature and area which affect NoC dimensions have been addressed over the past few years [52] , [8] , [53] , [17] , [54] .
However, choice of 3D NoC dimensions has not received much attention in terms of delay and energy consumption. For a better understanding of the effect of NoC dimension on packet delay and energy consumption, we simulated 3D NoCs with various NoC dimensions under uniform and hotspot traffic patterns. For the router configuration, an input and output buffer size of 3 and 2 flits are used, respectively. As mentioned before, the impact of buffer size is already discussed in [6] . Figs. 3 and 4 show the variation of average packet latency and total energy consumption of 64 node 3D NoCs with various traffic rates (uniform and hotspot) under XYZ routing. As can be seen, decreasing the regularity of the 3D NoC dimensions significantly increases the packet delays. For instance, Fig. 3 (a) shows that regular 4 × 4 × 4 3D NoC can sustain over 45% more traffic compared to lower average packet latency of 2 × 16 × 2 3D NoC. It can also be observed in Fig. 4 (a) that, the 4 × 4 × 4 3D NoC has lower energy consumption at low traffic rates compared to other 3D NoCs. Even under high traffic rates, energy consumption of 4 × 4 × 4 is lower than 4 × 8 × 2. Similarly in Fig. 3 (b), 4 × 4 × 4 3D NoCs have lower average packet latency compared to other NoC dimensions under hotspot traffic patterns. Moreover, Fig. 4 (b) shows that 4 × 4× 4 3D NoCs have the lowest energy consumption. In summary, it can be inferred that the symmetry of the NoC dimensions has a significant impact on the average packet latency and total energy consumption. Hence, to optimize the performance of inhomogeneous architectures, an effective approach is required to reduce the irregularity of NoC dimensions at an early phase Step 1: Application   t5  t6  t0   t4  t8 t10   t2  t1  t3  t9  t7 Constraints  Energy  Bandwidth
Step 2: NoC dimension estimation
Step 3: Application partitioning and clustering of design exploration. In general, as shown in Table I Fig. 5 . Variation of inter-layer propagation delay with temperature million-gate design; but 2 layers are considered optimal for a 100 million-gate design. Hence, the optimized size of each 3D NoC dimension is dictated by the total number of tasks available in the applications to be mapped. Assuming each task is to be mapped to a single PE and each tile has the same size. Based on the above discussion, an initial estimation of a symmetrical NoC size is made considering the total number of tasks available. To minimize communication delays, task clusters are employed by assigning tasks with high communication dependencies as close to each other as possible. Task clusters are assigned to the NoC regions by grouping adjacent tiles in a regular or irregular shape. Unlike other clustering algorithms, the goal is to minimize the manufacturing cost as well as power consumption of 3D NoCs by utilizing vertical links only when necessary. Also, unless a loop exists, each task cluster is assigned to at least one task with high communication bandwidth requirement to balance the traffic in the network and reduce hotspots as well.
Given A s as the number of tasks to be mapped. The first stage of this sub-problem is to estimate the size of a symmetrical 3D NoC (X, Y, Z) which satisfies Equation 1, such that:
In our implementation, the cube root of A s is calculated and the dimension sizes are rounded up or down to minimize the total dimension size for satisfying equation (9) .
B. Architecture matching
The main focus of this stage is to assign high bandwidth tasks in the NoC layers such that a limited number of 3D routers are assigned in the task clusters to generate an optimized inhomogeneous 3D NoC architecture. Though TSVs are expensive and complex to implement with high area overheads, TSVs are shorter than horizontal links with shorter delays. However, the crossbar of routers without TSV (2D routers) has lower power consumption and area overhead compared to that of routers with TSVs (3D routers) [8] . Moreover, 3D routers require more ports which consume a larger amount of memory resources compared to 2D routers [1] . Hence the challenge of this sub-problem is to find and generate an architecture that provides an optimized trade-off between power, energy, area and average packet delay while minimizing the manufacturing cost incurred by the TSVs. 
C. Manhattan Distance, Loops and Average Hop-Count in Task Mapping
The final sub-problem of this phase is to map the application to the tiles such that the total communication energy is minimized. Most tasks in applications tend to form loops of communication dependencies. The challenge is how to maximize such loops to reduce packet delays. For example, given a group of 4 tasks, Fig. 6 shows all the possible shapes for the task to be mapped to neighbouring tiles. It can be deduced that, mapping the task to a square region (a.k.a clustered region) produces the lowest communication cost as there are more path diversity even under adaptive routing. For instance both A 1 and A 2 in Fig. 7 have tasks that form communication dependency loops. In the mapping algorithm if loops are not considered, the mapping will follow along the upper part of Fig. 7 which has extra delays with higher communication cost. However, the lower part of Fig. 7 , where tasks are mapped to clustered regions of tiles, produces a lower average hop-count with lower communication cost due to the routing diversity provided by the multiple paths between any two nodes [33] . To minimize the complexity of the proposed mapping algorithm, all the tasks in a loop should not be shared with any other loop.
Hence this sub-problem can be summarized as, given the CG of application A x find a mapping function M :
to optimize objective function 8, where N 2DR and N 3DR are the total number of 2D routers and 3D routers in the inhomogeneous 3D NoC architecture, respectively 2 . This can be achieved by minimizing relative number of 3D routers to 2D routers due to the higher cost of 3D routers. This stages efficiently exploits the lower area and power efficiency of 2D routers as well as the higher bandwidth and lower latency characteristics of the vertical links associated with 3D routers.
V. ROBUST MAPPING AND ARCHITECTURE GENERATION FOR INHOMOGENEOUS 3D NOCS With reference to the sub-problems discussed above, this section presents the implementation details of the proposed technique. Before the mapping process begins, optimized 3D NoC dimensions have to be determined. Experiments conducted in Section IV-A confirm that a 3D NoC with symmetric dimensions has the most efficient average packet delay and energy consumption. Hence given |A s | as the total number of tasks available in the application to be mapped, an (X, Y, Z) symmetric 3D NoC dimension is determined by:
Equation 11 generates an equal NoC size in each dimension if the total number of tasks to be mapped is cubic. If the number of tasks does not have a positive integer as the cube root, the total number of tasks generated will be smaller than |A s |. Hence, the values of X, Y and Z in Equation 11 are increased in a stepwise manner until the total number of tiles available is enough to accommodate the total number of tasks. For example, when total number of tasks is 36, 3D NoC dimensions of 4 × 3 × 3 will be generated. To reduce the number of iterations, if the fractional part of 3 |A s | is greater than 0.4, the symmetric 3D NoC dimension is determined by:
For instance, an application with 26 tasks has 2.5 as 3 |A s |. In this case a 3D NoC dimension of (3, 3, 3) will be generated.
To generate a suitable inhomogeneous architecture and efficiently map an application to the 3D NoC without employing time consuming exhaustive search algorithms, we present an approach which uses task clusters to balance the traffic load. Moreover, the proposed mapping approach adopts bandwidth and loop based mapping explained in Section IV-C to increase the routing efficiency of the NoC while introducing localization along the vertices to reduce the total energy consumption. The proposed mapping technique is explained in Algorithm 1. Before architecture generation and mapping of tasks to tiles, task clusters are formed by grouping tasks which have close communication dependencies. First, tasks in the application are sorted in the descending order and assigned to an initial list (IniList) (Line 4). To simplify the task clustering and application mapping, we employ a number of terminologies which are defined as follows: Definition 5: The cost of assigning a task to a tile t i in relation to a mapped tile t j is defined by Equation 13 . Thus cost is a direct relation of inter-tile Manhattan distance, energy, latency, and 3D to 2D router ratio:
Definition 6: Vertex.(0,0,0) is defined as the tile which is closest to the heat sink.
Initially a task cluster is created by the highest bandwidth demanding task. Afterwards, tasks that either form a loop or have a direct communication with this task are assigned to the same task cluster. To map the tasks in a given task cluster to the NoC efficiently, two lists are employed: T askCluster M asters and T askCluster Slaves. Every element of the T askCluster M asters represents a task with the highest communication volume as a cluster initiator. However, elements of the T askCluster Slaves are the tasks that either form loops or have a direct communication with elements of the T askCluster M asters list. The creation of the two lists is initiated by assigning the first element of IniList to the T askCluster M asters. The remaining elements of the initial list (from the second to the last element) are then accessed sequentially. If the element (current task) forms a loop with a member of the T askCluster M asters, it is assigned to T askCluster Slaves list. The current task is then tagged with the task ID of the associated element in the T askCluster M asters. On the other hand, if the current task has a direct communication with tsk, a member of the T askCluster M asters, and the number of tasks in the T askCluster Slaves associated with the task ID of tsk is less than a set threshold 3 , it is assigned to the T askCluster Slaves list. It is then tagged with the task ID of tsk. However, the task is automatically assigned to the T askCluster M asters list if it does not meet any of these criteria. First, with energy, communication delay and 3D to 2D router ratio as the cost function (Equation 13 ), each member of the T askCluster M asters list is mapped to a vertex in each cluster while neighbouring tiles of each vertex are reserved for tasks in the T askCluster Slaves list. Here, the energy parameter and the 3D to 2D router ratio in the cost function ensure that a minimum number of 3D routers are employed. Moreover, inter-tile Manhattan distance and latency as computational arguments in the cost function ensures that high performance 3D routers (short TSV links) are employed within the clusters while satisfying Equation 10 . Thus, 3D routers when inserted in the NoC are placed in adjacent tiles between the layers where they provide minimum Manhattan distance and TSVs are inserted as vertical links. The T askCluster M aster list is necessary for mapping M A N U S C R I P T 8 tasks with high communication bandwidth and vertices are used for balancing traffic while maintaining a uniform distribution across the 3D NoC architecture. On the other hand, the T askCluster Slaves list ensures minimum communication hops to enhance localization of traffic. To begin the mapping process, the task with the highest communication volume is mapped to a V ertex.(0, 0, 0) which is closest to the heat sink (Line 15). Then other tasks of each cluster are mapped to the tiles such that the total energy, communication delay and thermal hotspot are minimized. To ensure minimum inter-task Manhattan distance within the clusters, the tasks are assigned in a unit step to the remaining tiles that have the minimum cost. Thus, a step of +1 or −1 is made along the x, y or z dimension depending on the location of the vertex with minimum cost in the cluster. Thus, beginning with the layer closest to the heat-sink and the cluster with the highest communicating task, each task is assigned to a tile. Consequently, tasks with high communication bandwidth are mapped to vertices to balance traffic while maintaining a uniform distribution across the 3D NoC architecture. Moreover, Equations 8 and 10 ensure that the total communication delay, TSVs and total energy constraints are met. Next, vertical links are assigned to tiles with direct interlayer communication in each cluster. To illustrate the mapping process, Fig. 8 shows an example of an inhomogeneous architecture with a NoC size of 3×3×3 and a total of 3 TSV bundles as an optimized solution generated for the Auto-indust Benchmark [56] . By considering hopcount and TSV cost alone, one might argue that an optimized solution will be to map each distinct group of tasks on each layer. Thus, the largest group in Fig. 8 can be optimally mapped to one layer. The second and third group (4 and 5 tasks in each group, respectively) can also be mapped on another layer. Finally, the group with six tasks can be mapped to the last layer. Hence there will be no need for any interlayer link. However, intra-layer communication is much slower compared to the inter-layer communication along TSVs. This mapping approach will cause performance degradation which in turn, will also affect the energy consumption negatively. Hence, as an effective solution the mapping tool needs to consider both energy and performance effects simultaneously while minimizing the cost of TSVs. We assume that frequently communicating tasks can be mapped within the layers with minimum delay as long as there is only one hop between the tasks. A closer look at the CG graph shows that though tasks 16, 17 and 19 have the highest communication volumes (in both ingress and egress traffics directions), they form an acyclic loop with tasks 15, 18 and 20, while task 21 has a direct communication with task 20. Hence the initial mapping phase assigns task 19 to the first element in the first cluster (19, 17, 16, 15, 18, 20) and maps it to the first vertex (first tile in the lowest layer) while tasks 15 − 18 and 20 are mapped to the remaining tasks. Then tasks 21 − 23 are mapped to the immediate planar and inter-planar tiles such that the total cost of the mapped tasks is minimum. Task 2, the first task in the next cluster as it has a high communication volume and provides the minimum cost relative to the mapped tiles, is mapped to the vertex which has the minimum cost. The 1 Initialization Phase: 2 IniList ← Sort Ax by communication volume in descending order 3 while (!IniList.Empty()) do 4 current.task = IniList.Next() 5 if (current.task form a loop with a task in T askCluster M asters) then 6 TaskCluster Slaves.Next() ← current.task /* add task to TaskCluster Slaves and tag with the ID of the corresponding task in TaskCluster Masters*/ 7 else if (current.task has a direct communication with a task tsk in T askCluster M asters and tasks associated with tsk in T askCluster Slaves is less than threshold) then 8 TaskCluster Slaves.Next() ← current.task 9 end 10 else 11 TaskCluster Masters.Next() ← current.task immediate tiles are then assigned accordingly to tasks 0 − 1 and 3 − 5. The next task which gives a high communication bandwidth is task 7 and when mapped to the next vertex, it gives minimum energy consumption. Consequently task 7 is mapped to the next vertex on the last layer and the neighbouring tiles, which provides a minimum hop-count between the communication dependent tasks, assigned to tasks 6, 8 and 9. Next, the final cluster, {10,11,12,13,14} is assigned to a region in the remaining tiles that meets the communication bandwidth and the total number of 3D router constraints. In this case, task 11 is mapped to the vertex instead of task 12 as it generates a more optimized solution with minimum cost. Similarly, Fig. 9 shows an example of an inhomogeneous architecture generated for D26-dVOPD Benchmark [57] . In contrast to Auto-indust Benchmark, D26-dVOPD Benchmark has more inter-task connectivities with more cyclic loops. However, by exploiting the communication loops to balance the traffic in the network, the generated architecture has 3 TSV bundles per layer to provide a good performance to power trade-off.
A C C E P T E D M A N U S C R I P T 9

A. Complexity of the Proposed Mapping Algorithm
The aim of this paper is to present a high performance and energy efficient inhomogeneous architecture generation and mapping algorithm which has a low complexity overhead. In Algorithm 1, there are three main loops which must be considered in the worst case analysis of the proposed mapping technique. The first while loop (Lines 3 to 14) which generates the T askCluster Slaves and T askCluster M asters has a complexity of O(n), where n is the number of tasks in the given application.
The second while loop (Lines 15 to 32) in the mapping and architecture generation phase has two nested for loop (Lines 17 to 31 and Lines 21 to 28). The run time of the inner for loop depends on the threshold, whereas the outer for loop and while loop have run times which depend on the number of tasks. Considering the worst possible case of the inner for loop, where the maximum threshold value (6) is employed, the loop (Lines 21 to 28) has a constant run time which is independent of the number of task n. Contrarily, the outer for loop (Lines 17 to 31) and while loop have direct dependency on the unoccupied vertices and T askCluster M asters, respectively. In the worst-case scenario where the tasks in the given application has no communication dependencies, the number of vertices as well as the size of the T askCluster M asters list can 15 9 Het. Mapping be considered to be equal to the total number of tasks (n). Within the outer for loop (Lines 17 to 31), the algorithm runs a constant number of instructions which is repeated n times. This instruction is repeated for a further n number of times within the while loop (Lines 15 to 32). Hence the complexity of Lines 15 to 32 (the nested loop) is O(n 2 ). Also, Algorithm 1 has a partial dependency on the complexity of the sorting function in Line 2 which is O(n log 2 n). However, in the worst case O(n log 2 )n (Line 2) and O(n) (Lines 3 to 14) are much lower than O(n 2 ) (Lines 15 to 32). Therefore the complexity of the proposed algorithm is O(n 2 ).
VI. EVALUATION
In order to evaluate the performance of the proposed technique, a cycle-accurate NoC simulator is used by extending W orm sim [34] , an existing 2D NoC simulator. Our extended simulator employs wormhole packet switching flow control to accurately simulate 3D NoCs with any configuration of 3D and 2D routers. We implemented the NoC components at the Register Transfer Level (RTL) in VHDL language and implemented on a 40nm CMOS process technology. Energy consumption of each component is then imported into the simulation platform. Energy consumption of the NoC was estimated using E bit energy model [49] . As this method can be only used for a known application with a communication task graph we have selected both both synthetic and realistic benchmarks to evaluate the algorithm in different possible patterns of traffic such as typical heavy traffic loads as well as realistic scenarios similar to the applied benchmarks [34] [56][57] [58] . In addition, these applications have been used in the literature [3] [4] [5] [6] [7] [8] [9] [13] [14] [15] [16] [17] [18] , so it makes feasible to have a fair comparison to evaluate the impact of the proposed solution. In the simulation, a fixed packet size of 5 flits is used in the NoC model, as various flit sizes used which had negligible impact on the result. This is expected as changing the flit size would mainly impact the size of packets rather than the overall communication and energy consumption. In order to evaluate the performance sustainability and energy of the NoC in real-world scenarios: a complex multimedia traffic (MMS) [34] , [57] , Auto-indust and Telecom (from the E3S benchmark suite) [56] and an AV (Audio-visual) benchmark [58] . The setup is running for a warm-up period of 2000 cycles and performance statistics are collected after a simulation length of 200, 000 simulation cycles. Hence, by introducing different delay and energy models of 2D and 3D routers in the system, we have compared the average packet latency and energy consumption.
A. Performance Evaluation of TSV-Aware Mapping Algorithms
First, we compare the performance of the proposed mapping technique with existing ones (Map Core Graph [41] , KL Map and TSV place [48] , Branch-and-Bound [33] , Onyx [38] , CastNet [39] , Nmap [40] ) and Random mapping. For simplicity, we refer to the proposed mapping technique as HetMap. In our mapping approach, 3D routers are treated as an expensive resource during the mapping to automatically generate an inhomogeneous 3D NoC architecture. Similarly, Map Core Graph and KL Map and TSV place exploit mapping algorithms to generate inhomogeneous 3D NoC architectures. Both [41] and [48] estimate the communication cost as a product of the bandwidth requirements of an application and the hop-count between associated cores. Applications are mapped to a homogeneous 3D NoC. Vertical links are then monitored for flow and a percentage of highly utilized vertical links are preserved. The mapping process is then repeated to remap the application into the newly generated inhomogeneous 3D NoC architecture. While [48] adopts Kernighan-Lin partitioning to solve the mapping problem, [41] maps the tasks in order of their bandwidth and their relative communication cost to the mapped task. VOPD, DVOPD and 263ENC-MP3DEC benchmarks from their results are selected to evaluate the communication cost, as we have used the same benchmarks in our evaluation. VOPD, DVOPD and 263ENC-MP3DEC have 16 cores (2× 4× 2), 32 cores (4× 4× 2) and 12 cores (2× 3× 2), respectively. In the inhomogeneous 3D NoC architecture generation process in [41] , [48] , the number of routers with TSVs are restricted to 25%. We have forced HetMap to use the same number of TSVs and NoC dimensions for this set of experiments. The investigated mapping algorithms have a wide range of computational complexities. We have previously extended the origional 2D NoC based Branch-and-Bound mapping algorithm to 3D NoCs [6] . Similarly NMAP, CastNet and Onyx which were origionally proposed for 2D NoCs have been extended to 3D NoCs for the purpose of our study. Particularly, we have also extended Branch-and-Bound [33] , Onyx [38] , CastNet [39] , Nmap [40] to automatically generate inhomogeneous 3D NoCs by exploiting the initial NoC architecture generation stage of the systematic approach proposed in [6] . Thus the application is mapped with the 3D version of the mapping algorithm under consideration. A full system simulation is then conducted. A constrained number of 3D routers are then placed at nodes with highly utilized vertical links to generate an inhomogeneous 3D NoC. As Branch-and-Bound employs searching trees in finding solutions, the complexity of the mapping problem increases exponentially with the number of variables [59] , [60] . In contrast, the complexity of HetMap is O(n 2 ), while CastNet, Onyx and Nmap have a complexity of O(en 2 ), O(n 2 log 2 n) and O(n 4 log 2 n), respectively. Here, e is the number of edges representing the minimal path candidates of a mapping solution. So far we have compared the inhomogeneous 3D NoCs generated by our mapping algorithm with inhomogeneous 3D NoC architectures which have been mapped with existing mapping algorithms. Table II shows a comparison of the communication cost in terms of bandwidth (identified by the edge labels in core graph) and hop-count [41] , [48] of the above existing approaches and the proposed HetMap application mapping and inhomogeneous 3D NoC architecture generation technique. A the communication cost is how much data and how far the data needs to be transferred, here communication cost of an application is defined to be equal to the sum of product of bandwidth requirement of pairs of cores and number of hops between the corresponding routers M A N U S C R I P T TABLE II  COMPARISON OF COMMUNICATION COST' OF DIFFERENT TSV-AWARE MAPPING ALGORITHMS to which the cores are attached. We adopted this definition from [40] , which has been used widely used such as in [41] , [48] .By extending the performance of the existing mapping algorithms with the systematic approach, the communication costs of these techniques are significantly reduced. As can be seen in Table II , CastNet, Onyx, Nmap', [41] and [48] have similar communication costs. Compared to these algorithms, HetMap generates inhomogeneous 3D NoC architectures with lower communication cost (with the improved Branch-and-Bound as its competitor) due to optimized allocation of TSVs during the application mapping. Moreover, the load balancing approach of application partitioning in HetMap significantly reduces the communication cost among the mapped tasks. In addition to HetMap generating architectures with lower communication cost compared to the approach presented in [48] , the complexity of the Kernighan-Lin partitioning algorithm adopted in [48] is much higher (O(n 3 )) than that of HetMap (O(n 2 )). Moreover, the complex mapping algorithm in [48] must be repeated to successfully generate the inhomogeneous 3D NoC.
In addition, we have compared the average CPU time required by each of the techniques to successfully map the investigated benchmarks to 3D NoCs 4 . As shown in Table  III , on average, the CPU time required by HetMap is similar to that of CastNet', Onyx', Nmap' and even RandomMap, though HetMap produces 3D NoCs with much lower average packet latency. By employing vertices and load balancing, only a few tiles are considered during cost consideration and clustering in HetMap. Hence though several cost functions are considered, HetMap has a short computational time. Despite the fact that HetMap has a lower complexity compared to Onyx', its average CPU time in table III is slightly higher than that of Onyx'. This is because under a considerably small number of nodes the computational time for mapping the tasks in the TaskCluster Slaves list under HetMap (Algorithm 1 Lines 17 to 31) affects the total CPU times. However, for a relatively large number of nodes, the average CPU time required by HetMap is much lower than that of Onyx'. Due to their iterative nature of the application mapping process, [41] , [48] have relatively long CPU time. Although Branchand-Bound' is comparable with HetMap in terms of average packet latency, its computational time is significantly higher (over 20 minutes [33] ). Additionally, Branch-and-Bound' employs exhaustive search trees, and bandwidth and loop based mapping which has a high complexity. Fig. 10 shows that the proposed mapping technique has lower average packet latency compared to existing heuristic mapping algorithms. This is expected as packets in HetMap experience shorter delays due to even distribution of localised traffic in the inhomogeneous architectures. Moreover, the average packet latency of 3D NoCs under HetMap is similar to that of Branch-and-Bound' mapping algorithm. It should be noted that Branch-and-Bound' explores all possible tasks mappings using a search tree and selects the most cost effective (in this case the cost parameter is the latency and energy) mapping solutions. Also by employing clusters and using vertices to balance traffic in the NoC, the average packet energy consumption of NoCs mapped with HetMap is lower than Branch-and-Bound' though they both employ bandwidth and loop mapping. Existing mapping algorithms take the network size and architecture as inputs and generate a suitable task to tile mapping based on the given application and configuration. For a fair comparison, the same NoC size and architecture was used in all cases. However, experimental results of HetMap emphasize on the significance of accounting for the NoC size and vertical links during optimization of task mapping. Fig. 11 summarizes the average packet energy of the mapping This is mainly due to the reduced paths and balanced traffic loads both in the planar and inter-planar regions. Fig. 12 shows that by reducing the number of 3D routers while maintaining the regularity of the 3D NoC in the first place, HetMap has a much lower power consumption compared to existing heuristic mapping techniques. Moreover, Hetmap and Branch-and-Bound' have similar power consumption, though the same number of 3D routers was used in both cases.
To emphasize on the performance benefits of the proposed mapping technique, we have applied a varied set of realistic benchmarks: D 36 4, D 64 4, D 26 media, D 62 pvopd and D 38 tvopd [57] . Similar to Fig. 11 , it can be noticed in Fig. 13 that, with the exception of Branch-and-Bound which has similar performance but with higher complexity, HetMap generates 3D NoCs with lower average packet latency compared to existing mapping algorithms. Particularly it can be noted that as the number of nodes and communication dependencies increase, the performance efficiency of HetMap over Onyx, CastNet, Nmap and Random mapping techniques increases, though 3D NoC architectures in HetMap have less 3D routers and TSVs. Moreover as shown in Fig. 14, the 
B. Performance Evaluation of Inhomogeneous Architectures
To analyze the performance benefits of inhomogeneous architectures generated by the proposed technique, we performed experiments with various mapping algorithms and existing inhomogeneous architectures [7] , [4] , [5] . For the existing mapping algorithms, we observed consistent characteristics with best performance recorded with Branch-and-Bound' algorithm. Hence, in this section we present our analysis for existing inhomogeneous architectures which have been mapped with Branch-and-Bound' mapping algorithm and compare them with inhomogeneous architectures generated by the proposed technique. Fig. 16 summarizes the average packet latency of various inhomogeneous architectures under different realistic benchmarks. By evenly balancing the intra-layer and inter-layer traffic loads and optimizing the number of 3D routers while minimizing the average hop-count, HetMap generates inhomogeneous architectures which have much lower average packet latencies compared to existing inhomogeneous architectures. This is expected as though existing hop-count based inhomogeneous architectures have evenly distributed 3D routers, they apply fixed 3D placement algorithms which have dynamic packet delays under different application benchmarks. The proposed technique however, considers the traffic dynamics, communication and power constraints, and generates an optimized inhomogeneous architecture with minimum number of TSVs and efficiently map applications to the architecture. Particularly, Fig. 17 shows that the cost in terms of number of TSVs and 3D routers of inhomogeneous architectures generated by the proposed technique is much less than that of the existing inhomogeneous architectures. Moreover, Figs.18 and 19 show that the architectures generated by the proposed technique have lower average packet energy and are more power efficient compared to the existing inhomogeneous architectures. This is mainly due to the fact that HetMap generates inhomogeneous architectures with less number of 3D routers and more evenly distributed traffic with lower average hop-counts compared to existing inhomogeneous architectures.
C. Performance Evaluation of TSV Reduction Scheme
Inhomogeneous architectures investigated in Section VI-B reduce the area overhead and the manufacturing cost of the NoC by combining 2D and 3D routers in 3D NoC. Thus by introducing 2D routers in the NoC, total number of TSVs is reduced. According to [25] , if the number of TSVs exceeds a threshold, serialization of a certain degree is adopted to reduce the total number of TSVs. Here, the bandwidth of the TSVs is reduced by the degree of serialization and transceivers are employed at the router interface to the TSVs for repacketization. In this Section, we compare the average packet latency and power consumption of architectures generated by the TSVs reduction technique proposed in HetMap, the serialization technique and that of existing inhomogeneous architectures. For a fair comparison, a serialization degree of 2 (64-bit vertical links reduced to 32-bit links for serialization) is used. Thus, a total of 50% TSVs is present in the architectures compared to a homogeneous 3D NoC of the same NoC dimensions. Fig. 20 shows the average packet latency of various TSV reduction techniques at various traffic loads. It can be seen when the bandwidth of the TSVs are reduced with serialization logic, the average packet latency of the NoC is much higher than other schemes with equivalent number of TSVs. HetMap on the other hand saturates with a higher traffic load compared to other schemes. This is because HetMap generates architectures with evenly distributed 2D and 3D routers based on the dynamics of the NoC traffic pattern. It can be seen in Fig. 21 , that serialization causes an increase in the total power consumption. The power consumption of serialization is even higher than that of homogeneous 3D Mesh. This is because the power consumption overhead of the serialization receiver and transmitter logic is much higher compared to the power savings on the TSVs. However, other TSV reduction schemes M A N U S C R I P T such as HetMap, 2-columns and chess have a much lower average power consumption compared to homogeneous 3D mesh. This is expected as these architectures have a reduced number of 3D routers which are uniformly distributed in the NoC to reduce total average hop-count.
VII. CONCLUSION
An application mapping algorithm is proposed for inhomogeneous 3D NoCs in this work to improve the efficiency in terms of energy and communication delays for emerging SoC design. The proposed algorithm employs bandwidthconstrained and loop based mapping to minimize power consumption of NoCs. Moreover, the proposed algorithm efficiently generates optimized inhomogeneous architectures with limited number of 3D routers without exhaustively searching all possible solutions. Experimental analysis under various realistic case studies shows that 3D NoCs mapped by the proposed approach have much lower average packet delay and energy compared to existing heuristic mapping algorithms (Map Core Graph, KL Map and TSV place, Cast-Net, Nmap and Onyx). Moreover, the proposed approach is more efficient than Branch-and-Bound mapping in terms of total power consumption though they have similar average packet delays. Also, the application mapping runtime of the proposed technique is comparable to Cast-Net, Nmap, Onyx and much lower than Map Core Graph, KL Map and TSV place and Branch-and-Bound. Specifically, the proposed algorithm has lower complexity compared to the existing inhomogeneous 3D NoC mapping algorithms (Map Core Graph, KL Map and TSV place). By evenly redistributing the traffic while localizing the communication dependencies, the proposed technique generates inhomogeneous 3D NoCs with minimum number of 3D routers which have less average packet latencies and more energy efficiencies.
A C C E P T E D M
A N U S C R I P T
