# Design of TSV-Sharing Topologies for Cost-Effective 3D Networks-on-Chip

Poona Bahrebar Department of Electronics and Information Systems (ELIS), Ghent University Sint-Pietersnieuwstraat 41, Ghent, Belgium poona.bahrebar@ugent.be

# ABSTRACT

The Through-Silicon Via (TSV) technology has led to major breakthroughs in 3D stacking by providing higher speed and bandwidth, as well as lower power dissipation for the interlayer communication. However, the current TSV fabrication suffers from a considerable area footprint and yield loss. Thus, it is necessary to restrict the number of TSVs in order to design cost-effective 3D on-chip networks. This critical issue can be addressed by clustering the network such that all of the routers within each cluster share a single TSV pillar for the vertical packet transmission. In some of the existing topologies, additional cluster routers are augmented into the mesh structure to handle the shared TSVs. However, they impose either performance degradation or power/area overhead to the system. Furthermore, the resulting architecture is no longer a mesh. In this paper, we redefine the clusters by replacing some routers in the mesh with the cluster routers, such that the mesh structure is preserved. The simulation results demonstrate a better equilibrium between performance and cost, using the proposed models.

# **CCS** Concepts

# Keywords

Three-dimensional Network-on-Chip (3D NoC), inter-layer communication, network topology, Through-Silicon Via (TSV)

# 1. INTRODUCTION

The Network-on-Chip (NoC) paradigm which is emerged by infusing the interconnection networks in the realm of multiprocessors, is an efficient communication architecture used for interconnecting various Intellectual Property (IP) cores implemented on a single silicon chip [1]. In 3D NoCs [2], multiple 2D planes are vertically stacked and interconnected

*NoCArc* '15, *December 05 2015*, *Waikiki*, *HI*, *USA* © 2015 ACM. ISBN 978-1-4503-3963-6/15/12...\$15.00 DOI: http://dx.doi.org/10.1145/2835512.2835514 Dirk Stroobandt Department of Electronics and Information Systems (ELIS), Ghent University Sint-Pietersnieuwstraat 41, Ghent, Belgium dirk.stroobandt@ugent.be

via special inter-layer connections such as *Through-Silicon* Vias (TSVs) [3] tunneling through them. The main benefits of 3D NoCs include higher performance and lower power consumption due to the reduced global interconnect length, smaller footprint due to the efficient utilization of the third dimension, and support for realization of mixed-technology chips [4]. As the amount of inter-layer communication increases, the number of interconnect TSVs is also expected to grow. However, the fabrication of TSVs faces severe yield losses in different manufacturing stages [5]. Moreover, since each TSV requires a pad for bonding to a wafer layer, this will lead to an interesting scenario where the yield loss and area footprint of TSVs can no longer be ignored [6].

In [7], two novel architectures were proposed to address this major concern in 3D platforms: *CIT (Concentrated Inter-Layer Topology)* and *CMIT (Clustered Mesh Inter-Layer Topology)*. The concept behind both architectures is to add cluster routers to each layer of the network in order to handle the vertical communication. Thus, TSVs are shared by the adjacent routers and they are no longer required by all of the routers. However, the resulting structure is no longer a mesh. CIT decreases the power and area overhead significantly by removing the classic routers. This, in turn, results into a major performance degradation. CMIT, on the other hand, keeps the classic routers to maintain the performance of the system. However, the power and area overhead rises due to the increased number of routers.

The main motivation of this work is to reduce the number of TSVs while providing a balanced trade-off between the performance degradation and area/power gain in 3D NoCs. In order to do so, we exploit the clustering technique, similar to CIT and CMIT. However, the clusters are carefully defined such that: (1) The proposed models are highly compatible with mesh topology as the primary structure for NoCs. Hence, there is no need to redesign the switch architectures, routing protocols, etc. (2) The proposed models offer a balanced solution between the performance loss and cost gain.

The rest of the paper proceeds as follows. The background and related works are studied in Sections 2 and 3. The proposed models and simulation results are presented in Sections 4 and 5. The conclusions are drawn in Section 6.

# 2. BACKGROUND

# 2.1 TSV (Through-Silicon Via) Technology

Wafer stacking relies on TSVs as a promising solution to realize high-performance 3D ICs [7,8]. The impact of using TSVs in 3D stacking is twofold [9]: (1) The implementation

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.

of long ( $\approx 1000 \ \mu m$  or more) inter-layer connections with short (ranging from 5 to 50  $\mu$ m) and wide TSVs leads to the reduction of the total wire length [4]. This, in turn, can be translated into high bandwidth, and low transmission delay and power [8]; (2) On the other hand, TSV manufacturing technologies are reported to be challenged by the yield loss which drops dramatically with the increasing fabrication density [5]. Worse still, the total die area is increased since the silicon area where TSVs punch through may not be utilized for building devices or connections [9]. More precisely, TSVs are capped with pads for bonding to a wafer level in order to compensate for bonding alignment inaccuracies. The dimension of the bonding pads is usually much greater than the TSVs, ranging from  $1 \times 1 \ \mu m^2$  to  $10 \times 10 \ \mu m^2$  [4]. Moreover, TSVs should be placed at a minimum distance from the nearest component in order to reduce the strain and coupling effects. This distance is referred to as the TSV pitch which is the distance between the center of two neighboring TSVs [2]. Fig. 1 illustrates the cross-section view of the TSV pad and pitch.

As an example, consider a 3D NoC with one hundred 64bit TSVs between layers. Assuming TSV pad dimensions of  $10 \times 10 \ \mu\text{m}^2$  and a pitch of 16  $\mu\text{m}$ , the TSV footprint in each layer will be 1.6 mm<sup>2</sup>, which is equivalent to the size of a computation core. Although the TSVs are spread out in each layer unlike cores, the area footprint of TSVs cannot be ignored [6]. This is even more pronounced with the increasing number of IP cores which necessitates more TSVs to handle the inter-layer communication [7].

#### 2.2 3D NoCs

Several architectures for 3D NoCs are proposed in the literature [2, 7, 10–12] to efficiently exploit the achievable performance benefits arising out of adopting 3D technology.

The 3D Mesh is the straightforward extension of the 2D mesh architecture. It is also called 3D Symmetric since both intra- and inter-layer communication is performed by hopby-hop traversal. Despite its simplicity, it has two major inherent drawbacks: First, it does not exploit the desirable property of negligible inter-layer distance in 3D chips since the inter- and intra-layer hops are indistinguishable. Second, it employs seven-port switches (Fig. 2(a)): one port to the IP core, one to each neighboring switch in the same layer, and two to the neighboring switches above and below. As reported in [12], the power consumption of a  $7 \times 7$  crossbar is approximately 2.24 times more than a  $5 \times 5$  crossbar [10,13].

The 3D Stacked Mesh or 3D Hybrid is a hybrid between a packet-switched network and a bus. In this architecture, layers are connected using a bus spanning the entire vertical distance of the chip. The overall length of the bus is small since the distance between the individual layers is extremely small. Thus, it is an appropriate candidate for communica-



Figure 1: TSV bonding pad and pitch.



Figure 2: Switch structure in the (a) 3D Symmetric, and (b) 3D Hybrid architectures.

tion in the Z-dimension. As shown in Fig. 2(b), a switch in 3D Hybrid has six ports: one to the IP block, one to each neighboring switch in the same layer, and one to the bus. Hence, it is less power-hungry and occupies less area compared with the 3D Symmetric switch. Although moving from one layer to any other layer takes only one hop, this structure suffers from the inherent limitation of buses which do not support concurrent communications [7, 10, 13].

As surveyed in [13], the Xbar-connected Network-on-Tiers (XNoTs), Dimensionally-Decomposed (DimDe) router, and 3D MIRA are other classes of multi-layered topologies which were designed to make the best use of the short delay and high density of inter-wafer links. The design of application-specific 3D NoCs for custom System-on-Chip (SoC) architectures is also investigated in [14].

### **3. RELATED WORK**

According to [5], there is a wide utilization gap between the vertical and horizontal links in 3D Symmetric networks with limited layers. The TSV underutilization makes it suitable to adopt the clustering technique which allows the adjacent routers to share the TSVs. The following topologies [7] were proposed to place constraints on the number of TSVs.

#### **3.1** CIT (Concentrated Inter-layer Topology)

Fig. 3(a) illustrates a CIT architecture with 36 IP cores in each layer where every four IP cores are grouped into a cluster (dashed box), forming nine clusters in each layer of the network. Unlike a mesh where each router is connected to a dedicated IP core, the *cluster router* and its corresponding TSV is shared between multiple IP cores. Thereby, the number of routers is reduced which in turn decreases the number of vertical channels and hop count. The connections between different IP cores, whether horizontal or vertical, are established through the cluster routers. If the 3D Symmetric architecture is employed, the cluster router has at most 10 ports (four to the IP cores, four to the neighbor clusters in the same layer, and two to the above and below clusters). In the 3D Hybrid architecture, the number of ports is nine [7].

The CIT cluster router consumes more area and power because of the larger number of input ports compared with the classic routers. However, since the number of routers is reduced in CIT, the power dissipation of the network is diminished. Furthermore, the area of the network is decreased not only due to the smaller number of routers, but also due to the considerable reduction of the TSV area footprint. Moreover, the packets sent between any two IP cores in the same cluster have to pass through just one router which results in a fast data transmission. On the other hand, the increased router complexity and contention probability may also lead to a performance bottleneck because there are more input ports competing for an output port inside the router [7].



Figure 3: (a) CIT, and (b) CMIT clustered architectures [7].

### 3.2 CMIT (Clustered Mesh Inter-Layer Topology)

The CMIT architecture (Fig. 3(b)) was proposed to meet the constraints on the number of TSVs while preserving the advantages of a mesh. Unlike CIT in which the cluster routers were responsible for both intra- and inter-laver packet transmission, the routers in CMIT are classified into classic routers, and cluster routers. The intra-layer communication is performed by the classic routers. Hence, each classic router has at most six ports: one to the dedicated IP core (not shown in the figure for simplicity), one to the cluster router, and four to the neighboring classic routers. Each cluster in CMIT consists of four classic routers sharing a cluster router. Thus, the cluster routers' sole duty is to establish the vertical connections. Each cluster router has six or five ports depending upon whether the architecture being used is Symmetric or Hybrid. Employing a greater number of routers in CMIT results in lower latency, and higher area and power consumption compared to the CIT topology [7].

The specifications of CIT and CMIT are summarized in Table 1 for each layer of Fig. 3. It is noteworthy to mention that both CIT and CMIT can be implemented as the 3D Symmetric or Hybrid structures.

### 4. PROPOSED CLUSTERED TOPOLOGIES FOR 3D NOCS

As discussed previously, reducing the number of TSVs was the main concern to design CIT and CMIT. In order to achieve this goal, the cluster routers were embedded in each layer as a replacement of, or in addition to the classic routers in the mesh topology. Due to the performance degradation in CIT, we will concentrate our discussion on CMIT where both cluster and classic routers contribute in routing. The main idea behind CMIT is to add the cluster routers to the conventional mesh architecture. However, this not only increases the total number of routers, but also modifies the mesh structure. Since the mesh structure is widely used for NoC designs due to its simplicity, layout efficiency, and good electrical properties [15], our main motivation was to design an architecture which (1) is more compatible with a mesh, (2) decreases the number of vertical inter-layer links, and (3) maintains a better trade-off between different design criteria. Utilizing the concept of clustering and TSV-sharing, we propose two models to fulfill such objectives.

#### 4.1 Clustered Model A

We take advantage of network clustering as an efficient approach to reduce the number of TSVs. Since we attempt to keep the modifications in the mesh topology to a minimum, we do not add the cluster routers to the conventional mesh structure. Instead, we replace some of the classic routers with the cluster routers, as shown in Fig. 4(a). As a result, the total number and physical location of the routers remains the same as a  $6 \times 6$  mesh topology.

As can be seen in the figure, each cluster is composed of nine routers: one cluster router and eight classic routers. Thus, four clusters are formed in each layer of this  $6 \times 6 \times 2$  mesh-based NoC which can be translated to four vertical connections. Similar to CMIT, cluster and classic routers maintain the inter- and intra-layer communication, respectively.<sup>1</sup> The number of ports of a classic router is five (four to the neighboring routers, and one to the IP core), similar to the routers in a 2D mesh NoC. As listed in Table 1, a cluster router has six ports: four to the neighboring classic routers, if the Symmetric architecture is taken into account. Other-

<sup>&</sup>lt;sup>1</sup>Note that the total number of IP cores can be preserved in the proposed models by changing the cluster routers to handle the intralayer communication, as well.



Figure 4: The proposed (a) model A, and (b) model B clustered architectures.

| Table 1: Specifications of the topologies in each layer |  |
|---------------------------------------------------------|--|
|                                                         |  |

| Architecture                                  | CIT             | CMIT            | Proposed Model A | Proposed Model B |
|-----------------------------------------------|-----------------|-----------------|------------------|------------------|
| No. Classic routers                           | 0               | 36              | 32               | 32               |
| No. Cluster routers                           | 9               | 9               | 4                | 4                |
| Total No. routers                             | 9               | 45              | 36               | 36               |
| Max. No. Ports/Classic router                 | -               | 6               | 5                | 6                |
| Max. No. Ports/Cluster router in 3D Symmetric | 10              | 6               | 6                | 10               |
| Max. No. Ports/Cluster router in 3D Hybrid    | 9               | 5               | 5                | 9                |
| Intra-layer communication                     | Cluster routers | Classic routers | Classic routers  | Classic routers  |
| Inter-layer communication                     | Cluster routers | Cluster routers | Cluster routers  | Cluster routers  |

wise, the number of ports is five in a Hybrid architecture.

#### 4.2 **Clustered Model B**

The drawback of the proposed Model A is the nonuniform access of the classic routers to the cluster router. As can be seen in Fig. 4(a), in each cluster, four classic routers which are located in the North, East, South, and West of the cluster router have an immediate access (i.e. one hop distance) to the cluster router. However, the distance of the remaining four classic routers to the cluster router is two hops which has a negative impact on the communication delay, specially when the inter-layer packet transmission is high. One solution to this is the communication-aware placement of tasks such that the highly inter-layer communicating cores are placed next to the cluster routers. Another alternative is the architectural modification of Model A to provide direct access to the diagonal routers through additional wiring. This topology which will be called Model B is illustrated in Fig. 4(b).

As depicted in Table 1, the number of routers and thereby, TSVs in Model B is similar to Model A. The classic routers in Model B have at most six ports. The number of ports of the cluster router is increased to 10 or 9, depending on the utilized (Symmetric or Hybrid) topology.

#### 5. SIMULATION RESULTS

In this section, the efficiency of the proposed models is evaluated and discussed in terms of the communication latency, power dissipation, and area overhead. The simulations were conducted using a modified version of the Book-Sim 2.0 cycle-based network simulator [16]. The packet switching technique being adopted is wormhole. The network configuration parameters are shown in Table 2. All of the implemented networks exploit a 3D Hybrid architecture with a conventional dTMA bus [10] for vertical connections.

The e-cube routing algorithm was modified to fit the 3D topologies. Each input port of the routers has 2 virtual channels [1] to avoid deadlock. The network was warmed up for 10,000 cycles and then the results were averaged over the next 100,000 cycles, each with a distinct random initial value to ensure a fair comparison between the methods.

#### **Performance Analysis** 5.1

The performance of the proposed models is evaluated under the uniform, hotspot 20%, and Rentian [17] traffic patterns. For the Rentian traffic, the Communication Probability Distribution (CPD) determines the locality of the traffic in the network [17] which is set to 70% in our simulations.



Figure 5: Performance comparison under uniform (left), hotspot (middle), and Rentian (right) traffic profiles.

The average latency curves of five topologies as a function of the network's request rate are plotted in Fig. 5. Note that the request rate is defined as the ratio of the successful read/write request injections into the network interface over the total number of injection attempts [7]. As demonstrated in the figure, the lowest average latency in the low traffic load belongs to CIT. This is due to the fact that CIT reduces the average hop count and thereby, the communication latency. However, as the injection rate grows and the network is overloaded, the performance of CIT degrades considerably leading to the highest average latency among the methods being studied in the paper. This performance degradation is due to the lower bandwidth in CIT compared with the mesh-based architectures. More precisely, since the number of links in CIT is much smaller than the other structures, the contention probability in CIT links is much higher which is more pronounced in high traffic loads [7]. As can be seen in the rightmost figure, the performance of CIT is more stable under the Rentian traffic profile, making it a better candidate for local traffics where most of the communication occurs between the neighboring nodes.

As can be expected, the 3D Hybrid is able to consistently reduce the average network latency across all traffic patterns in high traffic loads. This is achieved by employing a larger number of vertical communication links (i.e. 36) such that each router is directly connected to the adjacent layers, and no extra hop is required in order to vertically forward the packets. The CMIT which employs 9 vertical channels is the second best architecture in decreasing the average latency.

The ratio of cluster routers responsible for the classic routers is 1/4 in CMIT, while it is 1/8 in both of the proposed models. Thus, the cluster routers in our methods have to handle more requests which leads to an increased contention in high traffic conditions. Moreover, by replacing some of the typical routers with the cluster routers, the number of available paths between many pairs of source and destination nodes is reduced. As a result, the proposed models cannot compete with CMIT in reducing the average latency.

 Table 2: Simulation parameters

| Parameter             | Value                              |  |  |
|-----------------------|------------------------------------|--|--|
| Network size          | $6 \times 6 \times 6$ 3D network   |  |  |
| Topology              | Mesh, CIT, CMIT, Model A & Model B |  |  |
| Data width            | 32 bits                            |  |  |
| Buffer & Message size | 8 & 16 flits                       |  |  |

However, as discussed previously, they are both superior in comparison with CIT due to their number of communication links. The results also confirm that Model B achieves better performance in comparison with Model A. This improvement is mainly due to the additional links in Model B which provides direct access from the diagonal routers to the cluster router. This, in turn, eliminates the unnecessary packet transfer which needs to be performed in Model A in order to access the cluster router.

For a realistic traffic analysis, we carried out the tracedriven simulations from SPLASH-2 benchmarks across a  $6 \times 6 \times 3$  network. The network is configured such that 36 processors are placed on the first layer while 72 shared L2 cache nodes are distributed in the remaining layers, with the system configuration parameters similar to [7]. The normalized results are presented in Fig. 6. It can be noticed that for some applications such as *fft* and *radix* with lighter network loads, the performance of the proposed Model B is close to that of CMIT ( $\approx 2\%$  and 1% worse, respectively).

#### 5.2 Power Analysis

The power dissipation of the topologies (including the communication channels, bus arbiters, input buffers, router control logic, and output control modules) was calculated using an extended version of the high-level NoC power simulator presented in [18] with an operating point of 200 MHz and supply voltage of 1 V. Leakage power was included for channels, buffers, and switches. The results illustrated in Table 3 are obtained near the saturation points.



Figure 6: Performance for application traces.

Table 3: Average power dissipation (W)

| Architecture | Uniform | Hotspot | Rentian |
|--------------|---------|---------|---------|
| 3D Hybrid    | 11.51   | 14.96   | 9.01    |
| CIT          | 9.57    | 12.86   | 5.55    |
| CMIT         | 10.83   | 14.55   | 8.14    |
| Model A      | 9.99    | 13.42   | 6.41    |
| Model B      | 10.43   | 14.01   | 6.79    |

According to the results, CIT consumes less power compared with the other four topologies for several reasons: The number of routers in each CIT layer is 9 which is much smaller than the other architectures. Although the cluster routers in CIT consume more power due to the greater number of input/output ports, the bandwidth is lower which results in less power dissipation, specially when the source and destination nodes are close (i.e. for Rentian traffic).

The 3D Hybrid is the most power hungry structure since it does not exploit the TSV-sharing approach and employs a large number of vertical connections. Our proposed models are both more power-efficient than CMIT. This is due to a better clustering technique which can be translated to smaller number of routers and vertical links. The additional links and ports in the cluster routers in Model B is the source of more power consumption compared with Model A.

#### 5.3 Area Analysis

The area overhead of a chip is strongly affected by the number of routers and vertical links. By taking advantage of an efficient clustering approach to share the TSVs in the proposed models, not only the number of routers remains the same as in a classic mesh, but also the number of TSVs is reduced. To assess the hardware cost, the routers were modeled with VHDL and synthesized by Synopsys Design Compiler using the CMOS 65 nm LPLVT STMicroelectronics standard cells. Similar to [7], the pad size for TSVs is assumed to be 5  $\mu$ m<sup>2</sup> with pitch of around 8  $\mu$ m.

The TSV area footprint and total area are listed in Table 4. According to the results, our proposed models can efficiently alleviate the TSV area footprint compared with the remaining methods. The area saving for the TSV footprint in Model A is around 53% and 88% compared with CMIT and 3D Hybrid, respectively. The hardware overhead of this model is approximately 23% and 47% less than that of CMIT and 3D Hybrid, as well. Although Model B is not as area efficient as Model A, it can also outperform CMIT and 3D Hybrid in terms of area overhead. Note that the total network area required by CIT is smaller than that of the other architectures since the network is formed only by the cluster routers [7]. However, the area overhead of the proposed Model A is marginal compared with CIT.

### 6. CONCLUSION AND FUTURE WORK

The remarkable yield loss and area overhead of TSVs emerges as a critical concern in large 3D NoCs. Clustering appears as a promising solution to impose constraints on the number of TSVs. Although the performance is degraded as several routers share a vertical communication link, clustered architectures are able to offer better area and power efficiency for the same reason.

In this paper, two clustered architectures are proposed for 3D NoCs to develop a cost-effective design. The main advantages of the proposed structures over the existing topologies can be summarized as: (1) The modifications to the mesh

Table 4: Area overhead  $(mm^2)$ 

| Architecture | TSV Area | Total Area |
|--------------|----------|------------|
| 3D Hybrid    | 0.61     | 4.21       |
| CIT          | 0.15     | 1.81       |
| CMIT         | 0.15     | 2.89       |
| Model A      | 0.07     | 2.21       |
| Model B      | 0.07     | 2.39       |

topology are kept to a minimum; (2) The proposed models are able to reduce the TSV footprint by providing a pertinent compromise between the power and area overhead and the performance penalty, as confirmed by the results. Exploring the scalability and heat dissipation for the proposed approach will be the subject of future research.

#### 7. REFERENCES

- W. Dally and B. Towles. Principles and Practices of Interconnection Networks. Morgan Kaufmann, 2004.
- [2] A. Sheibanyrad, F. Pétrot, and A. Jantsch, editors. 3D Integration for NoC-based SoC Architectures. Springer, 2011.
- [3] S. Spiesshoefer, L. Schaper, S. Burkett, and G. Vangara. Z-axis interconnects using fine pitch, nanoscale through-silicon vias: Process development. In *Proc. ECTC*, pages 466–471, 2004.
- [4] F. Li, C. Nicopoulos, T. Richardson, Y. Xie, V. Narayanan, and M. Kandemir. Design and management of 3D chip multiprocessors using Network-in-Memory. In *Proc. ISCA*, pages 130–141, 2006.
- [5] Y. Wang, Y.-H. Han, L. Zhang, B.-Z. Fu, C. Liu, H.-W. Li, and X. Li. Economizing TSV resources in 3-D Network-on-Chip design. *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, 23(3):493–506, 2015.
- S. Pasricha. Exploring serial vertical interconnects for 3D ICs. In Proc. ACM/IEEE DAC, pages 581–586, 2009.
- [7] M. Ebrahimi, M. Daneshtalab, P. Liljeberg, J. Plosila, and H. Tenhunen. Cluster-based topologies for 3D Networks-on-Chip using advanced inter-layer bus architecture. *Journal Comput. Syst. Sci. (JCSS)*, 79(4):475–791, 2013.
- [8] I. Loi, F. Angiolini, and L. Benini. Supporting vertical links for 3D Networks-on-Chip: Toward an automated design and analysis flow. In Proc. Conf. Nano-Networks, pages 1-5, 2007.
- X. Dong and Y. Xie. System-level cost analysis and design exploration for three-dimensional integrated circuits (3D ICs). In Proc. ASP-DAC, pages 234–241, 2009.
- [10] B. Feero and P. Pande. Networks-on-Chip in a three-dimensional environment: A performance evaluation. *IEEE Trans. Comput.*, 8(1):32–45, 2009.
- [11] V. Pavlidis and E. Friedman. 3-D topologies for Networks-on-Chip. IEEE Trans. VLSI Syst., 15(10):1081–1090, 2007.
- [12] J. Kim, C. Nikopoulos, D. Park, R. Das, Y. Xie, V. Narayanan, M. Yousif, and C. Das. A novel dimensionally-decomposed router for on-chip communication in 3D architectures. In *Proc. ISCA*, pages 138–149, 2007.
- [13] A.-M. Rahmani, K. Latif, P. Liljeberg, J. Plosila, and H. Tenhunen. Research and practices on 3D Networks-on-Chip architectures. In *Proc. IEEE Norchip Conf.*, pages 1–6, 2010.
- [14] S. Yan and B. Lin. Design of application-specific 3D Networks-on-Chip architectures. In Proc. IEEE ICCD, pages 142–149, 2008.
- [15] R. Holsmark, M. Palesi, and S. Kumar. Deadlock free routing algorithms for irregular mesh topology NoC systems with rectangular regions. *Journal of Syst. Architect. (JSA)*, 54(3-4):427–440, 2008.
- [16] N. Jiang, D. Becker, G. Michelogiannakis, J. Balfour, B. Towles, D. Shaw, J. Jim, and W. Dally. A detailed and flexible cycle-accurate Network-on-Chip simulator. In *Proc. IEEE ISPASS*, pages 86–96, 2013.
- [17] G. Bezerra, S. Forrest, M. Moses, A. Davis, and P. Zarkesh-Ha. Modeling NoC traffic locality and energy consumption with Rent's communication probability distribution. In *Proc. ACM/IEEE Workshop SLIP*, pages 3–8, 2010.
- [18] G. Guindani, C. Reinbrecht, T. Raupp, N. Calazans, and F. Moraes. NoC power estimation at the RTL abstraction level. In *Proc. ISVLSI*, pages 86–96, 2008.