Abstract-The performance of most digital systems today is limited by the interconnect latency between logic and memory, rather than by the performance of logic or memory itself. Three dimensional (3-D) integration using through-silicon-vias (TSVs) may provide a solution to overcome the scaling limitations by stacking multiple memory dies on top of a many-core die. In this paper, we propose a Mesh-of-Trees (MoT) network to support 
HyperCore [2] , and STMicroelectronics Platform 2012 [3] are the most visible examples in this trend. All of the cited architectures share a common trait: a multi-core cluster consisting of many simple cores and an on-chip shared tightly coupled data memory (TCDM). The shared TCDM enables parallel threads to cooperate with each other, facilitates extensive reuse of on-chip memory data, and greatly reduces the off-chip memory accesses. However, the design of low latency and high-bandwidth on-chip interconnection network is crucial for such multi-core clusters having shared TCDM for parallel processing [4, 5] .
3-D integration by using through-silicon vias (TSVs) is a promising option to overcome the scaling limitations of 2-D integrated circuit (IC) including the well-known memory-wall problem [6] . However, in such integration, there is an inherent asymmetry in the delays between the fast vertical interconnections and the horizontal interconnections due to the differences in wire lengths (few tens of !lm in the vertical direction as compared to few thousand !lm in the horizontal direction). Vertical interconnections also impose a larger area overhead than corresponding horizontal wires due to the requirement for bonding pads and can compete with device area as the TSVs punch through the wafer when Face-to-Back bonding is used. Therefore, the design of 3-D interconnection network brings new constraints and opportunities as compared
This research presented in this paper was supported by the NanoSys project, the program ERC-AdG-246810
978-1-4673-2658-2/12/$31.00 ©2012 IEEE to that of 2-D interconnection network.
In this work, we focus on the communication between multiple cores and a shared multi-banked L2 TCDM which is multiply stacked on top of the multi-core die. By avoiding cache coherence overheads as well as cache indeterminacy, the shared L2 TCDM can be used as a frame buffer for video processing which should deal with a large amount of data within a tightly bounded time [7, 8] . The fully combinational
Mesh-of-Trees (MoT) interconnection network proposed in [5] is suitable for this shared multi-banked L2 TCDM with high throughput and low memory access latency. However, When a packet needs to be arbitrated among the other simultaneous packets forward to the same memory bank, the round-robin algorithm is used to provide a starvation-free arbitration. 
, - 
1+---bankJnO
Skewed ClK of the arbitration tree are directly connected to each memory bank through TSVs, which are distributed in the middle of the memory die [9] . TSVs are allocated per bank and each bank is connected to neighboring TSVs as shown in Figure 2 (b). Note that the size of stacked L2 TCDM dies does not need to be the same with that of the multi-core die while assuming that all the memory dies have an identical layout for the fabrication cost.
III. 3-D MoT INTERCONNECTION NETWORK

A. Sequential Routing Switches
As mentioned before, in 3-D integration, there is an inherent asymmetry in the delays between the fast vertical interconnections and the horizontal interconnections due to the differences in wire lengths. The latency difference between the two directions is even larger as the delay of TSVs is getting smaller. To eliminate the wide disparity, we propose a sequential routing switch as shown in Figure 3 , which can replace some of combinational routing switches without any network configuration change. The sequential routing switch has two directions: forward (core ports) which sends out the incoming packet from its input port at core side to one of its output ports at memory side; backward which rolls packet back from memory side to core side (bank ports). The flip-flops are inserted at each packet direction in order to buffer the incoming packets as well as the memory control signal. The flip-flop on the forward direction is clocked with the main clock CL K, whereas the others on the backward direction are clocked with a skewed CL K which is able to transfer data on both the rising and falling edges of the clock signal [5] .
Inserting such sequential logic on the combinational paths of MoT interconnect reduces the longer horizontal wire delay and, thus, allows the MoT interconnect to run at higher clock frequency, while sacrificing the number of clock cycles to be consumed. For this reason, it is important to consider how to insert the sequential routing switches in the fully combinational MoT network. Figure 4 shows an example of inserting sequential routing switches for four cores with eight-banked L2 TCDM, where those are connected with a 3-D 4x8 MoT interconnection network. Thanks to the inherent characteristics of a binary tree, inserting Ncore' (Nlev-l) sequential routing switches at each routing level makes half of the memory banks to be closer and the rest to be farther, where Nlev is the current level at the routing tree. When assuming that Nse q is the number of routing levels where Ncore'(Nlev-l) sequential routing switches are inserted at each routing level, the number of the closest banks for each core is to be NbonJ!2 N se q . Also, the farthest banks are accessed from a core by passing Nse q sequential routing switches, which means the number of clock cycles for the farthest bank access is 2 Nse q + 1. Note that the non-uniform memory access latency with appropriate memory data placement policies, such as thread-affinity-based memory data placement [10] , gives significant performance improvement, as shown in the experimental results. 
TSV Sharing Effects
TSV s connect multiple stacked dies with good electrical characteristics, but their area footprint is much bigger with respect to the on-chip metal lines. Sharing TSVs among L2 TCDM banks which are directly stacked on each other (which we called a bank stack) is the most straightforward method to reduce TSVs [9] . The total number of TSVs is reduced with respect to Ntier as follows.
where Nc is the number of clock TSVs added to one reset TSV. When considering the high manufacturing cost of 3-D integration due to high TSV failure rate as well as reduced yield, the reduction in both the die area and number of TSVs resulted from TSV sharing makes the fabrication yield higher and reduces the fabrication cost compared to 3-D stacked TCDM without TSV sharing. The stacking yield for Ntier dies can be modeled as follows (12) .
stackingbonding
where Ybonding captures the yield loss of the chip due to the faults in the bonding process and It sv is the TSV failure rate.
IV.
EXPERIMENTAL RESULTS
We performed experiments using a 3-D multi-core cluster with shared multi-banked L2 data memory stacked on top of the multi-core cluster. The cluster consists of 32 cores and 64 memory banks. The core is considered to be ARM Cortex-A5 with 16KBIl6KB instruction and data caches. The core estimated area is 1. 183mm x 1.18 3mm for 65nm technology (16). The core operating clock frequency (!core) is assumed to be 1 GHz. Each L2 TCDM bank has a capacity of 64 KB and a size of 0.867mm x 0.624mm, which are estimated for 65nm technology (17) . The access time of a bank itself (i.e., delay due to row decode, sense amplifier, and multiplexer in the bank) is assumed to be 1.062 ns. The number of stacked memory tiers (Ntier) used in the experiments varies from 1 to 8.
For the simulation, an in-house simulator is used, whose details are explained below.
In order to estimate MoT network performance, the latency for the longest possible link between cores and memory banks is estimated using Elmore distributed resistance-capacitance (RC) delay model for 65nm technology [13] (14) . We assumed that TSV pitches of 10J.lm x lOJ.lm, TSV diameter of 5J.lm, and TSV height of 20J.lm were used. To evaluate the many-core system performance, we used the metric of operations per second (OPS) presented in (15) . The average memory access time (i.e., the sum of average access times ofLl data cache, L2 TCDM, and off-cluster memory) is presented as follows.
where Ph it and PrCDM are, respectively, the average hit ratio of L1 data cache and average access ratio of L2 TCDM for each thread. tcache and toifcluster is the latency (in terms of cycles) for the L1 data cache hit and off-cluster memory access. trcDM is the latency for the L2 TCDM access (i.e., sum of interconnect delay from core to a target memory bank and the access time of the bank itself). All the values related to the system performance evaluation are shown in Table L Note that, in Table I , Po and PI represents the probabilities to access TCDM bank regions divided by sequential routing switches (when Nse q is 1), which can be general when static or dynamic data mapping methods are used in parallel processing (10) . For the fabrication cost estimation, we used analytical models proposed in [12] [23] assuming that wafer-to-wafer (W2W) and face-to back 3-D bonding is performed. We assumed that Ybonding and Itsv in Equation (2) to be 0.99 and 0.00001, respectively.
For architecture comparisons, we evaluated four candidates;
2-D MoT: All the cores and memory banks are placed on 2-D planar structure.
Plain MoT: Multiple memory tiers are stacked on the multi core tier and memory banks are connected to cores through a plain MoT (presented in Section II).
SRS:
Multiple memory tiers are stacked on the multi-core tier and memory banks are connected to cores through 3-D MoT with sequential routing switches (presented in Section lILA).
TSVshare: Multiple memory tiers are stacked on the multi core tier and memory banks are connected to cores through 3-D MoT with TSV sharing (presented in Section III.B).
SRS+TSVshare: Multiple memory tiers are stacked on the multi-core tier and memory banks are connected to cores through 3-D MoT with both sequential routing switches and TSV sharing. Figure 5 shows the results of MoT network clock frequency (i.e., the reciprocal of MoT network latency) with respect to the number of L2 TCDM tiers, i.e., Ntier. In case of SRS and SRS+TSVshare, we assumed that the number of sequential routing level, i.e., Nse q , is 1. As shown in Figure 5 Number of stacked TCDM tiers Number of stacked TCDM tiers Figure 6 . Results of operations per second (OPS). All the values are normalized with respect to the OPS of 2-D MoT. Ntier (e.g., Ntier= 8 in Figure 5 ). Sharing TSVs among banks in a bank stack largely affects the system performance due to the reduction in both the area occupied by the TSV s and the level of the routing tree. These reductions allow us to reduce the horizontal wire delay and, thus, increase the maximum available clock frequency. However, the memory contention occurr ing at the shared TSV s may degrade the system performance despite of the increase in clock frequency. As can be seen in Figure 6 , TSVshare and SRS+TSVshare do not give steady performance improvement with respect to Ntier even though SRS+TSVshare yields the best system performance. In Figure 7 , we varied the L2 TCDM access probability from 0.2 to 0.8 when Ntier is 2. TSV sharing makes the system performance worse as the probability to access TCDM, i.e., PTCDM, increases because of the heavy contentions at the shared TSV s. Note that inserting sequential routing switches gives better performance improvement as PTCDM increases.
When comparing the fabrication cost between SRS and SRS+TSVshare with respect to Ntim SRS+TSVshare gives lower fabrication cost than SRS and the disparity increases with Ntier (up to 47% of the disparity when Ntier is 8) because of the large area overhead as well as the high failure of TSV itself in case of SRS. Without TSV sharing, the large number of TSVs causes large die area and reduces stacking yield, which results in high fabrication cost.
V.
CONCLUSION
In this paper, we presented a MoT interconnection network that can be integrated in a multi-core cluster where 3-D multi banked shared L2 TCDM is stacked on the multi-core die. To exploit the fast vertical interconnections in 3-D integration, we proposed a sequential routing switch that can be adapted to the plain MoT interconnect without any network configuration changes. The experimental results show that the proposed sequential routing switch significantly improves the system performance. The architecture parameters of 3-D stacked memory also have been explored with TSV sharing. TSV sharing reduces fabrication cost as well as gives the highest MoT network clock frequency owing to the reduced form factor. However, since the system performance deeply depends on the memory contention at the shared TSVs, new solutions such as adding additional paths using redundant TSV s are needed in order to compensate the memory contentions, which will be our future work.
