A 3-D network-on-chip (NoC) enables the design of high performance and low power many-core chips. Existing 3-D NoCs are inadequate for meeting the ever-increasing performance requirements of many-core processors since they are simple extensions of regular 2-D architectures and they do not fully exploit the advantages provided by 3-D integration. Moreover, the anticipated performance gain of a 3-D NoC-enabled manycore chip may be compromised due to the potential failures of through-silicon-vias that are predominantly used as vertical interconnects in a 3-D IC. To address these problems, we propose a machine-learning-inspired predictive design methodology for energy-efficient and reliable many-core architectures enabled by 3-D integration. We demonstrate that a small-world network-based 3-D NoC (3-D SWNoC) performs significantly better than its 3-D MESH-based counterparts. On average, the 3-D SWNoC shows 35% energy-delay-product improvement over 3-D MESH for the PARSEC and SPLASH2 benchmarks considered in this paper. To improve the reliability of 3-D NoC, we propose a computationally efficient spare-vertical link (sVL) allocation algorithm based on a state-space search formulation. Our results show that the proposed sVL allocation algorithm can significantly improve the reliability as well as the lifetime of 3-D SWNoC.
of large numbers of embedded cores in a single die. Three-dimensional NoC architectures combine the benefits of these two new paradigms to offer an unprecedented performance gain [2] , [3] . With freedom in the third (vertical) dimension, NoC architectures that were previously impossible or prohibitive due to wiring constraints in planar ICs are now realizable in 3-D NoC, and many 3-D implementations can outperform their 2-D counterparts. However, existing 3-D NoC architectures predominantly follow straightforward extensions of regular 2-D NoC designs, which do not fully exploit the advantages provided by the 3-D integration technology [3] . Another challenge is that the anticipated performance gain of 3-D NoC-enabled many-core chips may be compromised due to potential failures of the through-siliconvias (TSVs) used as vertical interconnects. TSVs in a 3-D IC fail due to voids, cracks, and different kinds of fabrication challenges [4] . Additionally, the workload induced stress increases the resistance of the TSVs, which leads to different mean-time-to-failure (MTTF) for different TSVs [5] , [6] .
The main focus of this paper is to explore and consequently establish performance-energy-reliability tradeoffs for 3-D small-world NoC (SWNoC) [7] , [8] . To this end, we make the following contributions. 1) We consider the design space of 3-D SWNoC architectures, where the vertical connections predominantly work as long-range shortcuts for SW networks. 3-D SWNoC architectures (as shown in Fig. 1 ) help with both energy-efficiency (small average path length) and reliability (average path length grows insignificantly due to link failures). This is the first work to exploit the advantages of 3-D integration to design a power-law-based SW network-enabled 3-D NoC architecture.
2) The design space of a 3-D SWNoC is combinatorial in nature. Hence, we leverage machine-learning techniques to intelligently explore the design space to optimize the placement of both planar and vertical communication links for high performance and energy efficiency. 3) We consider spare-vertical link (sVL) allocation to improve the reliability of the 3-D NoC. This is another combinatorial optimization problem, where we do not know the cost function. We can experimentally compute the quality (or cost) of a solution by running a simulation. We solve this problem using a state-space search formulation, where the simulations guide the search process. We leverage the structure of the problem and domain knowledge of the 3-D SWNoC to efficiently produce an sVL allocation that can significantly improve the reliability of the 3-D NoC. 
4)
We perform a comprehensive experimental study by using several PARSEC and SPLASH2 benchmarks to evaluate the proposed optimized 3-D NoC architecture, and sVL allocation schemes. We show that the proposed 3-D SWNoC outperforms the state-of-the-art NoC architectures for all benchmarks considered in this paper. We show the effectiveness of our greedy sVL allocation method by comparing its computation time and solution quality with those obtained via exhaustive search. Finally, we also demonstrate the soundness of domain knowledge used for pruning the search space for sVL allocation for 3-D SWNoC.
II. RELATED PRIOR WORK We categorize the prior work on 3-D NoC design as follows.
A. 3-D NoC Architectural Space
Most existing 3-D NoC architectures are based on a conventional mesh topology [9] [10] [11] . However, it is well-known that mesh-based NoCs suffer from high network latency and energy consumption due to multihop communication links. To exploit the reduced distance along the vertical dimension of 3-D IC, an NoC-bus hybrid architecture was proposed in [12] ; it uses dynamic time division multiple access to reduce the network latency. To reduce energy consumption of the system, the 3-D dimensionally decomposed NoC router architecture [13] was developed. In an NoC, the largest percentage of energy is consumed by the routers, and energy consumption increases nonlinearly with the number of input ports. To reduce the energy consumption and the number of input ports, an improved 3-D NoC router architecture was developed [14] . All these architectures have buses in the Z-dimension; hence, with increasing network size, they are subject to traffic congestion and high latency under high traffic injection loads.
The Sunfloor 3-D was developed for synthesizing application specific 3-D NoCs [15] . The design and synthesis of application-specific 3-D NoC architectures was also investigated [16] , [17] . Later, a more general-purpose 3-D NoC was proposed in [18] using an integer linear programming (ILP)-based algorithm to insert long-range links to develop low diameter and low radix architecture. However, the reduction in energy consumption was found to be limited.
Photonic interconnects offer high bandwidth and low power for future many-core chip design. A number of hybrid 3-D/photonic NoC architectures have been designed [2] , [19] .
However, on-chip photonics still suffers from performance variation due to thermal issues [20] . In addition, the challenges of integrating two emerging paradigms, namely, 3-D IC and silicon nanophotonics, are yet to be adequately addressed.
B. Reliability Analysis
The performance of TSV-based 3-D ICs degrades due to TSV failure. To overcome such performance penalties arising from TSV failure, researchers have investigated spare TSV (sTSV) allocation and sharing scheme for 3-D ICs [4] , [21] . The initial sharing algorithm was developed based on the idea of utilizing one extra TSV for each of the functional TSVs [21] . However, the reliability improvement comes at the expense of double TSV count and significant area overhead.
To avoid the 100% TSV area overhead, researchers have proposed several TSV sharing schemes (e.g., 3:2, 4:1, and so on) [22] , [23] . The main idea is to share sTSV/s among a group of functional TSVs to compensate for performance penalty due to possible TSV failure within that particular group. However, depending on the sharing scheme, a significant amount of encoder and decoder logic circuits are necessary to shift the signals and select the correct sTSV, which introduces additional delay and power consumption. In addition, the delay for each TSV can be different depending on the location of the failed TSV, which may result in timing violation. To address the varying delay for each TSV, a group-based 6-TSV placement scheme for four functional TSVs was proposed to improve the reliability of a 3-D DRAM [24] . With the help of a switchbox-based design for each group, correct signals were selected and transferred for functional TSVs. The main advantage was that the same amount of delay was incurred by every TSV in the box. However, this advantage comes at the expense of 50% area overhead and significant power consumption from the switch-boxes. Similarly, researchers have developed a block-based redundancy architecture and used signal-shifting techniques for fault tolerance [25] , [26] . In this case, if any TSV fails, then the signal shifts toward the redundant ones. The signal shifting technique can tolerate one TSV failure. To improve fault tolerance for more than one TSV failure, a crossbar-based redundant TSV architecture was developed in [27] . However, the number of redundant TSVs increases significantly in this case. TSV resource sharing algorithms, which can be selectively applied depending on the granularity and design complexity were also developed. Word-level and bit-level TSV sharing was formulated as a constrained cliquepartitioning problem and efficient algorithms were designed to solve it. However, these algorithms do not scale for large-scale design problems.
All the above-mentioned TSV sharing schemes improve the performance of 3-D ICs and hence, the overall reliability as well. However, the allocation of sTSVs for 3-D NoCs need to consider additional constraints arising from the physical NoC design perspective. In a 3-D NoC, TSVs are placed in a bundle to enable a single vertical link (VL). Depending on the physical placements of switches and cores of 3-D NoCs, these VLs maintain considerable physical distance between them. Hence, sharing TSVs among these VLs is not feasible due to the physical design and timing constraints. In addition, if one TSV fails in a VL, then the achievable performance of the whole link is affected, which in turn degrades the overall NoC performance. Hence, we focus on sVL allocation instead of individual sTSVs.
In this paper, to analyze the reliability issues of 3-D NoC, we evaluate the performance of a 3-D NoC with workloadinduced VL failure. We formulate sVL allocation as an optimization problem to minimize the performance penalty due to TSV-based VL failure. We demonstrate two different algorithms, viz., greedy and exhaustive search to allocate the sVLs in a 3-D SWNoC to compensate for the performance penalty due to VL failure. We also compare the performance of both algorithms in terms of quality of the solution and computation time. We show that based on the domain knowledge of a 3-D NoC, we can develop computationally efficient algorithms whose performances are similar to exhaustive search, a naïve approach.
III. 3-D NOC ARCHITECTURE DESIGN
In this section, we first describe the design of an SW network-based 3-D NoC. Next, we discuss the main challenges for developing an energy-efficient 3-D NoC and the motivation for a machine-learning-based optimization algorithm.
A. Problem Description
The goal of an on-chip communication system design is to transmit data with low latencies and high throughput using the least possible power and resources. In this context, design of SW network-based NoC architectures [7] is a notable example. It has been shown that either by inserting long-range shortcuts in a regular mesh architecture to induce an SW effect or by adopting a power-law-based SW connectivity, it is possible to achieve significant performance gain and lower energy consumption compared to traditional multihop mesh networks [7] , [8] . In this paper, we advocate that the concept of small-worldness should be adopted in 3-D NoCs too. Specifically, the VLs in 3-D NoC should enable the design of long-range shortcuts necessary for an SW network. However, the appropriate placement of the planar and the long-range links along the vertical dimension is crucial for maximizing the performance benefits. Hence, our goal is to optimize the placement of the planar and the VLs in a 3-D NoC, where the overall interconnection architecture follows the SW connectivity, and improves the network latency and power consumption per message.
B. Small-World Network
An SW network lies in-between a regular, locally interconnected mesh network and a completely random Erdös-Rényi topology. SW graphs have a very short average path length, defined as the number of hops between any pair of nodes. The average shortest path length of SW graphs is bounded by a polynomial in log (N), where N refers to the number of nodes; this property makes SW graphs particularly interesting for efficient communication with minimal resources [28] . To develop SW network, we follow the power-law-based connectivity [28] . The probability (p) of having a direct link between nodes in an SW network varies exponentially with the link length ( ), i.e., p( ) ∞ −α where, the parameter α governs the nature of connectivity, e.g., a larger α means a locally connected network with a few, or even no longrange links. By the same token, a zero value of α generates an ideal SW network following the Watts-Strogatz model [28] one with long-range shortcuts that are virtually independent of the distance between the cores. Here, we consider an SW network with connectivity parameter α equal to 2.4, which was shown to produce energy efficient and high performance 3-D SWNoC [29] . This analysis is elaborated in the longer version [30] . 
C. Development of 3-D SWNoC
Starting from a power-law-based connectivity, we attempt to optimize the location of the planar links and the VLs to achieve lower latency and energy consumption. We define an objective function O called communication cost, which combines the NoC performance metrics, namely, the network latency and energy consumption per message. Optimizing the communication cost ensures lower average hop count and improvement in the network performance in terms of both latency and energy consumption. However, the space of physically feasible SW-based 3-D NoC designs D is combinatorial in nature and our goal is to find the design d ∈ D that minimizes O. One could employ search algorithms such as hill-climbing and simulated annealing (SA), which are very popular in the design community for this task. However, we leverage machine-learning techniques that have been shown to improve the performance of these search algorithms by intelligently exploring the design space [31] , [32] . This optimization process is undertaken before the actual NoC implementation.
IV. NOC OPTIMIZATION BASED ON
MACHINE-LEARNING We employ an online learning algorithm called STAGE [31] , which was originally developed to improve the performance of local search algorithms (e.g., hill climbing) with random-restarts for combinatorial optimization problems. The high-level conceptual idea of the algorithm is shown in Fig. 2 . The key insight behind STAGE is to leverage some extra features φ(d) ∈ R m (m is the number of features) of the optimization problem to learn an improved evaluation function E that can estimate the promise of a design d as a starting point for the local search procedure A. It employs E to intelligently select promising starting states that will guide A toward significantly better solutions. Past work in the search community concluded that many practical optimization problems exhibit a "globally convex" or "big valley" structure, where the set of local optima appear to be convex with one global optimum in the center [32] . The main advantage of STAGE over popular algorithms such as SA and ILP is that it tries to learn the solution space structure, and uses this information in a clever way to improve both convergence time and the quality of the solution. This aspect of STAGE is very advantageous for large system sizes to improve the design-validate cycle before mass manufacturing and for dynamically adapting the designs for new application workload. To the best of our knowledge, this is the first work that applies STAGE to an NoC design optimization problem. Algorithm 1 provides the pseudocode for our NoC optimization technique. 
A. Challenges
The main challenges in applying STAGE to 3-D NoC design are as follows.
1) We need to define additional features of the optimization problem that can be exploited to learn improved evaluation functions for efficient design space exploration. We provide these features for 3-D NoC designs (Table I) , but they can be adapted to other types of NoC designs as well. 2) Defining appropriate search spaces by leveraging the domain knowledge can potentially improve the effectiveness of the STAGE algorithm. We need to identify good starting state distribution (subset of initial 3-D NoC design solutions) and search operators (actions to get successor states from a given state) to navigate the design space. We have explored γ -greedy for starting state distribution with the hope of improving over random starting state distribution (see "starting states and successor function" below). 3) We need to find a good knowledge representation for the evaluation function E that is expressive, can be trained quickly, and makes fast predictions. We picked regression trees (RTs) as it satisfies all the requirements.
B. Instantiation for 3-D NoC Optimization
In this section, we provide all the details needed to apply the STAGE algorithm to our 3-D NoC optimization problem.
1) Design Space: Our design space depends on a set of network resources, which are given as input to the optimization algorithm. These resources are defined as follows.
where N is total number of cores. We assume that every core is connected to at least one switch.
2) Planar Dies (P):
A set of all dies P. For N = 64, we consider four dies with each die containing 16 cores.
For core placement, we follow a greedy algorithm to 
where f ij and d ij are the communication frequency and Cartesian distance between the cores, respectively. In this step, we form clusters with 16 cores in each die. 
We assume that F for each application is given as an input to perform application-specific network optimization.
2) Objective Function O:
We define O as the communication cost of the given 3-D NoC, which is the product of hop count, frequency of communication, and link length summed over every source and destination pair, that is
where f ij and d ij are defined as above; h ij is hop count between ith and jth node, and r denotes the number of switch stages. From a practical point-of-view, r is the number of cycles a message spends inside a switch to move from input to output port. An NoC design with low O will have low latency and energy consumption, and hence, low energy-delay-product (EDP).
3) Network Constraints:
To explore only physically feasible 3-D NoC designs, we enforce some constraints on the placement of VLs and switch configurations. If TSVs are considered as the VLs, we only allow placing them pointto-point (regularly) between the switches. Such constraints may put additional limits on the performance of NoC designs. However, efficient optimization can overcome such limitations. The SW network has an irregular connectivity. Hence, the number of links connected to each switch is not constant. For fair comparison between our SW network and 3-D MESH, we assume that both of them use the same average number of connections, <k avg > per switch. This also ensures that the 3-D SW NoC does not introduce additional links compared to a 3-D MESH. For a 64-core system, <k avg > is 4.5 considering all the switches, including the peripheral ones. In addition, the maximum connectivity per node, <k max >, is set to be 7 for the SW network as found in [33] .
4) Starting States and Successor Function:
For starting states, we randomly generate an SW network that satisfies the network constraints. The successor function S takes a network as input and returns a set of next states, and allows the search procedure to navigate the NoC design space. S generates one candidate state for each link connecting two nodes in the input network. It simply removes that link and places a link with the same length between two nodes in the NoC that are not directly connected.
The STAGE algorithm can benefit if we can specify the starting state distribution using some domain knowledge. Therefore, we also consider a starting-state distribution, named, γ -greedy. We formulate the starting state (design) construction as a sequential decision-making task, where we select the next link to be placed at each step. In γ -greedy distribution, we select a link greedily with probability γ based on communication frequency and a random link with probability (1 − γ ). We start with γ = 1 (completely greedy) and gradually reduce γ to increase the randomness.
5) Local Search Procedure A:
We employed a stochastic hill-climbing procedure, where the next states are sampled stochastically.
6) Feature Function φ: The main challenge in adapting STAGE to our NoC domain is to define a set of features φ for each network that can drive the learner. We divide the whole network into several overlapping subgraphs or regions, and define a set of features that can be categorized into three types.
1) Average Hop Count (h): which calculates the average hop count for each region or subnetwork. 2) Weighted Communication: which is defined as the sum of the products of hop count and communication frequency over all source-destination pairs for a particular
The highest value of k depends on the network size and topology. If the value of this feature is small, it indicates that highly communicating cores are placed in the same neighborhood.
3) Clustering Coefficient (C c ): which captures the connectivity of one core with its neighbors [34] . While the hop count takes into account mainly long-range communication, the clustering coefficient focuses more on local connectivity among the immediate neighbors. We found these features to sufficiently capture the network characteristics, efficient to compute, and allow learning highly accurate evaluation function, E. In this paper, for N = 64 cores, we divide the whole network into nine regions. For each region, we consider average hop counts as the features. In addition, the initial network has the highest hop count of eight, and hence, we require eight features for weighted communication cost. Finally, for each die in the network, we consider the average clustering coefficient and it gives rise to four more features. Table I lists all these features.
7) Regression Learner: The quality of our optimization methodology depends on the accuracy of the evaluation function E. We can employ any regression learning algorithm, e.g., k nearest-neighbor, linear regression, support vector regression, and RT. However, a regression learner that is nonlinear, fast in terms of training time and prediction time will improve the effectiveness of the STAGE algorithm. Therefore, analytically, the RT learner suits our needs the best.
Our training data consists of a set of input-output pairs
where each x i ∈ R m is a feature vector and y i ∈ R is the corresponding output. The RT learning algorithm tries to learn a function E in the form of tree (a set of if-then rules) to minimize the deviation of the predicted output E(x i ) from the correct output y i . The key idea in RT learning is to recursivelypartition the input space (as in hierarchical clustering) until we find regions that have very similar output values. The recursive partitioning is represented as a tree, where leaves correspond to the cells of the partition. Each leaf is assigned the sample mean of all the output variables in that cell as its prediction. During testing, we find the cell of the partition that input x belongs to through a series of comparison questions on the features, and return the prediction associated with that cell. RTs also allow us to identify the features that are important in making predictions.
We employed the WEKA machine-learning toolkit [35] to train RTs over training set Z, and tune the hyper-parameters using validation data.
V. SPARE-VERTICAL LINK ALLOCATION
The anticipated performance gain of 3-D NoC-enabled many-core chips can be compromised due to potential failures of the TSVs that are mainly used as vertical interconnects in a 3-D IC. Workload induced stress is one of the main reasons for the failure of TSV-based VLs in 3-D IC. Stress increases the resistance of the TSVs, which leads to different MTTF for different TSV-based VLs [4] , [5] . The TSV failure model is described in detail in the longer version of this paper [30] . The performance of the 3-D NoC degrades over time, leading to eventual failure of the chip. Therefore, we consider the allocation of sVLs as a way to improve the reliability of the 3-D NoC.
A. Spare VL Allocation Problem
Given a set of m functional VLs F and budget size of sVLs n (n > 0, n << m), we want to select the subset of n functional VLs out of m those when provided with one sVL each will maximize the reliability (lifetime) of the 3-D NoC. We can experimentally compute the quality of a given sVL allocation solution by running a simulation. This is an instance of a combinatorial optimization problem with an unknown cost function, where the quality of a given solution can be computed only by making a simulator call. Here, the term "solution" refers to a particular 3-D NoC configuration incorporated with sVLs for n functional VLs.
B. Computational Challenges
The main challenge here, is that we have a huge number of possible solutions or NoC configurations m n to allocate sVLs among the functional links. A naïve approach is to enumerate all possible solutions; compute the quality of each solution via simulator call; and pick the best solution. However, the simulator call is expensive in terms of both time and memory requirements. Hence, this exhaustive search approach to quantify the performance and lifetime of each of the candidate configuration is infeasible for practical purposes.
C. State Space Search Formulation
We solve the sVL allocation problem using a state-space search formulation, where the simulations guide the search process. Each state in our search space is a particular NoC configuration allocated with sVLs and consists of a set S ⊆ F, where S is a partial or complete solution. Our search space is a 3-tuple <I, A, T>, where I is the initial state function that returns the initial search state S = ∅ meaning solution set is empty; A is a finite set of actions (or search operators) corresponding to growing the partial solution S by one element from F\S; and T is the terminal state predicate that maps search nodes to {1, 0} indicating whether the node is a terminal or not. Each terminal state in the search space corresponds to a complete solution (|S| = n, where |S| denotes the total number of candidates of S), while nonterminal states correspond to a partial solution (|S| < n). Thus, the decision process for constructing a complete solution corresponds to selecting a sequence of actions leading from the initial state (none of the sVLs are allocated) to a terminal state (all the n sVLs are allocated). In principle, we can employ any heuristic search procedure (e.g., greedy and beam search) guided by simulations. Fig. 3 . Nonhomogeneous VL utilization pattern of the 3-D SWNoC for the CANNEAL benchmark. The region between second and third dies is denoted by VLs numbering 17 ∼ 32, and carries 45% of the total VL traffic of the four die 3-D system. Algorithm 2 Greedy sVL Allocation 1: Input: F = set of m functional VLs, n = budget for spare-VLs, 2: Output: S, the best set of n fVLs that gets spares 3: Initialization: initialize solution set S = ∅ 4: for each greedy step = 1 to n 5:
for each choice x ∈ F 6:
value (x) = simulator_call (S ∪ x) 7: end for 8:
x * = arg max x∈F value(x) 9: S = S ∪ x * // Functional VL x* gets spare 10: F = F\x * // x* is removed from F 11: end for 12: return S *simulator_call is a procedure that calculates and returns the network performance and lifetime for a given NoC, benchmark suite, and routing algorithm through extensive experiments.
D. Greedy Search for Spare-VL Allocation
This is the simplest search procedure (Algorithm 2). We start with an empty solution set S. In each greedy step, we add the sVL from F \S to the solution set S that when provided with a spare link, it improves the reliability by maximum amount. We repeat this greedy selection step until S is a complete solution (|S| = n). The time complexity of greedy search is O(m * n − n 2 ) simulator calls.
The greedy search is able to produce highly effective sVL allocation that can significantly improve the reliability of the 3-D NoC. This effect of greedy sVL allocation was observed through experimental studies as the cost function is unknown and we need to find solutions via simulator calls. The allocation policy to allocate a spare (if sVL budget allows) to the first functional VL that fails with a given functional and sVL-based 3-D NoC configuration is highly effective. Intuitively, if we do not allocate spare to the functional VL that is expected to fail first, it will result in a cascade of VL failures reducing the lifetime of the chip drastically.
E. Domain Knowledge for Sound Pruning
In 3-D NoC enabled many-core chips, some VLs experience heavy traffic and high utilization as the underlying routing algorithm tries to find shortest paths between source and destination cores via these links. As a result, those VLs with high utilization undergo heavier stress, and introduce additional delay in the path and fail more quickly when compared to others. Moreover, this is not an independent phenomenon: one VL failure can decrease the time to failure of a neighboring VL leading to a clustering effect as workload of the neighboring links increase. For example, in Fig. 3 , we show the traffic densities and the MTTF values of all the VLs for a 64-core and four-layer 3-D SWNoC for the CANNEAL benchmark (one of the PARSEC benchmark with highest traffic injection load and skewed traffic). We can see that the traffic densities of VLs 17-32 (we call this region as critical region) are significantly higher than that of the others and expectedly, their MTTF values are significantly lower.
Our key insight is that for a small budget size n (say less than the number of critical VLs), the spares should be allocated to some of the critical VLs only and there is no benefit for allocating spares to noncritical VLs (chip will fail due the failure of all critical VLs). We can use this domain knowledge to prune the search space of possible solutions for the spare-VL allocation problem. Let H ⊆ F correspond to the critical VLs and the total number of critical VLs is h, where h = |H|. If we consider complete solutions from H only (i.e., subsets of size n from H), we can still retain the optimal solution. In other words, we get huge computational savings without losing any accuracy due to sound pruning. For exhaustive search, we can consider h n instead of m n candidate solutions, where h < m. For greedy search, we only consider the VLs from H for spare allocation.
For the rest of the experiments and analysis, we denote the baseline greedy and exhaustive search by greedy-full and exhaustive-full, respectively. In addition, these techniques enabled with domain knowledge-based pruning are named as greedy-restricted and exhaustive-restricted.
VI. EXPERIMENTAL RESULTS AND ANALYSIS
In this section, we first present the achievable performance and energy consumption profiles of our optimized 3-D SW NoC architecture. Then, we present a detailed reliability analysis in the presence of sVL insertion.
A. Experimental Setup
To evaluate the performance of different NoCs, we use a cycle-accurate NoC simulator that can simulate any regular or irregular 3-D architecture [36] . We consider a chip multiprocessor consisting of 64 cores and 64 network switches equally partitioned in four layers. In each die, 16 cores are placed in regular interval in a grid pattern [3] . The length of each packet is 64 flits and each flit consists of 32 bits. The switches are synthesized from an RTL level design using TSMC 65-nm CMOS process in synopsys design vision. All switch ports have a buffer depth of two flits and each switch port has four virtual channels in case of irregular NoC. The NoC simulator uses wormhole routing, where the data flits follow the header flits once the router establishes a path. For regular 3-D mesh-based NoC, XYZ-dimension order-based routing is used. For irregular architectures such as the SW network, the topology-agnostic adaptive layered shortest path routing algorithm is adopted [37] . The energy consumption of the network switches was obtained from the synthesized netlist by running synopsys prime power, while the energy dissipated by wireline links was obtained through HSPICE simulations. We consider four SPLASH-2 benchmarks, namely, FFT, RADIX, LU, and WATER [38] , and five PARSEC benchmarks, namely, DEDUP, VIPS, FLUIDANIMATE, CANNEAL, and BODYTRACK (BT) [39] in this performance evaluation. These benchmarks vary in characteristics from computation intensive to communication intensive in nature and thus are of particular interest in this paper.
B. Performance of the Optimization Algorithm
In Section IV, we described the details of the STAGE optimization algorithm for designing the 3-D SWNoC architecture. Here, we first characterize the performance of the optimization algorithm by quantifying various performance metrics of the optimized 3-D SWNoC. To evaluate the performance of STAGE algorithm, we compare it with the wellknown combinatorial optimization algorithms, viz., SA [40] and genetic algorithm (GA) [41] . We evaluate the performance in terms of both the quality of solution and the convergence time.
1) STAGE vs. SA and GA: We create the initial network following the power law distribution shown in Section IV, where long-range links are placed randomly. Our goal is to find an optimized network starting from this random SW network. We call this initial NoC architecture as 3-D SW_rand. Fig. 4 shows the communication cost of the optimized network from the STAGE, SA, and GA algorithm as a function of time.
To compare the performance of these two optimization algorithms, we consider two parameters, viz., the quality of the solution and the convergence time. To make the comparison fair, we consider the same NoC configuration and apply both STAGE and SA algorithm to optimize it. We used a machine configured with Intel Core i7-4700MQ processor and 8 GB RAM running at a clock frequency of 2.4 GHz. Fig. 4 shows the cost of the best solution obtained at any particular time for SA, GA, and STAGE. We consider the best explored cost, O best , as the quality of the optimization algorithm. It is evident that STAGE reaches O best very fast (within 5 min). During the optimization process, the learned function E predicts an initial network configuration to start the local search procedure that can lead to lower communication cost (O). During the initial exploration phase, the error-rate is nonmonotonic and high. After a few iterations the prediction error reduces to less than 1%, and after 20 iterations, the error is almost zero (0.05%). The prediction error remained more or less the same for all the subsequent iterations. These results indicate the effectiveness of our network features φ and the regression-learning algorithm. Note that the best Ovalue decreases monotonically as the set of explored designs increases over the iterations. We also ran the same experiment with the γ -greedy starting state distribution as mentioned above. However, the communication cost O and the prediction error have similar characteristics as the random distribution for the benchmarks and the system size considered in this paper. Therefore, we present and discuss our results with a random starting-state distribution.
It is also seen that, both the SA and GA show similar trends in the cost function optimization. Both of them reach O best more gradually compared to STAGE, and even after 50 min their respective O best does not reach the same solution as STAGE. It should be noted that we have to optimize the link locations for various applications. Hence, this additional time needed by SA and GA will be a significant overhead when we have to optimize and reconfigure the SWNoC in the field. It should be noted that the final link distribution of the optimized 3-D SWNoC is the same for SA, GA, and STAGE. However, as shown in Fig. 4 the benefit of STAGE over SA and GA mainly comes from the much faster convergence time. We can conclude that STAGE algorithm is more efficient in designing an optimized SWNoC with better performance. We denote the final optimized NoC as 3-D SW_opt.
2) Characteristics of the Design (Random vs. Optimized): Now we investigate why the STAGE-based optimization algorithm is suitable for developing energy-efficient NoC architectures. In Section IV, we described the details of the feature definition (φ), to represent each network. So, we will explore how the design features change before and after the optimization process. Here, we specifically consider the role of the weighted communication feature mentioned in Section IV. Fig. 5 shows the weighted communication feature, which reveals the percentage of total communication that is constrained between two nodes separated by k hops (k ≥ 1). Careful observation of Fig. 5 shows that for 3-D SW_opt, the traffic constrained within one, two, and three hop increases compared to 3-D SW_rand. Moreover, the amount of traffic that has to traverse beyond three hops decreases.
Hence, the internode communication that takes place in less than three hops becomes more frequent. Since the average hop count of the optimized network is calculated to be 2.94, any communication below this average hop count can be considered to be efficient. Essentially, the optimized network becomes more efficient for the same objective function.
The inset in Fig. 5 shows the percentage of communication versus the number of hops, where the area under the curve denotes the weighted communication feature mentioned in Section IV. We can see that the 3-D SW_opt curve shifts toward the left, which means that on an average, any message in the optimized network traverses less hops compared to the initial network. Hence, it spends less time inside the network and occupies less network resources. Therefore, the STAGE-based optimization algorithm converges to an efficient architecture.
C. Performance of 3-D SWNoC Compared to Other 3-D NoCs
In this section, we compare the performance of 3-D SWNoC with several existing 3-D NoC architectures. For the comparative performance evaluation, we consider 3-D MESH and two recently proposed irregular 3-D NoCs, namely, mrrm and rrrr [42] . Both the mrrm and rrrr NoCs have point-to-point vertical connections as in 3-D MESH and 3-D SW. However, their die-level planar connection pattern varies. For rrrr, all the four dies have randomly connected interconnection patterns. On the other hand, mrrm has random connection patterns in the middle two dies whereas the first and the fourth dies follow mesh-based regular connectivity. To build mrrm and rrrr, we follow the method suggested in [42] and keep the number of links equal to that of 3-D MESH and 3-D SW. All the performance metric values are normalized with respect to the 3-D MESH.
In addition, to show the effect of the optimization algorithm, we evaluate and compare the performances of the optimized NoC architecture with un-optimized counterpart marked as 3-D SW_opt and 3-D SW_rand, respectively. 1) Network Latency: Fig. 6(a) demonstrates the normalized network latency of both the 3-D SW_rand and 3-D SW_opt NoC compared with other existing 3-D NoCs. The optimization improves the network latency on an average of 3% over the un-optimized version, and 5.5% over the conventional 3-D MESH. The optimization process redistributes the links among the cores such that cores that have to frequently communicate with each other are either directly connected or need to traverse a small number of hops. This results in reduced average hop count and weighted communication for 3-D SW_opt NoC.
TABLE II COMPARISON OF AVERAGE HOP COUNT AND COMMUNICATION COST OF 3-D NOC ARCHITECTURES
It can be seen that among all the NoCs, 3-D MESH, and 3-D SW_opt exhibit the highest and the lowest latency, respectively. The mrrm and rrrr architectures perform somewhere in the middle. As in the case of 3-D SW NoC, both mrrm and rrrr have irregularities in the horizontal planes. However, the number and the length of the links are not optimized for these architectures. For rrrr, the link distribution has large number of long-range links that help communication among long-distant cores at the expense of local communication. In the case of 3-D SW_opt NoC, the link distribution follows the power law and the connection pattern is optimized to facilitate both the local and long-range communications.
The mrrm architecture maintains the link distribution in between rrrr and 3-D SW_opt NoC. Hence, its network latency lies in between rrrr and 3-D SW_opt. Finally, 3-D MESH NoC suffers from higher average hop count compared to other 3-D architectures due to multihop communication pattern; hence, it suffers from the highest network latency. Table II lists the communication costs and average hop counts for all these NoCs. As expected, 3-D SW_opt and 3-D MESH exhibit the lowest and highest communication cost and hop count, respectively, whereas mrrm and rrrr reside in between these two. The effect of these costs is reflected in the latency characteristics.
2) Energy Consumption: Energy consumption per message depends on the amount of energy consumed by the switch as well as the planar links and VLs. The STAGEbased optimization algorithm reduces the average hop count and communication cost by optimizing the objective function O specified in (1) . As a result, both the switch and link energy consumption are minimized. Fig. 6(b) plots the energy consumption profile of the 3-D SWNoC before and after optimization along with the profile for other 3-D NoCs. All the energy values are normalized with respect to the corresponding values for the 3-D MESH. On an average, the 3-D SW_opt NoC shows 33% and 17% energy consumption improvement over the 3-D MESH and 3-D SW_rand, respectively. Fig. 5 helps us in understanding the reasons behind the improvement in energy consumption. The area under the 3-D SW_opt curve is less than that of the un-optimized counterpart. Hence, 3-D SW_opt reduces the utilization of NoC resources for any message. As a result, both the switch and link energy decrease and the overall energy profile improves. Fig. 6(b) also demonstrates that among all other NoCs, 3-D MESH has the highest energy consumption followed by mrrm, rrrr, and 3-D SW_opt NoC. Higher network latency of any NoC increases the utilization of network resources and hence, higher energy consumption per message. For 3-D MESH, the switch energy consumption is significantly higher due to multi hop communication, so it performs the worst among all of them. The mrrm and rrrr NoCs are capable of reducing the switch energy consumption compared to mesh and performs better than 3-D MESH. However, due to their random link distribution, they suffer from higher communication cost and average hop count compared to the optimized SW NoC. Hence, they consume more link energy and switch energy. With the least communication cost, 3-D SW_opt consumes the lowest energy possible among all these architectures. The detailed breakdown of various components of the energy consumption of 3-D SWNoC is provided in the longer version [30] .
3) Energy-Delay-Product: The EDP is directly affected by the network latency and energy consumption. The architecture that performs best in terms of latency and energy consumption is expected to have lower EDP compared to the EDP of other 3-D NoCs. Fig. 6 (c) presents the EDP profile of both the unoptimized and optimized 3-D SWNoCs along with other 3-D NoCs. From the EDP profile, we observe that the average EDP of 3-D SW_opt NoC is reduced by approximately 35% and 19% compared to 3-D MESH and 3-D SW_rand, respectively. In addition, among all the other 3-D NoCs, 3-D SW_opt NoC has expectedly the lowest EDP profile followed by mrrm, rrrr, and 3-D MESH. For all the benchmarks, 3-D SW_opt shows the best EDP improvement of 43% for RADIX. From the above results and analysis, we can conclude that 3-D SW_opt performs better than all other considered NoCs in terms of network latency, energy consumption, and EDP. Hence, for the rest of the experiments, we consider this optimized 3-D SW architecture and denote it by 3-D SWNoC for simplicity.
D. Performance of 3-D NoCs in Presence of Link Failures
In this section, we analyze the robustness of the 3-D SWNoC architecture under VL failure. The reason behind studying the scenario of VL failures is that despite the recent advancements in TSV technology, TSVs are still subject to failure due to voids, cracks, and misalignment [4] , [43] . In addition, TSVs also face the wear-out problem due to stress arising from potentially high workload. The imbalance in workload among different TSV-based VLs in the NoC also creates wide variation in TSV MTTF, where some VLs fail early compared to others. Due to all these reasons, if the TSV-based VLs fail, then the EDP and network latency of 3-D NoC increases and in the worst case, the corresponding NoC may contain disjoint source-destination pairs in the network. As a result, the performance of the 3-D NoC degrades over time. Fig. 7 demonstrates the EDP profile of 3-D MESH, mrrm, rrrr, and 3-D SWNoC with workload induced VL failure scenario with time for the CANNEAL benchmark (for up to 15 faults). All the EDP values are normalized with respect to fault free 3-D MESH at t = 0. From the figure, we can see that the EDP values of all the 3-D NoCs increase with time as the VLs fail progressively. Among all the NoCs, 3-D SWNoC shows the lowest EDP value for any particular period of time and the rate of increase in EDP is also lower than the other NoCs. As a result, 3-D SWNoC is expected to have longer lifetimes relative to other NoCs. Note that 3-D SWNoC is inherently robust against link failure and its average hop count increases only marginally in the presence of link failures due to the SW nature of the overall connectivity [28] . As a result, it shows better robustness and EDP profile in comparison to other NoCs.
To address this time-dependent failure of VLs and the EDP performance degradation, we propose to incorporate sVL allocation. As 3-D SWNoC is inherently more robust than other NoCs, we focus on allocating sVLs to 3-D SWNoC and analyze its performance with such an allocation.
VII. PERFORMANCE OF 3-D NOCS
WITH SVL ALLOCATION In this section, we analyze the sVL allocation methodology and its effects on lifetime and overall reliability of the 3-D NoC. To quantify the effects of sVL allocation, we first define the lifetime of the 3-D SWNoC. Whenever any functional VL fails in a 3-D SWNoC, the average hop count increases and hence, the network latency and EDP increase as well. Eventually, the EDP of the 3-D SWNoC may be higher than fault free 3-D MESH, where it can no longer be considered as an efficient NoC. At this point, the 3-D SWNoC loses its architectural advantages over a 3-D MESH. We consider the time required to reach this configuration as the lifetime of the 3-D SWNoC.
A. Greedy vs. Exhaustive Search for sVL Allocation
For the sVL allocation, we explored two different allocation algorithms as explained in Section IV: 1) greedy search and 2) exhaustive search. The allocation algorithms are named as greedy-full and exhaustive-full (The suffix "full" is added to differentiate the algorithms with domain knowledge-based pruning, which we will introduce later). For brevity, we show results for three representative benchmarks with varying traffic patterns, viz., CANNEAL, DEDUP, and VIPS. These benchmarks are chosen because they have a wide variation in message injection rates, e.g., high (CANNEAL), medium (DEDUP), and low (VIPS). Fig. 8(a) -(c) plots the lifetime of the 3-D SWNoC with sVL allocation using greedyfull and exhaustive-full algorithms for different number of sVLs for the CANNEAL, DEDUP, and VIPS benchmarks, respectively. From these figures, we can see that both greedyfull and exhaustive-full sVL allocation algorithms achieve the same lifetime for the 3-D SWNoC. Note that, greedy search (as expected) takes significantly less computation time to produce the solution when compared to exhaustive search. To explain this, we first need to understand the details of the sVL allocation procedure. To be more specific, the VL failure sequence and its effects on NoC performance need to be explored.
If any functional VL fails, then the workload of this particular VL negatively affects the other neighboring VLs and as a result, the EDP increases rapidly. Consequently, allocation of sVLs to the functional VL, which fails first, is expected to minimize the NoC performance penalty. If sVL is allocated without following the VL failure sequence, then the allocation effect may not be visible on both the EDP profile and lifetime at all.
To explain this behavior in more detail, we consider the case of the CANNEAL benchmark for the 3-D SWNoC and number the 48 VLs serially starting from 1 to 48 for a 64 core system (as shown in Fig. 3 ). For 8 sVLs, the sVL allocation solution from exhaustive search corresponds to assigning spares to functional VLs numbered 26, 22, 27, 10, 42, 43, 7, and 6 . Somewhat surprisingly, greedy search also produced the same sVL allocation solution. Our experimental analysis showed that the greedy search produces sVL allocation solutions that can significantly improve the reliability of the 3-D NoC. The allocation policy to allocate a spare (if the sVL budget allows) to the first functional VL that fails with a given functional and sVL-based 3-D NoC configuration is highly effective. Intuitively, if we do not allocate spare to the functional VL that is expected to fail first, we will be faced with a cascade of VL failures, which will reduce the lifetime of the chip drastically. For example, the VL failure sequence without any spares allocated is 26, 22, 27, 10, 32, 30, 25, 18 , and so on. Greedy search allocates the first spare to functional VL 26. The VL failure sequence after assigning spare to VL 26 is 22, 26, 27, 23, 32, 30, 31, 25, 18 , and so on. Greedy search allocates the second spare to functional VL 22. Continuing this policy, greedy search assigns spares to the same set of functional VLs as done by the exhaustive search. We found this behavior to be consistent across all the benchmarks.
B. Domain Knowledge for Pruning the Search Space
The time to compute the sVL allocation solution grows as the number of functional VLs (m) and the number of sVLs (n) increases for both exhaustive search and greedy search. The time complexities of exhaustive search and greedy search (in terms of the number of simulator calls) are O m n and O(mn − n 2 ), respectively. For example, for a 64 core 3-D NoC with m = 48 and n = 8, the total solution exploration times for exhaustive and greedy search are 377, 348, 994, and 356q, respectively, (here q corresponds to the computation time of a single simulator call which is ∼7 min in the current experimental setup using a machine configured with Intel Core i7-4700MQ processor and 8 GB RAM running at a clock frequency of 2.4 GHz). Therefore, our sVL allocation algorithms may not scale for large-scale 3-D NoC. We consider using domain knowledge of the workload of different functional VLs to prune the solution space as described in Section V. In 3-D NoC, the workload of some fVLs (say critical VLs) is much higher than the others and hence, their failure probabilities are higher too. Intuitively, when the sVLs budget (n) is small, it is beneficial to allocate spares to some of the critical VLs only because the chip will fail due to a cascade of critical VL failures.
We select a subset of critical VLs (say H) out of the m functional VLs that we will consider for allocating spares and prune the remaining ones. Pruning can improve the computational efficiency of solving the sVL allocation problem, but may potentially compromise the accuracy of solutions depending on the amount of pruning. We can consider varying amounts of pruning from |H| = n (only one candidate solution) to |H| = m (no pruning) to tradeoff speed and accuracy of producing sVL allocation solutions. A simple pruning strategy to achieve this goal is as follows: rank all the functional VLs according to their workload; select the top-|H| VLs to be considered for spare allocation; prune the remaining m-|H| VLs. We can use both exhaustive search and greedy search to find the solution from this restricted set of candidate solutions. We refer to the exhaustive and greedy sVL allocation algorithms as exhaustive-restricted and greedy-restricted, respectively. Fig. 3 shows the traffic densities of all the VLs for a 64-core 3-D SWNoC consisting of four planar layers for the CANNEAL benchmark. It can be noted that the traffic densities of some VLs (critical VLs) are significantly higher than that of the others. To identify the critical VLs, we rank VLs according to workload and sort the highest workload ones. In this particular work, we consider 16 critical VLs. This number is chosen considering the worst-case VL failure scenario where all 16 critical VLs are placed in between two adjacent planar dies and if all of them fail together, then the NoC becomes completely unrouteable. Therefore, we prune all the noncritical VLs, a total of 32 out of 48 (other than 16 critical VLs). In other words, |H| = 16 corresponding to 16 high workload carrying VLs, which is significantly smaller compared to m = 48 (total number of VLs). We found that with this setting, both the search algorithms with pruning produce the same sVL allocation solutions as their counterparts without any pruning (exhaustive-full and greedy-full) for different number of sVLs (n = 1 to any number of upper limit).
In other words, we do not lose accuracy due to pruning. We do not show these results for the sake of brevity. The main benefit of pruning is that it improves the computational efficiency of producing sVL allocation solutions. As an example, Fig. 9 (a) and (b) shows the estimated runtime comparison of greedy-full and greedy-restricted, and exhaustive-full and exhaustive-restricted, respectively. We can see that the computational gains are significant due to pruning, but without losing any accuracy.
C. Computing the Lifetime of 3-D SWNoC With sVL Allocation
In this section, we describe the procedure to compute the lifetime of any 3-D NoC configuration. For better understanding, we plot the EDP profile of 3-D SWNoC with and without sVLs incorporated into it, and graphically illustrate how to calculate the lifetime of any 3-D NoC.
As defined in the earlier section, the lifetime of any 3-D NoC is the time when the EDP value of that particular NoC equals to a certain threshold value. Since the performance requirement for the NoC is application and/or user dependent, the threshold value to compute the lifetime of the 3-D NoC will vary. Fig. 10 illustrates the lifetime computation procedure for a 3-D SWNoC incorporated with 8 sVLs for DEDUP benchmark. This particular configuration is chosen as an example, however, the procedure is applicable for any other 3-D NoC and benchmark. For the reference purposes, the EDP profile of the original 3-D SWNoC (without any sVL) is also plotted. The EDP of 3-D SWNoC is normalized with respect to the EDP of fault free 3-D MESH. To help illustrate the lifetime computation procedure more clearly, a dotted horizontal line is drawn in Fig. 10 , which we call the lifetime line. This line corresponds to 100% EDP value for the fault free 3-D MESH (at t = 0).
We 
D. Effects of Spare-VL Allocation on 3-D NoC
Whenever a sVL is allocated to a functional VL, the sVL carries the traffic when the corresponding functional VL fails. This minimizes the effect of VL failure on 3-D NoC performance degradation and essentially helps in maintaining lower EDP value over longer period of time. However, there exists an upper limit for the sVL number, beyond which the advantages of sVL allocation can no longer be pronounced. We call this number as the optimum number of sVLs.
Depending on the benchmark and NoC configuration, the optimum number of sVL varies. In this paper, we consider 3-D SWNoC as the testbed for evaluating the performance of sVL allocation. However, subsequent experiments and analysis are equally applicable for other 3-D NoC architectures as well.
1) Optimum Number of Spare VLs: In this section, we evaluate the effects of different number of sVLs on the 3-D NoC performance. Fig. 11(a) -(c) demonstrates the normalized EDP of 3-D SWNoC with time for CANNEAL, DEDUP, and VIPS benchmarks, respectively. Similar to the previous experiments, we have considered these three benchmarks as the representative of high, medium, and low injection benchmarks from the PARSEC and SPLASH-2 suites. All the EDP values are normalized with respect to the EDP of fault free 3-D MESH with no sVLs allocated to it at t = 0.
From these figures, we can see that the EDP remains unchanged up to a certain point and after that, it increases when the functional VLs start failing. This happens due to the fact that initially no functional VL fails and EDP remains constant up to a certain time. Subsequently, VLs from the critical region (as defined in Section V-E) having high traffic density start failing. In such a link failure scenario, the traffic of the failed VLs is carried by the neighboring VLs along with their own traffic. This has two kinds of negative effects. First, the EDP and the network latency of the NoC increases due to a critical link failure. Second, the neighboring functional VLs also fail quickly which further degrades the NoC performance. As a result, the EDP increases at a faster rate.
Another interesting result is that as the number of allocated sVLs increases, the EDP profile shifts toward the right on the time scale. This implies that the 3-D SWNoC with sVL allocation can maintain a particular EDP level for a longer period of time. Expectedly, the lifetime of 3-D SWNoC also increases with sVL allocation. In addition, we can see that the difference between the EDP profiles on the time axis decreases gradually as the sVL number increases. For the CANNEAL benchmark, the right-most EDP is found to be for 8 sVLs. It is seen that even if we increase the number of sVLs beyond 8 for CANNEAL, the EDP profile does not shift to the right anymore. This implies that any further improvement of EDP profile is not possible, and we call this scenario as the saturation effect of sVL allocation. Similarly, for 3-D SWNoC with DEDUP and VIPS benchmarks, the EDP profile gets saturated for 14 sVLs.
2) Performance of 3-D SWNoC With Partial sVL-Allocation: In this section, we evaluate the performance of 3-D SWNoC with partial sVL allocation. With partial sVL allocation, instead of allocating an sVL (total bundle of TSVs replacing the whole VL), we only allocate some sTSVs to an fVL and compare its performance with full sVL-allocation explored earlier. For partial sVL allocation, we need to consider the cross-coupling capacitance of the individual TSVs. If we consider a grid-based layout of the TSVs in a bundle, then the centrally located TSVs will have the highest cross coupling. We replace the TSVs that are affected most by the cross coupling in this partial allocation. As a case study, we consider this partial TSV allocation to the critical fVLs only and allocate 50% and 75% of the total TSVs in an fVL. We characterize the performance of this partial TSV allocation in comparison with the full sVL allocation. Fig. 12(a) -(c) shows the EDP profile with time of 3-D SWNoC with partial sTSV allocation. In these figures, 3-D SW-8 sVLs denotes the performance of 3-D SWNoC with 8 sVLs allocation (complete bundle allocation) whereas the 3-D SW-8 sVLs_x% denotes the performance of 3-D SWNoC with individual sTSV allocation within the bundle (VL). For example, 3-D SW-8 sVLs_50% indicates 50% TSVs within the bundle (for 8 VLs) have sTSVs. From the figures, it is clear that, complete sVL allocation performs better than partial sTSV allocation. As the percentage of sTSV allocation increases, the EDP profile shifts right on the time scale and lifetime improves consequently. It should be noted that if we allocate 100% sTSVs, then it is equivalent to full-sVL allocation (3-D SW-8sVLs in the figure) and achieves the best EDP profile and maximum lifetime for the 3-D NoC.
3) Saturation of Lifetime Improvement: In this paper, we have considered one-to-one correspondence between sVLs and functional-VLs, where any sVL replaces one functional VL regardless of the workload intensity. Allocation of such sVLs increases the traffic carrying capability of the critical VLs and improves the lifetime of the 3-D NoC. As an example, Fig. 13 plots the percentage of lifetime improvement of 3-D SWNoC for the CANNEAL benchmark with different number of sVLs allocation. Note that similar lifetime improvements are observed for other benchmarks as well.
From the figure, we can see that as the number of allocated sVLs increases, the lifetime of the 3-D SWNoC also increases. Initially, the gain of lifetime is almost linear with the number of allocated sVLs and later, the gain increment decreases and improvement saturates after some point. Allocation of sVL increases the combined lifetime of the particular VL (consists of sVL and functional VL in this case), which helps to minimize the network latency and EDP degradation due to VL failure. In general, most critical VLs fail early when compared to the other VLs. If the sVLs are allocated to the critical VLs, then they help in significantly increasing the lifetime of the NoC. However, the lifetime gain saturates as the number of allocated sVL crosses a certain number. This happens due to the fact that the combined lifetime of some critical VLs even with the sVL allocation is shorter than other noncritical VLs. Consequently, even if we allocate sVLs to these noncritical VLs, they do not improve the EDP beyond what is achieved already. It is important to note that similar effects are also observed for DEDUP and VIPS benchmarks as well (in these cases, the saturation effect was observed for 14 sVLs). However, we have omitted plotting such repetitive results and analysis.
VIII. CONCLUSION
We proposed a robust design optimization methodology to improve the energy efficiency of 3-D NoC architectures by combining the benefits of SW networks and machinelearning techniques to intelligently explore the design space. We showed that the optimized 3-D SWNoC architecture outperforms the existing 3-D NoCs. The optimized 3-D SW NoC on an average achieves 35% EDP reduction over conventional 3-D MESH. We also demonstrated the efficacy and robustness of the 3-D SWNoC in presence of nonhomogeneous workload induced VL failure. The proposed 3-D SWNoC shows better resilience and EDP profile against VL failure at any instant of time compared to state-of-the-art 3-D NoCs. We also proposed an sVL allocation mechanism to address the performance degradation and lifetime shortening problem due to VL failure. We showed that with a small number of sVLs, we could exploit NoC domain knowledge to develop efficient and computationally inexpensive algorithms to explore optimal solution. The proposed sVL allocation significantly improves the reliability and lifetime of the 3-D NoC.
