Abstract. We present a novel 3D crossbar for future Network-on -a-Chip implementations. We introduce a routing algorithm for the 3D crossbar circuit and detail two specific 3D crossbar topologies. We evaluate the defect tolerance of the 3D crossbar and quantify the number of extra layers required to support arbitrary permutations as a function of the defect rate. Further, we estimate the circuit performance and advantages of the 3D crossbar circuit based on post-silicon devices.
Organization for 3D on-chip crossbar
We design the 3D crossbar to route any permutation between a set of N x N I/O terminals on one layer of the 3D microchip and a second set of N x N I/0 terminals on another layer. Fig. 1 shows an example where one set of 10 terminals is a dense memory and the other is the datapath of a processor. For routing in the 3D on-chip crossbar, the x-wire, y-wire, and z-wire are set in x-axis, y-axis, and z-axis direction, respectively. In this paper, the circuit consists of some crossbar-switch layer sets (CLSs) sandwiched by a top I/0 layer and bottom I/0 layer. One CLS consists of a layer of crossbar-switch matrix array (switch layer) and a few wire layers. We explore the implications of using both CMOS switches and post-silicon devices to implement the crossbar switching. In this model, 3D crossbar circuit consists of 3N CLSs. The characteristic of this circuit is that the position of the terminals in the top-layer differs from the position of the terminals in the bottom-layer (See Fig. 2(b) ). A circuit constructed like this contains fewer switches than a circuit which has the top layer and bottom layer terminals in the same position. Moreover, in this model each routing passes fewer switches than a circuit having same position terminals. Each terminal is connected to the z-wire for transferring a bit from the terminal to the terminal on the other side. This circuit consists of 3Nsets of CLS having N x 2N z-wires. The 3N sets of CLS are obtained by stacking N layers ofA type-CLS, N layers ofB type-CLS, and N layers of A type-CLS. From now on, A type-CLS and B type-CLS are called CLS-A and CLS-B, respectively. The CLS-A has one wire-layer consisting of N y-wires and one switch layer having 2N2 switches. The switches in the switch layer connect y-wires and z-wires, which are in the same x-coordinate. The CLS-B has one wire layer consisting of N x-wires and one switch layer having 2N2 switches. The switches in the switch layer connect x-wires and z-wires, which are in the same y-coordinate. Fig. 2 (c) and Fig. 2(d) show the relation of the connection of two wires in CLS-A and CLS-B for 2 x 2 terminal circuit, respectively. Z-top-wires which connect to terminal in the top layer are cut off between CLS-A and CLS-B. Z-bottom-wires which connect to terminal in the top layer are cut off between CLS-B and CLS-A. Fig. 2(b) shows the z-wire situation, viewed from the side of the circuit.
Simple Connection Model (SC-
model
Flexible Connection Model
Next we introduce the flexible connection model, which is an improvement on the SC model and consists of fewer CLSs and switches than those of the SC model. In this model, the switch and wire CLS layer contains the same function as three CLSs in the SC model; in particular, each CLS layer in the FC model acts as two CLSs-A and one CLS-B from the SC-model. This model consists of N sets of CLS, 5N2 switches, and three types of N z-wires. The terminal position is also different between top-layer and bottom-layer. Fig. 3(a) shows the location of top-layer terminals and bottom-layer terminals. The z-wires connecting to terminals in the top-layer and the bottom-layer are called top-z-wire and bottom-z-wire, respectively. The remaining N z-wires are called aid-z-wires. In this model, each CLS consists of two types of wire layers and one switch-layer. One wire layer is constructed of only N x-wires and the other is constructed of only 2N y-wires. The switches are used for connecting two wires in the same CLS as follows. * Top-z-wire and y-wire at the same x coordinate. * Y-wire and x-wire which are crossing. * X-wire and aid-z-wire at the same y coordinate. * Y-wire and aid-z-wire at the same x coordinate. * Bottom-z-wire and y-wire at the same x coordinate. Fig. 3(b) shows a relation of switches and wires in one CLS of terminals circuit.
2.3
Trade-off between the number of CLS and the number of wires Considering fabrication cost of 3D crossbar, it is preferable that 3D on-chip crossbar routing be able to connect any I/0 terminal on the top-layer to any I/0 terminal on the bottom-layer using the least number of CLSs. We assume that all the routing required in the SC-model and Flexible model are decided, respectively. The number of CLSs in the crossbar circuit is related to the number of wires in one wire layer. When we can place more wire in a physical CLS, the terminal location is changed from original two models. Fig. 4 shows the location of the top layer terminals in 4 x 4 I/0 terminal circuit. One case has two wires in one wire layer of CLS and the other has four wires in one wire layer of CLS. Fig. 5 shows the relation between the number of CLSs and the number of wires for 8 x 8 terminals circuit. It can be seen that as the number of wires in a wire layer is increased, the number of CLSs can be decreased. Since the limitation of number of wires is determined by fabrication process and circuits are used for the crossbar, the minimum number of the CLS is determined using Fig. 5 . It can also be seen that the minimum number of CLS based on the flexible connection model is one third of that based on the simple connection model.
Fundamental Routing Algorithm for 3D Crossbar
The algorithm for routing our 3D crossbar is an adaptation of Leighton's Offline Mesh Routing algorithm (OMR) [3] . Leighton The sum simplifies to O(N2), making the entire result o(N35).
Since this is the dominate work in the algorithm, the total algorithm runs in o(N3.5) time.
Tolerance to Defects of Switches
In our defect model, we assume that only switches in the crossbar may be defective. In particular, we envision using nanoscale, post-silicon switches such as CNT switches and nanowires for the crosspoints. Consequently, we expect the switches to have a much higher defect rate than conventional lithography. In contrast, the wires in this design remain lithographic wires which we expect to be much less error prone. Nonetheless, since these are crossbars, wire sparing is easily handled simply by providing spare wires. A free-switch is a non-defective switch which can be set both to on-state and off-state. A defective switch is a switch which can be set only to off-state. We assume opens or breaks in these nanoscale devices are much more likely than shorts. Further, the high defect-tolerance demonstrated with our algorithm suggests that it is reasonable to tune manufacturing to avoid shorts even at the expense of increasing open defects. Now we introduce the defect tolerant circuit for SC-model with this defect model.
Defect tolerant strategy
Here the defect-SC-model has extra CLSs for redundancy in order to avoid all defects of the circuit. Fig. 6(b) shows the defect-SC-model. For the defective crossbar circuit, redundant AL layers are added to each CLS. The crossbar circuit thereby consists of N + AL sets for each CLSs; N + AL sets of B-type CLSs, and N + AL sets of A-type CLSs. Based on this model, the minimum number of AL layer was estimated. The routing algorithm for the defect-SC-model is basically the same as the SC-model shown in Section 3. The different point of these two models is the way to assign routes to crossbars in each step. In the defect-free algorithm, a crossbar is assigned uniquely. In the algorithm for defect-SC-model, we are using greedy allocation strategy which is similar to the greedy method in [5] . A crossbar, which has two free-switches, has to be assigned to route in each step. At first, from upper CLS we search the unused crossbar which is in the same position for each CLS. Then we check the two switches which must be set to the on-state. If theses two switches are free switches, then the algorithm select the crossbar to assign the route. If not, the algorithm searches another crossbar of the other CLS. Fig. 6(b) shows an example of defect routing for 2 x 2 terminals circuit. As the switch of the crossbar of second CLS is defective, the target route selects the crossbar of the third CLS.
Analysis of redundancy
We compute the number of redundant layers, AL, needed for routing in practice. Let R be the set of routing requests and ri E R be a routing request for 0 < i < NxN 1.
Let a be the fault probability of each crossbar switch. The wire must have two free-switches for use as a route. Thus the probability of successfully using the wire is (1 a)2. In the algorithm, the log(1 -PC eN) log(2a -a2)
Computational Experimental
We implemented the defect tolerant algorithm in this section and calculated the number of extra layers for an 8 x 8 crossbar circuit and the fault probability a of a switch. For statistical purposes the same terminal circuit is routed 105 times for each a. Fig. 7 shows the maximum, minimum, and average of the number of extra CLS AL for the fault probability a. This shows that, on average, the circuit will require no extra layers when the defect probability is at or below 100%. We calculate AL from the inequality in Section 4.2 for a routing success rate Pc of 0.999. Fig. 7 shows the theoretical value of AL. From this result, the theoretical value AL reaches almost the same value of the experimental result. The maximum number of CLS layers is larger than the average case. Currently, we are using a greedy method for selecting wires. Consequently, as part of our future work, we expect to be able to reduce the (a) (b) Fig. 6 Example of the routing for SC-model: (a) routing of the non-defect algorithm, (b) routing of the defect algorithm maximum number of layers required using a more aggressive algorithm.
Circuit Performance of 3D crossbar
We calculated latency of 2D and 3D crossbar buses based on sub-90 nm CMOS technology. Fig. 8(a) shows latencies associated with one crossbar switch and one interconnect between crossbar switches and their dependencies on interconnect length between crossbar switches. CMOS circuit for the crossbar switch is used as shown in Fig. 8(b) . For 2D crossbar, since the maximum interconnect length is more than 1 mm, the latency of crossbar is dominated by interconnect delay. For 3D crossbar, since maximum interconnect length can be decreased to less than about 100 um by optimizing design, the latency of crossbar is dominated by delay of crossbar switch its self. The delay of crossbar switch is naturally reduced by decreasing feature size of transistor. The delay will be further reduced by replacing silicon-MOSFET by carbon-nanotube FET or Ge nanowire FET to less than half of those in Fig. 9[2] . Although the delay can be reduced, the advantage of 3D structure is that no repeaters are needed to reduce interconnect delay, since overhead of area and power associated with repeaters is increasing rapidly as the feature size of transistor is decreased. Fig. 9 shows estimation for bandwidth of 2D and 3D crossbar.
Frequency is the inverse of crossbar latency calculated from Fig. 8(a) , where the crossbar is composed of five crossbar switches and interconnects between them. Here the footprint area of crossbar circuit is assumed to be 0.1 mm x 0.1 mm. Whereas bandwidth of conventional 2D crossbar based on silicon-CMOS is 100 Gbps to 1 Tbps, that of the 3D crossbar based on silicon-CMOS and wafer-bonding technology is 10 Tbps to 1 Tbps. Furthermore, the bandwidth of 3D crossbar based on post-silicon device is expected to be over 1 Pbps.
Conclusion
Emerging, post-silicon technologies open the way to densely integrated 3D switched interconnect. At the same time, these densely packed, nanoscale devices may be highly prone to defects. We have developed the detail design of a 3-stage, 3D crossbar NoC and shown that it can be routed, tolerating a high defect rate with modest overhead. The 3D design keeps wire short, enabling high bandwidth communications. NoC design using such an extremely large bandwidth can solve various bottlenecks of data transfer in future high-performance micro-chips. ..,,.,I; 'W"', ", 2--l
