Abstract: With the advent of complex computing applications such as cloud computing and artificial intelligence, the utilization of multicore processors has become one of the best solutions to improve the computation performance. Benefitted from the technologies of silicon-photonic-based communication and three-dimensional (3-D) integration, the 3-D optical network-on-chip (3-D ONoC) has attained extensive attentions as a new multicore architecture, providing high communication bandwidth, as well as low transmission delay and power consumption. As the main part of 3-D ONoCs, the structures of topology and optical router (OR) heavily affect the transmission efficiency of the whole network. In this paper, we propose a mesh-based topology and a novel cost-efficient 6 × 6 nonblocking OR structure. Different from the traditional 3-D ONoC topology that needs seven-port ORs to realize the data transmission, the OR mentioned in our solutions only consumes six ports. This improvement effectively reduces the number of optical switching elements and cross waveguides in ORs, lowering the power consumption and hardware costs of the system. The simulation analysis based on the modified Noxim simulator demonstrates that our method performs well in terms of mitigating the OR-and network-level insertion loss, shrinking the floorplan/chip area and improving the network scalability over benchmarks.
Introduction
The demand for complex computing applications are very high and growing. The large amount of application data requires multi-core processors to compute in a short time span. The Network-onChip (NoC) which efficiently utilizes multi-processors for parallel computing used to be an attractive platform [1] . However, the metal interconnects typically used within NoCs makes its clock frequency limited in the range of 4-5 GHz [2] , which seriously worsens the computing scalability from perspectives of operating frequency and power consumption [3] .
Instead of the traditional NoC based on electrical/metal interconnects, the Optical NoC (ONoC) [4] - [8] provides higher communication bandwidth, as well as lower transmission delay and power costs. Moreover, owing to the breakthrough of Through Silicon Via (TSV) [9] and die stacking [10] technology, the three-dimensional ONoC (3D ONoC) [11] - [18] further improves the interconnect density, shortens the interconnect distance and improves the power efficiency. In summary, the 3D ONoC will undoubtedly become the future development trend of multi-core platforms.
The topology of 3D ONoCs defines the interconnection and physical layout of computing and communication nodes on the chip. The current researches of designing topologies mainly focus on 3D mesh and 3D torus [12] - [15] , [17] . On the other hand, different kinds of Optical Router (OR) structures have been proposed based on the optical waveguide and Microring Resonator (MR) for 3D ONoCs [11] - [16] . The MR is one of widely used wavelength selective optical switches in ONoCs. The authors in [11] proposed a 7 × 7 MR-based non-blocking full-connected crossbar OR structure for 3D mesh ONoCs. This OR structure had 49 MRs, 67 cross waveguides and 14 Optical Terminators (OTs). Here, the OT is an important but expensive device used in the end of the optical waveguide [13] , and its function is to absorb light from returning to the transmission pipeline. However, as to the XYZ dimension order routing, some special turns, such as the case from 'North/South' port to 'West/East' port, will not certainly occur. Therefore, the full-connected crossbar should be optimized. Gu et al. proposed a 7 × 7 non-blocking optimized partial crossbar OR structure in [12] . It was simplified to own 30 MRs, 47 cross waveguides, and 10 OTs. Ye et al. proposed a 7 × 7 non-blocking OR structure for 3D mesh ONoCs in [13] , which included 24 MRs, 31 cross waveguides and 3 OTs. In [14] , Zhu et al. proposed a 7 × 7 non-blocking OR structure named as Votex, and this OR structure contained 36 MRs, 57 cross waveguides, and 2 OTs.
Unfortunately, the aforementioned ORs require 7 ports for the data transmission when the number of vertical layers becomes not smaller than 3. The number of ports equipped in the ORs of 3D ONoCs can be reduced by adding vertical ORs [15] , [16] . As a result, the intra-layer OR in [15] , [16] only had 6 ports because it merely realized the data transmission in the same layer. The OR in [15] contains 16 MRs, 10 cross waveguides, and 2 OTs. The OR in [16] contains 18 MRs, 10 cross waveguides, and 2 OTs. Utilizing these two kinds of ORs can reduce the number of MRs and cross waveguides compared with the traditional 7 × 7 OR structures in [11] - [14] . However, the ORs in [15] , [16] still presented the obvious disadvantage of being spatially blocked, and the proposed torus topology structure could cause deadlock, leading to the sharp decline of the on-chip resource utilization. In a word, all the existing ORs are not optimal for 3D ONoCs.
To solve the problems mentioned above, we proposed a 6 × 6 non-blocking OR structure in [19] . Using this OR can effectively reduce the number of consumed MRs and cross waveguides while guaranteeing the non-blocking performance. In this paper, we extend our previous work, further optimizing the OR structure, and the corresponding topology and communication protocol are also designed. We also make the comparative analysis of the hardware consumption, insertion loss, and scalability between our solution and benchmarks mentioned in [11] - [16] . The simulation results based on the modified Noxim simulator demonstrate that our design performs better compared with benchmarks. The main contributions of this paper are summarized as follows.
r We optimize the OR structure proposed in our previous work [19] for 3D ONoCs. Compared with the traditional 6 × 6 ORs mentioned in [15] , [16] , our OR can achieve the non-blocking performance. Compared with the traditional 6 × 6 non-blocking OR structure proposed in [13] , using our OR reduces the worst-and average-case insertion loss by 36.6% and 8.3%. Compared with the traditional 7 × 7 non-blocking OR structure designed in [11] - [14] , using our OR can reduce the worst-case insertion loss by 47.7%, 32.9%, 16.7% and 50.2%, and it also reduces the average-case insertion loss by 57.8%, 30.7%, 21.6% and 43.4%, respectively.
r As to the network-level performance comparison, given the network size of 3 × 3 × 3, 4 × 4 × 3, 5 × 5 × 3, and 6 × 6 × 3, respectively: 1) utilizing our OR reduces the worst-case insertion loss by 55.6%, 55.3%, 55.1% and 54.9%, while the average value by 57.4%, 57.7%, 58.1% and 58.1% for full-connected crossbar OR mentioned in [11] ; 2) it also reduces the worst-case insertion loss by 32.8%, 33.4%, 37.2% and 39.5%, while the average-case insertion loss by 29.1%, 32.0%, 34.7% and 36.3% for optimized partial crossbar OR proposed in [12] ; 3) it reduces the worst-case insertion loss by 29.5%, 26.6%, 24.6% and 23.1%, while the average value by 19.3%, 20.4%, 21.6% and 22.0% for OR designed in [13] ; 4) it also reduces the worst-case insertion loss by 53.8%, 55.4%, 52.1% and 57.0%, while the average-case insertion loss by 39.9%, 44.5%, 44.8% and 50.6% for Votex OR presented in [14] . The rest of the paper is organized as follows. The 6 × 6 non-blocking OR assorted with corresponding 3D topology, communication protocol, and routing algorithm is designed in Section 2. We analyze numerical results in Section 3. Finally, we conclude this paper in Section 4.
Optimized Design of OR Structures Supporting 3D ONoCs
In this section, we first introduce the basic optical switching element of ORs. Next, we propose novel 3D ONoC topology and communication protocol supporting our designed OR structure. As previously discussed, the number of used OR ports can be reduced by adding vertical ORs in 3D torus ONoCs, thus saving on-chip resources but showing the potential deadlock. Now, we propose a new 3D X-Mesh ONoC topology by combining the superiorities of 3D mesh and 3D Torus. Based on our topology and communication protocol, a novel 6 × 6 MR-based OR structure is also designed with the characteristics of low insertion loss and no message blocking.
MR-Based Optical Switching Elements
The Optical Switching Element (OSE) is the basic unit of the OR. In this section, we introduce the most popular MR-based OSE, as summarized by Fig. 1(a) -(c). An OR can be built by three types of basic elements including 1 × 2 parallel OSE, 1 × 2 cross OSE, and 2 × 2 cross OSE. Both 1 × 2 parallel and 1 × 2 cross OSEs consist of two optical waveguides and one MR, while the 2 × 2 cross OSE includes two optical waveguides and a pair of MRs. All of OSEs have two states, i.e., on-state when the signal wavelength is equal to the resonance wavelength of the MR, and off-state when the signal wavelength is different from the resonance wavelength. The resonance wavelength is determined by the material and structure of MRs. Since the structure of MRs remains unchanged, we usually adjust the resonant wavelength by injecting carriers through a p-n junction. The data injected to the input port will bypass the MR to the through port or coupled into the MR to the drop port when the OSE is in off-state or on-state, respectively.
Optimized Design of OR Structures Supporting 3D ONoCs
The 3 × 3 × 3 X-Mesh ONoC topology is shown in Fig. 1(d) . Similar to the 3D X-Torus, our topology also has two kinds of OR structures: intra-layer OR and vertical OR. Each intra-layer OR totally has six ports, and each of them is connected to one local 'IP Core'. While for the inter-layer data transmission, the second-layer node is additionally equipped with a vertical OR. Thus, the data can be transferred from one vertical layer to the others via vertical ORs. Moreover, every optical waveguide between a pair of ORs in the topology is bidirectional. The intra-layer OR is labeled by the address (x, y, z), 0 ≤ x, y, z ≤ 2, and it is arranged regularly in three dimensions. The vertical OR is labeled by the address (k), 0 ≤ k ≤ 8.
The XYZ dimension order routing is still used for the path selection in our 3D ONoCs. Additionally, due to the lack of the optical storage, the optoelectronic hybrid interconnect is involved in our topology. The electronic control layer is used for the path establishment, while the optical layer is used for the transmission of payload data because the optical interconnection has high capacity [20] . The communication between nodes in our optoelectronic hybrid interconnect adopts the communication protocol described in [4] . As shown by Fig. 1(e) , at the source node, the corresponding 'IP Core' generates a communication demand, and then it sends a path-setup signal to the destination node according to the XYZ dimension order routing. The path-setup signal includes the source and destination addresses of the communication demand. When the correct destination node receives the path-setup signal, it will send an Ack signal to the source node following the direction reverse to the previous setup path. This process is conducted in the electrical layer, and after that, the corresponding path resource will be reserved. When the source node receives the Ack signal, the payload data will be converted into optical signals through the Electro-Optical (EO) converter and transmitted to the destination node along the previously established optical path without repeating, regenerating, or buffering [4] . Finally, in the destination node, the optical signal will be converted into an electrical one arriving at the local 'IP Core'. When the data forwarding is completed, the reserved path will be released.
By using the aforementioned communication protocol, only a small part of data (control signals) is transmitted in the electrical network, and most of packets will pass through optical waveguides, which increases the network bandwidth and reduces the transmission delay.
Novel 6 × 6 Non-blocking MR-based OR Structure
According to the topology information given by Fig. 1(d) , the intra-layer OR needs six ports for data exchange, and the corresponding 6 × 6 non-blocking MR-based OR structure is shown in Fig. 2(a) . Each kind of port, e.g., 'IP Core', 'East', 'West', 'North', 'South', or 'Vertical', has an input and an output, i.e., it supports bi-directional data transmission. In Fig. 2(a) , there are only 16 MRs and 13 cross waveguides. Thus, compared with the traditional non-blocking OR, our OR consumes less number of cross waveguides and MRs. This means that our OR offers the benefits in terms of mitigating insertion loss, shrinking device area and reducing power consumption.
As listed in Fig. 2(b) : 1) it is unnecessary for us to turn on any MR when the optical signal travels along one single dimension within the OR (labeled as 'none'); 2) it is forbidden to achieve the communication between the input and output of the same port, and meanwhile, the communication from 'North/South' port to 'East/West' port is not allowed either; 3) note that, the data coming from 'Vertical' port only can be transmitted to the local 'IP Core' by using XYZ dimension order routing. Thereby, 14 possible paths are neglected, and they are labeled as "-" in Fig. 2(b) ; 4) finally, only one MR will be turned on for other 16 possible paths, each of which is established by a specific resonant MR (labeled as "MRn", n = [1, 2, ... , 16]). As a result, our 6 × 6 OR provides the specific path for each communication pair, and its nonblocking property is proved by enumerating all possible cases. In addition to the aforementioned 6 × 6 non-blocking OR structure tailored for the intra-layer data transmission, the non-blocking vertical OR structure mentioned in [15] is also utilized for the inter-layer data transmission.
Simulation Results and Discussions
In this section, we first compare the structural complexity and hardware cost among our OR and traditional ORs displayed in [11] - [17] . We then evaluate the insertion loss under different cases. Finally, we compare the maximum number of wavelengths allowed for varying topology sizes.
Simulation Environment Setup
The simulation environment is set up by using an open source NoC simulator named as Noxim [21] , in order to prove the superiority of our topology and OR structures. Since Noxim is the simulator used for 2D electrical NoCs, the new simulator utilized in our evaluations is modified for the 3D optoelectronic hybrid topology. Additionally, we add parameters of insertion loss to our simulator for the corresponding analysis. The considered topologies are traditional 3D mesh and 3D X-Mesh. The topology sizes are 3 × 3 × 3, 4 × 4 × 3, 5 × 5 × 3, and 6 × 6 × 3. The OR types are our 6 × 6 OR, three kinds of 6 × 6 ORs in [13] , [15] - [16] , 7 × 7 full-connected crossbar OR [11] , 7 × 7 optimized partial crossbar OR [12] , 7 × 7 OR [13] and 7 × 7 Votex OR [14] .
Structural complexity and Hardware Consumption Analysis
In ONoCs, the complexity of OR structures is related to the amount of hardware consumed, data exchange/control strategy as well as the production cost. The OR is mainly composed of OE/EO units, MRs, OTs and waveguides. Since the OE/EO unit is only used for optical-electoral-optical conversion of the 'IP Core' port, each OR requires the same number of OE/EO units. Hence, the number of used MRs, OTs and waveguides mainly determines the corresponding hardware consumption including production cost and chip/floorplan area. Table 1 shows the hardware used by different ORs. As listed in Table 1 , the hardware consumed by our OR is less than or equal to others. More importantly, our OR has the same consumption with the 6 × 6 OR mentioned in [13] , but it has a lower average insertion loss, which will be discussed in the next section. In addition, our OR can achieve the non-blocking transmission, so it has the same exchange/control complexity with non-blocking ORs in [11] - [14] , and lower complexity than ORs in [13] - [15] . In summary, our OR can achieve the lower structural complexity than traditional 6 × 6 and 7 × 7 ORs.
To ensure fairness, we also analyze the hardware consumption of different ORs at the network level. Assuming that the topology size is N × N × 3, four kinds of existing 7 × 7 ORs are used in the N × N × 3 mesh ONoC topology while 6-port (6 × 6) OR is tailored for our N × N × 3 X-Mesh topology. As to our 3D X-Mesh, the expenditure of deploying vertical ORs will also be considered, and each vertical OR is equipped with 4 MRs and 2 OTs. In the N × N × 3 mesh topology, we require N × N × 3 7 × 7 ORs, while in N × N × 3 3D X-Mesh topology, merely N × N × 3 6 × 6 ORs and N × N vertical ORs are consumed. Fig. 3 shows the number of MRs and OTs used by different ORs at the network level. As shown in Fig. 3 , the number of MRs and OTs consumed by our OR is less than others, and with the increasing size of topologies, this performance advantage becomes more obvious.
Analysis of OR-Level Insertion Loss
We analyze the OR-and network-level insertion loss of different topology and OR structures. The configuration of insertion loss parameters is in the following. Here, waveguide crossing loss is 0.16 dB, waveguide bend loss is 0.005 dB/90
• , drop into a ring is 0.6 dB, and pass into a ring is 0.005 dB [22] - [24] . The OR's insertion loss can be calculated by (1) .
where I L (R ) (i ,j) is defined as the insertion loss travelling from the i -th port to the j-th port in the OR R . For the 7 × 7 non-blocking OR: i , j ∈ {N or th, E ast, West, South , I P Core, Up , D ow n}; for the 6 × 6 OR: i , j ∈ {N or th, E ast, West, South , Ver ti cal, I P Core}; for the vertical OR: The transmission paths between different pairs of ports within the OR result in various degrees of insertion loss. We list all possible paths using XYZ routing and give the detailed data in the table of Fig. 4 . Here, 'N', 'S', 'W', 'E', 'V', and 'IP' denote 'North', 'South', 'West', 'East', 'Vertical', and 'IP Core', respectively. We can see that, as to the OR-level insertion loss, the transmission paths between different pairs of ports within the OR have various degrees of insertion loss. In addition, to show clearer performance results, we compare the maximum, minimum and average insertion loss in Fig. 4 . As to our OR structure, the maximum, minimum and average insertion loss are 1.605 dB, 0.6 dB and 1.089 dB, respectively. As to the OR structure in [15] , the maximum, minimum and average insertion loss are 2.47 dB, 0.34 dB and 1.172 dB, respectively. As to the OR structure in [16] , the maximum, minimum and average insertion loss are 1.87 dB, 0.34 dB and 1.064 dB, respectively. As to the OR structure in [13] , the maximum, minimum and average insertion loss are 2.53 dB, 0.6 dB and 1.188 dB, respectively.
We also analyze the insertion loss for 7 × 7 ORs, as shown by Fig. 5 . As to the OR in [11] , the maximum, minimum and average insertion loss are 3.07 dB, 2.065 dB, and 2.553 dB, respectively. In terms of the OR in [12] , the maximum, minimum and average insertion loss are 2.395 dB, 0.6 dB, and 1.571 dB, respectively. While for the OR in [13] , the maximum, minimum and average insertion loss are 1.925 dB, 0.6 dB, and 1.389 dB, respectively. As to the OR in [14] , the maximum, minimum and average insertion loss are 3.22 dB, 0.6 dB, and 1.924 dB, respectively.
Through the simulation analysis above, we can grasp the insertion loss data for all input and output combinations in each OR. Moreover, for OR-level insertion loss, our OR is better than others. Specifically, our solution has the lowest maximum-case insertion loss among these eight ORs, and it also has the lowest minimum and average insertion loss in non-blocking ORs.
Analysis of Network-Level Insertion Loss
The network-level analysis of insertion loss is made for different ORs given different topology modes and sizes. The Network-level Insertion Loss (NIL) can be determined by: where N I L (s, d) is the NIL travelling from the source OR s to the destination OR d. Fig. 6 (a) shows the simulation result of the average insertion loss, and it shows that our OR has the lower insertion loss than other non-blocking ORs under arbitrary network scale. In addition, this performance advantage becomes more obvious with the increasing size of the network. When the topology size reaches to 6 × 6 × 3, the average insertion loss can be reduced by 58.129%, 36.269%, 22.045%, and 50.57% than 7 × 7 OR structures in [11] - [13] , and [14] , respectively. Furthermore, we use the 7 × 7 full-connected crossbar OR as the benchmark to analyze the network-level optimized insertion loss over other four ORs. As shown in Fig. 6(b) , our solution has the best optimization performance of average insertion loss. Moreover, the improvement ratio of our solution follows the rising trend, while the other three approaches have the declining trend.
As can be seen from Fig. 7(a) , the worst-case insertion loss of ORs is growing when the topology size increases. This is because that, as to the regular mesh topology, the increasing network size results in the longer path length. Compared with other two kinds of 6 × 6 ORs, the N I L of our OR is not dominant. However, these two 6 × 6 ORs are blocked, and they reduce the insertion loss by sacrificing the network throughput. Compared with other four kinds of 7 × 7 non-blocking OR structures, our solution has an obvious advantage in terms of N I L. With the increment of the network scale, this performance advantage becomes more obvious. For example, when the network size is 3 × 3 × 3, our OR can effectively reduce the insertion loss by 9.84 dB, 3.83 dB, 3.29 dB, and 9.15 dB, compared with full-connected crossbar OR, optimized partial crossbar OR, 7 × 7 OR in [13] and Votex OR, respectively. When the network size is 6 × 6 × 3, our OR structure can effectively reduce the insertion loss by 17.41 dB, 9.33 dB, 4.20 dB, and 18.96 dB, respectively.
According to [25] , the maximum number of wavelengths allowed for varying the topology size is obtained by (3) .
where P is the amount of power allowed to be injected, S is the receiver sensitivity for the desired device and bit error rate. P − S is the network-level optical power budget. I L max is the worstcase insertion loss of the network. n determines the number of wavelengths used for the WDM signal. Given the nonlinearities threshold P − S, the value of n is inversely proportional to I L max. When I L max becomes small, the network will be able to accommodate more wavelengths, further improving the network bandwidth and throughput. On the other hand, given a limited number of wavelengths, a small I L max will allow more nodes to be accessed in the network. Fig. 7(b) indicates the maximum number of wavelength channels that are allowed for a given number of access points assuming the P − S is 20 dB [25] . It can be seen clearly from Fig. 7(b) , our OR structure performs better than other solutions in terms of the number of access points and wavelength channels, which means that our method has the better scalability.
Conclusions and Future Work
In this paper, we proposed novel 3D X-Mesh topology and 6 × 6 non-blocking OR structure which consumed less number of MRs and cross waveguides than existing designs. Simulation results demonstrated that our scheme had less hardware consumption, lower insertion loss and higher scalability. More specifically, when the network size was 6 × 6 × 3, our scheme saved 64.6%, 42.2%, 27.8% and 51.8% over the number of used MRs, and the network-level average insertion loss could be reduced by 58.129%, 36.269%, 22.045%, and 50.57% compared with 7 × 7 full-connected crossbar, 7 × 7 optimized partial crossbar, OR structure in [13] , and 7 × 7 Votex. Nowadays, data centers are constantly increasing the number of processor cores to increase the speed of parallel computing. Therefore, new interconnect technologies and structures need to be developed to meet the demand of high-performance interconnection between ever-growing processor cores. In the future, we will expand the structure proposed in this paper and combine the WDM technology to achieve low-loss, low-latency and high-bandwidth data center using optical switching structures.
