Abstract. The packet switching based Network-on-Chip (NoC) is an obvious interconnect design alternative to the shared bus, crossbar or ring based on-chip communication architecture used in System-on-Chips (SoCs). The advent of the three dimensional NoC (3D NoC) architecture attracts added interest as it offers improved performance and shorter global interconnect. In the 3D NoC architecture, topology plays a vital role in determining the performance of the interconnect architecture. The performance and cost metrics of the 3D NoC topology are evaluated by using simulation in general. However, analytical models provide more insights on how the traffic related parameters influence the performance of the topology. In this paper, the traffic related parameters of a 3D NoC topology, namely 3D Recursive Network Topology (3D RNT) are evaluated by using network calculus based methodology and the results of the evaluation are compared against the results produced using simulation.
Introduction

Billions of transistors are integrated in deep sub-micron VLSI / ULSI technology which enable the SoC designers to compose tens to hundreds of Intellectual ----------------------------------------------------
4174
N. Viswanathan et al.
Property (IP) blocks such as video and audio processors, memories, I/O peripherals, hardware accelerators, etc,. In the present SoCs, shared medium, crossbar or ring based interconnect architectures are being used to integrate the IP blocks. These interconnect architectures become restrictive as the number of IP blocks in SoC increases more than ten [1, 2] . In future, more number of the IP blocks is to be incorporated in SoCs to implement emerging complex computation, multimedia and network services.
Evolving an efficient interconnect architecture to integrate the IP blocks in SoC is a challenging research area. The interconnect designs used in the current SoC are not competent as the number of IP blocks increases owing to the poor scalability and drastic performance degrade. Packet switching based Network-on-Chip (NoC) has been recognized as an obvious interconnect design alternative to other schemes to address the design challenges and to achieve high performance and scalability of future nano scale SoC architectures [3] .
The advent of the 3D technology as a result of the advances in the process and mixed signal integration technology provided opportunity to expand the NoC in the third dimension (3D NoC) [4] . Design of competent 3D NoC topology to share the data among the IP blocks located in different layers is the key research area in the macro architecture of the 3D NoC [5, 6] .
Generally, simulation experiments are conducted to study the performance of the 3D NoC topology and the simulation results give trivial insight and understanding on how the traffic related parameters influence the performance of the 3D NoC topology [8, 18] . Analytical models are being considered recently to evaluate the traffic related parameters and they allow the designers to investigate trade-off between different parameters rapidly so as to improve the performance of the 3D NoC topology [12, 13] . Among the many 3D NoC topologies proposed in [7, 8, 9] , the 3D recursive network topology (3D RNT) has been identified as an appropriate topology for 3D NoC. The topology offers a high degree of scalability, higher performance and well suited for modular approach and to implement heterogeneous, distributed systems involving a large number of IP blocks.
In this paper, performance and cost metrics of the 3D RNT are evaluated by using network calculus based analytical model and the results of the evaluation are compared against the results produced using an event driven, cycle accurate simulator Network Simulator NS-2 to validate the effectiveness of the model. The performance and cost metrics chosen for the study are end-to-end delay, switch buffer size and the influence of the buffer size in determining the area over head requirement of the 3D NoC.
In the rest of the section, related works are summarized in section 2, section 3 illuminates the mapping of the traffic related parameters with the model, section 4 describes the end-to-end latency under different traffic pattern and section 5 explains the evaluation of the switch buffer size and the influence of the buffer size in determining area requirement of the 3D NoC. Conclusion is given in the last section.
Performance and cost metrics analysis
4175
Related works
Many topologies and routing algorithms have been developed for 3D Network-onChip (3D NoC). To the best of our knowledge, evolving an optimized topology and routing algorithm for 3D NoC is still an open problem. The current literatures [9, 10] focus on partially, vertically connected 3D NoC topology which is quite interesting research area in the 3D NoC Several works have illustrated that there is a crucial need for analytical model to evaluate the NoC interconnect design architectures [14, 15, 16] . Network calculus is an analytical model that computes the input and output arrival curves, end-to-end delay bounds, buffer size of the switch and other performance metrics of 2D mesh on-chip interconnect and the results of the model are compared against simulation results in the papers [13, 17] . In the papers [7, 8] , 3D NoC topologies are evaluated using Network Simulator-2 (NS-2), an open source simulator where five traffic source-sink pairs are selected randomly and the simulation output for the end-to-end delay is observed by varying both injected data rate and buffer size of the switches.
Mapping of the NoC parameters with the analytical model
In this section, we have introduced a recursive based 2D NoC topology and then a 3D NoC topology is constructed. Our proposed topology is graph based recursive topology. Degree of a particular node in a graph / network is the number of links incident to that node. If the degree of all the nodes is same, say µ then the graph is called as µ-regular graph. If µ = n-1, then the graph is called as a complete graph, where n is the number of nodes in the graph. A complete graph with 4 nodes is our basic 2D module. A recursive network is denoted as K (l, 4, t) where l is the number of tiers/layers and t is the levels, l and t ≥1. If l=1, the topology is 2D recursive network topology (2D RNT) and l >1, the topology is 3D RNT is shown in Fig.1 [7] [8] [9] [10] [11] .
In the 3D RNT, the basic module K (1, 4, 1) is a 3-regular graph and in K (1, 4, 2) , degree of all the nodes are 4 excepting the intermediate layer four corner nodes i.e. NW, NE, SE and SW and there are four copies of K (1, 4, 1) which are interconnected in clock wise manner by using six horizontal links as shown in the Fig.1 . In general, K (1, 4, t) is a 4-regular graph excepting the four corner nodes in the intermediate layer and the graph is constructed by using the four copies of (1, 4, t-1) . There are 2t copies of the basic module (1, 4, 1) in the module K (1, 4, t) , where t >1. In this paper, we have taken K (3, 4, 2) as our topology [11] . In the topology, each layer has four clusters which are formed by grouping four nodes with one node as cluster head (CH) [7, 8] .
To evaluate the performance and cost metrics like end-to-end delay, buffer size of the switches and area requirement, we use network calculus which is a mathematical model that allows us to specify the NoC parameters. Incoming traffic pattern of a switch can be defined by the arrival curve α(t) of the incoming data flows. The incoming data flows consist of cumulative flits arriving from an IP block interfaced with a switch and the neighboring switches. If a switch S xyz has input data flows from n neighboring switches and its own IP block, then the arrival curve curve of the input data flows arriving from n neighboring switches [14] . Consider a straight line is y-y 1 = m (x-x 1 ), where m is the slope of the line and (x 1 , y 1 ) is a point on the line. The maximum input data flows to a switch is constrained by the arrival curve α xyz (t) = rx+ b, where b is the maximum burst size of the data, r is the average rate of the data flow or average service rate. In other words, the arrival curve of the switch S xyz is α xyz (t)= rt + b (1) Similarly, the service curve of the switch is constrained by the equation β(t) = R (x-T),where R is the guaranteed service rate and T is the maximum latency caused by a switch. In other words, when t>T, the service curve of the switch is
where T= k /R, k is the flit size in bytes. The output data flows of a switch is constrained by the following output curve α * xyz (t) = α xyz (t) + rT (3) Input buffer size B of a switch is computed as follows. At time t, number of bytes available in the buffer is B= b + rt. If maximum number of the bytes residing in the buffer at t = T, then the maximum input buffer size of a switch can be B= b + rT (4) Similarly, the delay bound can be calculated as follows. Let 
The data flows considered here as case study are generated by five different source-sink pairs that impersonate particular application data traffic. The date flows f 1 , f 2 , f 3 , f 4 and f 5 generated by the nodes 200,102,101,131 and 203 are consumed by the sinks 000,030,012 and 100. The sources are selected randomly and the sinks are selected according to the communication locality principle in which 20 % of the traffic takes place between neighboring nodes with distance d =1 and the rest of the traffic is uniformly distributed among the other nodes. The locality principle is assumed that the shorter the distance between source-sink pair, the higher the traffic happens between them. Routing algorithm used in the 3D RNT has three steps [7, 8] , namely (i) finding destination layer, (ii)finding destination cluster and (iii) finding destination node in the destination cluster and the routing path for the five traffic flows is established in the topology as follows (i) f 1 There are four input data flows f 1 , f 2 , f 3 and f 4 to the switch S 100 and the following can be computed using the equations (1), and (3) The arrival curve of the switch S 100 is α 100 (t) = 4rt+4b+1/2rT+rT+3rT+2rT= 4rt+4b+13/2rT (6) From the equation (6), b=4b+13/2rT and r = 4r, then the maximum input buffer size B 100 of the switch S 100 becomes B 100 = 4b+21/2 rT (7) Similarly, the delay bound D 100 of the switch S 100 becomes D 100 = 4b/R+ (13/2* r/R +1) T (8) If the source injection rate r = 100 Gbps, R = 400 Gbps, maximum burst size of the data b= 100 bytes and flit size k=10 bytes, then maximum latency caused by a switch is T= k/R = 0.191ns. The arrival and service curve of the switch S 100 is drawn as shown in Fig. 2 . From the Fig. 2 , it is observed that the maximum input buffer size B 100 of the switch S 100 =426 bytes and the delay bound D 100 of the switch = 8.131 ns.
End-to-end Latency
The arrival curve of the switch S 100 is given in equation (6) . Similarly, the arrival curve of the other switches participating in the data flows are given as follows (i) α 200 
(t) = 2rt+2b+rT,(ii)α 103 (t)=rt+b+2rT,(iii)α 100 (t)=4rt+4b+13/2rT,(iv) α 000 (t)=3rt+3b+63/8 rT,(v)α 001 (t)=rt+b+87/24rT,(vi)α 010 (t)=rt+b+111/24rT,(vii)α 012 (t)= rt+b+135/24rT.
Having obtained the arrival curves, the data transfer delay bound D xyz of each switch S xyz is computed by using the equation (5) 
as follows (i)D 200 =2b/R+(1+r/R)T, (ii)D 103 = b /R+ (1+2r / R) T, (iii) D 100 = 4b /R+ (1+13/2* r/R), (iv) D 000 =3b /R+(1+63/8*r / R) T, (v) D 001 = b / R+(1+87/24*r/R)T, (vi) D 010 = b / R+ (1+111/24* r/R)T, (vii) D 012 = b / R+ (1+135/24* r/R) T.
The assumptions needed to compute the end-to-end latency are (i) interconnect link delay is to be negligible and (ii) all switches have same speed in processing the flits. The delay bound D xyz of each switch participating in the data flows is computed and shown in the Fig. 3 . It is observed from the Fig. 3 that the delay increases as the injection rate is increased at the fixed burst size b = 100 bytes. When increasing the injection rate, more and more flits are waiting in the buffer before being serviced by a switch and it causes an increase in the delay bound. The delay bound is also observed by varying the burst size b at fixed injection rate r =100 Gbps as shown in Fig. 4 and it is observed from the Fig. 4 that 72% average delay bound is reduced as the burst size of the data flows b is decreased to 20 bytes at r =100 Gbps. Having computed the delay bound of each switch participating in the data flows, the end-to-end delay bound of a particular data flow is computed by summing up the individual delay bound of the switches participating in the data flow. The Fig. 5 illustrates the end-to-end delay bound of the five data flows under five different injection rates where the increase in the injection rate causes an increase in the end-to-end delay as more and more number of flits filling and waiting in the switch buffer before they are serviced by the switch. The Fig.6 also shows the delay bound as the burst size b is varied at r = 100 Gbps and it is observed that 74% reduction in the average delay is observed when the burst size of the data is decreased to 20 bytes as the number of flits waiting in the switch buffer is reduced before they are serviced by the switch.
The end-to-end delay bounds computed using the model are validated with the simulation performed using cycle accurate, event driven simulator NS-2. In the simulation, ingress port of the each switch where the flits are generated or consumed has a buffer is similar to drop-tail queue with FIFO queue management and routing path of the data flows and participating nodes of the flows are defined using a deterministic routing protocol. An average 13% deviation is observed between the model and simulation as shown in the Fig. 5 and 6 [16, 19 , 20].
Switch buffer size and its influence in the area overhead
Each switch has an input buffer where the incoming flits are temporarily stored before the flits being serviced by the switch. The dominant part of the area occupied in a switch is buffer and hence accurate modeling of the switch buffer plays crucial role in reducing the chip area [16] . The input buffer size B 100 of the switch S 100 is given in the equation (7).Likewise, the input buffer size of the switches participating in the data flows are computed and given as follows The Fig. 7 shows the buffer size B xyz of the seven switches participating in the data flows at different injection rate by keeping the data burst size fixed at b=100 bytes. The switch S 100 needs more buffer size than the other switches as it is participating in 4 data flows whereas the switches S 103 , S 001 , S 010 , and S 012 need less buffer space as they are participating in a single data flow. It is observed from the Fig. 7 that the average buffer size of the switches is (n × 100) bytes where n is the number of data flows arriving to a switch. The data burst size b also plays major role in determining the size of the switch buffer. Fig. 8 shows the buffer size of the seven switches participating in the data flows and the graph is drawn by varying the burst size b of the data flows at the fixed injection rate r =100 Gbps. It is observed from the Fig. 8 that around 70% of the buffer size reduces while the parameter b is decreased to 20 bytes from 100 bytes as the buffer space requirement of the switches reduced while the switches handle less bursty data. Further, the influence of the buffer size in determining the area overhead of the switch is demonstrated in this section. In the NoC, the three major sources of the area overhead are switches, IP blocks, and links. The average area can be computed from the equation (16) 
where N s is the number of switches, R s is the switch silicon area required for routing table and logic to implement the routing algorithm, a s is the area required for one byte,d g is the average number of buffers inside the switch, S f is the size of flits in bytes, B s is the average buffer size in bytes, N b is the number of IP blocks, A b is the area requirement for an IP block, a ℓ is the area required for a link with the link length L ℓ and N ℓ is the number of bidirectional links. (9) is reduced as A=33.50+0.80 (b+rT) and the area overhead requirement of the buffer of the switches participating in the data flows are computed at two cases (i) the injection rate is varied by keeping b=100 bytes (ii) The maximum data burst size b is varied by keeping r fixed at 100 Gbps.
The Fig. 9 shows the area overhead requirement of the switch buffers at the two cases and the graph is drawn such that the buffer size of the switches participating in a particular data flow is computed and the cumulative switch buffer size of the each data flow is shown in the Fig. 9 . It is observed from the Fig. 9 that the area overhead reduction is 6% by reducing the injection rate r to 20 Gbps from 100Gbps at fixed b = 100 bytes and is 74% by decreasing the parameter b to 20 bytes at fixed rate r =100 Gbps.
It is concluded from the Fig. 9 that the influence of the data burst size is more than the injection rate of the data in reducing the area overhead as bursty nature of the data flows requires more space in the buffer than the uniform data flows. The simulation is also performed for the two cases and the simulation results are compared against the results of the model as shown in the Fig. 9 and the average 16% deviation is observed between the model and the simulation. 
Conclusion
In this paper, network calculus based analytical model is used to evaluate the performance and cost metrics like end-to-end delay, buffer size and area requirement of the 3D RNT. The arrival and service curve of a switch of the 3D topology is derived from which the delay and input buffer size of the switch can be computed easily for a given traffic pattern. The end-to-end delay of the five data flows are computed by summing up the delay bound of the switches participating in the data flows. Further, input buffer size of the switches participating in the data flows are computed and discussed on the influence of the switch buffer size in determining the area overhead requirement under various injection rate and burst size of the data. Simulation also carried out using an open source simulator NS-2 to validate the effectiveness of the model. The performance and cost metrics deviation is about 14 % between the model and the simulation. The methodology used in the paper can provide the designers an intricate insight on the influence of the traffic related parameters in determining the performance and area requirement of a 3D NoC topology which is useful as far as the design space of the 3D NoC architecture is concerned.
