The Network-on-Chip (NoC) is an emerging communication technique for System-on-Chip (SoC) communications. The NoC uses multiple processors, usually targeted for embedded applications and other applications [3, 13] . Performance of the bus is degraded by the increasing number of processing elements and transaction oriented model [13] . This has attracted much attention for applying wireless network protocols as CDMA, TDMA, and dTDMA in SoC. The TDMA systems use a fixed number of timeslots. This protocol wastes bandwidth when some timeslots are allocated but not used. The dynamic TDMA (dTDMA) bus arbiter dynamically grows and shrinks the number of timeslots to match the number of active transmitters [14] . In this paper, we present a design of area-efficient switch for inter-layer communications in 3-D NoC. The arbitration logic in the switch is based on a programmable priority encoder. A 640-bit message with uniform random destination data pattern was injected per IP per machine clock cycle. We have obtained the maximum clock frequency of 2.09 GHz for 96(4 x 8 x 3) IP cores connected in a mesh topology. The presented architecture demonstrates their superior functionality in terms of speed, latency, area, and power consumption as compared with the existing implementation [14] . The maximum power consumption of the proposed area-efficient programmable arbiter is 0.625 mW. The design is synthesized using 180nm TSMC Technology.
Introduction
A System-on-Chip (SoC) consists of computational intellectual property cores (IPs), analog components, interface and integration circuits to implement a system on a single chip. The NoC can be defined as a communication paradigm that 
Bus
Network-on-chip
Bus uses simple protocol and is well understood. Light-weight computer network protocols are applied for communications among IPs.
The limited bandwidth is shared by all units attached to bus.
Total available bandwidth scales up with the network size.
The delay in bus arbiter grows with the increase in number of masters.
The decision of routing and arbitration is distributed across the routers.
The bus is almost directly compatible with most of the available IPs.
NoC oriented IPs needs some wrappers or network interface to communicate.
The cost of silicon for a bus is comparably less than NoC.
The network has a large silicon area for router and links.
The bus has only arbitration latency, other is almost zero.
Internal network contention has significant amount of latency.
uses multiple processors, usually targeted for embedded applications. The NoC paradigm represents a promising solution for forthcoming complex embedded systems and multimedia applications. In the heart of the current SoC technology, Moore's law expresses a continually increasing CMOS integration capability, thus challenging the EDA community to deliver new design methodologies and tools to address ever-increasing system complexity [9] . The International Technology Roadmap for Semiconductors estimates that NoCs will soon contain billions of transistors running at speeds of many GHz(line-rate), operating below 1 Volt [9] . A typical, NoC application consists of multiple storage components (memory cores) and processing elements, such as general-purpose CPUs, specialized cores and embedded hardware connected together over complex communication architecture. In 1974, Peter Stoll integrated the LCD driver transistors and timing functions onto a single Intel 5810 CMOS chip that appeared in Microma watch and is known as the first true SoC [5] . Also, a survey of circuitry for wrist watches has been reported [7] . Since 1974, the design trends have been shifting towards multiple homogeneous or heterogeneous intellectual property cores on a single chip. The growing numbering of IPs on a single chip demands for efficient interconnect, performance and low-power consumption solutions. A variety of proposed interconnection networks are in use, including crossbars, rings, bus, and NoCs [10] . But bus and NoC have been dominant in the design of SoC. Buses are a simple shared-medium interconnect that is quite common and well understood [10] . A comparison between bus and NoC is shown in Table 1 . The bus protocols are easy to understand and implement. Performance of the bus is degraded by the increasing number of processing elements (PEs) and transaction oriented model. The traditional bus architectures such as CoreConnect, AMBA, and OCP-IP have inherent disadvantages of their transactional model of operation [1, 12, 14] . The read and write operation in a traditional bus is split into transactions that involves a delay of minimum one clock cycle [14] . This has attracted the attention of researchers [2, 11] for applying wireless network protocols as CDMA, TDMA, and dTDMA in NoC. The TDMA systems use a fixed number of timeslots regardless of the actual number of requests made. In this paper, we propose the architecture of an area-efficient programmable arbiter for inter-layer communication in 3-D NoC. The proposed design has the ability to inject the traffic into the switch arbiter directly, and thus bypass the crossbar switch. This approach prevents one cycle at the crossbar switch if a packet has a destination for other layers.
To validate the proposed architecture, the circuit has been implemented in IC Station (Mentor Graphics). The energy consumption has been determined by T-Spice based model using .18 micron TSMC Technology 1 . The rest of the paper is organized as follows. In Section 2, we present related work in dynamic TDMA. Section 3 presents the technique for dynamic timeslot allocation, pipelining and architecture of the proposed arbiter. Functional simulation, latency analysis, synthesis results and power analysis on the proposed architecture are laid out in Section 4. Finally, we conclude the paper and discuss future research plans in Section 5.
Related work
There have been several attempts to minimize the long distance interconnects from both the power and speed perspective. The network based communication infrastructures and three-dimensional (3-D) designs offer low power and high speed where multiple device layers are stacked together. The emergence of 3-D circuits provides an opportunity to reduce the wire length of global interconnects, resulting in an increase in performance and decrease in the power consumption [4, 6] . The 3-D architecture reduces wiring length by a factor of √ , where is the number of layers used in NoC [8] . The inter-layer interconnect can extend the NoC into three dimensional space. The 2-D and 3-D mesh based architecture has been proposed and evaluated in the paper [6] . Generally, a 3-D architecture is a straightforward extension of 2-D architecture. The authors have proposed three different architectures for interlayer communication [6] . The first derivation, Stacked Mesh, is a hybrid communication paradigm between a packet-switched network and a bus. For consistency with [14] , authors have considered the use of a dynamic, time-division multiple-access (dTDMA) bus. In the paper [6] , authors have not reported any design issues on the proposed use of dTDMA in 3-D Network-on-Chip architecture. The authors have also proposed a second method of constructing 3-D NoC by placing switches to each layers. This design has no vertical bus. The inter-layer communication is performed via a switch placed at a predefined layer of 3-D NoC. The authors have also presented a third design called ciliated 3-D mesh. The ciliated 3-D mesh network has multiple IP blocks per switch. Hence, this architecture has lower bandwidth and reduced connectivity than a complete 3-D mesh network. But this type of network offers an advantage in terms of energy dissipation under specific traffic patterns [6] . In the existing literature [6, 14] the traffic is injected into the main switch even if the packet has a destination to other layer. This will cause latency of one cycle to eject the packet from the crossbar switch. In our design we have bypassed the packet from the crossbar switch by extracting the routing information present in the header field. If the packet has a destination for other layer, then the packet will be directly injected to the switch arbiter. The dTDMA bus offers improved performance over new enhancement standards such as Multi-layer AMBA, SAMBA bus, and OCP-IP. These buses, in their simplest form, waste a significant portion of wire resources and time in complex arbitration logic [14] . In addition, CDMA has been evaluated as an efficient protocol, however a major drawback exists in spreading code generation and management [11] . The CDMA channel is multivalued and must be represented analogously. The Silicon Backplane III has achieved up to 90% bandwidth utilization using TDMA arbitration [3] . The dTDMA has achieved nearly 100% bandwidth utilization [14] . But, in dTDMA, authors have not discussed any arbitration logic implemented in the design except four-fully tapped feedback registers. The fully-tapped feedback registers are able to generate only the grant signal. Also, authors [14] have not discussed any hardware control logic to generate the arbitration signal. In this paper, we have designed an area-efficient programmable arbiter to transfer the data between layers in 3-D NoC.
Architecture of proposed programmable arbiter
In this section, we present the background of dynamic timeslot allocation, pipelining, architecture, and arbitration logic for 3-D Network-on-Chip.
Dynamic timeslot allocation
A TDMA bus allows processing elements to transmit the data on the shared bus wires in a round-robin fashion. A bus consists of processing elements, PE is allowed to transmit the data on the bus every cycles. The bus access time is divided through timeslots or time slice, as opposed to being divided through code words, in the case of CDMA. TDMA bus arbitration logic uses a fixed number of timeslots regardless of the number of actual requests made. This strategy is simple to implement, but wastes bandwidth when timeslots are allocated but not used. Often, a significant portion of timeslot is wasted as many incoming channels (South, East, West) have no packet to transmit though a timeslot is allocated as shown in Figure 1 . Thus, under time varying traffic profile, static or fixed timeslot based TDMA arbitration schemes will underperform. The underutilization of the TDMA bus can be removed by be re-assigning the timeslot. The arbitration logic in dTDMA is modified and timeslots are allocated according to the number of actual requests made. Therefore, the timeslot dynamically grows and shrinks depending on the request made at particular time by the participating channels. In the hardware design this is difficult to modify the arbitration data in one single clock cycle. 
Pipelined approach for slot allocation and data transfer
To achieve single cycle arbitration and data transfer, the design exploits fine-grain parallelism in the arbiter. In Figure  2 , we have shown an example of request and grant in TDMA arbitration. In dTDMA, when a PE needs to transmit, it asserts a request signal to the arbiter to allocate a timeslot. The vertical bus arbiter (dTDMA) decides for timeslot assignment for each PE and produces a new configuration data for each active transceiver before the next clock cycle [14] . This concludes that arbitration is completed between the clock edges. Authors of [14] have not presented any discussion how the request are registered and processed simultaneously in a single clock cycle, neither any hardware circuitry is reported. This is an important aspect of arbiter design. Authors have mentioned that on the next clock edge, active transceivers load new timeslot configuration data and continue transmission/reception operation as required. Though authors have mentioned an obvious fact that when a transceiver finishes the operation, it de-asserts the request signal, and the bus arbiter de-allocates (shrinks the timeslot) a timeslot [14] . In the paper [14] , only modified linear feedback shift register (LFSR) is presented to generate the grant signal. Authors have not presented any design on the arbitration except mentioning the signals coming out of the arbiter. In this work, we have designed an area-efficient programmable arbiter (switch and layer) to transfer the data. In Figure 3 , we have shown the time space for three processes (Request Monitor, Data transfer) executing at the boundaries and between the clocks edges. 
Proposed architecture for 3-D network-on-chip.
The proposed arbitration logic is able to arbitrate and transfer the data between layers of Network-on-Chip in a single clock cycle. The proposed area-efficient programmable switch arbiter has two levels of arbitration: one level at switch port (at each IP) hence it is called Switch Arbiter(SA), and the second level is Layer Arbiter(LA) that is placed at middle layer connected with each Switch Arbiters of the column in NoC as shown in Figure 4 . The switch arbiter consists of Routing Computation unit(RC), Set of virtual channels and bypass logic. The switch arbiter has five possible ports that can transfer the data in the vertical direction using inter-layer communication mechanism. The entire five input channels are directly connected to the switch arbiter that schedules data from five sides to be transferred to the other layers. A flit at a particular input channel is examined by extracting the routing information from head flit. If the flit has a destination for other layer, then the flit will be directly injected to the switch arbiter. This strategy prevents the delay of one clock at crossbar. The layer arbiter manages data coming from three switch arbiters and transfer data to the destination layer.
Arbitration logic
The proposed dynamic arbiter is based on round-robin scheduling. The dynamic arbiter allocates a timeslot based on the number of active requests. To identify the priority of active request we have used a programmable priority encoder (PPE). The main components of dTDMA arbiter are decoder, adder, registers and PPE (Programmable Priority Encoder). Initially, a will be applied so registers will produce 000. The initial output will be fed to DATA input of decoder. The enable signal of the decoder will automatically turn to 1 whenever there is at least one request. Since previously no grant was generated, therefore, the adder will forward 000 to the PPE. This shows that the first side (North) will be considered at highest priority. The generated request is also provided to PPE to check which request will be served based on the highest priority value of input C . The programmable priority encoder accepts input C that shows highest priority for particular side (port). In Table 2 we have shown the value of C and MASK. We generate two sets of requests for programmable priority. The first part is based on the mask and original requests (R) called 'lower'. The second part of the request is based on the inverted mask and requests (R) called 'upper'. In our design we have generated eight-bit request where we have only six requests. The unused inputs have been connected to ground that means this request is permanent low. The need of grounding unused requests is to prevent unnecessary grant generation as we know that when no request is given, the system will generate 000. After generating lower and upper inputs, the inputs will be forwarded to the priority encoder that will have two encoded outputs from upper priority and lower priority encoders. If there are any requests in upper inputs then we will choose output code of upper priority encoder otherwise lower. The use of multiplexer enables us to change the priority externally. We can change the pre-assigned priority sequence by enabling . The generated grant signal will be forwarded to the adder that will add the signal to the current serving request code. The output of the adder will be forwarded to the C input of the PPE. This process enables us to rotate the priority dynamically in a round-robin fashion. 
Simulation setup and experimental results
The switch arbiter has the ability to send and receive flit on each channel (North, South, East, West and IP) in a single cycle. The flit has the size of 128-bits. The physical channel is also capable to transmit the data using 128-bit lines.
In every clock cycle, the router can transmit 640-bits to neighboring routers, and receive 640-bits from neighboring routers as compared to 512-bits of Hybrid SoC interconnect [14] . We have applied Balanced Dimension Order Routing algorithm, therefore, the traffic load is distributed uniformly. This is less probable that the entire channel (North, South, East, West and IP) will transmit the data to neighboring layers. Therefore, the width of the link between the switch and layer is arbiter fixed to 128-bits. Average message latency at the switch arbiter can be written as follows:
The average latency of the arbiter is derived using parameters shown in table 3 and table 4 . The latency D LA at layer arbiter can also be calculated using equation 1 by reducing the number of channels to three. The average latency D for a message in the network can be calculated as follows: 
We have developed Register Transfer Logic (RTL) level simulation model to evaluate the performance of the proposed switch arbiter. The area-efficient programmable switch arbiter is implemented in 0.18 micron TSMC libraries using Verilog RTL language. A 640-bit message with uniform random destination data pattern was injected per IP per machine clock cycle. We have obtained the maximum clock frequency of 2.09 GHz for 96(4 x 8 x 3) IP cores connected in a mesh topology that is higher than the existing design [14] . The worst case latency for the proposed design is 35 clock cycles as compared to 200 clock cycles [14] which is shown in Figure 6 .
Schematic and layout generation
The synthesis process has been performed using the Leonardo spectrum tool. The tool generates the net list file. The schematic has been generated using ASIC technology instead of FPGA technology. We have used automatic floor planner to place the cells. The layout design has passed Design Rule Check (DRC) using Mentor graphics EDA Tools (ICStation, Eldo, caliberDRC). To verify the functionality after net-list generation, a post-layout simulation was conducted and verified to be correct as shown in Figure 8 . The area of the programmable arbiter is 896 2 . The layout obtained is shown in Figure 7 . The power consumption of the proposed high-speed dynamic TDMA switch arbiter is shown in Table 5 . The maximum power consumption was found 0.625 mW, while the minimum power consumption was 0.406 mW at V DD equal to 5 volts. The post-layout simulation was performed on Modelsim RTL simulator. As shown in the Figure 8 , the channel asserts their request signal at different instance of time. The switch arbiter sends grant to signal back to the channel for data transmission. The switch is able to transmit the data to the layer arbiter in one clock cycle if one request is present, otherwise five cycles in the worst case. The average message latency of the switch arbiter is three clock cycles. The request for data transmission from the layer is forwarded to the layer arbiter that delivers the data to the destination switch arbiter.
Power consumption analysis.

