With the shrinking technology, reduced scale and power-hungry chip IO 
limited to topology, channel width, buffer size, floorplan, routing, switching, scheduling, and IP mapping [2] . Additionally, [3] lists research issues to be application modeling and optimization, NOC communication architecture analysis and optimization, NOC communication architecture evaluation, and NOC design validation and synthesis.
The most important metrics for NOCs are application runtime, silicon area, power consumption, and latency. All these are to be minimized and usually appropriate trade-off is sought [4] . The required silicon area is the most commonly reported value (77%) followed by latency (55%) and maximum operating frequency (50%). The other metrics have lower occurrence [1] .
In this regard, the current work is related to optimization of buffers in the router design so as to achieve lower silicon area, lower power and higher operating frequency.
The input block in the design of router consists of six major components: the packet array, the linked list array, the destination head array, the destination tail array, the free-list FIFO, and a shift register, see fig. 3 .1. Four of these six components are conventional memory elements. In a standard cell based design, memory elements are realized using D flip flops in the standard SYNOPSYS Library. If we consider a NAND gate implementation of a D flip flop with no RESET or SET inputs, we require 28 MOS transistors to realize one D flip flop, see fig. 1.1. A more area efficient implementation of memory is through the use of SRAM cells. Each SRAM cell is implemented using 6 transistors, see fig. 1 .2. Therefore, memory realization using SRAM is more efficient compared to D flip flops. However, standard cell based approach to ASIC design does provide SRAM standard cells because of the many possible configurations of width and depth. SRAM design is carried out using full custom approach to ASIC design. By combining standard cell based and full custom ASIC design, D flip flops can be replaced by SRAM, improving the area efficiency of the input block (Fewer transistors -less area required for equal amount of memory). The queues are maintained in a packet array. Queue size is dynamically determined depending on the arrival pattern of the data. If more data is destined for output port "m", then correspondingly, more buffer space, and hence, a longer queue is maintained for data packets to be routed to output port "m," subject to the maximum space available in the packet array ( Fig.  1.3 ).
Fig. 1.3 Each input port has its own buffer
An alternative design is based on using a common packet for all the input ports. For example, if the crossbar switch consists of four input ports, then the original design calls for four packet arrays. The proposed design would utilize one common packet array for all the four input ports ( Fig.1.4) 
Fig. 1.4 Common buffer shared between all input ports
It is intended to study and quantify the behavior of the single packet array design in relation to the multiple packet array design. Intuitively, a common packet buffer would result in better utilization of available buffer space. This in turn would translate into lower delays in transmission. (Fig. 2.1 ). Routers direct data over several links (hops). Routers further consists of a scheduler, buffer to store the incoming data packet and the crossbar. Topology defines their logical lay-out (connections) whereas floor plan defines the physical layout. The function of a network interface (adapter) is to decouple computation (the resources) from communication (the network). Routing decides the path taken from source to the destination whereas switching and flow control policies define the timing of transfers [2] [5] . Task scheduling refers to the order in which the application tasks are executed and task mapping defines which processing element (PE) executes certain task. IP mapping, on the other hand, defines how PEs and other resources are connected to the NoC.
Introduction to
For illustrative purposes, Fig. 2.1 shows an example SOC with a NOC and nine heterogeneous IP blocks that are CPUs, memories, input/output devices, and HW accelerators.
Router Structure
NOC architectures are based on packet-switched networks, see figure 2.2a. This has led to new and efficient principles for design of routers for NOC [6] . Assume that a router for the mesh topology has four inputs and four outputs from/to other routers, and another input and output from/to the Network Interface (NI). Routers can implement various functionalities -from simple switching to intelligent routing. Since embedded systems are constrained in area and power consumption, but still need high data rates, routers must be designed with hardware usage in mind. For circuit-switched networks, routers may be designed with no queuing (buffering). For packet-switched networks, some amount of buffering is needed, to support bursty data transfers.
Such data originate in multimedia applications such as video streaming. Buffers can be provided at the input, at the output, or at both input and output [7] .
Various designs and implementations of router architectures based on different routing strategies have been proposed in the literature. Wolkotte et al. proposed a circuit switched router architecture for NOC [8] , while Dally and Towles proposed a packet switched router architecture [9]. Albenes and Frederico provided a wormhole-based packet forwarding design for a NOC switch [10] .
Fig 2.2 (a) NOC Router (b) Router Components
In this paper, the buffers in the design of the routers are based on the principle of virtual output queuing since it is simple and reduces the risk of Head of Line Blocking [11] [12] [13] .
In this paper, the scheduling policy embodied in the router is based on Iterative SLIP algorithm. iSLIP uses round-robin to choose on port among those contending. This permits simpler hardware implementations compared, besides making iSLIP faster. iSLIP achieves close to maximal matches after just one or two iterations.. iSLIP achieves 100% throughput under uniform traffic and the round robin policy ensures fairness among contenders.
Even though its behavior may be unstable under bursty traffic, iSLIP is commonly implemented in commercial switches due to its simplicity [14] . This algorithm becomes more silicon area efficient if it is implemented with its folding concept [15] .
Implementation of input module with D FF based array vs CUSTOM SRAM.
The traditional design of input module consists of D FFs based register arrays [16] , see fig. 3 .1 The same function of input module we intended to achieve using full custom SRAM replacing the register arrays. The input module with arrays based on D FFs is now replaced by the input module with full custom SRAMs, see fig.3 .2
The input block is synthesized using the Synopsys 90 nm EDK standard cell library. Memory elements are synthesized using D flip flops using the standard cell library. In order to save silicon area, D flip flop memory is replaced by custom built SRAM. The SRAM provided by Synopsys in the 90 nm process are available in sizes (width x depth) of 8x16, 8x32, and 128x64.
However, the input block requires packet array memory of size 72x32. This is realized using SRAM of size 128x64. The free-list FIFO requires a 6x32 memory. This is implemented using an SRAM of 8x32. The destination head and destination tail arrays are of size 6x8. These are replaced by SRAM of size 8x16. The SRAM used is underutilized. However, due to more compact implementation, overall design area is greatly reduced, as shown by results in the following section.
Fig.3.1 Input module with D FF based register arrays [17]
Custom SRAM IP is generated by full custom mask layout. Synopsys provides SRAM macros in LEF format. In order to integrate SRAM into the input block, this LEF file is used to generate CEL, FRAM, and LM views. Synopsys Milkyway is used to process the LEF file to create CEL, FRAM, and LM views. These views combine to produce a reference library for IC Compiler.
The layout of the input block using SRAM requires CEL, FRAM, and LM views of the 90nm standard library as well as the SRAM macro. The design file containing the input block gate level netlist is a DDC file. This file is generated by Design Compiler from RTL code by applying appropriate constraints. The input and output delays are derived from the clock using a time budget of 40%.
The layout is generated from the gate level netlist and the reference libraries. Additionally, the TLU+ capacitance models are required by IC Compiler.
Steps on IC Compiler [18]
The following steps are executed on IC Compiler to derive the layout:
1. A new Milkyway library database is created for the input block.
2. Initial floorplan is created where the start row and row orientation is specified. Additionally, the spacing between the IC core and input/output pad ring is specified.
3. Power grid is initialized for VDD and VSS.
4. Rectangular rings are created for VDD and VSS. For each segment, the metal layer, segment width, and segment offset are specified.
5. Floorplan placement is executed. All cells and IP cores are placed within the IC core boundary.
6. Power straps are generated for VDD and VSS. The direction, width, and metal of power straps is specified.
7. Standard cells are prerouted for horizontal connection.
8. Clock tree is synthesized.
Detailed routing is performed.
The layout is ready to be streamed out in GDSII format.
Single Buffer of size 128 packets versus 4 buffers of size 32 packets:
The block diagram of the Simulink model is given below.
Fig. 3.3 Simulink model for single buffer vs 4-buffers.
The first block is labeled "Exponential Distribution1." This block specifies packet arrival time. The packet arrival pattern is an exponential distribution. The block labeled "Packet Source1" generates packet events. The "Set Attribute1" block combines the effects of "Exponentia Distribution1" with "Packet Source1" to generate packet entities at time intervals specified by the exponential distribution mentioned above. The arrival of a packet event at the "Start Timer2" block causes the simulation timer to start. Generated pack buffer designated "Common Buffer." The packets leave the common buffer when they are serviced by the scheduler vector. The scheduler vector is generated for four input ports by "4x Scheduler Vector." The total numbe block. Whenever a packet leaves the buffer, the departure time is recorded by the "Read Timer." The packet exits the simulation flow through the "Entity Sink1" block. The average time spent by the packet in the buffer is captured by the "Average Delay1" block.
The behavior of the dedicated 32 packet buffer model differs only in two components, "Output Switch" and "Path Combiner." The "Output Switch" block demultiplexes the generated packets into their respective input port packet buffers. The "Path Combiner" aggregates the output stream to help calculate total number of packets served and average time spent by packet waiting for service.
The simulation was run for 50000 packets. The packet gene identical, using the exponential distribution for inter performed to verify the average delays observed.
RESULTS:
International "Exponential Distribution1." This block specifies packet arrival time. The packet arrival pattern is an exponential distribution. The block labeled "Packet Source1" generates packet events. The "Set Attribute1" block combines the effects of "Exponentia Distribution1" with "Packet Source1" to generate packet entities at time intervals specified by the exponential distribution mentioned above. The arrival of a packet event at the "Start Timer2" block causes the simulation timer to start. Generated packets are stored in a common 128 packet buffer designated "Common Buffer." The packets leave the common buffer when they are serviced by the scheduler vector. The scheduler vector is generated for four input ports by "4x Scheduler Vector." The total number of packets served are recorded by the "Number Served1" block. Whenever a packet leaves the buffer, the departure time is recorded by the "Read Timer." The packet exits the simulation flow through the "Entity Sink1" block. The average time spent packet in the buffer is captured by the "Average Delay1" block.
The behavior of the dedicated 32 packet buffer model differs only in two components, "Output Switch" and "Path Combiner." The "Output Switch" block demultiplexes the generated packets their respective input port packet buffers. The "Path Combiner" aggregates the output stream to help calculate total number of packets served and average time spent by packet waiting for
The simulation was run for 50000 packets. The packet generation rates for both models are identical, using the exponential distribution for inter-arrival times. Multiple simulation runs were performed to verify the average delays observed. The packet arrival pattern is an exponential distribution. The block labeled "Packet Source1" generates packet events. The "Set Attribute1" block combines the effects of "Exponential Distribution1" with "Packet Source1" to generate packet entities at time intervals specified by the exponential distribution mentioned above. The arrival of a packet event at the "Start Timer2"
ets are stored in a common 128 packet buffer designated "Common Buffer." The packets leave the common buffer when they are serviced by the scheduler vector. The scheduler vector is generated for four input ports by "4x r of packets served are recorded by the "Number Served1" block. Whenever a packet leaves the buffer, the departure time is recorded by the "Read Timer." The packet exits the simulation flow through the "Entity Sink1" block. The average time spent
The behavior of the dedicated 32 packet buffer model differs only in two components, "Output Switch" and "Path Combiner." The "Output Switch" block demultiplexes the generated packets their respective input port packet buffers. The "Path Combiner" aggregates the output stream to help calculate total number of packets served and average time spent by packet waiting for ration rates for both models are arrival times. Multiple simulation runs were Table 1 The results of the SIMULINK model for distributed and common buffer are as 
Clock period
The results of the SIMULINK model for distributed and common buffer are as
The input block optimized for area and power by incorporation of custom SRAM blocks into the design. The custom SRAM blocks replace DFF(D flip flop) memory implementations. The amount of hardware resources required to store one bit of information using and SRAM cell (6 transistors) is much less than the hardware requirements for storing one bit of information using DFF (28 transistors).
The use of SRAM macros in place of standard cell D flip flops have resulted in an area reduction and corresponding reduction in power consumption. The improved design occupies approximately 30% of the area of the original design. This is in conformity to the ratio of the area of an SRAM cell to the area of a D flip flop, which is approximately 6:28. The power consumption is almost halved to 1.5 mW. Maximum operating frequency is improved from 50 Mhz to 200 Mhz.
The utilization efficiency of the packet buffer array improves when a common buffer is used instead of individual buffers in each input port. This is manifested in the form of lower delay in transferring a packet from the input to the output. The delay is improved by approximately 40% through the use of a common buffer.
