Present day technology for ASICs supports Networks-onChip designs which can have 100 million gates on a single chip. The latest FPGAs can support only about 10 million gates to accommodate all logic and the associated routings. In order to implement competitive Networks on Chip (NoC) architecture in FPGA's, the occupancy by the networks should be kept minimum. This ensures that the maximum area can be utilized by the logic while maintaining the performance of router networks. Reducing the area also reduces the power consumption. In this paper we report an implementation of a parallel router which can support five simultaneous routing requests. We introduce optimizations in XY routing and decoding logic thereby gaining in area and performance. The header overhead is 8 bits per packet and the packet size can vary between 16 and 128 bits.
INTRODUCTION
ASICS are increasingly being replaced by FPGAs for Applications with low to medium volume, due to longer design cycles and high cost. Modern FPGAs with embedded processors can be used for System-on-Chip (SoC) designs. SoC system represents a complex interconnection of various functional elements. An interconnection of cores in SoC in the Giga Bits era presents a communication bottleneck [11] . Existing bus based interconnect architectures do not present a scalable solution to existing problems in the communication. Network-on-Chip (NoC)has been proposed as a new design paradigm to solve the communication bottleneck [3] . The basic idea is to interconnect various IP cores using on chip networks as compared to traditional shared bus approach. The interconnection is achieved by means of routers. The information is transferred using packet switching mechanism. NoC present a host of advantages like IP reuse, scalable and modular architecture.
Area is at a premium of FPGA and therefore communication network should be as small as possible. Hence router which is a central component of any NoC, must also be small. In this paper we present a lightweight router for a NoC implemented on FPGAs, which can support five parallel connections simultaneously. The router uses store and forward type of flow control and XY deterministic routing. Reducing the size of the Finite state Machine (FSM) for XY routing and performing a simple logical OR of the Select/Gnt lines, significantly reduces the number of slices [5] . The area savings have significant impact on the performance and power consumption.
We describe an on chip interconnection network as, terms of the network interconnection topology, switching mechanism, routing, flow control, queuing (buffering) and scheduling. Network topology refers to the arrangement and type of interconnection of the nodes. Network topologies include mesh, torus, hypercube and fat-tree [6] . Switching refers to the mechanism of moving data from source to the destination node. The two extremes that are made along the entire path are, circuit routers, with no channel reservation along the path and independent routing decisions. Flow control deals with the allocation of channel and buffers to a message as it travel from source to destination. The popular flow control mechanisms include store and forward, wormhole and virtual channel [9] .
A simplest routing mechanism that can be used is the Store and forward whose latency is proportional to the packet size [6] [1] . The wormhole routing demands increased decoding logic in each router. Virtual channel routing is very costly in terms of the buffer numbers and the associated decoding and arbitration logic, thereby it cannot be used in a light switching and packet switching [2] [3] . In NoCs, the packet transfer happens between the cooperating weight routers. Store and forward type of flow control in the router is used in this paper.
ROUTER ARCHITECTURE
The router has set of ports to communicate with the logic element such as local, East, West, North and south. It receives the incoming pockets and forwards them into appropriate ports. Buffers are present at various ports to store the packets temporarily. Control logic will be present to take routing decisions and arbitration decisions.
In this work we design a light-weight parallel router. The motivation is to reduce the area which also reduces the power consumed. We choose one of the popular methods of buffering called store and forward. The motivation behind choosing such a scheme is to have the simplest possible decoding logic, thereby, reducing both area and power. Establishment of connections is made automatically without any complex decoding logic. The router switches with a set of inter-communicating ports, define the physical layer of the NoC system. There are two types of ports to establish communications, namely input and output ports. 
IMPLEMENTATION OF THE ROUTER
Router has three main blocks, namely the input channel, cross point matrix and output channel.
Algorithm
Step 1: a router consists of input channel, output channel and crossbar switch
Step 2: a router can be instantiated by using the acknowledgements which can be provided from the user based operation.
Step 3: an input channel being designed by using the FSM controller and the XY routing can be used in order to perform the routing algorithm.
Step 4: an output channel may be designed by using the synchronous FIFO operation and the arbitration
Step 5: a cross bar switch can be used to switch the packets in the small mean time to switch the packets to the output channel.
Step 6: the whole architecture can be operated by using the clock input and the requests input
Input Channel
One input channel at each port is found, each running its own control logic. Each input channel has a FIFO of depth 16 and data width of 8 bits and a control logic which has implemented as a FSM. The input channel accept request from other neighboring routers. On receiving the request, if it is free, it will acknowledge the request. The first flit is the header and following flits constitute the data. It will accept the data as long as the request signal held high. The previous router's output channel ensures that the request line is held high until it empties the packet of data, being accepted by the input channel. The input channel accepts the acknowledgement line is high, as long as there is a transfer taking place (indicated by the request line).
The transfer being complete, the request and acknowledgement line go low in sequence. The packet of data received from the previous router is stored locally in the FIFO thereby implementing a store and forward dataflow. Next the control logic reads the header of the packet and using decides which output channel is to be requested for sending out of the router and sends the request to that output channel. It is to be noted that each of the input channel is running an independent FSM and hence can initiate five possible parallel connections at the same time. Once the input channel gets a grant from the requested output channel, the control bits of cross point matrix are set appropriately by the granting output channel. 
Crosspoint Matrix
Cross point matrix is a set of multiplexers and demultiplexers having an interconnection allowing all possible connection between the five inputs and output channels. 
Output Channel
One output channel at each port which has an 8 bit FIFO of depth 16 and a control logic making an arbitration decisions is found. The output channel gets request from the different input channels and grants one and sets the control bit lines of cross point matrix. It accepts the packets into its FIFO as long as the sending input FIFO are not empty thereby providing a simple decoding logic. When transfer is complete the cross point matrix controls are reset.FSM then initiate the process to send the data into the neighboring router using handshake mechanism. Empty status of its FIFO triggers the next inter-channel transfer. 
. XY ROUTING
At the Input channel, once the FIFO is filled, the Xcoordinate of the destination router (say H x ) is compared with locally stored X coordinate of the router first to decide on the horizontal displacement. If H x >x then the packet is forwarded to the East port of the router, and if H x <X then the packets goes out through the West port of the router. If H x is equal to X then the Y coordinate of the Router to decide on the vertical displacement. If H y >Y the packet is forwarded to the North port and if H y <Y the packet is forwarded to the South port. When H y equals Y it indicates that the packet is at the destination router and so the packet is forwarded to the local port. A packet is forwarded horizontally till the target column is reached and is then forwarded vertically to the destination router in a XY routing. This means that there is no request for the East or West output ports by the North or South ports. This fact is exploited and the FSMs of the mentioned output channels are simplified, as need not service the mentioned input ports. Translating to significant area saving and reduction in number of clock cycles in servicing requests is done by this. This helps the implementation of light weight router, having area overheads at the minimum with acceptable level of performance.
POCKET DESCRIPTION
In this work, we have used the FIFO available in Xilinx logic-core [13] .The depth of the FIFO is 16.Since we have a store and forward scheme, it makes sense to have a larger buffer size. Packet specification is very simple in our router. The flit size is fixed at 8 bits. The first flit is always a header having the co-ordinates of the destination router. With 8 available bits, we can support the maximum of 16x16 NOC systems. The flit size has to be increased if we want to build a bigger system. With the available FPGA's building a 16x16 system is impractical because the NOC system will occupy more than 50% of the total FPGA area(even if a single router takes only 0.2% of the total FPGA area)then there will be less area available for the user logic.
In this work we fixed the number of X and Y bits at 2 each. Hence it can support maximum of 4x4 NOC systems. The remaining four bits are reserved to implement High Level Protocol (HLP).We are building an advanced router prototype, Incorporating HLP as a part of our future work. There is no trailer flit and hence the maximum data size is 120 bits per packet.(for the FIFO of depth of 16),which in practical term, would suffice communication between cores. If there is a requirement for bigger packet size, we can easily build one by increasing the FIFO buffer provided by the Xilinx logic-core [13].
ROUND ROBIN ARBITER (RRA)
It is implemented as FSM at each output channel.RRA arbitrates and decides which input channel is to be given access to that output channel when many channels are requesting the same output. Generally the output channel must follow priority based arbitration. If a fixed priority scheme is followed the same input channel gets access repeatedly. Hence in our arbiter the priorities of the input ports are changed dynamically taking the last input port serviced into account. The priorities are implemented in clockwise fashion i.e., if the last input port serviced was north, then during Next service then the priority will be in the order of East, South, West, Local and North. It should be no clock cycles are wasted in our scheme as the grant is issued only if there is a request from corresponding input channel.
As each input channel has its own XY routing FSM and each output channel has its own RRA FSM, there is no latency in establishing the connections. This allows five different requests to be granted simultaneously at the same time, when five requests come for different output channels. This provides a significant improvement in the performance of our router. It is to be noted that the router coordinates are stored in two registers inside each of the router, which can be accessed from the primary inputs. This facilitates easy reconfiguration of the router coordinates in case of the system change, compared to the hard-coded coordinates, where one has to re-synthesize with new coordinates. We extend our work to build a 1x2 and a 3x3 mesh-type router network.
SIMULATION AND RESULTS
We use the Xilinx spartan3 board, which has a xcs400 FPGA to functionally verify the standalone router and the NoC system. We use the Xilinx 10.1 [13] to synthesize the system and Modelsim 6.3 [15] to simulate the model and generate the activity data of the place and router (PAR) model.
The router is implemented in verilog HDL in a modular fashion. The data width and the FIFO depth are parameterizable. In this work the data size is fixed at 8 bit.
The flow control mechanism is handshake based with minimal decoding logic. Both the input and output channels are buffered, so as to minimize the blockage in a store and forward buffering scheme. We employ XY routing and the FSM and decoding logic has been accordingly. The arbitration scheme is dynamic as there is round-robin arbiter implemented with a dynamic priority scheme.
CONCLUSION
We present light-weight parallel router architecture for implementing Network on Chip on FPGAs. We have implemented optimizations in the decoding logic and FSM's thereby saving significant area. We show a functional validation of the stand alone router and a 3x3 mesh network. In future we intend to build an advanced prototype supporting HLP having less area overhead.
