The research article presents the simulation and FPGA synthesis of mesh, torus and ring Network on Chip (NoC). The network is based on the Multiprocessor System on Chip (MPSoC) structure for a network cluster of 256 nodes. The paper focuses on the comparative analysis based on hardware design parameters, memory utilization and timing parameters such as minimum and maximum period, frequency support. The interprocess communication among nodes in verified using Virtex-5 FPGA with an arbitration logic. The designs are developed in Xilinx ISE 14.2 and simulated in Modelsim 10.1b with the help of VHDL programming language. Network topological structures help for on chip intercommunication, routing, switching, flow control, queuing, scheduling and to communicate among different networks.
Introduction
Today, the designers are facing the problem of on chip interconnects apart from increasing the no. of nodes. The systems using traditional bus system are facing the problem of scalability are not capable to fulfill the requirement for future SoC in terms of power, timing parameters, hardware utilization, performance and predictability. To overcome the design productivity gap, cost, and signal integrity for future SoC, a scalable NoC structure is helpful to realize the on chip communication problems. Chip processors (CMP) 2 and MPSoC 1,5 uses bus structure for on chip communication and integrates nodes on a single die to meet the requirement of transistor density, more throughput, less delay, less time to market, and operating frequency.
The internode communication in multiprocessor system is based on the concept of memory sharing or message passing. The message passing among nodes is dependent on APIs such as transmit ( ), receive ( ). The APIs need some protocol to connect each other. In many multiprocessor NoC system shared memory architecture concept is used and data transfer is possible trough memory access points. MPSoC architecture is based on processor memory hierarchy and topological structure helpful for interprocess communication in network. The shared memory based architecture provide high throughput because of shared or cache memory between processors and pipelined processing for data transactions. Lucent developed a single chip multiprocessor called Daytona 6 to perform on chip communication based on packet routing. In the work eight alpha processors were integrated on a single chip multiprocessor.
A shared memory multiprocessor 7 consists of several nodes/processors or processing elements form an on chip interconnected network. All PEs 4 have their own CPU or hierarchy of their memory, may be one or two level of cache memory. The multiprocessor system 3,4 has a big memory unit physically but it has shared memory accessed by different processors globally. The data packet arrives at a particular node is based on the request by the node. The memory will return a reply packet to requested node containing the data of the requested node. Read the data of the requested node and write data to destination node is accessed through cache reference. In MPSoC system, the major problem is cache coherence 7, 8 because the data is saved by the different caches should be updated otherwise one data can have multiple copies. The problem of cache can be resolved with the help of cache updating that updates all node memories whenever there is new data in memory. Multiprocessor system consist of different network topologies may be targeted to specific application to enhance the NoC performance and throughput. The MPSoC network structure can form direct network and indirect network structure. In direct form all the nodes are connected directly with each other with the help of network only. The arbitration and data flow is possible with the help of each node. In indirect network structure data flow is possible by an intermediate switch. The switching and routing is performed with the help of switch between the processors. Multistage network configurations are formed using indirect networks. Orthogonal topological 7, 10 structures are the examples of direct topologies. The nodes in the orthogonal topologies can form mesh structure (with k ary and n dimensional or k ary n cube) or torus (with k ary and n dimensional). The pipelined operations and parallel processing can be performed with the help of mesh or torus 9, 10 structures because the structures provide easy connection and simple routing and interconnection length between nodes can be same.
NoC Design Consideration
The design considerations for the mesh and torus structure for (256 x 256) is shown in fig. 2 (a) and (b) in which 256 nodes can process intercommunication. Each node is identified with its address assigned N 0 (00000000), N 1 (00000001), N 2 (00000010), N 3 (00000011), N 4 (00000100),N 5 (00000101)…..N 255 (11111111). There is also row and column address assigned for node identification based on row and column processing having 8 bits addresses because (2 8 = 256). The functionality of mesh and torus NoC structure is understood with the help of table. For an example node, the identification of node 18 is based on row address (00000001) and column address (00000010) but it has the probability to communicate with any node in NoC. The topological structure of ring NoC for 256 nodes is shown in fig.2(c) . The structure has 256 nodes, arranged in a ring configuration. The functionality of the ring NoC can be understood with the help of table 2. All 256 nodes are counted from N 0 to N 255 sequentially counted with their node address of 8 bits starting from "00000000" to "11111111" Let node N 0 is assigned a source_address "00000000", Node N 1 has address "00000001". In the same way, all the nodes can be assigned their 8 bits of address and node N 256 is assigned source_address "11111111". Moreover, nodes have the priority mechanism to communicate in multiprocessor system. The data packet arrival to source and delivery to destination node is considered with the help of arbiter which assigns the priority for interconnection of destination node in mesh, torus and ring NoC. 
Results & Discussions
The RTL view is the description of input and outputs of the developed chip. The RTL view of the NoC is shown in fig. 4 . The functionality of the individual pin is described in table 3. The functional modelsim simulation shown in fig. 5 , shows the data transfer scheme from node N 3 to node N 4 . The functional simulation depends on the following steps input.
Step input 1: Reset = '1' and run, all node data will contains zero output.
Step input 2: Reset = '0', Apply rising edge clock pulse, source_address and destination_address value and data of destination node with input_data_packet, then run.
Step input 3: Apply the source address and destination address of another nodes and data on input_source.and run 
FPGA Synthesis Results
Device utilization report gives the percentage utilization of device hardware for the chip implementation. Device hardware includes no. of slices, no. of flip flops, no. of input LUTs, no. of bounded IoBs, and no of gated clocks (GCLKs) used in the implementation of design. Timing details provides the information of delay, minimum period, maximum frequency, minimum input arrival time before clock and maximum output required time after clock. Table 4 and table 5 show the synthesis results as device utilization and timing parameters for mesh, torus and ring NoC. Total memory utilization required to complete the design is also listed for individual stage. The target device is: xc5vlx20t-2-ff323 synthesized with Virtex-5 FPGA. 6 describes the timing variations in the design of mesh, torus and ring NoC. From the device utilization and timing parameters, it clarified that ring NoC has optimized parameters. In torus structure min period 2.74 %, minimum input arrival time before clock 8.45 % and maximum output required time after clock 11.65 %, is greater than in comparison to mesh structure. In ring NoC structure min period 32.25 %, minimum input arrival time before clock 62.72 % and maximum output required time after clock 3.58 %, is less than in comparison to mesh structure. The hardware and memory utilization in torus and ring NoC is less than mesh NoC. The frequency support for the same targeted device is 600.00 MHz, 589.00 MHz and 780 MHz dor mesh, torus and ring NoC respectively, which Ring signifies that ring NoC is faster in comparison to mesh and torus and has significant less hardware optimization to support a particular network configuration.
Conclusions
The NoC design for mesh (256 x 256) torus (256 x 256) and ring (256) is implemented on Virtex 5 FPGA successfully. The architecture is based on shared memory architecture and optimal routing scheme is suggested. The design is tested for the different test cases. In each NoC configuration, the data transfer with arbitration scheme is verified on modelsim 10.1 b and FPGA successfully. The synthesis report is generated and contains the information for hardware utilization in terms of No of slices, No of flip flops, No of input LUTs, No. of bounded IOBs and No of gated clocks (GCLKs) used in the implementation of design. Timing analysis is also carried out for the staged network which provides the information of delay, minimum period, maximum frequency, minimum input arrival time before clock and maximum output required time after clock. A comparative study is carried out for the mesh, torus and ring NoC structure hardware and timing parameters and estimated that ring NoC has optimized results.
