The emulation and functional validation are essential to the assessment of the correctness and performance of networks-on-chip architecture. A flexible hardware/software networks-on-chip open platform (NoCOP) emulation framework is designed and implemented for exploring the on-chip interconnection network architectures. An instruction set simulator and universal serial bus communicator control and configure the emulation parameters and process that are running on the host computer as active elements in the emulation framework. The experimental results show that the proposed emulation/verification framework can speed up the simulation, preserve the cycle accuracy and decrease the usage of the resources of field programmable gate array.
Introduction
With the growing complexity of integrated circuits, there is a strong tendency to adopt multiprocessor systems-on-chip (MPSoC) architectures. The MPSoC consisting of many cores will not be feasible using a single shared bus or a hierarchy of buses. The bus structures do not perfectly match the multiprocessor system due to their poor scalability with system size. To overcome these problems, networks-on-chip (NoC) (Benini and De Micheli, 2002; Dally and Towles, 2001 ) have been proposed as a promising replacement for buses and dedicated interconnections. However, NoC-based architecture involves new design challenges, such as topology selection, router design, routing algorithm, communication protocols, system tools and so on (Owens et al., 2007) . NoC verification has received less attention compared to other design aspects (Marculescu et al., 2009) . Verification technique is an indispensable component of NoC design flows. NoC design methodology needs to be complemented by efficient mechanisms to validate the NoC building blocks. All these challenges require a very time consuming and error-prone design process of on-chip interconnects to design power efficient and high performance NoC (Genko et al., 2007 (Genko et al., , 2005b Ogras et al., 2007; Wolkotte et al., 2007) .
Simulation and functional validation are essential to the assessment of the correctness and performance of MPSoC and NoC architectures. After decisions regarding the communication paradigm, intellectual property (IP) core selection and infrastructure are made, simulation and emulation of the system are used to validate the design (Liu et al., 2009a) .
A hardware/software networks-on-chip open platform (NoCOP) emulation framework has been devised (Liu et al., 2009b ) and implemented on a field programmable gate array (FPGA) platform based on hardware accelerate technology (Xiang, 2008 ). An instruction set simulator (ISS) and universal serial bus (USB) communicator running on the host computer are used as active elements in the emulation software layer to control and configure the emulation parameters and process. The NoCOP emulation framework is able to test actual physical realisations of NoC on silicon up to four orders of magnitude faster than HDL simulator (ModelSim for example) while preserving cycle accuracy.
The remainder of the paper is organised as follows. In Section 2, we briefly discuss some related works. In Section 3, we describe the NoCOP emulation/verification method and framework. In Sections 4 and 5, we describe the hardware and software layer of the NoCOP emulation framework. In Section 6, we present the experimental results and conclude the paper in Section 7.
Related work
A number of cycle accurate simulation frameworks in VHDL or SystemC have been proposed in the literature (Bertozzi et al., 2005; Goossens et al., 2005; SiguenzaTortosa and Nurmi, 2002) . Though these simulators have flexibility in NoC design space exploration, they cannot use real life traces to extensively evaluate the entire NoC system at high speed. The major issue with simulationbased approaches is the trade-off between the level of implementation detail and the slow simulation time. Detailed models can deliver very accurate results, but the simulation time can be prohibitive.
An NoC emulation platform implemented on FPGA is presented in the literature (Genko et al., 2005a (Genko et al., , 2007 (Genko et al., , 2005b (Genko et al., , 2005c . The NoC hardware platform is implemented on a Virtex-II FPGA, which consists of network injection, reception and controller components. The processor core PowerPC is integrated into the hardware platform and functions as a controller. Instead of merely being the platform where the circuit is prototyped, this method can speed up functional validation and add flexibility to the NoC configuration exploration. An FPGA emulation-based NoC prototyping framework is presented, where the main goal is to speed up the synthesis process by partial reconfiguration of hard cores (Krasteva et al., 2008) . However, these methods have one major drawback. They need a processor core in the hardware to control and monitor the network at the cost of the limited resources of FPGA.
A simulator on the FPGA is implemented as a homogeneous wormhole switching network with virtual channel (VC) flow control with a torus topology (Wolkotte et al., 2007) . The system consists of an SoC board and an FPGA board. The scalable multi-FPGA platform is designed for NoCs emulation and debugging (KouadriMostéfaoui et al., 2007 (KouadriMostéfaoui et al., , 2008 . The platform is constructed by XUP VirtexII Pro board using the high speed serial links to connect multiple FPGA boards. Four real applications mapped onto an NoC and prototyped in an FPGA are presented (Ogras et al., 2007) . These methods (KouadriMostéfaoui et al., 2008; Ogras et al., 2007; Wolkotte et al., 2007) lack flexibility and suffer the limited resources of FPGA in the case of data extraction. In order to solve these problems and keep the emulation flexibility, the proposed NoC emulation framework consists of an ISS and USB communicator running on a host computer, and a hardware platform. The ISS and USB communicator are used as active elements to control and configure parameters during the emulation process. The hardware platform has been used to validate the NoC design.
NoC emulation of framework
The emulation framework of NoCOP is a combined hardware/software platform. The block diagram of emulation framework is separated into two layers: hardware layer and software layer, shown as Figure 1 .
Hardware layer
The hardware layer of NoCOP consists of four elements to emulate NoCs: network to be emulated, packet generator (PG) and packet receptor (PR), packet controller and result analysis (PCRA) module, and interface to the host computer.
1 Network to be emulated: The routers and network interfaces (NIs) are organised into different topologies. The router utilises a parameterised design in terms of various bit width, flow control, routing algorithm, arbiter scheme, etc.
2 PG and PR: Every node in the network has one PG and one PR. The PG is controlled by a packet controller. The PG generates packets with designated control information, such as the source, destination, packet length, packet number and interval cycle. The PR receives packets from the network.
3 PCRA module: The PCRA unit initialises the network communication pattern and analyses the network performance in terms of latency.
4 Interface between the host computer and NoC:
The designer can configure the network through the advanced microcontroller bus architecture (AMBA). The advanced high-performance bus (AHB) and advanced peripheral bus (APB) are implemented on the platform. AHB has an arbiter, a USB master module and a slave module (APB bridge). The APB is designed to connect peripheries. A universal asynchronous receiver/transmitter (UART) controller and PCRA module are connected on the APB bus. The USB module is connected with the AHB bus. We have developed the soft reduced instruction set computer (RISC) IP core (Xiao et al., 2006) that is compatible with MIPS 32 instruction set architecture (http://www.mips.com). The RISC core is linked with the emulated NoC through the NI. If the processor core is instantiated in the emulated system, the designer can use the third party software to debug the NoC through the enhanced joint test action group (EJTAG) interface.
Software layer
The software layer is used to configure the emulation parameters and control the emulation process of NoC architecture. The software layer of NoCOP contains three parts: USB software, ISS and personal computer (PC) UART software.
1 USB software is used to access hardware based on the USB 1.1 protocol. All the address space of AMBA bus can be accessed by the USB software.
2 ISS is an instruction accurate simulator with MIPS 32 compatible instruction set. The ISS is connected with the USB software to access the AMBA bus. The ISS has access to AMBA APB bus when entering the special memory space. The processor IP soft core can replace the ISS for emulation that is similar to Genko et al.'s (2007) method.
3 The communication between the emulated network and the UART is completely controlled by the UART software. We can use the commercial PC UART software tools to communicate with NoCOP hardware.
Emulation methodology
All the emulated NoC modules are implemented in Verilog. The designer used the Modelsim to simulate the NoC and emulated NoC on the FPGA to validate its functions. The monitor is used to configure NoC and extract data from the emulated network. The designer wrote the programme to generate packets based on the traffic model. The programme is sent by the ISS to the emulated NoC system. The statistics are fetched and sent to the monitor module. The NoCOP emulation flow is shown in Figure 2 . The advantage of our approach is that it establishes a general hardware/software emulation framework to validate and explore NoC implementation, allowing flexibility and accuracy in the emulation, simultaneously. 
Hardware layer
The hardware layer is designed in a modular way, which consists of the building blocks and three modules relating to the network packet: PGs, PRs and PCRA module. The building blocks include the router and the NI.
Router architecture
The router is a key component of the interconnection architecture for routing information from a source to a destination, the overall block diagram of router is shown in Figure 3 . The router is composed of three architectural blocks, input port, crossbar switch and output port. The input port includes link controller (LC), virtual channel buffer (VCB), routing controller (RTC) and virtual channel controller (VCC). The crossbar switch can provide connection between all VCs and all output channels. The output port includes the arbitration unit and output channel controller. In order to support several different application demands, the flit format is defined. The flit format shown in Figure 4 consists of two areas: the header and the payload. The header field includes the type of flit, source address and destination address. The payload section carries the data. The type of flit is composed of request, feedback, data header, data, multicast data header, multicast data and other type. The structure of the flit and length of each field are configurable based on application demands.
Figure 4 Flit format
The router design parameters in term of the channel number, the VC number, first-in first-out (FIFO) buffers depth and routing algorithm can be selected based on the designer needs. The detail block diagram of an ingoing router for each input port is shown in Figure 5 . The routing unit is composed of unicast routing and multicast routing, according to the packet type. The options of routing algorithms include deterministic routing, west-first routing algorithm and Duato's protocol (Duato et al., 2003) , which are named as router_D, router_P and router_A, respectively. The router has five ports and each port has two VCs. The physical channel is 75 bits wide in default. The header field is 11 bits and the payload field is 64 bits.
Figure 5 Block diagram of input port
The block diagram of an outgoing router for each output port is shown in Figure 6 . The output port consists of an arbiter, an output channel controller and a multiplexer. The arbiter is composed of a programmable priority encoder, a right shifter and an encoder. The encoder design uses the thermometer encoding method (Gupta and McKeown, 1999) . 
Network interface
NI is responsible for adapting resource elements to the onchip network. It is always denoted as a glue logic necessary to enable a resource to communicate with the network. We propose an NI design by decoupling resource dependent parts (RDP) and resource independent parts (RIP) (Jantsch and Tenhunen, 2003) . The block diagram of NI based on mutual interface protocol is shown in Figure 7 . 
Packet generator
PG consists of three main parts: linear feedback shift register (LFSR) random generator, configure module and packet generate controller. Figure 9 illustrates the block diagram of PG. The LFSR random generator is used to generate a pseudo random number. This number is used for the destination node number. The Xilinx FPGA series have an IP core named LFSR that can generate the LFSR pseudo random number. We use this LFSR IP core now for simplicity. For detailed information about LFSR generator, please refer to the Xilinx datasheet (http://www.xilinx.com). The configure module receives the configuration information from the packet control and result analysis module. The configuration information includes PG start up, destination node, packet length, packet numbers and transmitting interval. The configure unit has three counters: packet length counter, packet number counter and interval timer. The packet generate controller generates the control signals based on these registers content.
The packet generate controller has been designed into a four-state finite state machine which is illustrated in Figure 10. 1 IDLE is the idle state. When the cfg_start is asserted, the machine starts the PG, exits idle state and enters PKG_TRANS state.
2 PKG_TRANS controls the PG injecting packets into network. After finishing packet transfer, the PKG_TRANS state has alternatives to the next state. The PKG_TRANS changes to WAIT or STOP depending on the packet number counter.
3 WAIT is a waiting next-transfer state which a packet has finished the transfer, but the whole packets have not been sent to the destination. When the interval time counter reaches the predefined waiting cycles, the machine will roll back to PKG_TRANS state to transfer the next packet.
4 STOP is the end state of a transfer. The STOP state will roll back to IDLE after cfg_start is deasserted. When a flit is injected into the network, the PG inserts a time stamp in its body. This time stamp is used by PR to calculate the transfer cycles.
Packet receptor
The PR receives flits from the network. Every flit has a time stamp in its flit body that contains the time when the flit is injected into network. The PR picks up the time stamp information and subtracts the time when it receives the flit and the time stamp in the received flit body. The result indicates the flit transfer cycles in network. The time stamp is 32 bits long.
Packet controller and result analysis
The PCRA module fulfils two functions:
1 controls all PGs to generate packets 2 keeps an account of the total number of flits that the PRs received and the total cycles of these flits flying in the network. Figure 11 illustrates the PCRA, which consists of three parts: an APB bus interface connecting to the APB bus, a packet controller and a mean cycle statistics module. The mean cycle statistics module is used to calculate the average message latency of flits transmitting in the network. The packet controller includes two packet injection control registers, a system control register, a system status register and a latency register. Programmers use these registers to configure and monitor the network. 
Software layer
The three tools: USB communicator, ISS and UART monitor are located in the software layer. The USB communicator has been used to communicate with USB IP core in the slave mode. The interface of the USB communicator is illustrated in Figure 12 . We select 'IPC only' to make USB communicator connect with the ISS. If the ISS is not used, the USB communicator can be used to download the programme into FPGA. The application programme is simulated on the ISS. The simulator will access the APB bus through the USB communicator by the specified address. The UART monitor communicates between PC and the hardware layer.
To facilitate the NoC emulation, designers use an application programme running on the ISS to log and control the network. This programme completes initialisation and controls the emulation process that has several driver functions. The flowchart of this programme is shown in Figure 13 . The programme can initialise UART, print initial information, wait for user's input and choose the transfer mode. For selecting the script mode, we need to load a file through the UART with some designated format. The programme will configure the network according to the script file and display the result. If the internal mode is chosen, the network will be automatically configured using internal parameters, start a transfer and display the result.
If the user mode is chosen, the network will wait for the next input to decide what to do. The network configuration provides three choices: 'configure node', 'start up configuration' or 'display'. If one chooses 'configure node', one needs to first choose a source node, then configure the destination, packet number, packet length and packet interval time of the source node. If one chooses 'start up configuration', one needs to type all the source nodes which need to start, then the network will start. If one chooses 'display result', the node configuration information and the mean cycle will display.
Experimental results
The NoCOP FPGA board (Wu, 2007) is illustrated in Figure 14 . The central part of the board is a Xilinx Virtex-4 LX160. The synthesis tool of FPGA is Synplify Pro 8.1. The FPGA resource usage of building blocks is listed in Table 1 . The FPGA delay information of building blocks is listed in Table 2 . The NoCOP emulation system can run at a clock rate of 110 MHz. We use the typical workload to evaluate the network latency. The workload model is defined by three parameters: distribution of destinations, injection rate and message length. Figure 15 shows the average message latency versus injection rate on a 4 × 4 mesh when using the deterministic, partially adaptive and fully adaptive routing algorithms under different workloads. Some different distributions are used to evaluate the network latency, including uniform, hot spot, bit reversal and matrix transpose. 
As seen from Figure 15 , the router uses adaptive routing algorithm in the case of a heavy network load and non-uniform distribution, the average latency decreases more than 50%. When the router uses uniform distribution in the case of high injection rate, the deterministic routing algorithm has better performance than the adaptive routing algorithms.
Conclusions
By introducing the NoCOP, designers can use it to validate and test different NoC implementations in MPSoC. A method for overcoming emulation system drawbacks derived from inflexibility and lack of debugger software has been presented. The emulator gives us detailed insights in the behaviour of our building blocks in the network. The software tool can control the emulation process and the hardware can track the packet information. As a demonstration of the approach, the NoCOP framework has been presented and mapped on our platform. In the next step, we will develop multi-FPGA system and programme powerful tools using the proposed method to explore the NoC architectures design space.
