Abstract-The Network-on-Chip (NoC) approach for designing (System-on-Chip) SoCs is currently used for overcoming the scalability and efficiency problems of traditional on-chip interconnection schemes, such as shared buses and point-to-point links. NoC design draws on concepts from computer networks to interconnect Intellectual Property (IP) cores in a structured and scalable way, promoting design re-use. This paper presents the design and evaluation of a parameterizable NoC router for FPGAs. The importance of low area overhead for NoC components is pivotal in FPGAs, which have limited logic and routing resources. We obtain a low area router design by applying optimizations in switching fabric and dual purpose buffer/connection signals. We utilize a store and forward flow control with input and output buffering. We proffer a component library to increase re-use and allow tailoring of parameters for application specific NoCs of various sizes. The proposed router supports the mesh architecture which is well known for its scalability and simple XY routing algorithm. We present IP-core-to-router mapping strategies for multi-local port routers that enable ample opportunity to optimize the NoC for application specific data traffic. A set of experiments were conducted to explore the design space of the proposed NoC router using different values of key router parameters: channel width (flit size), arbitration scheme and IP-core-to-router mapping strategy. Area and latency results from the experiments are presented and analyzed. These results will be useful to designers who want to implement NoC on FPGAs.
I. INTRODUCTION
The complexity of a system on silicon is comparable to other macro systems such as space shuttle or skyscrapers, when measured in terms of the number of basic elements intricately connected together, but at a micro level. Moore's law describes an important trend in the history of the integrated circuit (IC): the number of transistors that can be placed on an IC is increasing exponentially, doubling approximately every two years. This trend has continued for more than half a century. Increasing transistor density, higher operating frequencies, shorter time-to-market and reduced product life cycle, characterize today's semiconductor industry [8] .
As semiconductor technology evolves, electronic industries continually push the envelope for greater functional and performance capabilities in new electronic systems. This is creating a continuing need for new design methodologies and design space exploration.
An embedded system is a special-purpose computer system designed to perform one or a few dedicated functions, often with real-time computing constraints. Embedded systems range from portable devices such as digital watches, cameras and MP3 players, to large stationary units like traffic lights and factory controllers. Complexity varies from low, with a single microcontroller chip, to very high with multiple intellectual property (IP) cores and peripherals. The exponential growth in chip density is opening the door for the implementation of even larger and more complex systems, where complete embedded systems can be built onto a single chip. This paradigm shift is known as System-onChip (SoC) and is becoming increasingly common and complex. SoCs may contain many hardware and/or software blocks, such as processors, DSPs, memories, peripheral controllers, gateways, and other custom logic blocks.
The on-chip interconnect architecture used in SoCs is a key factor that impacts the overall performance. Since the introduction of SoC concept, designers relied on a custom-designed ad-hoc mixture of buses and dedicated wires for on-chip interconnections. Dedicated wires are effective for systems with a small number of cores, but available routing resources are quickly used up as system complexity grows. They also provide poor reusability and flexibility. A shared bus is a set of wires common to multiple cores, which increases both reusability and scalability. This scheme works well for Master-Slave communication patterns, where peripherals (slaves) wait for data to be received or requested from a more complex IP core (master). However, when there are several masters in the system, contention creates a bottleneck which gets worse as complexity grows. Although using hierarchical bus models connected by bridges may reduce some of these constraints, it also complicates protocols while failing to fully eliminate the scalability problem. Design and verification times also grow with SoC complexity [13] . Infrastructure aims to determine the network architecture and includes topology, channel width, buffering and floor planning. These parameters are all application specific and should be left to the designer's discretion. Chanel width describes the size of the data passed between routers. It is important since it directly affects bandwidth but can lead to the side effects of increased area and power consumption. Our library allows for a parameterizable channel width. Topology refers to the way routers are connected in the network. It should be chosen to minimize area, while maximizing utilization without causing bottlenecks. Saldana et al. evaluate different topologies in terms of area and routing resources [3] . Ring and star achieve slightly better results, although both fail to provide solutions to the scalability problem. As the number of nodes increases, ring suffers large end to end delay and star suffers from a central bottleneck. Narasimhan et al. compare the performance of 2D torus to mesh, showing a slight edge for 2D torus [4] . They however, do not compare the extra routing resources needed or the increase area of each router due to a more complex routing algorithm. We restrict the topology to mesh, which is most common among FPGA networks, but allow for various implementation sizes up to an 8 x 8 network. With available FPGAs, it would be impractical to build anything larger due to area and routing resource constraints. Buffering defines the approach used to store messages while they cannot be scheduled. This has a serious impact on the area overhead of the network, however, it can also have a serious impact in reducing network latency. We use input and output buffering to prevent head-of-line blocking. This occurs when a packet or packets, experience blocking and cause the blocking of later packets which could otherwise be processed. The inclusion of an output buffer allows the blocked packet to move out of the input buffer, to unblock the later packets for processing. Buffer allocation should be based on traffic patterns. The authors of Hermes [8] , design a generic router which has a parameterizable buffer depth. They also include insight through testing various buffer sizes for area and performance values.
Floor planning involves the placement of network components on the chip, it is more important for ASIC when compared to FPGA implementation.
Communication mechanism deals with flow control, switching mode, switching mechanism, and routing algorithm. These parameters are usually set when designing the NoC platform. Flow Control deals with the allocation of channels and buffers to data as it travels from source to destination. The two extremes are packet switching and circuit switching. In circuit switching, there is a dedicated connection between the two modules in which raw data can be transmitted freely. This technique requires a setup time to build and tear down connections, and its channel reservation nature often leads to idle times and causes unreliable blocking. The only upside to this method is its ability to provide guaranteed bandwidth during connection times. This method does not scale as well. In packet switching, data is broken into packets which carry routing information. Packets can further be broken down into flow control units (flits). Modules can send packets at any time and there are often many different packets in flight at a given time. The routers must process and redirect each packet accordingly. The Switching mode defines how packets move through the network. The most important are storeand-forward (SAF), virtual cut-through (VCT), and wormhole (WH). In SAF, a switch cannot forward a packet until all its flits have been received. Therefore, latency is proportional to packet size. In WH, the first flit (header) determines the next hop and all remaining flits follow. Therefore, latency is proportional to flit size. This method combines packet switched and circuit switched ideas but also leads to channel reservation. It also requires a complex routing algorithm. VCT uses a combination of both ideas to provide latency based on flit size without idle times by guaranteeing buffering before setting up the connection. However, this method uses large number of buffers and very complex routing algorithms making it unsuitable for light-weight networks. We have chosen SAF for its light-weight algorithm and to prevent channel reservation. Future testing may extend flexibility to include WH as well. Switching refers to how connections are made inside a router. We use a partial crossbar scheme to save area. The routing algorithm determines the path the packet will take. We use XY routing for its simplicity and low area overhead. This scheme also prevents livelock. Routing schemes can also require congestion control and recovery mechanisms, which can lead to added area overhead. We allow this to be handled by the application layer.
Mapping determines how to integrate a given application to the NoC platform and includes scheduling and module mapping. Scheduling is a traditional computer science topic but most work neglects interprocessor communication. Arbitration schemes consider priority of packets in routers among the network and include static and dynamic. Dynamic arbitration makes a decision at run-time and is more flexible, however also requires a larger area. Our library provides a few different component to allow for area and performance trade-offs. Arbitration can also deal with preventing deadlock. Module mapping aims at selecting IP modules to different locations to minimize traffic. There parameters are application specific and are both tested later.
III. RELATED WORK
Our Router has been designed and synthesized on an Altera Stratix II FPGA. Therefore, although there are a number of ASIC and custom IC implementations, we restrict our discussion of related work to FPGA implementations. This section is intended to provide a comprehensive review of the state of the art for NoC implementation on FPGAs, although the authors do not make claims about its completeness.
The first working implementation of FPGAs was presented by Marescaux et al. [6] . It has many faults mainly large size, and a one dimensional architecture which fails to provide a high degree of scalability. They extend their work in [7] , allowing a more flexible architecture, but still suffering large area. They use VCT flow control which is now [14] considered too area intensive for FPGA platforms because of complex routing logic without eliminating any buffer constraints.
Moraes et al, present Hermes [8] , a router with parameterizable data width and buffer depth. They perform simulations on a 5 x 5 mesh to explore the parameter buffer depth. They conclude with the notion that increased buffer size reduced latency, but only to a saturation point. Their design uses centralized arbitration and routing units, which decreases area but stalls performance as routing requests are queued to be handled one at a time. Their design also suffers from a very low clock speed. They later extend their work to provide an automatic router generation and traffic analyzer [9] .
A comparable router, RASoC [10] , was presented by Zeferino et al. The main difference being they use a WH flow control. Performance differences are yet to be compared and may be considered for future work as a WH downfall is that it reserves channels which can cause blocking. However, WH also requires complex routing logic as well as extra bits in the datapath for framing. They also used Altera FPGA to synthesize their 5-port, 8-bit router which occupies 486 LE's and has a clock frequency of approximately 57MHz. This area is quite large for a router whose buffers are limited to 4 per port.
PNoc, proposed by Hilton et al in [13] , gives us a router with circuit switched flow control. They test their router against bus based approaches to show improvements. However, routing complexity grows as the number of ports, or number of routers increase and therefore reduces scalability. It also suffers typical CS setup and teardown latencies and possible idle time which could block other needed communication.
Sethuraman et al. propose LiPaR in [14] , which was a starting point of our design, but significant improvements were added by us. They use SAF, input and output buffering, and decentralized components. Optimizations are made in the crossbar matrix to reduce area through careful analysis of the XY routing algorithm. However, we extend these optimizations to the arbitration unit. They use a single 5x5 crossbar matrix for switching rather then 5 5x1 partial crossbars leading to a larger area. Their complex crossbar design results in a slower clock speed and increased area.
They later propose multi-local port routers (MLPR) in [15] , which have the potential of improving area and performance metrics. However, the authors fail to provide any synthesis results to support their proposal. Another extension the authors propose is Optimap [16] , an exhaustive CAD tool for mapping IP's and choosing network size.
Vestias et al. propose GNoC in [17] , a generic router which supports a range of routing, switching and arbitration protocols. They create a tool for exploring the sharing of some decentralized components to reduce area that is based on the injection rate of ports. Unfortunately, they lock all protocols to certain values and do not explore them further. Their tool shows how they can save area when injection rates are low but does not test to see if performance is degraded.
MoCres, designed by Janarthanan et al. in [18] , uses complex VCT flow control and attempts to reduce area by sacrificing area through centralizing components. They create multi-clock domain to enable high clock frequencies during transfers. Optimizations from XY routing in the crossbar matrix have been extended to the routing algorithm, and gave us the idea for a further arbitration unit extension. We have also used their idea of creating VHDL wrappers to simulate the stand-alone router or routing configurations to compare parameters.
Our paper attempts to zero in on all the best router characteristics from the above to make as many optimizations in area as possible while concentrating on system performance. We notice a lack of evaluation and comparison of network parameters on FPGAs and try to test accordingly. Most work has focused on dynamic arbitration schemes, mainly round robin (RRA), which may be too area consuming when implementing decentralized components. We see that the data width size is often set to 8-bit flits as many papers assume a size without analysis. Most importantly, we agree with the opportunity to optimize data traffic through use of MLPR. Our plan is to present area utilization and performance values for the above network parameters to help future designers make accurate decisions for their computing needs.
IV. ROUTER ARCHITECTURE
In this section we describe the architecture of the proposed parameterizable router for NoC implementation on FPGAs. The router has 4 ports, North, East, South, and West for communication with neighboring routers. There can also be anywhere from 1 to 4 Local ports for connecting to IP cores. FIFO buffers were created at the input and output ports for temporary storage. Communication between them is established by use of a two-way handshake of request/grant signals controlled by logic controllers. These signals are also used for configuration of the partial crossbar switch to reduce area overhead. The partial crossbar switch serves as the data connection for input to output port. We also use the empty/full status signals from the FIFO buffers as request/grant signals for inter/intra-router communication (Figure 2 ). Further details are provided in the following paragraphs.
Buffer size has been set to a depth of 8 bytes. This parameter has been previously explored in related work [8] and it was shown that a buffer size of 8 bytes gives reasonably good area and latency results. Flit size is parameterizable, with 8 bits being the smallest possible size. The first flit, known as the header, contains routing coordinates u port the pack having up to used to iden router identif implementati (HLP), but co level if neces The block Figure 3 . Ea therefore, in even though output chann input and ou and output ch request and s include deta switch design
A. Input Co
All input buffer unit o signal from Once the w signal from decide its nex sends a requ logic control the output p transferred, empties, the
B. Switchin
The cross an interconn between inpu have been m partial schem output rather port router. E Next, there a used to identif ket is destined o 4 local ports ntify between fication, whic ion does not ould easily be ssary. k diagram of t ach port mod ncludes the re an input cha nel. Figure 4 utput channels hannel runs it set up concurr ails on the I/ ns. rity to the po he first schem ut port is assig n, has priority, assed to the n n is not making port closes t ch like round lementations. 
V. EXPER
In this sectio mework for ex sent a summ ulation results NoC-based sy ues [20] . We c ily, device phics Modelsi vity. All route e been imple inally tested f router coordi ues of the fol e, flit size, an pping). Resul ncy in terms ncy was later erage through wn) using tota Follow up research can use the developed infrastructure to implement and evaluate NoC architectures with different communication mechanism parameters (switching mode, routing algorithm) to further decrease area and/or increase performance. Currently, another member of our research group is working on the design of a network interface to allow an IP core running Wishbone protocols to connect to our router which would allow router evaluation using real world applications. 
