The regular 2-D mesh topology has been utilized for most of Network-on-Chips (NoCs) on FPGAs. Spatially biased traffic generated in some applications makes a customization method for removing links more efficient, since some links become low utilization. In this paper, a link removal strategy that customizes the router in NoC is proposed for reconfigurable systems in order to minimize the required hardware amount. Based on the pre-analyzed traffic information, links on which the communication amount is small are removed to reduce the hardware cost while maintaining adequate performance. Two policies are proposed to avoid deadlocks and they outperform up*/down* routing, which is a representative deadlock-free routing on irregular topology. In the case of the image recognition application susan, the proposed method can save 30% of the hardware amount without performance degradation.
Introduction
Multi-core systems have been a common architecture in reconfigurable systems mounted on FPGAs as well as ASIC System-on-Chips (SoCs). A number of computing modules that communicate with each other are implemented in a single chip FPGA. While traditional buses can be used for connecting multiple modules in small systems, packetbased Network-on-Chip (NoC) has received attention as a promising interconnection for large scale reconfigurable systems [1] . With the NoC approach, system modules communicate by sending packets to one another over the network. Since multiple messages can be transferred simultaneously, the total bandwidth is much better than that of the shared bus and can be improved scalable to the system size. Since the wiring resource can be shared, the area for the network is much smaller than dedicated wiring between modules.
Although most discussion of NoCs have been targeted on ASICs, NoCs are rather suitable for FPGAs with large wiring delay, and some researches have been reported recently [2] . For most NoCs on an FPGA, 2-D mesh has been utilized, since it is efficient in terms of area and power consumption compared with other topologies, and also it matches two-dimensional nature of the island-style FPGA interconnection. However, especially some parallel em- bedded applications usually generate spatially biased traffic, which will make the link utilization of 2-D mesh unbalanced. By making the best use of flexibility of FPGAs, the NoC architecture can be optimized to such an unbalanced traffic. When a part of links are rarely used, their corresponding channel buffers are also rarely used in a router in which the channel buffer size often dominates the router size.
To decrease the amount of hardware for the router, such ports can share a channel buffer without much reducing their throughput. Customization methods have been reported recently [3] , and we also proposed a port combining method optimized to the target traffic in [4] . However, these previous researches only reduce the ports of routers, and links between routers are kept even if the amount of traffic on them is quite small. For FPGAs, wiring resource often dominates the total hardware which can be mounted on a chip, and removing such links can contribute to reduce the total cost of the NoC. Updating topology by reducing links may introduce the possibility of deadlocks of packets, and require to change the routing policy which will degrade the performance.
In this paper, a novel customization method for removing NoC links with small traffic based on the pre-analyzed traffic information is proposed. Our link removal strategy can reduce the total hardware amount of NoC making the least performance loss and avoiding deadlock problem. Given the traffic characteristics of the target application and the expected number of links to be removed, the algorithm can automatically make the link removal plan for the networks. The proposed method is available for applications whose traffic is spatially biased, since links can be removed to achieve the least hardware cost. This paper is organized as follows: the link removal methodology is shown in Sect. 2. Then the evaluation of the required hardware amount and performance of the generated NoCs is given in Sect. 3. In Sect. 4 a discussion about power consumption issues of our methodology will be given. After a survey of related work in Sect. 5, we draw the conclusion in Sect. 6.
Link Removal Methodology
In this section we propose our link removal methodology. It is composed of two main algorithms: the link removal algorithm which determines the removing links and the path Copyright c 2009 The Institute of Electronics, Information and Communication Engineers seeking algorithm for finding the new deadlock † -free routing path set.
Motivation
Most researches about NoCs on FPGAs have chosen the mesh topology because it is reported to be efficient in terms of area and power consumption compared with other topologies, and also it matches two-dimensional layout of the FPGA. A simple XY routing, that transfers packets in the ydirection after completing x-direction transfer, is commonly used for mesh based NoCs because of its high degree of performance with a simple control mechanism. The area of on-chip network should be as small as possible, since the remaining area can be used for computational modules.
In a mesh topology, each router has four (two or three in the periphery) bidirection ports connecting to the neighboring routers (North, South, West and East) and one connecting to the processing element (Local). Figure 1 (a) shows the two neighboring 5 × 5 routers connected by a bidirectional link. There are mainly three components of the router: the buffer, arbitration module, and the crossbar. We assume wormhole flow control to the router, in which a packet is divided into small flow control digits, called flit. The input port is pipelined and each stage has one flit buffer. Since the router commonly provides four stages, each port has a four-flit buffer in total. The arbitration module contains a routing control logic for deciding the routing direction of the coming flit and an arbiter to configure the crossbar.
Since the router is symmetric, the transfer capacity is equal in every input/output port. However, if traffic on some links is not so frequent compared with others, they can be removed from the network and the total performance will not be much harmed. The traffic on the removed link can be redirect to use other paths. As links are removed, the regular Fig. 1 Hardware mechanism for router port removal. mesh topology can be customized into a certain irregular topology.
Assume that a link can be removed as shown in Fig. 1 (b) . In such a small router with only a 5 × 5 crossbar, buffers play an important role from the viewpoint of the area consumption. By removing a link, a buffer can be removed. The size of arbitration module and crossbar can also be reduced. We have implemented a tool for generating Verilog HDL description of NoC architectures with several removed links. Figure 2 shows the size of router when 5, 10, 15 and 20 links are removed respectively, synthesized by Xilinx ISE 8.2 with Virtex4 xc4vlx200 as the target device. Supposing the link length is l and the data width is w, since there are 48 links for a 4 × 4 mesh, the total wire resource can be expressed as 48lw. The total wire resource for each NoC is also shown in Fig. 2 .
The synthesis results show that the area reduction by link removal is achieved. Compared with the port combination methodology [4] , the link removal will achieve even more hardware savings since port combination requires extra control logic.
In general, communication between computational modules does not always use all of the links equally. Especially for some stream processing for multimedia applications, the links along the stream may have heavy traffic while others may only have little or no traffic. Figure 3 (a) shows the task flow graph and the mapping result onto a 4×4 mesh of the susan image recognition application [5] . figure. ) We can see that the traffic is highly depending on the link location, and there are a lot of links with very small traffic. By removing such links with small amount of traffic, notable hardware is expected to be saved without degrading performance.
To remove links based on the information of traffic analysis, the systematic methodology is required. Here we propose an algorithm to determine the links to be removed and a path searching algorithm to find new paths replacing the removed link based on analyzing target trace file. 
Link Removal Strategy
The link removal algorithm is based on the analysis of the link traffic quantity. A parameter n is defined to express the number of links which are going to be removed. For example, in a regular 4 × 4 mesh, there are 48 unidirectional router-to-router links in total, while the network connectivity can be provided by an embedded tree topology that consists of all routers with 15 links. We call a link (from router a to router b) a Removable Link (RL) if the packet from a to b can be redirect to use another path after removing the current link.
We propose a strategy for the removal as follows, and the flow of the link removal strategy is shown in Fig. 4 . The application traffic pattern is pre-analyzed in order to get quantity data of each link, and we give n as a parameter to be the number of links removed from a two-dimensional mesh.
1. Apply the simple link analyzer, as follows. If there are links with no traffic during the target application execution, they are removed, and decrease the value n by the number of no traffic links. 2. Apply the link removal algorithm. 3. Apply the path seeking algorithm and re-calculate the traffic amount on each link. 4. Decrease n → n − 1, and go back to the step 2 if n is larger than zero. Otherwise finish.
Link Removal Algorithm
The link removal algorithm is simple as follows.
1. Mark all links as RL, and form the RL set. 2. If the RL set is empty, then finish. Otherwise, choose the link with the least traffic quantity from the RL set, and remove it from the topology. Also remove it from the RL set. 3. Update the RL property of all links in RL set. Remove links which do not satisfy the RL property from the RL set.
Since the hardware reduction by link removal is almost linear with the number of removed links, we can use Target Removal Number (TRN) of links to express the expected hardware reduction. Given the architecture parameters, such as topology information, with the expected number of links to be removed and the target application, the link analyzer will analyze the traffic on every RL. Based on the analysis of links, the link remover will remove one link and let the path seeker to find a replaced route. This procedure will continue until the removed link number is equal to the value of TRN. Then the final link removal plan is generated.
Path Seeking Algorithm
By removing links, the topology is changed and the original routing algorithm, such as XY routing can not be used in order to guarantee both deadlock freedom and connectivity. For each source-and-destination router pair (a, b), if any link in the original routing path from a to b is removed, a new path along another deadlock-free routing must be defined.
A high-performance routing model called the west-first turn model which requires no additional virtual channels on mesh [6] can ensure the deadlock-free path. The west-first turn model forwards packets along a path that finally transfers zero, or more hops to the west (x-) direction. In westfirst model, some turns along the routing path is prohibited, as shown in Fig. 5 . However, it is sometimes impossible to find the paths which do not violate the turn model due to the link removal. In such cases, we adopt a buffering strategy called In-Transit Buffer (ITB) mechanism [7] to get paths between all pairs of routers. For a certain path which violates the turn model, by adopting ITB mechanism to the router on which the prohibiting turn will occur, the path can still be used without deadlock problem.
As shown in Fig. 5 , the router c is set to have the ITB mechanism which can eject the packet from the network to store in the PE and then re-insert it into the network from the PE again. As show in the router c in Fig. 6 , a packet from North to West is once stored at the in-transit buffer. Since this packet transfer violates the rule of the west-first turn model, it is temporally stored at the in-transit buffer so as not to introduce deadlocks. That is, we can first route the packet to the in-transit buffer in the PE at intermediate router, and then route it from the PE to the destination. Here, the router c is called an ITB router. Since ejected packets are buffered in the extra buffer space in the PE cores, so no extra overhead is needed to the ITB router. The ITB mechanism can make any networks deadlock-free, because it breaks the cyclic channel dependency. However, it requires extra packet buffers for storing the ejected packets from the network in the PE node and will cause some special requirement for the PEs. Although the hardware requirement for PE cores may become severe, the buffers can be shared in intra-PE communication and modules. This approach is feasible in real systems. In recent tile processor such as RAW processor [8] , each PE core usually has a buffer space (e.g., cache or scratch pad memory), which can be used for the in-transit buffer. In order to minimize the overhead to the PE cores, we use 2-flit buffer size in our evaluation environment. When an ejection packet is coming but the ITB buffer is full, a NACK signal is automatically transmitted to the original source node. Then the source node retransmits the same packet to the destination again. The sufficiency is discussed in Sect. 3.4.
Up*/down* routing [9] is a common used routing algorithm for irregular topologies. Although the up*/down* routing can be applied to this mesh-based irregular topology, its performance tends to be quite low compared with regular routing such as XY routing. We thus propose the path seeking algorithm that makes the best use of path set of XY routing, because the XY routing uniformly distributes the paths, and has strong regularity that improves throughput and latency. The path seeking algorithm is described as follows.
1. For the original routing path between each sourcedestination pair (a, b) in the target application trace, check whether it goes through any removed links. 2. If it goes through no removed links, the procedure is completed. Otherwise, assign a new path avoiding the removed links as follows. 3. Using a recursive method to find all the possible path from a to b.
• In the Minimum Path policy, choose and set the path with the least hop counts † . Choose the one with less number of router that is equipped with ITB mechanism, if there are multiple paths with the same hop counts.
• In the Minimum ITB policy, choose and set the path with the least number of routers that is equipped with ITB mechanism. Choose the path with less hop counts if the number of routers that are equipped with ITB mechanism is the same.
The recursive method adopt the depth-first search and the time complexity is 4 n in the n × n mesh network case. On our 2.4 GHz, 4 G memory Linux machine, the method implemented in Perl has an execution time less then 0.02 seconds to find all possible paths for our target 4 × 4 mesh network.
The details of the recursive method is shown as following. After the all the possible paths have been been found by the recursive method, two path seeking policies: Minimum Path and Minimum ITB are selectable to decide the desirable path.
Algorithm 1 find path(s: source node, d: destination node)
With Minimum Path policy, path with less hop count has the priority to be selected, without considering the number of ITB routers that will be introduced. However, due to the constrains of PE cores, sometimes we need to limit the number of ITB routers. With Minimum ITB policy, the path with minimal number of ITB routers are selected while it may be not a non-minimal path. Although the latency may be higher due to more hop counts, the overhead requirement for PE cores become less.
For example, based on the west-first turn model, two turns are prohibited to break cyclic channel dependency, as shown in Fig. 5 . When the link from router b to d is removed, the original path defined by XY routing from a to d can no longer be used and we must find a new path. By adopting the Minimum Path policy, the new path will be a → c → d which is the minimal path containing only 2 hop counts. However, this path includes turn that violates the west-first model, and ITB mechanism must be inserted to node c to avoid deadlock problems. Instead, a nonminimum path a → b → e → f → d which does not need the ITB mechanism can be adopted.
There is a trade-off between the amount of ITB routers and the throughput. When the buffers of most PE cores are fully used which means the budget for ITB buffer is tight, the Minimum ITB policy is selected for less ITB routers. On the other hand, from the view point of throughput, the Minimum Path policy is better for less hop counts. As a result, we choose one of them according to the buffer budget in PE cores or the performance requirement.
Experimental Results
Here, we applied our algorithm to some application models.
4 × 4 mesh with XY routing is assumed to be a target network. A flit-level simulator written in C++ [10] is developed for evaluating the usage of each port and the throughput of the network. Simulation parameters used in this evaluation are shown in Table 1 . And the router architecture with input-buffer shown in Fig. 1 is used here. The evaluation will be made under two different options of the path seeking algorithm.
Uniform Traffic
Uniform traffic is distributed equally into the network so it is not suitable for our proposed methodology. Here the uniform traffic is used only to show the different effect of the two policies in the path seeking algorithm. Figure 7 shows the regular mesh network when the link from router 6 to 10 is removed. Figure 8 shows the performance comparison between the original network and the network with removed link. 30% performance degradetion occurs with just one link removed in such equal distributed traffic situation. When we adopt the Minimum Path policy for the path seeking algorithm, the performance is a little better than that of the Minimum ITB policy because of the less number of hop counts. However, it requires the router 11 to be the ITB router, while no ITB router is needed in Minimum ITB policy.
Hot Spot Traffic
Under hot spot traffic, one or more nodes are chosen as hot spots which receive an extra proportion of traffic in addition to the regular uniform traffic. Since traffic with hot spot nodes is unbalanced, the proposed method to remove links is expected to be advantageous. Here, a traffic pattern including 4 hotspot nodes is used. Figure 10 shows the performance for the network with 11, 12, 13 and 14 links removed respectively. In this case, the two policies of path seeking algorithm generate the same link removal plan. Although nearly one quarter of the total links is removed, the performance is almost kept, while the hardware amount of the network infrastructure is reduced by more than 30%, shown in Fig. 9 . However, 4 ITB routers are required. Notice that this hardware amount is just for the network infrastructure, not including the in-transit buffer in the PE cores for ITB mechanism. In this paper we focus on the hardware amount of the NoC infrastructure, assuming that PE cores have enough local memory for the in-transit buffers.
We also evaluated the performance by using up*/down* routing. Figure 11 shows the evaluation result for the original mesh topology and with only 2 links removed irregular mesh. From the comparison we can easily get the fact that even after 11 or more links removed, the routing algorithm proposed by our link removal methodology can outperform the up*/down* routing in the full mesh case. The perfor- mance of up*/down* routing also drop obviously with only 2 removed links. In this case, our methodology is much efficient than adopting the up*/down* routing for the irregular topology formed by removing links.
Susan Traffic
Finally, we applied the proposed methodology for a real world traffic from the susan image recognition application, as shown in Fig. 3 .
In the case shown in Fig. 13 is removing 12 links, and its performance is shown in Fig. 12 .
3 ITB routers are needed when the Minimum ITB policy is used, while 4 are needed with Minimum Path policy. Both two methods can keep almost the full performance while the hardware amount of the network infrastructure is 30% reduced. The performance of up*/down* in full mesh case is also evaluated here. Even in the full mesh case for up*/down* routing, our proposed method achieves better performance. Thus, by using up*/down* routing as the routing strategy in the irregular topology formed by link removal will unavoidably cause much performance degradation.
In-Transit Buffer Size
In the simulation environment, the size of the in-transit buffer has been set to be 2-flit, in order not to add too much overhead to PE cores. Table 2 shows the ratio between NACK signal number and total simulating packet number all previous traffic, uniform traffic (1 link removed), hotspot traffic (11, 12, 13, 14 links removed), suan traffic (12 link removed).
For each traffic pattern, we set the packet injection rate close to its critical point (that is, close to the bending point in the performance simulation figure).
As the low possibility (or just zero) of NACK signal, it is rare that the in-transit buffer becomes full in our simulation environment, proving the sufficiency of the buffer size we set. Also the performance evaluation result shows the efficiency by link removal with the help of ITB mechanism.
Power Consumption Discussion
Since the power consumption is one of the most crucial factors in recent embedded devices, here we discuss the energy efficiency of the proposed link removal method. The average energy consumption needed to transmit a single flit from source to destination can be estimated as [11] .
where w is the flit-width, H ave is the average hop count, E sw is the average energy to switch the 1-bit data inside a router, and E link is the energy of 1-bit transfer consumed on a link. Since our link removal methodology will introduce non minimal path as well as the in transit buffer mechanism during packet transfer, the average hop count will be increased, and it will cause the increase of transmission energy. Table 3 shows the average hop counts of the uniform and susan traffic with the same evaluation condition as the previous section. The original hop counts come from the regular mesh. After removing links, the hop counts are slightly increased. Adopting the Minimum Path policy can get smaller hop counts than the Minimum ITB policy.
However, the hop count increase is quite small. For example, for the susan application, after 12 links are removed (almost 30% hardware amount decrease), the performance can be kept and the average hop count only increases less than 1%. This is because the link being removed only have small traffic. So the quantity of packets which need to be redirect to the non minimum path is also small. In addition to that E sw can also be decreased due to router port removal, the total energy increase can be ignored, or even no energy increase occurs.
Related Work
The customization techniques for NoCs have been researched both for ASICs and reconfigurable systems. In [12] , a buffer space allocation method in order to fit the traffic was proposed. In this approach, router architecture is customized to fit the application. In routers, each input buffer does not have the same depth. An algorithm is used to allocate the buffer space to each input channel of each router according to the target application, such that the communication performance is maximized or the same performance can be achieved with less area. In [13] , a generic NoC architecture for reconfigurable systems was proposed. In this study, multiple ports with small traffic are combined in order to reduce the hardware cost. This port combining technique only reduces multiplexers comprising the crossbar in the router, keeping the same number of packet buffers. Thus, the total hardware in a router is not much reduced especially when the size of router is small. Moreover, this customization techniques used simply the total port bandwidth utilization as a metric for the combination without considering temporal change of the traffic. In [4] , we also proposed a temporal correlation based port combination methodology for reconfigurable systems. The ports of the router can be combined so as to achieve the smallest hardware requirement. An efficient temporal correlation algorithm was proposed, which automatically makes the combination decision to reduce the area size while keeping the performance. The temporal correlation based methodology shows its efficiency to some application including multimedia stream processing which involves intermittent data transfer. The port combination method benefits from the unchanged topology so the routing algorithm for the regular topology can still be used. Although these methods reduce the hardware in each router, links are not really removed. Thus, the wiring resource is not saved unlike the link removal method proposed here.
Some methodologies for developing applicationspecific irregular topology for NoC has also been proposed. A methodology to explore a network topology customized to the application traffic has been developed [14] . In the methodology, the initial network constructed from a single switch connecting to all processors, called "megaswitch", is given, and it is partitioned recursively until specified design constraints are met by all switches. We also tried to find the optimal router structure for this kind of method under given traffic for fitting on FPGAs [15] . In [16] a method which reduces network area overhead by making the router handle multiple logic cores was proposed as well as the algorithm which optimally maps the cores onto the routers. All of them can reduce the wiring resource, but the generated topology becomes completely irregular, and deadlock-free topologyagnostic routing, such as up*/down* routing whose performance is usually inferior to dimension-order routing on kary n-cube topology [17] is needed. Since our method can basically keep the mesh topology and ITB routers provide well-distributed minimal paths, the overhead on the performance or hardware amount is expected to be small.
In order to reduce the power consumption of NoCs, network links and channels can dynamically be powered on and off in response to network load; thus the network topology sometimes becomes irregular due to the link shutdown. In [18] , a thorough discussion about on/off link networks in terms of connectivity and deadlock-free routing is presented. Although this method is not for generating customized irregular topology for target application, it shows us another train of thought for topology customization methodology.
Conclusions
A link removal method which can automatically selects links to be removed while maintaining adequate performance is proposed for reconfigurable systems on FPGAs. Two policies Minimum Path and Minimum ITB are proposed for avoiding the deadlocks and they achieved better performance than up*/down* routing on the mesh with some links removed. In the image recognition application susan, the proposed method can save 30% of the hardware amount of the network infrastructure while keeping the performance with negligible energy increase. In addition to FPGAs, this methodology can be extended for ASICs that could have regularity of interconnection as a baseline topology.
The proposed method can be easily extended to other topologies including tree structures. The generalization of the method is our future work. We also would like to discuss the possibility of making the previous proposed port combination methodology and link removal methodology working together to get further hardware amount reduction. Future work also includes the improvement of current path seeking method, which has an exponential complexity, in order to deal with future large scale networks.
