In this paper, we consider the problem of synthesizing custom Networks-on-Chip (NoC) architectures that are optimized for a given application. Both unicast and multicast traffic flows are considered in the input specification. We formulate the joint multicast routing and network design problem using a rip-up and reroute procedure, where each multicast routing step is formulated as a minimum directed spanning tree problem, and we propose a very efficient algorithm called Ripup-Reroute-and-Router-Merging (RRRM). Our new formulation adopts a rip-up and reroute concept that provides us with a heuristic iterative mechanism to identify increasingly improving solutions. The minimum directed spanning tree formulation efficiently captures the best routing solutions for multicast flows during the topology synthesis procedure. Our design flow integrates floorplanning, and our solutions consider deadlock-free routing. Experimental results compared with our previous proposed algorithms CLUSTER and DECOMPOSE on a variety of NoC benchmarks showed that our new synthesis results are largely improved. RRRM can on average achieve a 9% reduction in power consumption over CLUSTER and a 17% reduction in power consumption over DECOMPOSE with 1786× and 57× faster execution times than CLUSTER and DECOMPOSE respectively. Improvements in performance were also achieved, with an average of 3% reduction in hop counts over CLUSTER and 7% in hop counts over DECOMPOSE on all benchmarks.
Introduction
N ETWORK-ON-CHIP (NoC) architectures have been proposed as a scalable solution to the global communication challenges in nanoscale Systems-on-Chip (SoC) designs [1] , [2] . The use of NoCs with standardized interfaces facilitates the reuse of previously-designed and third-party-provided modules in new designs (e.g. processor cores). Besides design and verification benefits, NoCs have also been advocated to address increasingly daunting clocking, signal integrity, and wire delay challenges.
NoC architectures can be designed as regular or custom network topologies. Regular topologies, such as mesh or folded-torus networks, have been successfully employed in a number of tile-based chip-multiprocessor projects, e.g. [3] , [4] , which are appropriate because of processor homogeneity and application traffic variability. On the other hand, for custom SoC applications, the design challenges are different in terms of varied module sizes, irregularly spread module locations, and different communication data rate requirements. Therefore, a custom network architecture optimized to the needs of the application is more appropriate. This synthesis problem is the focus of this paper.
The NoC synthesis problem is challenging for a number of reasons. First, for a large complex SoC design, an optimal solution will likely involve multiple networks since each module will likely communicate only with a small subset of modules. Therefore, a single network that spans all nodes is often unnecessary. Part of the synthesis problem is to partition cores to groups, and connect each group to the same router so that they can share the network resources. It is hard to decide which cores should be partitioned into the same group. In general, cores may be grouped together and connect to the same routers even though they are not common sources or destinations of the same group of flows because they may be able to beneficially share common intermediate network resources. Also it is hard to decide the sizes of partitions beforehand, namely whether a design with a few larger routers would be more cost efficient than a design with more smaller routers. Second, besides deciding on connectivity of cores to routers, our synthesis problem must also decide on the connectivity and the physical network topology between routers. The network topologies that are tailored to a specific application as well as optimized to specific design goals are most wanted. Finally, depending on the optimization goals and the implementation backend, the appropriate cost function may be quite complex. In particular, in this paper, we consider a power minimization problem that considers both leakage power and dynamic switching power. It is well-known that leakage power is becoming increasingly dominating [26] , [28] . Therefore, it is important to properly account for leakage power when adding routers and network links to the synthesized architecture. Other optimization goals may include minimizing hop counts along with power minimization.
In this paper, we consider the problem of synthesizing custom Networks-on-Chip (NoC) architectures that support both unicast and multicast traffic flows. In general, there exists a variety of SoC applications. For many applications, support for multicast flows is necessary. Cases include for example the passing global states, the management and configuration of the network, and the implementation of cache coherency protocols. The work presented in this paper improves our previous work that were presented in [19] , [20] . Our previous NoC synthesis algorithms were based on the formulation of the problem as set partitioning of traffic flows, finding a good network topology for each flow set using a Steiner tree formulation and providing an optimized network implementation for the derived topologies. All possible set partition of flows are investigated in an intelligent way and a Rectilinear Sterner tree problem is solved for each intermediate set partition, which makes those algorithms less efficient for future large applications with hundreds of cores envisioned.
In this paper, we formulate the joint multicast routing and network design problem using a rip-up and reroute procedure, where each multicast routing step is formulated as a minimum directed spanning tree problem. A key idea in our new formulation is a rip-up and reroute concept that has been successfully used in the VLSI routing problem [5] - [7] . The rip-up and reroute concept provides us with a heuristic iterative mechanism to identify increasingly improving solutions. There are two central differences between our on-chip network routing and design problem and the VLSI routing problem. The first is the ability to share network resources in our problem, and the second is the difference in cost models. In the latter case, the costs of routers and links are not simple linear costs, and the sharing of network resources further complicates the optimization process.
In particular, we propose a very efficient algorithm called Ripup-Reroute-and-Router-Merging (RRRM) that synthesizes custom NoC architectures for supporting both unicast and multicast traffic flows. The algorithm is based on a ripup-reroute formulation for routing flows to find a suitable network topology followed by a router merging procedure to optimize network topology. The key part of the algorithm is a rip-up and reroute procedure that routes multicast flows by way of finding the optimum multicast tree on a condensed multicast routing graph using the directed minimum spanning tree formulation and the efficient algorithms [8] , [9] . Then a router merging procedure follows after to further optimize the network topology In order to obtain the best topology solutions with minimum power consumption, accurate power models for interconnects and routers are derived. The RIPUP-REROUTE algorithm for routing flows and ROUTER-MERGING algorithm to optimize topologies are based on using these power costs of network links and router ports as evaluation criteria. Our design flow integrates floorplanning and our synthesis process is both performance and power consumption aware. Our solutions also consider several ways to ensure deadlock-free routing.
As already has been shown in our earlier work, our previous Steiner-tree based formulation already significantly outperformed regular mesh and optimized mesh topologies. In comparison to our previous work, the performance of our new algorithm was able to achieve a relative reduction of up to 45% in terms of power consumption, up to 21% in terms of hop counts and up to 39% in terms of router area. More important, the execution times of our new algorithm are 2 to 3 orders of magnitude faster than the previous algorithms even for very large benchmarks.
The rest of the paper is organized as follows. Section 2 outlines related work. Section 3 presents our design flow, which incorporates floorplanning. Section 4 presents the problem description and our formulation. Sections 6 describes the details of RRRM algorithm. Section 7 addresses deadlock considerations. Finally, experimental results and the conclusion are presented in Section 8 and 9, respectively.
Related Work
The NoC design problem has received considerable attention in the literature. Towles and Dally [1] and Benini and De Micheli motivated the NoC paradigm. Several existing NoC solutions have addressed the mapping problem to a regular mesh-based NoC architecture [10] , [11] . Hu and Marculescu [10] proposed a branch-and-bound algorithm for the mapping of computation cores on to mesh-based NoC architectures. Murali et al. [11] described a fast algorithm for mesh-based NoC architectures that considers different routing functions, delay constraints, and bandwidth requirements.
On the problem of designing custom NoC architectures without assuming an existing network architecture, a number of techniques have been proposed [12] - [17] . Pinto et al. [14] presented techniques for the constraint-driven communication architecture synthesis of point-to-point links by using heuristic-based k-way merging. Their technique is limited to topologies with specific structures that have only two routers between each source and sink pair. Ogras et al. [12] , [13] proposed graph decomposition and long link insertion techniques for application-specific NoC architectures. Srinivasan et al. [15] , [16] presented NoC synthesis algorithms that consider system-level floorplanning, but their solutions only considered solutions based on a slicing floorplan where router locations are restricted to corners of cores and links run around cores. Murali et al. [17] presented an innovative deadlock-free NoC synthesis flow with detailed backend integration that also considers the floorplanning process. The proposed approach is based on the min-cut partitioning of cores to routers. In [19] , [20] , Yan et al. formulated the custom NoC synthesis problem based on set partitioning of traffic flows and finding good network topologies using a Steiner tree formulation.
Multicasting in wormhole-switched networks has been explored in the context of chip multiprocessors based on the methods in parallel machines for supporting cache coherency, acknowledgement collection, and synchronization, etc [41] , [42] . In the NoC works of [21] , [22] , they have reported that multicast service can be implemented in their NoC architectures. However, the methods for providing multicast routing and services have not been presented in details. In [23] , a novel multicast scheme in wormhole-switched NoCs using a connection-oriented technique to realize QoSaware multicasting in a best-effort network was proposed to support SoC applications. In [24] , a router architecture supporting unicast and multicast services was proposed using a mechanism for managing broadcast-flows so that the communication links in an on-chip network can be shared. In [25] , the dual-path multicast algorithm, used in multicomputers, was adapted to wormhole-switched NoCs to support deadlock-free multicast routing.
This paper presents an improved synthesis algorithm over our previous work. The approach is based on flow ripuprerouting formulation and router merging scheme that considers both unicast and multicast traffic, which to the best of our knowledge has not been considered in previous custom NoC synthesis formulations other than us. Our approach considers deadlock-free routing with multicast traffic as well as considers floorplanning in the design flow. Our approach represents a different way of formulating the custom NoC synthesis problem. Given that custom NoC synthesis is still a relatively new problem, we believe our work provides an interesting direction in this research area.
Design Flow
Our NoC synthesis design flow is depicted in Figure 1 and the details were discussed in [20] . The major elements in the design flow are briefly summarized below.
1) Input Specification:
The input specification to our design flow consists of a list of modules and their communications. Modules can correspond to a variety of different types of intellectual property (IP) cores in a variety of sizes and can be either hard or soft macros. Packet-based communication with standard network interfaces is considered and Custom NoC architectures are addressed in this paper as a scalable solution. Traffic flows with required data rates between modules are specified as part of the input specification. For our synthesis problem, we consider both unicast and multicast traffic flows. In general, a mixture of network-based communications and conventional wiring may be utilized as appropriate, and not all inter-module communications are necessarily over the on-chip network. Our design flow and input specification allow for both interconnection models.
2) Floorplanning: The floorplanning problem has been extensively studied with many mature solutions (e.g. [35] - [37] ) In our design flow, we have adopted the open source floorplanner Parquet [37] . An initial floorplanning step is performed before NoC synthesis to obtain a placement of modules. This is important because the floorplanning of modules is often influenced by non-network-based interconnections, and the floorplan locations of modules can have a significant influence on the NoC architecture. With the module locations available from the initial floorplanning step, NoC synthesis can better account for wiring delays and power consumptions during the exploration of NoC architectures. During the NoC architecture synthesis, routers are positioned close to the network interface of the IP cores. After NoC synthesis, actual routers and links in the synthesized NoC architecture can be fed back to the floorplanner to update the floorplan. The refined floorplan information can be used to obtain more accurate power and area estimates. After the floorplan has been updated, NoC synthesis can be re-invoked to consider more accurate placement information. As shown experimentally in Section 8, our NoC synthesis algorithms are fast, making it feasible to iterate NoC synthesis with floorplanning.
3) Networks-on-Chip Synthesis: Given floorplanning information, the NoC synthesis step then proceeds to synthesize an NoC architecture that is optimized for the given specification and floorplan. Consider Figure 2 An example floorplan is shown in Figure 2 (b). As noted earlier, modules in a design do not necessarily have to be attached to the on-chip network. Modules can also be connected by conventional wiring, as shown in the unlabeled rectangles in Figure 2 4) NoC Objective and Constraints: Our NoC synthesis design flow allows different user-defined objective and constraints. As power dissipation becomes a critical issue in future IC designs due to the increased design complexity, we focus in this paper on the problem of minimizing network power consumption under performance constraints. Another possible design objective is the minimization of hop counts for data routing under power consumption constraints. Other possible constraints can be design area, total wire length, or some combinations of them.
5) NoC Design Parameters:
In addition to user-defined objectives and constraints, NoC design parameters such as the operating voltage, target clock frequency, and link widths are provided to the NoC synthesis step as well. If the design allows for different voltages or clock frequencies, or if the IP modules allow for different link widths, then NoC synthesis can be invoked to synthesize solutions for a range of design parameters specified by the user.
6) Detailed Design: Finally, the synthesized NoC architecture with the rest of the design specification can be fed to a detailed RTL design flow where design tools like RTL optimization and detailed place and route are well established.
Problem Description and Formulation

A. Problem Description
The input to our NoC synthesis problem is a communication demand graph (CDG), defined as follows:
, where each node v i ∈ V corresponds to a module, and each directed hyperedge e k = s → D ∈ E represents a traffic flow from source s ∈ V to one or more destinations
The data rate requirement for each communication flow e k is given by λ(e k ).
In general, traffic flows can either be unicast or multicast flows. Multicast flows are flows with |D| > 1. For example, in Figure 2 (c), e 7 corresponds to a multicast flow from source v 4 to destinations v 2 , v 5 and v 6 .
Based on the optimization goals and cost functions specified by the user, the output of our NoC architecture synthesis problem is an optimized custom network topology with pre-determined routes for the specified traffic flows on the network such that the data rate requirements are satisfied. For example, Figures 2(d) and 2(e) show two different topologies for the CDG shown in Figure 2 (c).
Figure 2(d) shows a network topology where all flows share a common network. In this topology, the pre-determined route for the multicast flow e 7 travels from v 4 to v 2 to first reach v 2 , and then it bifurcates at v 2 to reach v 5 and v 6 . Figure 2 (e) shows an alternative topology comprising of two separate networks. In this topology, the multicast flow e 7 bifurcates in the source node to reach v 6 , then it is transferred over the network link between v 4 to v 2 to reach v 2 , and then bifurcates to reach v 5 . Observe that in both cases, the amount of network resources consumed by routing of multicast traffic is less than what would be required if the traffic is sent to each destination as a separate unicast flow.
B. Problem Formulation
In general, the solution space of possible application-specific network architectures is quite large. Depending on the communication demand requirements of the specific application under consideration, the best network architecture may indeed be comprised of multiple networks, among each, many flows sharing the same network resources.
The goal of the proposed work in this paper is to find an optimized network topology such that the communication bandwidth requirements are satisfied and the power consumption of the network is minimized. In order to obtain the best topology solutions with minimum power consumption, accurate power models for interconnects and routers are derived. They are provided to the synthesis design flow as a library and utilized by the synthesis algorithm as evaluation criteria.
The application-specific NoC synthesis problem can be formulated as follows: Input:
• The communication demand graph H(V, E, π, λ) of the application.
• The NoC network component library Φ(I, J), where I provides the power and area models of routers with different sizes, and J provides power models of physical links with different lengths.
• The target clock frequency, which determines the delay constraint for links between routers.
• The floorplanning of the cores. Output:
, where R denotes the set of routers in the synthesized architecture, L represents the set of links between routers, and a function C : V → R that represents the connectivity of a core to a router.
• A set of ordered paths P , where each p ij ∈ P = (r i , r j , . . . , r k ), r i , . . . , r k ∈ R, represents a route for a traffic flow
• The minimization of power consumption for the synthesized NoC architecture.
Power Models
In nanoscale technologies, minimizing power consumption is a very important design goal along with performance maximization. In this paper, the design goal of NoC synthesis problem is to construct an optimized interconnection architecture such that the communication requirements are satisfied and the power consumption is minimized.
The total power consumption of the communication architecture includes both leakage power and dynamic switching power of the routers and links. The dynamic switching power is a function of data rate passing through each component and the leakage power is related to the type and the characteristics of the components in the NoC architecture.
We will discuss the details of modelling these components in the following sections.
A. Modelling Routers
It is well-known that leakage power is becoming increasing dominating [28] . In the on-chip network studied in [28] , leakage power represented only about 0.6% and 1.8% of the total power consumption at 180nm and 100nm, respectively, but leakage power increased to 25% at 70nm. High-performance microprocessor studies show even a much larger leakage power component [26] . Therefore, it is important to properly account for leakage power when adding routers and channels to the synthesized architecture. However, when considering leakage power, the cost function may need to account for possibly discrete cost increments of links and routers whereas dynamic switching power may be best modelled as a function of cumulative data rates. This non-linear characteristic of the power consumption of the NoC makes it hard to be accurately modelled using MILP or LP formulations.
To evaluate the power of the routers in the synthesized NoC architecture, We use a state-of-the-art NoC powerperformance simulator called Orion [27] , [28] that can provide detailed power characteristics for different power components of a router for different input/output port configurations. It accurately considers leakage power as well as dynamic switching power. The power per bit values are used as the basis for the entire router dynamic power estimation under different configurations. The leakage power and switching bit energy of some example router configurations with different number of ports in 70nm technology are showed in Table I .
B. Modelling Interconnects
In the NoC architecture, interconnects can be modelled as distributed RC wires. As discussed in Section 3, the target clock frequency is provided to our NoC synthesis design flow as a design parameter. Depending on the network topology, long interconnects may be required to implement network links between routers, which may have wire delays that are larger than the target clock frequency. To achieve the target frequency, repeaters may need to be inserted. Thus we use the state-of-art repeated on-chip interconnect model [29] , [30] where the interconnect is evenly divided into k segments with repeaters inserted between them that are s times as large as a minimum-sized repeater. The delay and power consumption per bit of this interconnect can be modelled using the Elmore model, as in [29] , [30] . When minimizing power consumption is the objective, the optimum size s opt and number k opt of repeaters that minimize power consumption while satisfying the delay constraint can be determined for the interconnect using the method proposed in [29] .
In our experiments, the physical and electrical parameters in 70nm technology are used and are listed in Table II . The wires are implemented on the global metal layers and their parameters are extracted from ITRS [31] .
In our NoC synthesis design flow, we use the above interconnect model to evaluate optimum power consumption of interconnects with different wire lengths under the given design frequency and delay constraints. These results are provided to the design flow in the form of a library. Since the floorplanning is performed in advance of NoC synthesis, wirelength is known for each on-chip interconnect when evaluating the power consumption. Table III lists the static power and switching bit energy parameters of some example interconnects with different wirelengths in 70nm technology under 1GHz frequency constraints.
Design Algorithms
In this section, we present algorithms for the NoC topology synthesis process. The entire process is a joint multicast routing and network design procedure that consists of the inter-related steps of constructing an initial network topology, rip-up and rerouting multicast flows to design the network topology, inserting the corresponding network links and router ports to implement the routing, and merging routers to optimize network topology based on design objectives. In particular, we propose an algorithm called Ripup-Reroute-and-Router-Merging (RRRM). The details of the algorithm are discussed in this section. 
A. Initial network construction
The details of RRRM are described in Algorithm 1. RRRM takes a communication demand graph (CDG) and an evaluation function as inputs and generates an optimized network architecture as output. It starts with initializing a network topology by a simple router allocation and flow routing scheme. Then it uses a procedure of rip-up and rerouting flows to refine and optimize the network topology. After that, a router merging step is done to further optimize the topology to obtain the best result.
In the initialization, every flow is routed using its own network. To construct an initial network topology, a router is allocated at each core and placed close to the location of network interface. These routers are not actual routers that will be included in the network topology. Only those that have traffic either multiplexed from more than two ports to the same port or de-multiplexed from one port to more than two ports at the end of the RRRM procedure will be included. After router allocation, a Routing Cost Graph (RCG) is generated (Algorithm 1 line 2) . RCG is a very important graph used in the whole rip-up and reroute procedure of the RRRM algorithm.
Definition 2: The RCG(R, E) is a weighted directed complete graph with each vertex r i ∈ R represents a router, and each directed edge e ij = (r i , r j ) ∈ E from r i to r j corresponds to a connection from r i to r j . A weight w(e ij ) is attached to each edge which represents the incremental cost of routing a flow f through e ij .
Please note that RCG does not represent the actual physical connectivity between different routers and its edge weights change during the whole RIPUP-REROUTE procedure for different flows. Also, the actual physical connectivity between the routers is established during RIPUP-REROUTE procedure, which is explained in the following sections.
Before RIPUP-REROUTE, initial network topology is constructed using InitialNetworkConstruction() procedure. Each flow e k = (s k , d k ) in the CDG is routed using a direct connection from router r s k to router r d k , where r i is the router that core i connects to, and the path is saved in path(e k ). Multicast flows are routed as a sequence of unicast flows from the source to each of their destinations. The links and router ports are configured and saved. If a connection between routers can not meet the delay constraints, its corresponding edge weight in RCG is set to infinity. This can be used to guide the rerouting of the flows to use other valid links instead of this one in the RIPUP-REROUTE procedure.
As an example, after initial network construction, the connectivity of routers for the example shown in Figure 2 (a) is shown in Figure 3 (a). 
Algorithm 1 RRRM(G(V, E, π, λ), C, L)
Input: G(V, E, π, λ): CDG,
B. Flow Ripup and Rerouting
Once the initial network is constructed and the initial flow routing is done, the key procedure of the algorithm -RIPUP-REROUTE procedure is invoked to route flows and find an optimized network topology.
The details of RIPUP-REROUTE are described in Algorithm 2. In the RIPUP-REROUTE procedure, each multicast routing step is formulated as a minimum directed spanning tree problem. Two important graphs, Multicast Routing Graph (MRG) and Multicast Routing Tree (MRTree), are used to help facilitate the rip-up and rerouting procedure. They are defined as follows.
Definition 3: Let f be a multicast flow with source s ∈ V and one or more destinations
A Multicast Routing Graph (MRG) is a complete graph Γ(N, A) defined for f as follows:
• There is a directed arc between every pair of nodes (i, j) in N . Each arc a i,j ∈ A corresponds to a shortest path p i,j between the same nodes in the corresponding RCG, p i,j = e 1 → e 2 → · · · → e k .
• The weight for arc a i,j , w(a i,j ), corresponds to the path weight of the corresponding shortest path p i,j in RCG.
i.e., w(a i,j ) = When a flow is ripped-up and rerouted, its current path is deleted and the links and router ports resources it occupies are released (line 3). Then based on the current network connectivity and resources occupation, the RCG related to this flow is built and the weights of all edges in RCG are updated (line 4). In particular, for every pair of routers in RCG, the cost of using those routers and the link connecting them is evaluated. This cost depends on the sizes of the routers, the traffic already routed on the routers and the connectivity of the routers to other routers. It also depends on whether an existing physical link will be used or a new physical link needs to be installed. If there are already router ports and links that can support the traffic, the marginal cost of reusing those resources is calculated. Otherwise, the cost of opening new router ports and installing new physical link to support the traffic is calculated. The cost is assigned as edge weight to the edge connecting the pair of routers in RCG. If the physical links used to connect the routers can not satisfy the delay constraints, a weight of infinity is assigned to the corresponding edges in RCG.
Once the RCG is constructed, the multicast routing graph (MRG) for the flow is generated from RCG (line 5). MRG is built by including every source and destination router of the flow as its nodes. For each pair of the nodes in MRG, the least cost directed path with least power consumption on RCG is found for the corresponding routers using Dijkstra's shortest path algorithm and the cost is assigned as edge weight to the edge connecting the two nodes in MRG. Then the Chu-Liu/Edmonds algorithm [8] , [9] is used to find the rooted directed minimum spanning tree of MRG with the source router as root. A rooted directed spanning tree of a graph is defined as a graph which connects, without any cycle, all n nodes in the graph with n − 1 arcs such that the sum of the weight of all the arcs is minimized. Each node, except the root, has one and only one incoming arc. This directed minimum spanning tree is obtained as the multicast routing tree (MRTree) so that the routes of the multicast flow follows the structure of this tree. The details of Chu-Liu/Edmonds Algorithm is summarized in Algorithm 3. The multicast routing for flow f in RCG can be obtained by projecting MRTree back to RCG by expanding the corresponding arcs to paths. A special case is when f is a unicast flow with source s and destination d. In this case, MRG will just consist of two nodes, namely s and d, and one directed arc from s to d. Therefore, the routing between s and d in RCG is simply a shortest path between s and d.
After the path is determined, the routers and links on the chosen path are updated.
As an example, Figure 3(b) shows the RCG for rerouting the multicast flow e 7 . For clarity, only part of the edges are shown for RCG. The MRG and MRTree for e 7 are shown in Figure 3 (c) and (d) respectively. By projecting MRTree back to RCG, the routing path for e 7 is determined, namely e 7 bifurcates in the source router R 4 to reach R 6 and v 6 , then it is transferred over the network link between R 4 to R 2 to reach v 2 , and then bifurcates to reach R 5 and v 5 . The real physical connectivity between routers before and after rip-up and rerouting e 7 are also shown in Figure 3 (e) and (f). From them, we observe that the link between R 4 and R 5 and their corresponding ports are saved thus the power consumptions are reduced after rerouting e 7 by utilizing the existing network resources for routing other flows.
This RIPUP-REROUTE process is repeated for all the flows. The results of this procedure depends on the order that the flows are considered, so the entire procedure can be repeated for several times to reduce the dependency of the results on flow ordering 1 . Once the path of each flow is decided, the size of each router, the links that connect the routers are determined. Routers that have no traffic multiplexing or de-multiplexing are deleted and links are reconnected. The remain routers and links constitute the network topology. The total implementation cost of all the routers and links in this topology is evaluated and the network topology is obtained.
C. Router Merging
After the physical network topology has been generated using RIPUP-REROUTE, a router merging step is used to further optimize the topology to reduce the power consumption cost. The router merging step was first proposed by Srinivasan in [16] . Their router merging was based on the distance between routers. However, in this paper, we propose for all flow e k ∈ E in increasing order of λ(e k ) do 3: delete path(e k ) and release the link and router resources it occupied 4: update all edge weights in RCG for flow e k , according to power consumption of the corresponding links and routers resources
Algorithm 2 RIPUP-REROUTE(G(V, E, π, λ), RCG, C, L)
5:
M RG(e k ) = ConstructMulticastRoutingGraph(RCG)
path(e k ) = Find paths from
Update link, BW avail, routers for path(e k )
9:
end for 10: end while 11: return links, routers Algorithm 3 DirectedMinimumSpanningTree(G (N, A)) 1: Discard the arcs entering the root if any; For each node other than the root, select the entering arc with the smallest cost;
Let the selected n − 1 arcs be the set S. 2: If no cycle formed, G(N, S) is a MST. Otherwise, continue. 3: For each cycle formed, contract the nodes in the cycle into a pseudo-node (k), and modify the cost of each arc which enters a node (j) in the cycle from some node (i) outside the cycle according to the following equation.
c(i, k) = c(i, j) − (c(x(j), j) − minj(c(x(j), j))
where c(x(j), j) is the cost of the arc in the cycle which enters j. 4: For each pseudo-node, select the entering arc which has the smallest modified cost; Replace the arc which enters the same real node in S by the new selected arc. 5: Go to step 2 with the contracted graph. a new router merging algorithm for reducing the power consumption of the network and improving the performance. As has been observed, routers that connect with each other can be merged to eliminate router ports and links and thus possibly the corresponding costs. Routers that connect to the same common routers can also be merged to reduce ports and costs. We propose a greedy router merging algorithm, which is shown in Algorithm 4. The algorithm works iteratively by considering all possible mergings of two routers connected with each other. In each iteration, each router's adjacent routers list are constructed and sorted by the distance between them in increasing order. They are possible candidate mergings. Then the routers are considered to merge in the decreasing order of the number of neighbors they have. For each candidate merging, if the topology from the merging result is valid, the total power consumption of the resulting topology after merging is evaluated using the power models. Routers are merged if they have not merged in this iteration and the cost is improving. After all routers are considered in the current iteration, they are updated by replacing the routers merged with the new one generated. Those routers are reconsidered in the next iteration. The algorithm keeps merging routers until no improvement can be made further. After router merging, the optimized topology is generated and the routing paths of all flows are updated. Since the router merging will always reduce the number of routers in the topology, it will not increase the hop counts for all the flows thus will not worsen the performance of the application. The topology generated after router merging represents the best solution with the minimum power consumption. It is returned as the final solution for our NoC synthesis algorithm.
As an example, the connectivity graphs before and after ROUTER-MERGING procedure for the example of Figure 2 (a) are shown in Figure 4 (a) and (b). It is shown that after router merging, the network resources are reduced from 4 routers to 3 routers and the total power consumption is reduced as well. for all ri ∈ R do 4: adj(r i ) = generate all 1-hop adjacent router list and sort it by their distance to r i in increasing order 5: end for 6: sort routers in R by their number of adjacent routers in decreasing order 7: for all ri ∈ R in this order do 8: for all r j ∈ adj(r i ) do 9: if neither r i nor r j is merged in this round then 10: evaluate the total power consumption cost of merging ri and rj 11: merge ri, rj to r if merging is valid and total power consumption is improving 12: delete r i , r j from R and add r to R
Algorithm 4 ROUTER-MERGING(R, T )
13:
end if 14: end for 15: end for 16: if no merging is done in this iteration then + 1) 2 |V | 2 ) by finding shortest path between each pair of nodes. Then it takes O(|V | 2 ) to find the rooted directed minimum spanning tree as the multicast tree by using the Chu-Liu/Edmonds algorithm. So the overall complexity of our algorithm is O(|E||V | 2 ).
Deadlock Considerations
Deadlock-free routing is an important consideration for the correct operation of custom NoC architectures. In our previous work [19] , [20] , we've proposed two mechanisms to ensure the deadlock-free operation in our NoC synthesis results. In this paper, we adopt the same mechanisms into our new Noc synthesis algorithm to ensure deadlock-free in the deterministic routing problem we consider.
The first method is Statically Scheduled Routing. For our NoC solutions, the required data rates are specified and the routes are fixed. In this setting, data transfers can be statically scheduled along the pre-determined paths with resource reservations to ensure deadlock-free routing [43] , [44] . The second method is Virtual Channels insertion. As shown in [40] , a necessary and sufficient condition for deadlock-free routing is the absence of cycles in a channel dependency graph. In particular, we use an extended channel dependency graph construction to find resource dependencies between multicast trees 2 and break the cycles by splitting a channel into two virtual channels (or by adding another virtual channel if the physical channel has already been split). The added virtual channels are implemented in the corresponding routers. We applied this method into our NoC synthesis procedure and we have found that virtual channels are rarely needed to resolve deadlocks in practice for custom networks. In all the benchmarks that we tested in Section 8, no deadlocks were found in the synthesized solutions. Therefore, we did not need to add any virtual channel.
Results
A. Experimental Setup
We have implemented our proposed algorithm RRRM in C++. As discussed in the design flow outlined in Section 3, we use Parquet [37] for the initial floorplanning step.
In all our experiments, we aim to evaluate the performance of our algorithm RRRM on all benchmarks with the objective of minimizing the total power consumption of the synthesized NoC architectures. The total power consumption includes both the leakage power and the dynamic switching power of all network components. As discussed in Section 3, we use a power-performance simulator called Orion [27] , [28] to estimate the power consumptions of router configurations generated. We applied the design parameters of 1 GHz clock frequency, 4-flit buffers, and 128-bit flits. For the link power parameters, we use the state-of-art on-chip repeated interconnect model [29] , [30] to evaluate the optimum powers for links with different lengths under the given delay constraint of 1ns. Both routers and links are evaluated using the 70nm technology and they are provided in a library.
As already has been shown in our earlier work, our previous Steiner-tree based formulation and the proposed four algorithms already significantly outperformed regular mesh and optimized mesh topologies. Specifically, the two heuristic algorithms CLUSTER and DECOMPOSE could achieve similar results as the other probabilistic algorithms but with faster execution times.
Therefore, in the experiments in this paper, in order to evaluate the effectiveness of our new algorithm, we applied RRRM on the same sets of benchmarks used in [19] , [20] and compared its synthesis results with the results of CLUSTER and DECOMPOSE. We do not repeat here the comparisons with mesh-based topologies since our new formulation already outperforms our earlier work. In particular, in order to emphasize the benefit and efficiency of our new algorithm on large benchmarks, we pick up those benchmarks with the number of cores larger than 15 and reported their results in this paper. The results show that the algorithm RRRM outperforms CLUSTER and DECOMPOSE in both power consumption and performance with execution times two to three orders of magnitude faster. The details of the results are discussed in the following sections.
The same two groups of benchmarks were used. The first group of benchmarks was used to evaluate the performance of our algorithm on applications with only unicast flows. It consists of a generic MultiMedia System (MMS) and several applications of the combinations of four different video processing applications obtained from [18] , namely VOPD, MPEG4, PIP and MWD. The names of the benchmarks represent the abbreviations of the names they include, e.g. V+M means benchmark including VOPD and MWD applications and so on. The second group of benchmarks was used to evaluate the performance of our algorithm on benchmarks with multicast traffic flows. In the absence of published benchmarks with multicast traffic, we generated a set of synthetic benchmarks using the NoC-centric bandwidthversion of Rent's rule proposed in [38] They showed that the traffic distribution models of NoC applications should follow a similar Rent's rule distribution as in conventional VLSI netlists. The bandwidth-version of Rent's rule was derived showing that the relationship between the external bandwidth B across a boundary and the number of blocks G within a boundary obeys B = kG β , where k is the average bandwidth for each block and β is the Rent's exponent. The benchmark generation procedure proposed in [39] is adopted and modified in accordance to NoC-centric Rent's rule to generate multicast benchmarks. The average bandwidth k for each block and Rent's exponent β are specified by the user. In our experiments, we generated large NoC benchmarks by varying k ranging from 100kb/s to 500kb/s and varying β from 0.65 to 0.75. We formed multicast traffic with varying group sizes for about 10% of the flows. Thus our multicast benchmarks cover a large range of applications with mixed unicast/multicast flows and varying hop count and data rate distributions.
All experimental results were obtained on a 1.5 GHz Intel P4 processor machine with 512 MB of memory running Linux.
B. Comparison of results
The floorplans for the custom topologies synthesized by our tool using different algorithms for one of the benchmark VOPD are shown in Figure 5 . Figure 5(a) shows the topology generated by RRRM. It consists of three routers with 0.040W power consumption. Figure 5(b) shows the topology generated by CLUSTER and DECOMPOSE. Those two algorithms generated the same topology for VOPD consisting of four routers with each having a smaller size. Its total power consumption is 0.042W . Although the topology generated by RRRM has larger routers, it benefits from reducing one router, leading to lower power for the overall network.
The synthesis results of our algorithm RRRM on all benchmarks at 70nm with comparison to results using CLUSTER and DECOMPOSE are shown in Table IV . For all benchmarks, the power results and the execution times of each algorithm, and power ratios and execution time ratios of CLUSTER and DECOMPOSE over RRRM are reported. The power results of all algorithms relative to RRRM are graphically compared in Figure 6 . The results show that RRRM can efficiently synthesize NoC architectures that minimize power consumption as well as achieve good performance. Among all 14 benchmarks tested, RRRM can achieve better results than CLUSTER and DECOMPOSE for 12 benchmarks. On average RRRM can achieve a 9% reduction in power consumption over CLUSTER and a 17% reduction in power consumption over DECOMPOSE, respectively.
Moreover, due to the low complexity of RRRM, it works much faster and more efficient than CLUSTER and DECOMPOSE. The execution times of all algorithm relative to RRRM are graphically compared in Figure 7 . As can be seen from the results, RRRM can obtain results for all benchmarks under 1 minute. Even for the largest benchmarks tested with 64 cores and 164 flows, RRRM can finish within 35 seconds while it takes CLUSTER over 5 hours to finish. On average RRRM is 1786 times faster than CLUSTER and 57 times faster than DECOMPOSE. Its low complexity and very short execution time makes RRRM more suitable and efficient for benchmarks with large sizes.
To evaluate the performance of the synthesized topologies, average hop count results for the benchmarks from the synthesized topology are reported in Table V and the results of all algorithms relative to RRRM are graphically compared in Figure 8 . Hop counts correspond to the number of intermediate routers that a packet needs to pass through from the source to the destination. The results show that RRRM can improve performance of the synthesized topologies as well. In particular, the solutions obtained using RRRM can on average achieve a 3% reduction in average hop counts over CLUSTER and a 7% reduction in average hop counts over DECOMPOSE. In a number of benchmarks, some modules only have single incoming flow or single outgoing flow. For example, for the VOPD application, 6 out of the 12 modules have at most one incoming flow as well as one outgoing flow, and 10 out of the 12 modules have either at most one outgoing flow or one incoming flow. For these benchmarks, the most efficient architectures are actually the ones that provide direct network links between network interfaces for some of its traffic flows without going through intermediate routers 3 . For these benchmarks, the average hop count may be less than one since not all flows necessarily pass through intermediate routers. Our algorithms are able to arrive at these implementations by correctly capture those properties of the benchmarks.
Finally, to evaluate the area costs of the synthesized solutions, we also used Orion [27] , [28] to estimate the areas of the routers in the synthesized architectures, using the same 70nm technology used for power estimation. The area cost of a solution corresponds to the sum of the router areas in the solution. The results are presented in Table VI and their relative results over RRRM are compared in Figure 9 . Total area costs of all solutions produced by RRRM are better than those produced by CLUSTER and DECOMPOSE. In particular, on average, total area costs produced by RRRM are 12% better than those of CLUSTER and 1% better than those of DECOMPOSE.
Conclusions
In this paper, we proposed a very efficient algorithm called RRRM for the custom NoC synthesis problem. Our algorithm takes into consideration both unicast and multicast traffic and our objective is to construct an optimized interconnection architecture such that the communication requirements are satisfied and the power consumption is minimized. The entire process is formulated as the joint multicast routing and network design problem using a rip-up and reroute procedure. Each multicast routing step is formulated as a minimum directed spanning tree problem. Our new formulation adopts a rip-up and reroute concept, that has been successfully used in the VLSI routing problem, as a good optimization strategy to identify increasingly improving solutions. The minimum directed spanning tree formulation efficiently captures the best routing solutions for multicast flows during the topology synthesis procedure. We have described several ways to ensure deadlock-free routing of both unicast and multicast flows. Experimental results on a variety of benchmarks using a power consumption cost model show that our algorithm can produce more effective solutions with much faster execution times comparing to our previous proposed algorithms CLUSTER and DECOMPOSE on both unicast and multicast applications. Therefore, it also significantly outperforms regular mesh and optimized mesh topologies.
