The addition of express channels to a traditional mesh network-on-chip (NoC) has emerged as a viable solution to solve the problem of high latency. In this article, we address the problem of integrated mapping and synthesis for express channel-based mesh NoC topologies. An integer linear programming-based formulation has been presented for the mapping problem followed by a constructive heuristic for simultaneous application mapping and synthesis for an express channel-based NoC. The static and dynamic simulation results indicate that the obtained mappings lead to significant reduction in both average packet delay and network energy consumption. The obtained synthesized topologies were also found to be much more power efficient compared to conventional express channel topologies.
INTRODUCTION
With the growth in complexity of VLSI systems and increasing functionality, the number of cores on a single chip continues to grow. In today's systems, it is common to find multiple intellectual property (IP) cores integrated onto a single chip. These cores may perform different functionalities in such a system-on-chip (SoC) environment. Traditional SoCs use a bus to interface the different cores. With the increase in the number of cores, bus-based SoCs suffer degradation in performance. Thus, in present and upcoming VLSI systems, where highly parallelized systems are the norm and fabrication technology is at its limits, optimizing on-chip communication has become critical to ensure performance increments in line with Moore's law. The network-on-chip (NoC) Authors' addresses: S. D'souza (corresponding author), Department of Electronics and Electrical Communication Engineering, Indian Institute of Technology Kharagpur, West Bengal, India, 721302; email: smdsouza@ece.iitkgp.ernet.in; Soumya J., Department of Electrical and Electronics Engineering, Birla Institute of Technology and Science Pilani, Hyderabad Campus, Hyderabad, Telangana, India, 500078; email: soumyaj@hyderabad.bits-pilani.ac.in; S. Chattopadhyay, Department of Electronics and Electrical Communication Engineering, Indian Institute of Technology Kharagpur, West Bengal, India, 721302; email: santanu@ece.iitkgp.ernet.in. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or permissions@acm.org. paradigm [Dally and Towles 2001; Atienza et al. 2008; Marculescu and Bogdan 2009] has evolved as a viable approach to solve the communication problem. In an NoC, multiple cores communicate through a fabric of routers interconnected according to a given topology. One or more IP cores may be attached to a single router. The nature of the network allows it to scale up to possibly hundreds of cores.
To interconnect a large number of cores, multiple network topologies have been proposed for NoCs. Due to their simplicity, low design complexity, and easy mapping onto 2D silicon substrates, the ring topology [Pham et al. 2006] and 2D mesh topology have become the most popular. Up to a few cores, the ring network is simple and efficient to use. For example, the Intel Sandy Bridge Architecture [Intel Sandy Bridge 2011] uses a ring interconnection network to interface the CPU cores with the GPU. For a core count of the order of a few tens, the mesh architecture provides better scalability. It provides a regular and symmetric structure that is easy to design and replicate. Most industrial designs with a core count up to hundred, such as the 64-core TILE processor from Tilera [Wentzlaff et al. 2007 ] and the 80-core Tera-scale chip from Intel [Vangal et al. 2008 ] use the mesh topology. However, these topologies face scalability issues when the number of cores is of the order of a few hundred [Grot et al. 2009] . As the number of cores grow, the intercore communication latency increases tremendously.
For large and parallel multiprocessor systems-on-chip (MPSoCs), the mesh topology faces many issues and is inherently slower as messages are forwarded hop by hop. To alleviate the problem of a large network diameter, multiple variants of the mesh topology have been proposed, such as concentrated mesh (CMesh) [Balfour and Dally 2006] , butterfly fat tree [Pande et al. 2003 ], and mesh-of-tree [Leighton 1992 ]. In Balfour and Dally [2006] , increasing core concentration at a single router has been proposed as a means to reduce the network diameter. CMesh [Balfour and Dally 2006] achieves this reduction by co-locating multiple terminals at each network interface with a crossbar interconnect. Mesh and CMesh are illustrated in Figure 1 .
The addition of express channels [Grot et al. 2009; Kim et al. 2007; Kumar et al. 2007 ] to mesh-based NoCs has been proposed in recent literature. By means of express channels, nonadjacent routers are directly connected, thus reducing network latency due to multiple hops. Significant among them are the topologies of flattened butterfly (FB) [Kim et al. 2007 ] and multidrop express channels (MECS) [Grot et al. 2009 ]. The FB topology involves flattening a conventional butterfly network for use on a chip with the aim of reducing the hop count through the use of high-radix routers. Each router has a dedicated link to all routers sharing the same row or column as the source router. However, this causes large interconnection complexity, with a channel count that is quadratic with respect to the number of nodes [Grot et al. 2009 ]. To overcome this concentration (i.e., multiple IP cores are mapped to a single router). In this work, this concentration has been taken as four. -An integrated mapping and synthesis technique for express channel-based NoCs has been proposed. The proposed approach gives the system designer the ability to restrict the router radix and maps the IP cores onto the simultaneously synthesized topology.
The article is organized as follows. Section 2 presents the related work. Section 3 presents a mathematical formulation of the problem as well as delay models for express channel-based NoCs. Our proposed approach (for both the mapping and the synthesis problems), as well as an integer linear programming (ILP)-based mapping formulation, has been elaborated in Section 4. Section 5 enumerates the experimental results. Section 6 concludes the article.
RELATED WORK
The design of systems that meet different requirements of an application is a complex task. From the point of view of an NoC, custom topologies may be developed to meet the requirement of a given application. This is a complex process involving floorplan generation, custom network routing components, and application-specific routing algorithms. Several works covering synthesis of complex application-specific NoCs have been presented in academia [Benini 2006; Srinivasan et al. 2006] .
A simpler approach involves mapping an application onto regular topologies to minimize a given cost function. The cost function can be in terms of several metrics, such as communication latency, network throughput, network power, and area. Several algorithms have been proposed in the literature to solve the application mapping problem. A comprehensive survey can be found in Sahu and Chattopadhyay [2013] . Several mixed ILP-based formulations have been proposed in Faruque et al. [2008] , Ozturk et al. [2007] , and Ghosh et al. [2009] . ILP-based approaches provide the optimal mapping solution for a given application and network. However, these approaches require a large amount of computational resources and do not scale easily to large core counts. Some methods propose constraint relaxation in the ILP formulation to provide accurate results with a reduction in computation time. In Tosun [2011] , a clusteringbased relaxation for ILP formulations has been proposed. However, to enable scaling to larger designs with lower computation costs, heuristics such as NMAP [Murali and Micheli 2004a] and metasearch techniques have been proposed in academia. NMAP is a communication-aware three-stage iterative mapping technique that has been proposed in Murali and Micheli [2004a] . It consists of an initial mapping phase, followed by minimum path computation and a final phase where pair-wise swapping of vertices is undertaken to refine the solution. SUNMAP [Murali and Micheli 2004b ] is another such tool that performs RTL-level NoC topology exploration by minimizing area and power consumption. In Hu and Marculescu [2003] , a branch-and-bound algorithm to minimize communication energy of mapping has been proposed. A two-stage KL partitioningbased mapping heuristic for mesh and mesh-of-tree topologies has been proposed by Sahu et al. [2014] . A discrete particle swarm optimization (DPSO)-based approach has been proposed in Pham et al. [2006] . Energy-and buffer-aware application mapping (EBAM) [Celic and Bazlamacci 2014] handles the application mapping issue as a joint optimization problem for minimizing the energy consumption and buffer utilization simultaneously by employing a genetic algorithm. In Moein-Darbari et al. [2009] , the CGMAP method has been proposed, which employs a chaos-genetic-based algorithm that obtains close results compared to other metaheuristic algorithms.
The presence of express channels adds interconnections between nonadjacent routers. Hence, the preceding methods cannot be applied directly. For a mesh-based NoC, the latency of communication between two tiles depends purely on the Manhattan distance between the nodes. In an express channel-based mesh network, every router has a direct connection to the other routers in the same row or column. Therefore, in the absence of a turn, a packet can directly travel from the source to the destination. Thus, the express channels introduce an extra dependence on the number of turns in the routing path. Zhu et al. [2014] have proposed a TRAM onto express channel-based NoCs (MECS). The TRAM approach [Zhu et al. 2014 ] uses the KL heuristic to place heavily communicating cores in the same row or column of the network topology, thus reducing the number of turns required for communication. Assigning cores to routers within the row is an assignment problem. TRAM employs the Hungarian method [Kuhn 2005 ] to optimally perform this assignment. A packet delay model and mathematical formulation of the application mapping problem for express channel-based NoCs has also been proposed in Zhu et al. [2014] . We elaborate on this model further in Section 3.
For an express channel-based mesh network with large network diameter, the link latency to the routers farther away in the same row or column may sometimes be more than the total latency to the nodes in adjacent rows or columns. The TRAM approach causes suboptimal mappings for such networks with a large diameter. The initial rowbased core clustering in TRAM provides a poor starting solution. This problem has been enumerated using an example 8 × 8 network with 64 routers (Figure 3 ). In the example, let the nodes be numbered row-wise from left to right, with the top left node numbered as 1. The number within the node is used to indicate the packet delay incurred by a packet traveling from the first node to that particular node. It can be seen that the latency from node 1 to node 7 is 12. The latency from node 1 to node 10 is 11. This is lower than the latency from node 1 to node 7, even though nodes 1 and 10 do not share the same row or column. However, the TRAM [Zhu et al. 2014 ] algorithm will prefer node 7 over node 10 for mapping a core with a greater degree of communication with the core mapped to node 1. This causes a poor mapping solution. In addition, for core graphs with a lower degree of connectivity (as is the case for most real applications), placing several cores in a single row may not make sense, especially when all of them do not communicate. Hence, in such situations, a net increase in the number of turns may possibly occur using TRAM.
In this article, a constructive heuristic has been proposed to solve the mapping problem. The proposed single-stage heuristic effectively maps cores with a high degree of communication close to each other and generates a mapping solution in a constructive fashion. The method scales efficiently to larger designs providing efficient mapping solutions with low computation time.
Another approach toward system design can be the synthesis of a semicustom topology [Ogras and Marculescu 2006] . Semicustom topologies can be designed by taking a general-purpose topology and adding additional express channels as per the requirements of the designer. For a given mapping onto a mesh-based topology, an iterative link addition approach has been proposed in Ogras and Marculescu [2006] . Between each pair of cores, long distance links are added, and the benefit of link addition is calculated. Based on the benefit, the required links are added. It has been found that the addition of such links improves both the static and dynamic properties of the network [Ogras and Marculescu 2006] .
Based on the constructive heuristic proposed in this article, we also present an integrated mapping and synthesis approach. The approach intelligently adds dedicated express links based on the router radix limitations specified by the system designer. Definition 3. A mapping solution is defined as the mapping of each core c i ∈ C to a router r j ∈ R. Each router may have one or more nodes. A mapping can be one-to-one where a single core c i ∈ C is mapped to a router r j ∈ R, or a many-to-one mapping where a cluster of cores is mapped to the nodes of a single router. The size of this cluster is same for each router and is referred to as the concentration factor of the network. For an application graph with n cores, the set of mapping solutions are the permutations
Definition 4. The average packet delay (APD) is the average latency faced by the data packets while traversing the network. For a given core graph G(C, E) and the topology delay graph T (R, L), the APD depends on the mapping solution. For a given mapping solution, the APD is calculated by
where comm ij represents the bandwidth requirement of the communication from c i to c j and d map(i)map( j) represents the network communication delay between the routers to which cores c i and c j are mapped.
The delay offered to a packet by the network depends on the mapping of the source and destination cores, as well as the interconnections of the topology. The delay experienced by a packet is a combination of router delay, link delay, and delay due to router contention. The unit-length link delay T L is the number of clock cycles between adjacent tiles. The delay offered by a link can be assumed to be directly proportional to its length. Router delay T R is the number of clock cycles a packet takes to go through a router and depends on the number of router pipeline stages. The router contention latency T C is proportional to the traffic in a network. For each router traversed, the packet experiences router delay as well as contention delay. For a packet traveling from source core c i to destination core c j , the delay can be given by
where a is the number of routers encountered on the path taken by the packet and length is the link distance traversed by the packet. Both a and length depend on the topology as well as the path chosen by the routing algorithm.
In an express channel-based network, the routing algorithm must favor the use of express channels as much as possible. Normal hop-by-hop forwarding is used only in the case of contention. Hence, dimension-order routing is typically used for express channel-based networks. Techniques such as adaptive routing [Palesi et al. 2009; Qian et al. 2012 ] may generate a large number of turns [Zhu et al. 2014] , causing most packets to go through in a hop-by-hop fashion, thus causing greater packet latency. Dimension order routing routes the packet first to the correct position in a higher dimension before attempting to route in the next dimension.
For, a mesh-based NoC employing dimension order routing, the number of routers traversed by a packet is equal to one more than the Manhattan distance M ij between the source and destination router. All routers from source to destination contribute to the delay experienced by the packet. This is due to the hop-by-hop packet forwarding of a mesh NoC. The delay experienced by a packet traveling from router r i to r j , can be given by Zhu et al. [2014] .
For an express channel-based mesh NoC employing dimension order routing, the number of routers traversed by a packet from source to destination is either of two cases:
-Two routers (source and destination) when the source and destination are either in the same row or column (i.e., there exists a direct express channel). The source router will directly send the packet to the destination router, except in the case of router contention. -Three routers (source, destination, and turning point) when both source and destination are not in the same row or column (i.e., a turn needs to be made).
When the source and destination routers are not in the same row or column, a turn is made necessary. The turn function [Zhu et al. 2014] δ ij is used to determine whether packets from router r i to r j need to make a turn. For an express channel-based mesh network, with M rows and N columns and nodes numbered from left to right and top to bottom, the turn function can be calculated as follows: Using the turn function, the delay experienced by a packet traveling from router r i to r j , in an express channel-based NoC can be given by Zhu et al. [2014] .
For an express channel-based CMesh NoC with concentration conc node , the delay function is slightly different. Concentration implies that conc node number of cores are mapped to a single router. Increasing concentration leads to an increase in the router radix, thus increasing its complexity. In most designs, the concentration, conc node is taken as 4. The concentration in the X direction is given by conc x and in the Y direction by conc y . Hence, the concentration can be given by
In this case, when dimension order routing is used, the number of routers traversed by a packet from the source node to destination node is one of the three cases:
-One router (shared router) when the source and destination nodes share the same router. -Two routers (source and destination) when the source and destination nodes have corresponding routers in either the same row or column (i.e., there exists a direct express channel). The source router will directly send the packet to the destination router, except in the case of router contention. -Three routers (source, destination, and turning point) when neither source and destination nodes have routers that are neither in the same row nor in the same column (i.e., a turn needs to be made).
For an express channel-based CMesh network with M * N cores, the router topology will form a mesh with M/2 rows and N/2 columns. If the nodes are numbered from left to right and top to bottom such that i, j ∈ {0, 1, 2, 3 . . . , M * N − 1}, the turn function can be calculated as follows:
When the source and destination node share a router, the packet encounters only one router delay. The locality function, β ij is used to determine whether packets from node i to node j need to make a turn. For an express channel-based CMesh network with M rows and N columns and the nodes numbered from left to right and top to bottom such that i, j ∈ {0, 1, 2, 3 . . . , M * N-1}, the locality function can be calculated as follows:
Using the turn function and the locality function, the delay experienced by a packet traveling from node i to node j , in an express channel-based CMesh NoC can be given by
where, M map(i)map( j) is the Manhattan distance between the routers to which the source node node i and the destination node j are mapped.
Application Mapping Problem Formulation
For an NoC, the application mapping problem can be formulated in terms of minimizing the APD of the network. As given in Zhu et al. [2014] , the problem can be formulated as follows:
Given, an application with a core graph G(C, E) with n cores and a network topology with a topology delay graph T (R, L). Find the optimal solution, map(i) = j, where i, j ∈ {1, 2, 3 . . . , n}, To minimize the APD
OUR APPROACH
In this section, we present an integrated mapping and synthesis technique for express channel-based mesh networks. The application mapping problem described in Section 3 has the form of a quadratic assignment problem (QAP) [Zhu et al. 2014] . In general a QAP is found to be NP-hard [Leighton 1992 ]. Such problems can be solved using ILP techniques. However ILP techniques require a large amount of computational resource and time, and as the number of cores increase, these methods do not scale. Thus, there is a need for efficient methods which generate good mapping solutions, keeping in mind the architecture of the network topology. Two such express channel-based network topologies have been considered in our work:
-The express channel-based mesh network (topology 1) with a concentration factor of one (conc node = 1) (i.e., each router in the network fabric has one core mapped to it). -The express channel-based CMesh network (topology 2) with a concentration factor of four (conc node = 4)(i.e., each router in the network fabric has four cores mapped to it).
Both of the preceding topologies share similarities in their structure; however, increase in router radix in topology 2 necessitates proper clustering of cores and optimization of intercluster communication latency, as compared to intercore communication latency in topology 1. The CMesh network is preferred in the case of a large number of cores, where increased router complexity is preferred over a large network diameter. However, in many situations, design constraints do not allow the use of high-radix routers. In such a scenario, a custom topology must be synthesized that meets the prescribed design constraints. Using the formulations proposed in Section 3, we propose to provide good mapping solutions for the given topologies using the following methods:
-An ILP-based formulation for application mapping in an express channel-based NoC -A constructive heuristic-based approach (for topology 1) -A modified KL clustering followed by a constructive heuristic-based approach (for topology 2) -A constructive heuristic-based approach for mapping onto a simultaneously synthesised custom express channel topology (integrated mapping and synthesis approach) based on the maximum router radix specified by the system designer.
ILP Formulation for Express Channel-Based NoCs
This section presents an ILP formulation for the problem of mapping.
4.1.1. Parameters and Variables. The parameters and variables used in the ILP formulation are noted in Table I . 4.1.2. Objective Function. Our objective is to minimize the APD by selecting suitable routers in the NoC for mapping. The objective function can be formulated as follows. If cores i and j are mapped to routers s and t and a path exists between them in the network, P st ij is equal to 1. This multiplied by d st gives the communication delay between routers s and t to which cores i and j have been mapped. Communication delay multiplied by bandwidth, averaged with bandwidth of all edges, gives the APD, which has to be minimized over all edges in the core graph. E is the set of edges in the core graph, R is the set of routers, and C is the set of cores.
4.1.3. Constraints. The following is the set of constraints framed to solve the mapping problem:
(1) Mapping constraints: -Each core has to be mapped onto only one router.
-Each router can have at most one core mapped onto it.
(2) Constraints for core graph edges: -Each edge present in the core graph has to be mapped onto a path in the NoC considered.
and,
This completes the formulation. The objective function along with the constraint set can be fed to any ILP solver to get mapping for minimizing the APD. We have used the CPLEX tool [CPLEX 2013 ] for this purpose. However, except in the case of very small NoCs, it takes a huge amount of computation time to arrive at the solution. Hence, we have proposed a constructive heuristic to obtain mappings for larger core graphs, which is explained in the next section.
Constructive Heuristic for Express Channel-Based Mesh NoCs
We now propose a constructive heuristic-based approach for the problem of application mapping in express channel-based NoCs assuming a single core mapped to each router. The basic idea behind the approach is to constructively generate a mapping solution such that highly communicating cores are mapped close to each other. The order of cores to be mapped is decided by considering the edge communication weights comm ij of the core graph.
The first step involves sorting the edges of the core graph in descending order of bandwidth requirement comm ij . Let e ij be the edge with the maximum bandwidth requirement. The mapping process starts with this edge. Let c i and c j be the cores attached to this edge of the core graph. For each core, the total bandwidth is calculated, and the core with the higher communication bandwidth is selected as the first core to be mapped. The other core is mapped to a node one hop away from the first node. For a network with n nodes, there exist n possible start positions where this core can be mapped. However, as a mesh-based NoC is symmetric, there exist n/4 unique starting points. For each of these starting points, we carry out the mapping, and the solution with the lowest APD is chosen as the final mapping. At any point during execution of the constructive heuristic, a subset of the cores C is already mapped. Let the cores belonging to C be mapped to the set of nodes R . Given the nature of the algorithm, R always forms a continuous set-for instance, for each r i ∈ R , there exists a neighboring r j ∈ R . The next core to be chosen is the core having the highest communication bandwidth to any core in C . The possible mapping positions for this core are one hop away from R . Let the unmapped chosen core of the corresponding edge bec k . The core c k is added to C and is iteratively mapped to all nodes that are one hop away from the routers in the set R . For each of the possible positions, the local APD of the mapped cores C is calculated. The position with the lowest local APD is chosen as the mapping, and the selected node is added to R . The procedure is iterated over until all of the cores have been mapped. In the runtime of the algorithm, a scenario may arise where multiple positions may have the lowest local APD. For example, let the mapping of core c k to m positions of the set Min Positions = {r 1 , r 2 , . . . , r m } have the same local APD. In such a scenario, a mapping considering R ∪ {r m } for each of the m positions of the Min_Positions set is separately carried out. For this tie-breaking mapping call, in case of a tie in APD during a subsequent local APD calculation, the core is mapped to the first element of its corresponding set. At the end, the position with the lowest APD is chosen, and the core is mapped to the corresponding position. The algorithm then iteratively goes on to map the other unmapped cores. The proposed algorithm is described next.
ALGORITHM 1: Constructive Heuristic

Constructive Mapping
Input:
Core Graph G, Topology Delay Graph T Output:
Optimal The complexity of the proposed constructive heuristic is governed by the number of entries in the Mapping_Positions set-that is, the number of positions that have to be considered for mapping each core. Initially, when the number of already mapped cores is small, the set will have a small number of positions. The cardinality will be maximum when around half of the cores have been mapped. Subsequently, it will again start to reduce. The worst-case scenario for the algorithm occurs when each core communicates equally with any other core. It can be proved that for an n core application, the proposed heuristic has a complexity of O(n 5 log n). Details of the proof have been omitted for the sake of brevity.
KL Partitioning-Induced Constructive Heuristic for Express Channel-Based CMesh NoCs
For an express channel-based mesh network with concentration, multiple cores are mapped to the same router. This causes a single router delay between the cores sharing a router and multirouter delay among cores mapped to distinct routers. The nature of this delay makes it necessary to map a cluster of cores with a greater degree of intercommunication onto a single router. To achieve this clustering, the KL portioning algorithm [Kernighan and Lin 1970 ] is applied to the core graph. The KL partitioning strategy was originally developed to partition modules between hardware and software in VLSI physical design. It creates two partitions such that highly communicating and connected modules are kept in one partition. As a preconditioning to the mapping step, we propose using a modified KL bipartitioning strategy to identify the cores that can be assigned to a single partition by analyzing their bandwidth requirements. This bipartitioning is applied recursively until the number of cores in each partition is equal to the node concentration. In most practical designs, the concentration is chosen to be four. The modified KL partitioning heuristic used is elaborated next. Core Graph G, Current Partition partitioned into two nonempty equalsized partitions p1 and p2 such that p1 ∪ p2 = C and p1 ∩ p2 = NULL Output:
swapped partition, swap pair START cost log = NULL
For each unlocked (c k ∈ p1 and c j ∈ p2) temporary partition = Swap (c k and c j between p1 and p2) Append Comm between partitions(temporary partition) to cost log End for swapped partition = Partition made by swap with lowest cost swap pair = Swap that causes lowest cost Output swapped partition, swap pair END
Following the clustering by the KL approach, the cluster graph G (Cl,E ) is constructed. For a network with concentration conc node , each node cl i ∈ Cl of the graph represents a cluster of cores {c i1 , c i2 , . . . , c i conc node ∈ cl i belonging to a single KL partition. The directed edges e ij ∈ E represent the communication between the clusters cl i and cl j . Each edge e ij has a weight denoted by cl_comm ij , which represents the bandwidth requirement of the communication from cluster c i to c j . The constructive heuristic reported in Section 4.2 is subsequently applied to map the cluster graph G (Cl,E ) to the network topology. Subsequently, each core belonging to the cluster is mapped to the router to which the cluster has been mapped.
Integrated Mapping and Synthesis for Express Channel-Based Mesh NoCs
From the point of view of a system designer, the high router radix required by the mentioned express channel topologies [Kim et al. 2007; Grot et al. 2009 ] may make them practically infeasible. Based on the constructive heuristic described in Section 4.2, we now propose an integrated mapping and synthesis technique. The idea behind the approach is to simultaneously perform mapping and synthesis of the topology. For each router, the radix is limited by the upper bound set by the system designer. The constructive nature of the algorithm gives priority for highly communicating cores to have express channels between them. The proposed approach is similar to the constructive heuristic described in Section 4.2, with some differences that are elaborated next.
The traditional mesh topology is taken as the starting point. As the mapping proceeds, express channels are added based on the communication requirements of the application. As mentioned in Section 4.2, at any point during execution of the constructive heuristic, a subset of the cores C is already mapped. The cores belonging to C are mapped to the set of routers R . Given the nature of the algorithm, R always forms a continuous set-that is, for each r i ∈ R , there exists a neighboring r j ∈ R . For the next core to be mapped, the possible mapping positions are one hop away from R . For each of these positions, the local APD calculation depends on the following factors:
-The express channels that have already been synthesized until that point of execution -Whether the router radix at each router permits the addition of express channels between the position being considered and another core (with which the core being mapped communicates) -If an express channel cannot be added or does not exist, then hop-by-hop forwarding must be considered.
The position with the lowest local APD is chosen as the position of mapping, and the selected node is added to R . Subsequently, the Make_Connections subroutine is invoked, and the express channels are added based on the router radix specified and the communication requirement between the mapped core and the previously mapped cores. The Make_Connections subroutine updates the Router_Interconnect_Matrix and the Router_Radix_Array, which respectively keep track of the network interconnections and the individual router radices. The rest of the procedure is similar in all respects to the heuristic proposed in Section 4.2. For greater clarity, the entire procedure is elaborated next.
ALGORITHM 3: Integrated Mapping and Synthesis
Constructive Mapping and Synthesis
Input:
Core If the upper bound for the router radix is set equal to the maximum possible router radix (for a given express channel network, assuming all express links in the same row and column), we would obtain a mapping that is identical to the one using the constructive heuristic proposed in Section 4.2. However, in this case, the synthesized topology would contain only the express channels that are essential to the application (assuming dimension order routing). Hence, we obtain a topology with an equivalent theoretical delay as the topology that contains all of the express channels.
EXPERIMENTAL RESULTS
In this section, we first present the simulation methodologies and setup used for the experimentation. Subsequently, we present the simulation results for both the application mapping and the integrated mapping and synthesis for express channel networks. The simulations have been performed on several benchmark applications available in the literature [Leighton 1992 ]. To demonstrate the efficiency of the proposed heuristic on larger NoCs, synthetic benchmarks with core sizes of 64 and 128 have been generated using the TGFF tool [Dick et al. 1998 ]. These synthetic application core graphs have been named as G1 through G6 in the tabulated results. We organize the results as follows. First we present the results of the mapping onto the mesh-based express channel topology. This has been followed with the results of the mapping onto the CMesh-based express channel topology. Last, we present the result of the integrated mapping and synthesis technique proposed in this work.
Simulation Methodologies
To demonstrate the robustness of the proposed approaches, both static and dynamic simulations have been undertaken. 5.1.1. Static Performance Analysis. In the static simulations, the APD of the obtained mapping solution is computed using the relevant delay models presented in Section 3. The mappings are obtained using the proposed methods, and they are compared on the basis of the computed APD. The APD calculation is performed by considering a three-stage pipelined router. Hence, the router delay T R has been considered as three cycles and the unit link delay T L as one cycle. We assume that the links of the network have sufficient bandwidth to accommodate the communication requirement of the application. Therefore, we can safely assume that the network does not suffer a high degree of contention. Thus, the contention delay T C has been considered as one cycle. The runtime of the mapping methods has also been compared on a personal computer with an Intel Core-i5 processor.
5.1.2. Dynamic Performance Analysis. The dynamic simulations are performed using a modified version of the cycle-accurate open source BookSim interconnection network simulator [Jiang et al. 2013] . The BookSim simulator has been modified to perform simulation of a defined NoC topology with an application mapped onto it. The modified simulator generates custom traffic patterns based on the communication graph between the application cores and their mapped positions onto the given topology. Through these simulations, the APD, average packets injected (API or throughput), and total network power are obtained. The APD and API calculation are obtained using a three-stage pipelined router model. The network power estimation is performed using the Orion 2.0 [Kahng et al. 2009 
Comparisons for Mapping onto Mesh-Based Express Channel NoCs
For the express channel-based mesh NoC, we compare our proposed mapping technique to the TRAM approach [Zhu et al. 2014 ] and the ILP formulation described in Section 4.1. To the best of our knowledge, TRAM is the only method that takes into account the nature of express channels. In Zhu et al. [2014] , it has been shown that TRAM performs better than the Monte Carlo and simulated annealing methods. Hence, TRAM [Zhu et al. 2014 ] has been implemented and has been chosen as the benchmark for comparison with our proposed approach. For small core graphs (up to 16 cores), the results obtained using the proposed method has been compared to the mappings obtained by the ILP formulation. The ILP problem has been solved by using CPLEX solver [CPLEX 2013 ]. This comparison is used to demonstrate the quality of the solution obtained by the proposed constructive heuristic. For larger benchmarks, the ILP formulation failed to run. The static simulation comparisons of APD, runtime, and the percentage of traffic making turns for various benchmark applications can be found in Table II , and the dynamic simulation comparisons for APD, API, and network power can be found in Table III .
From the results of Table II , it can be noted that for the mesh-based express channel NoC, compared to TRAM [Zhu et al. 2014] , the proposed heuristic on average results in a 8.69% reduction in APD cycles with a maximum reduction in the APD being around 27.67%. The average reduction in the percentage of packets that make turns is about 1.08%. The maximum reduction in percentage of turns is about 8.01%. This establishes the superiority of the proposed heuristic over TRAM [Zhu et al. 2014] . For the case of the application graph G4, we observe a decrease in APD with an increase in the percentage of traffic that takes a turn. This validates the notion stated in the example of Figure 3 that the mapping of communicating cores in different rows or columns may increase the percentage of traffic making turns but reduce the APD. The dynamic simulation results in Table III further support the superiority of the mappings obtained using the proposed approach.
Comparisons for Mapping onto CMesh-Based Express Channel NoCs
For the CMesh-based express channel network, to ensure fair comparison, the proposed approach has been compared to the KL partitioning-guided TRAM [Zhu et al. 2014] . For the CMesh NoCs, as applications with lower core count have a trivial mapping, we have compared the methods using 64-and 128-core synthetic benchmarks. The static simulation comparisons of APD, runtime, and the percentage of traffic making turns for various benchmark applications can be found in Table IV , and the dynamic simulation comparisons for APD, API, and network power can be found in Table V . From Table IV for the CMesh-based express channel NoC, it can be noted that compared to TRAM [Zhu et al. 2014] , the proposed heuristic on average results in a 1.87% reduction in APD cycles, with the maximum reduction being about 3.90%. The average reduction in the percentage of packets that make turns is about 1.08%. The maximum reduction in percentage of turns is about 2.91%.
The dynamic simulation results amply demonstrate the superiority of the mapping solutions obtained using the proposed approach compared to the existing TRAM approach [Zhu et al. 2014] . The proposed constructive heuristic leads to a mapping solution that leads to a significant reduction in both APD and network energy consumption. 
Comparisons for Integrated Mapping and Synthesis for Express Channel Networks
In this section, we present the results of the proposed integrated mapping and synthesis technique. To the best of our knowledge, this work is the first of its kind. Hence, we cannot compare it directly to TRAM [Zhu et al. 2014] . Therefore, we have done an indirect comparison with the TRAM approach. Since TRAM is not constructive in nature, no provision can be made for simultaneous synthesis of a custom topology. Thus, we generate a mapping solution using TRAM and remove the express channels that would not be used, assuming perfect dimension order routing between the cores. We take this to be the custom topology synthesized using TRAM. To ensure fair comparison, we generate a solution using our approach by setting the router radix limit to the maximum possible radix for a given topology. For example, in a 4 × 4 express channel mesh network, each router would have a radix of 7 (one port to the local core and six ports to the other routers in the same row and column). We then compare this synthesized topology with the one generated using TRAM.
In Table VI , we compare the two synthesized topologies on the basis of the maximum router radix required, the average router radix required, and the number of synthesized express links required. We also compare these topologies with the mesh-based express channel NoC with all-to-all express links (within the same row and column). This comparison has been done to illustrate the benefit of using a semicustom topology compared to general-purpose topologies [Kim et al. 2007] . For the semicustom topologies, the static simulation comparisons of APD, runtime, and the percentage of traffic making turns for various benchmark applications can be found in Table VII , and the dynamic simulation comparisons for APD, API, and network power can be found in Table VIII .
Finally, we present some simulation results for the integrated mapping and synthesis technique on several benchmarks. Based on the benchmark, the APD and number of express channels have been compared for different router radices in Table IX . The smallest router radix considered is 5 (assuming one port to the local core and four ports to the adjacent routers). The largest router radix considered depends on the Table X.  From Table VI , for the semicustom synthesized NoCs, it can be noted that compared to the modified TRAM [Zhu et al. 2014] , the proposed integrated mapping and synthesis technique on average results in a 45.21% reduction in the number of express channels required, with the maximum reduction being about 91.18%. The average reduction in the average router radix is about 9.75%. The maximum reduction in the average router radix is about 18.61%.
From Table VII , it can be observed that the APD and the percentage of packets that make turns are identical to the results reported in Table II . This implies that for a given application and mapping technique, the semicustom synthesized NoCs have an identical mapping and theoretical APD as that of the mesh-based express channel NoC if we assume perfect dimension order routing. When runtimes are compared, we observe a slight increase in runtime compared to those reported in Table II . The results of Table VII also indicate that the integrated mapping and synthesis technique gives a lower theoretical APD, despite requiring lower radix routers and fewer express channels compared to modified TRAM. The dynamic simulation results from Table VIII sufficiently demonstrate the superiority of the integrated mapping and synthesis approach compared to the modified TRAM approach [Zhu et al. 2014] . We observe that the reduction of router radix and number of express channels (as shown in Table VII ) lead to a significant reduction in network power consumption. Compared to the NoCs synthesized using modified TRAM, on average we obtain a substantial power reduction of 19.18% for the NoCs synthesized using the integrated approach. The maximum reduction in power consumption Fig. 4 . A 16-core VOPD application NoC with all-to-all express channels (within the same row or column as the source router) (a), a synthesized NoC with router radix limit = 6 (b), and a synthesized NoC with router radix limit = 5 (c). Black blocks indicate the mapped cores.
was found to be 40.37%. From Table VIII , we can observe that the power reduction is especially significant for larger NoCs. In Section 5.2, the dynamic simulation results for the mesh-based express channel NoCs with all-to-all links (within the same row or column as the source router) for several benchmarks was described in Table III. Comparing the results from Tables III  and VIII , we observe that the semicustom express channel NoCs synthesized using the integrated mapping and synthesis technique show a large reduction in power consumption with a negligible increase in APD. This substantiates the results of Table VII, which indicated no theoretical increase in APD. Comparing the results, we find that on average a 65.55% reduction in network power consumption can be observed. The maximum reduction in power is noted to be 94.39%. This power reduction can be attributed to significant reduction in the router radix and number of express channels in the synthesized semicustom express channel NoCs.
The static and dynamic results of Tables IX and X clearly indicate the slight degradation in network performance when the router radix is reduced. However, the degradation in performance is minimal and can be observed that reducing router radix achieves significant reduction in network power.
In Figure 4 , the synthesized NoCs using the integrated mapping and synthesis technique for the 16-core VOPD application have been illustrated for different router radices. The graph of change in APD and network power with respect to router radix for the 128-core G5 application (generated using the TGFF tool [Dick et al. 1998 ]) is illustrated in Figure 5 .
CONCLUSION AND FUTURE WORK
Express channel-based NoCs have emerged as a viable alternative to solve the inherent deficiencies of a mesh-based NoC. In this article, an efficient constructive heuristic has been proposed for both the application mapping problem and the semicustom NoC synthesis problem for express channel-based networks. The proposed method outperforms the previously proposed TRAM approach [Zhu et al. 2014] . The constructive heuristic effectively maps highly communicating cores close to each other, thus providing a mapping solution with significantly lower APD and minimizing the number of turns. The integrated mapping and synthesis technique provides for simultaneous mapping and link synthesis based on the router radix restrictions set by the designer. The constructive nature of the approach favors link addition between highly communicating cores when the radix is restricted. This minimizes degradation in performance due to system design constraints. Significant power savings have also been obtained using the proposed approach. This work can be extended to other topologies such as 3D NoCs and topologies using application-specific routing techniques. The addition of network congestion models in the cost function can also be done to make the proposed technique congestion aware.
