A core mapping method for reconfigurable network-on-chip (NoC) 
allow the network to dynamically change its connections and thus, its topology.
In this work, we first specify the set of different applications integrated onto the NoC. The physical core mapping is accomplished based on the average communication pattern of all applications. Afterwards, for each application, we use a branch-and-bound algorithm to explore a network topology which optimizes the NoC parameters of interest such as power and performance. This topology which is formed by switch configurations is loaded to the network upon starting the corresponding application.
The topologies proposed for on-chip networks vary from regular tiled-based [5] , [6] to fully customized structures [7] , [10] , [11] . Since fully customized NoCs are designed and optimized for a specific application, they give the best performance and power results for that application. On the other hand, regular NoC architectures provide standard structured interconnects which ensures well-controlled electrical parameters. Moreover, usual physical design problems like crosstalk, timing closure, and wire routing and architectural problems such as routing and switching strategies and network protocols can be designed and optimized for a regular NoC and be reused in several SoCs. The NoC architecture proposed in this paper can be placed between these two extreme points in NoC design. While this type of NoCs can be designed and optimized like regular NoCs, they can be dynamically configured to the topology that best matches the current application.
In the following sections, we introduce the reconfigurable NoC architecture and evaluate it in terms of power and performance gains and imposed area overheads.
Mapping and reconfiguration method

Dynamically reconfigurable NoCs
The NoC used in this research is a reconfigurable mesh-based interconnection network [12] . In this topology, the routers are not connected directly to each other, but rather are connected through a simple switch. These switches can be configured to establish a direct static connection between two or more incident data links. Figure 1 .a shows such a network while figures 1.b and 1.c display the network configured as a mesh and a binary-tree, respectively, by properly setting the simple switches. In general, dynamic hardware reconfiguration can only be implemented on dynamically reconfigurable devices; hence, most of the reconfigurable architectures are implemented using FPGAs. Since the reconfigurable part of the proposed NoC is limited to the network switches, not only this NoC can be implemented using FPGAs, but also can be realized on ASIC platforms. The structure of the simple switch is depicted in Figure 2 . The switch box consists of 6 small switches which can connect two incoming links. Tri-state gates, transmission gates, and single-transistor switches are three options for implementing such small switches. A detailed power analysis for these switching options can be found in [13] . Here, we use transmission gates as the configuration switches. Transmission gates add two NMOS and two PMOS drain capacitances to the capacitance of the link into which they are embedded. The results for these extra capacitances which are calculated using Orion power model [14] for 70 nm technology (which is the smallest feature size available in Orion) show that the capacitance of a default size transmission-gate is 6×10 -15 F. We set the length of a wire segment to 500um (Figure 2 ). Therefore, switch overhead can be completely affordable due to this wire-length which is a realistic length in current NoCs.
Another consideration in the proposed topology is the long wire links which may be generated by merging a number of wire segments. This long wire may decrease the interconnection clock frequency and result in a performance overhead. To solve this problem, we can put a 1-flit buffer at the end of each wire segment in each switch box. Using such a buffer (which is inspired from pipelined circuit switching methods in conventional interconnection networks [15] ) can provide pipelining over the wire and also act as a repeater for it.
We used Orion to calculate the power consumption of these 1flit-buffers for 32-bit links using 70 nm technology. Each buffer consumes a static power of 3.1×10 -5 W. It also consumes a dynamic power of 6.4×10 -5 W, on average. Since the network has (2×(n-1)) 2 wire segments, inserting a buffer in each segment, in a 4×4 mesh for example, will result in 3.42×10 -3 W total power which is considerably lower than the total NoC power values we present in the next section.
Mapping and reconfiguration method
In this section, we address the mapping and routing problems in the proposed reconfigurable mesh.
We first specify the set of different applications integrated onto the NoC. Any application running on a NoC is described as a Communication Task Graph (CTG), where each vertex represents an IP-core, and a directed edge characterizes the data transfer between the two cores. The communication volume corresponding to every edge is also provided. It is assumed that the designer has selected a set of IPs and assigned/scheduled the tasks onto these IP cores. We also assign a weight to every task-graph based on the percentage of time that the corresponding application is run on the NoC. Assigning weights enables the designer to focus more fully on major or more critical applications of the NoC. The problem is to map the cores onto different tiles of a reconfigurable mesh network and then find a topology for each task-graph and a path for every communication in the graph such that the power consumption of the NoC is minimized. The found paths should not violate the bandwidth of the network links (router input ports) to avoid congestion.
At the first step, our objective is to figure out how to map these IP cores physically onto different tiles of a mesh network such that the distance between the communicating cores is minimized. This step is accomplished for a task graph resulted from a weighted average of all the application task graphs. The vertices of the average task graph are IP cores and the edge values are calculated by taking the average of the corresponding edge values in all task graphs. If an edge does not exist in a task graph, its weight is considered to be 0 in the task graph.
The core mapping can be accomplished using an existing method based on the average graph and without taking the reconfiguration into account. Here, we have used NMAP, a power-aware core mapping algorithm presented in [6] . NMAP uses a heuristic approach that maps the task graph nodes into a meshbased network and generates a route for every task- Each route generated for a communication (a taskgraph edge) is composed of a combination of wires and routers. In this step, we try to reduce the network hops (or number of routers) between the source and destination nodes of a high volume communication trace (or ideally connect them directly) by bypassing one or more intermediate routers by a wire. If some switch configurations (set by previous routes) do not allow the nodes to connect directly, the route can be made by a combination of routers and wires.
The dynamic power consumptions of a router and a wire of length 500um are compared in Table 1 for different traffic loads. The wire power is the sum of the power of wire, transmission-gate, and a 1-flit buffer. The router power is a calculated based on a typical router in a mesh-based interconnection network. Network load denotes the probability of receiving a flit in a cycle. Although the table compares the power of routers and links with the same load, the load of a router is the sum of the loads of its incoming links and is larger than the load of the individual links connected to it. As the table indicates, the power consumption of a wire segment is 4 to 5 times lower than a router with the same load. Consequently, from the point of view of the power consumption, it is advantageous to decrease the router loads and put the loads on the links.
In this work, route generation is done using a branch-and-bound algorithm. In this phase, the task graph edges are selected in order of their communication volumes. Selecting an unmapped edge with maximum communication volume and starting from the source router of the edge, the algorithm makes a new branch by adding a router/switch adjacent to the current node to the path. If the current node is a router, the path can be extended by adding each of its four neighboring switches (or routers, if the router is connected to another router when the path for previous edges is established). However, the switches may have been configured by previous edges and so, can extend the path through the directions that the configuration allows. Every path has a cost which is related on the number of routers/switches the path contains. According to the aforementioned power analysis, we assign a cost of 1 to a wire segment and a cost of 5 to a router. After finding the best path for an edge, the switches of the network is set according to the found path and the algorithm continues with the next edge.
A branch (a path) is bounded (discarded) in some conditions. First, it is bounded if by adding the node, the bandwidth constraints of the newly added link is violated. In addition, if the cost of a partial path ended at a node is larger than the minimum cost of the partial paths already ended at that node or larger than the minimum cost of the completed paths, the path is discarded.
Although the branch-and-bound algorithm finds the best route for an edge, its global strategy which find route for edges in order of their communication volume is a greedy algorithm. In general, the path generated for a higher volume edge can set the switch connections in such a way that the cost of the path of a large number of other edges is increased. Therefore, this algorithm may not find the optimal configuration which minimizes the total energy consumption for a task graph. Since the problem of finding the optimal set of routings is a NP-hard [17] problem, we use a simple evolutionary algorithm to further improve the quality of solutions. In the evolutionary algorithm, the diversity in the populations is made by different routing options between the nodes. More precisely, for every edge, we select a random path among the possible shortest paths found. Each solution is represented by a set of paths generated for every edge in the task graph. Obviously, by keeping the paths we implicitly keep the configuration of the switches as well. We have a parameter for tuning the population size, p, which is 5 at present. Crossover is the only evolutionary operation we apply. This operator selects two solutions in random as parent. Afterwards, it considers the task graph edges in order of the communication volume and for every edge chooses the best route (in which the cost of communication between the sink and source nodes is minimum) between the routes in parents and copies it to the new solution.
In case of any conflict between the configuration of already configured switches in the offspring (new solution) and the currently selected best route, the offspring gets the route from the other parent. If the problem persists, a new path for this edge is generated by the branch-and-bound algorithm explained before. A solution is evaluated based on a fitness metric which is the cumulative power generated by the traffic traversed through the network. To this end, the load of all links and routers are calculated and the power consumption of every router and link is computed according to its load using the Orion power library. The fitness is the sum of the power of all routers and links. Afterwards, if the offspring improves the power of parents, it replaces the parent with the lower quality.
Generating new solutions continues for a specified number (10 in this work) of generations. Afterwards, the best solution which consists of the switch configurations and paths is saved and will be loaded to the network upon starting the corresponding application. After loading a configuration, the switch configuration part of the solution configures the network switches and the paths in the solution set the routing table of the network routers.
Evaluation
To validate the performance of the proposed mapping methodology, we perform simulations on some application task-graphs as benchmark. The benchmark set includes two random graphs; random graph 1 with 25 nodes and 30 edges and random graph 2 with 25 nodes and 20 edges together with some existing SoC designs which have been used in a number of other papers: an MPEG4 decoder [10] , a multi-window display (MWD) [10] , an object plane decoder (OPD) [6] , and 263 decoder mp3 decoder [18] . These benchmarks describe rather small-sized NoCs, but we got them from other papers on NoC mapping. Although, these SoCs have a single task graph, we generate two additional task graphs for each of them by modifying the base task graphs and integrate the three graphs into a NoC. We also assign a weight of 0.5 to the base task-graph and 0.3 and 0.2 to the other graphs. For example, the task-graph of the object plane decoder is depicted in Figure 5 .a and the two additional graphs we generated are depicted in Figures  5.b and 5 .c. the edge tags represent the communication volume in Mbit/s. The physical mapping is accomplished based on the average graph and then, a network topology is generated for each task-graph. Figures 5.d, 5 .e, and 5.f display the topology generated for the graphs in Figure 5 .a, 5.b, and 5.c, respectively.
We map the task-graphs on reconfigurable meshes with flit size of 32 bits in 70nm technology. The NoC working frequency is 250 MHz. More precisely, we set these parameters in the Orion power library integrated in our simulator. Although the actual bit-width in NoC links may be higher than 32, we used this bit-width since most of the power components scale equally with the bit-width and so, the bit-width has no effect on comparison results. The bit-width can only affect the static power of transmission-gate switches. However, this static power changes linearly with the number of transmission-gates and by changing the number of gates from 32 to larger values (256 for example), the static power is steel considerably lower than the total NoC power. As mentioned before, we estimate the power of a NoC based on the load of the routers and links of it. Table 2 displays the power consumption of the NoC for the 3 cases. The first case maps the cores based on the average task graph but does not use reconfiguration and performs the routing as in conventional fixed-connection meshes. The second and third cases use the proposed reconfigurable mapping and routing without and with evolutionary optimization, respectively (the numbers are presented in Watts). Since we intend to compare the proposed architecture with the traditional mesh we ignore some power components that are common between two mechanisms such as router static power. However, we proposed an analysis for the static power of links in previous sub-section.
The parameters of the evolutionary algorithm are set as denoted in previous section. The results show that the reconfiguration can adapt the topology to the application and effectively reduce the power consumption of the NoC by 32%, on average. As the difference among the traffic patterns of the applications integrated into a NoC increases, the proposed reconfiguration method reduces the power consumption more effectively. For example, the two additional graphs generated for the MPEG benchmark are resulted from a slight modification to the original task-graph and so, the impact of reconfiguration on its power consumption is lower than the other benchmarks whose the additional graphs are generated with more changes. Moreover, two random task-graphs show that the impact of reconfiguration increases with an increase in the number of graph edges. When the number of edges increases, putting the source and destination of all connections near each other becomes a problem which can be alleviated by reconfiguration.
Obviously, although we focus on power consumption, reducing the distance between the cores communicating frequently can reduce the average packet latency and improve the NoC performance as well.
However, this power reduction is obtained for the price of adding extra wires. The proposed reconfigurable n×m mesh architecture has (2m-1)×(n-1)+ (2n-1)×(m-1) links (we consider two 500um wire segments as a link in order to compare them to conventional meshes) while the conventional mesh has n×(m-1)+m ×(n-1) links. To evaluate this area overhead, we estimate the area of the routers used in this research using Orion and calculate the link area based on the analysis in [19] . We scale the areas proposed in [19] to 70nm technology and for 32-bit 1mm links. The results show that the area for the 1mm-wires and the routers are 0.0204 mm 2 and 0.0634 mm 2 , respectively. Based on these results, the area overhead is 19% for a 4×4 NoC and 17% for a 3×4 NoC. Like FPGAs, This overhead can be compensated by the obtained flexibility.
Conclusions
In this paper, we presented a core mapping mechanism for reconfigurable network-on-chip (NoC) architectures. This mechanism addresses the problem of most of the existing NoC mapping methods which are based on the traffic characteristics of a single application. The reconfigurable NoC was formed by embedding programmable switches between routers of a mesh-based NoC.
Our evaluation results showed that compared to a conventional mesh network, this method reduces the power consumption of the NoC by 32%. Future work in this line can further consider reducing the power consumption by scaling the voltage and frequency of the routers and links based on their loads. Moreover, more optimal algorithms for mapping and routing in reconfigurable NoCs can be developed.
