One of the key steps in Network-on-Chip (NoC) based design is spatial mapping of cores and routing of the communication between those cores. Known solutions to the mapping and routing problem first map cores onto a topology and then route communication, using separated and possibly conflicting objective functions. In this paper we present a unified single-objective algorithm, called Unified MApping, Routing and Slot allocation (UMARS). As the main contribution we show how to couple path selection, mapping of cores and TDMA time-slot allocation such that the network required to meet the constraints of the application is minimized. The time-complexity of UMARS is low and experimental results indicate a run-time only 20% higher than that of path selection alone. We apply the algorithm to an MPEG decoder System-on-Chip (SoC), reducing area by 33%, power by 35% and worst-case latency by a factor four over a traditional multi-step approach.
INTRODUCTION
Systems-on-Chip (SoC) grow in size with the advance of semiconductor technology enabling integration of dozens of cores on a chip. The continuously increasing number of cores calls for a new communication architecture as traditional architectures are inherently non-scalable, making communication a bottleneck [1, 21] .
System architectures are shifting towards a more communication-centric methodology [21] . Growing SoC complexity makes communication subsystem design as important as computation subsystem design [2] . The communication infrastructure must efficiently accommodate the communication needs of the integrated computation and storage elements. In application domains such as multi-media processing, the bandwidth requirements are already in the range of several hundred Mbps and are continuously growing [17] .
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Networks-on-Chip (NoC) have emerged as the design paradigm for design of scalable on-chip communication architectures, providing better structure and modularity [1, 3, 7, 21] . Although NoCs solve the interconnect scalability issues, SoC integration is still a problem.
To enable cores to be designed and validated independently, computation and communication must be decoupled [20] . Decoupling requires well defined communication services [13] . Service guarantees are essential in many SoCs as numerous application domains require real-time performance [20] . Quality-of-Service (QoS) guarantees enable independent design and validation of every part of the SoC by ensuring that real-time application requirements are met under all circumstances [7] .
Creating a NoC-based system with guaranteed services requires efficient mapping of cores and distribution of NoC resources. Design choices include core port to network port binding, routing of communication between cores and allotment of network channel capacity over time. These choices have significant impact on energy, area and performance metrics of the system.
Existing solutions rely on a multi-step approach where mapping is carried out before routing [7, 12, 19] . Routing and mapping objectives do hereby not necessarily coincide. The routing phase must adhere to decisions taken in the mapping phase which invariably limits the routing solution space. Mapping therefore significantly impacts energy and performance metrics of the system [12] .
We propose a unified algorithm, called Unified MApping, Routing and Slot allocation (UMARS), that couples mapping, path selection and time-slot allocation, using a single consistent objective. The time-complexity of UMARS is low and experimental results indicate a run-time only 20% higher than that of path selection alone. We apply the algorithm to an MPEG decoder SoC, reducing area by 33%, power by 35% and worst-case latency by a factor four over a traditional multi-step approach.
The problem domain is described in Section 3 and formalized in Section 4. The UMARS algorithm, which solves the unified allocation problem under application constraints, is described in Section 5. Experimental results are shown in Section 6. Finally, conclusions are drawn in Section 7.
RELATED WORK
QoS routing objectives are discussed in [9, 22] and implications with common-practice load-balancing solutions are addressed in [16] . In addition to spatial, temporal characteristics are included in path selection in [8, 10] .
The problem of mapping cores onto NoC architectures is addressed in [7, 11, 12, 17, 18, 19] .
In [11] a branch-and-bound algorithm is used to map cores onto a tile-based architecture, aiming to minimize energy while bandwidth constraints are satisfied. Static xy routing is used in this work. In [12] the algorithm is extended to route with the objective of balancing network load. In [17, 18, 19] a heuristic improvement method is used. An initial mapping is derived with objectives such as minimizing communication delay, area or power dissipation. This is succeeded by routing according to a predefined routing function. Routing and evaluation is repeated for pair-wise swaps of nodes in the topology, thereby exploring the design space in search for an efficient mapping. In [19] the algorithm integrates physical planning and QoS guarantees. Design space exploration is improved with a robust tabu search.
In all these approaches [11, 12, 17, 18, 19] , multiple mapping and routing solutions are evaluated iteratively to mitigate the negative effects mapping decisions may have on routing.
A greedy non-iterative algorithm is presented in [7] . Mapping is done based on core clustering whereafter communication is routed using static xy routing.
Known mapping and routing algorithms that incorporate QoS guarantees [10, 19] assume static communication flows, where traffic does not vary with input data.
In this work, our methodology unifies the three resource allocation phases: spatial mapping of cores, spatial routing of communication, and the restricted form of temporal mapping that assigns time-slots to these routes. We consider the communication real-time requirements, and guarantee that application constraints on bandwidth and latency are met. The proposed solution is fundamentally different from [7, 11, 12, 17, 18, 19] in that mapping is no longer done prior to routing but instead during it. However, we compare UMARS only to [7] , and a more extensive comparison with traditional algorithms [11, 12, 17, 18, 19] is of value.
PROBLEM DESCRIPTION
We assume that the application is mapped onto cores. The bandwidth and latency constraints of the application flows are determined beforehand by means of static analysis or simulation.
Our problem is to: 1) map those cores onto any given NoC topology, 2) statically route the communication and 3) allocate TDMA time-slots on network channels so that application constraints are met. Services are provided on the level of flows where a flow is a sequence of packets being sent from a source to a destination. Regular, as well as irregular topologies are supported to enable dedicated solutions.
Two important requirements can be identified and the onus is, in both cases, on the mapping and routing phases. Firstly, the constraints of individual flows must be satisfied. These constraints must hence be reflected in the selection of mapping, path and time slots such that proper resources are reserved. Secondly, all flows must fit within the available network resources. Failure in allocating a flow is attributable to non-optimal previous allocations or insufficient amounts of network resources. This calls for conservation of the finite pool of resources, namely the channels and their time-slots. This paper shows how path selection can be extended to span also mapping and time-slot allocation. This enables the aforementioned requirements to be formulated as path selection constraints and optimization goals.
PROBLEM FORMULATION
The application is characterized by an application graph.
Definition 1.
An application graph is a directed multigraph, A(P, F ), where the vertices P represent the set of cores, and the arcs F represent the set of flows between cores. More than a single flow is allowed to connect a given pair of cores and no core is isolated. Each flow f ∈ F is associated with a minimum bandwidth constraint measured in number of slots, b(f ), and a maximum latency constraint, l(f ). Let s(f ) denote the source node of f and d(f ) destination node.
To be able to constrain mapping according to physical layout requirements, we group the cores in P and map groups instead of individual cores. UMARS is thereby forced to map certain cores to the same spatial location. The mapping groups correspond to a partition PM of P , where the elements of PM are jointly exhaustive and mutually exclusive. The equivalence relation this partition corresponds to, considers two elements in P to be equal if they must be mapped to the same spatial location. The equivalence class of a core p is hereafter denoted by [p] .
NoCs are represented by interconnection network graphs.
Definition 2. An interconnection network graph I is a strongly connected directed multigraph, I(N, C).
The set of vertices N is composed of three mutually exclusive subsets, NR, NNI and NP containing routers, network interfaces (NI) and core mapping nodes as shown in Figure 1 . The latter are dummy nodes to allow unmapped cores to be integrated in the interconnection graph. The number of core mapping nodes is equal to the number of core subsets to be mapped,
The set of arcs C is composed of two mutually exclusive subsets, CR and CP containing physical network channels and virtual mapping channels. Channels in CR interconnect nodes in NR and NNI according to the physical router network architecture. Channels in CP interconnect every node in NP to all nodes in NNI.
More than a single physical channel is allowed to connect a given pair of routers. However, an NI nNI is always connected to a single router through one egress channel cE(nNI) ∈ CR and one ingress channel cI (nNI) ∈ CR, as depicted in Figure 1 .
The time division of network channel capacity is governed by slot tables. These tables are used to set up pipelined virtual circuits and divide bandwidth between flows [20] . A slot table is a sequence of elements in T = F ∪ {∅}. Slots are either occupied by a flow f ∈ F or empty, represented by ∅.
The number of residual slots in a slot table t is denoted σ(t).
The same slot table size ST is used throughout the entire network.
Each channel c ∈ C is associated with the bandwidth not yet reserved (residual bandwidth) measured in number of slots, β(c), and a slot table, t(c). Let s(c) denote the source node of c and d(c) destination node.
As residual bandwidth and slot tables change over iterations, I is subscripted with an index. I0 denotes the initial network where β(c) = ST and every slot in t(c) is empty for every channel c ∈ C.
Definition 3.
A path π ∈ seq C from source ns ∈ N to destination n d ∈ N is a non-empty sequence of channels c1, . . . , c k such that:
A path π = c1, . . . , c k is associated with an aggregated slot table t(π). Every channel slot table t(ci), i = 1 . . k, is shifted cyclically i − 1 steps left and a slot in t(π) is empty iff it is empty in all shifted slot tables [20] . The NIs and core mapping nodes together form the set of mappable nodes NM = NNI ∪ NP as shown in Figure 1(a) . NM contains all nodes to which the elements of PM can be mapped. We define a mapping function, mapi : PM → NM , that maps sets of cores (the elements in PM ) to mappable nodes. Like I, this function is iterated over, hence the index. Our starting point is an initial mapping, map0, where every [p] ∈ PM is mapped to a unique nP ∈ NP .
As seen in Figure 1 (a), the range of map0 initially covers only NP . As the algorithm progresses (b), the range of mapi covers both NP and NNI partially. Successive iterations of mapi progressively replace elements of NP with elements of NNI until a final mapping is derived (c), where the range of map k contains elements of NNI exclusively.
Let the set of mapped cores P i denote those elements of P where mapi([p]) ∈ NNI. From our definition of map0 it follows that P 0 = ∅.
UMARS contribution
We now introduce a major change from previous work and formulate mapping and path selection problem as a pure path selection problem.
Given an interconnection network I0 and an application graph A, we must select a path π for every flow f ∈ F such that bandwidth (1) and latency (2) requirements of the flow are met without overallocating the network channels (3).
The theory required to derive worst-case bandwidth and latency from a slot table is covered in [5] .
UNIFIED MAPPING AND ROUTING
The outmost level of UMARS is outlined in Algorithm 5. 
Flow traversal order
We order flows by bandwidth requirements as it: 1) helps in reducing bandwidth fragmentation [16] , 2) is important from an energy consumption and resource conservation perspective since the benefits of a shorter path grow with communication demands [12] , 3) gives precedence to flows with a more limited set of possible paths [12] .
Ordering by b(f ) alone may affect resource consumption negatively as clusters of communicating cores are disregarded. Consideration is taken by limiting the selection to flows having s(f ) or d(f ) mapped to a node in NNI. As a result, every cluster of communicating cores have their flows allocated in sequence. A similar approach is used in [17, 18] where the next core is selected based on communication to already mapped cores.
Due to the nature of the least-cost path selection algorithm, explained in Section 5.2.2, we restrain the domain even more and only consider flows where s(f ) ∈ P i . This restriction can be removed if path selection is done also in the reverse direction, from destination to source.
The next flow is chosen according to Equation (4), where
When the latter condition is not fulfilled by any flow, the entire F i is used as domain.
Path selection
When a flow f is chosen, we proceed to Step 2b of Algorithm 5.1 and select a path for f . This is done according to Algorithm 5.2, briefly presented here, followed by in-depth discussions in Sections 5. (d(head πs)) ) where fE ∈ FE iff s(fE) ∈ [s(f )] and fE = f (c) Reserve ingress bandwidth for all flows incident to [s(f )] by subtracting P
Reserve egress bandwidth for all flows emanating 
) Reserve ingress bandwidth for all flows incident to [d(f )] by subtracting P
f I ∈F I b(fI ) from β(cI (s(last π d ))) where fI ∈ FI iff d(fI ) ∈ [d(f )] and fI = f
Bandwidth reservation
When s(f ) for a flow f is mapped to an NI, the communication burden placed on the ingress and egress channels of the NI is not determined by f only. As every p in [s(f )] is fixed to this NI, the aggregated communication burden of all flows incident to those cores is placed on the ingress channel. The egress channel similarly has to accommodate all flows emanating from those cores. When d(f ) is mapped, all flows to or from [d(f )] must be accounted for accordingly.
Failing to acknowledge the above might result in overallocation of network resources. Numerous flows, still not allocated, may be forced to use the ingress and egress channel due to an already fixed mapping. An NI would thereby be associated with an implicit load, not accounted for when evaluating possible paths. We make this load explicit by exploiting knowledge of ingress-egress pairs. Although we have no knowledge of exactly what time slots will be needed by future flows, we can estimate the bandwidth required by b(f ) and incorporate average load β(c) in the cost function, further discussed in Section 5.2.3.
Steps 1 and 2 of Algorithm 5.2 restore the speculative reservations for f on egress and ingress channel to have Ii reflect what resources are available prior to its allocation.
The corresponding bandwidth reservations on egress and ingress channels are carried out in Steps 4b, 4c and Steps 6b, 6c for source and destination NI respectively.
Selecting constrained least-cost path
Steps 3 and 5 of Algorithm 5.2 select a constrained leastcost path using Dijkstra's algorithm.
Two minor modifications are done to the standard relaxation procedure, where πp denotes the partial path from s(f ) to the current node: 1) Search space is pruned by discarding emanating channels where β(c) < b(f ) or σ(t(πp c )) < b(f ). Channels that cannot meet bandwidth constraints are thereby omitted. 2) As the final path must contain only physical network resources, channels in CP may only be the first or last element of a path. Hence, if d(last πp) ∈ NP then all emanating channels are discarded.
The NI architecture requires a path to incorporate at least one physical channel as packets cannot turn around inside an NI. From a least-cost perspective the best path from an NI to itself would be the empty path and we force the algorithm into leaving the NI by doing path selection in two steps.
The first part of the path πs is selected in Step 3 of Algorithm 5.2. We start at s(f ) and find the router with the lowest cost. If several such routers exist, then arity is used to distinguish between them. Routing flexibility is thereby maximized and the flows with the highest communication volume have their s(f ) and d(f ) mapped to NIs connected to high arity routers as suggested in [18] .
The second part of the path π d is selected in Step 5, starting where πs ended. From there we continue to the location where d(f ) is currently mapped. The complete path is then just the two parts concatenated, π = πs π d . Deriving π as two separate least-cost parts might, without further care, lead to a path which is not the least-cost path in Π(s(f ), d(f )) as minimization is done on the parts in isolation. However, if a flow f has s(f ) ∈ P i then there is only one possible least-cost router and hence only one possible πs. As this πs is a part of any path in Π(s(f ), d(f )) and π d is a least-cost path, π must be a least-cost path in Π(s(f ), d(f )). We therefore prefer allocating flows where s(f ) ∈ P i , as discussed in Section 5.1.
Choice of cost function
The cost function plays a critical role in meeting the requirements discussed in Section 3. It therefore reflects both resource availability and resource utilization. We select a path with a low contention (high probability of successful allocation) and at the same time try to keep the path length short, not to consume unnecessarily many resources. Similar heuristics are suggested in [14, 15, 22] . Double objective path optimization in general is an intractable problem [9] . Combining objectives in one cost function allows for tractable algorithms at the cost of optimality. We therefore use a linear combination of the two cost measures, where two constants Γc and Γ h control the importance (and normalization) of contention and hop-count respectively.
Contention is traditionally incorporated by making channel cost inversely proportional to residual bandwidth, how much t(c) reduces the amount of available slots compared to t(πp) if c is traversed. Available bandwidth is incorporated by taking the maximum of the two as contention measure, according to Equation (5).
Γc max {SL − β(c), σ(t(πp)) − σ(t(πp c ))} + Γ h (5)
Channels in CP must not contribute to the path cost, as they are not physical interconnect components. We therefore make them zero-cost channels.
Refining mapping function
When a path πs has been selected for a flow f , we check in Step 4 of Algorithm 5.2, whether s(f ) is not yet mapped to an NI. If not, πs decides the NI to which the core is to be mapped. We therefore refine the current mapping function with the newly determined mapping to a node in NNI as seen in Step 6a. This refinement is fixed and every core in
is not yet mapped to an NI in Step 6 and if not, refine the mapping according to π d in Step 6a.
Resource reservation
When the entire path π is determined in Step 7 of Algorithm 5.2, we deduce the slots available to f by looking at t(π). From the empty slots we select a set of slots TS such that bandwidth and latency requirements of f are met [5] . All channels c ∈ π are then updated with a new t(c) and β(c). Slot tables hereafter reflect what slots are reserved to f and β(c) is updated with the actual number of slots used.
Algorithm termination
With each refinement of the mapi, zero, one or two additional sets of cores will be mapped to elements of NNI instead of NP , hence P i+1 ⊇ P i , as depicted in Figure 1 . Theorem 1. ∃k such that all cores are mapped to NIs,
Proof. When a flow is f allocated, mapi will be refined so that s(f ) and d(f ) are guaranteed to be in P i . Hence, for every allocated flow f / ∈ F i we know that s(f ), d(f ) ∈ P i . When all flows are allocated F k = ∅, s(f ) and d(f ), ∀f ∈ F will be in P k . As no isolated cores are allowed in A it follows that P = P k .
Algorithm complexity
Due to the greedy nature of UMARS the time-complexity is very low, as seen in Equation (6) . The expression is dominated by the first term that is attributable to Dijkstra's algorithm, used for path selection. Experiments indicate that algorithm run-time is only 20% higher than that of loadbalancing path selection alone.
O(|F |(|C|
+ |N | log |N |)) + O(|F |(|F | + |P | + ST )) (6)
EXPERIMENTAL RESULTS
A cost function where Γc = 1 and Γ h = 1 is used throughout the experiments. Those values favor contentionbalancing over hop-count as the slot table size is an order of magnitude larger than network diameter in all use-cases. All results are compared with the traditional multi-step algorithm in [7] , referred to as original.
For comparison, only mesh topologies are evaluated. For a given slot table size ST , all unique n × m router networks with less than 25 routers were generated in increasing size order. For every such router network, up to three NIs were attached to each router until all application flows were allocated, or allocation failed. Slot table size was incremented until allocation was successful.
Each design was simulated during 3 × 10 6 clock cycles in a flit-accurate SystemC simulator of our NoC, using traffic generators to mimic core behavior.
The mpeg use-case is a MPEG codec SoC, further described in Section 6.2. The uniform use-case features all-to-all communication with 20 cores and a total aggregated bandwidth of 750 Mbps per core. The remaining use-cases are internal designs, all having a hot-spot around a limited set of cores.
Evaluation experiments
Silicon area requirements are based on the model presented in [6] , assuming a 0.13 μm CMOS process. Figure 2 shows that area requirements can be significantly reduced. Up to 33% in total area reduction is observed for the experiment applications. Slot table sizes are reduced why the buffer requirements, analytically derived as described in [7] , decrease, and area savings up to 31% are observed for the NIs. The router network is reduced between 30% and 75%, but the impact on total area is much smaller. The relative energy consumption of the router network, calculated according to the model in [4] is depicted in Figure 3 . As the application remains the same and essentially the same bits are being communicated, the savings in energy consumption is attributable to flows being allocated on paths with fewer hops. There is a clear correlation between energy saving ratio and relative reduction in number of routers. However, as the smaller router network is used more extensively, energy is reduced less than the number of routers. Figure 4 shows the average utilization of channels emanating from NIs and routers respectively. As expected, utilization increase as router network size is reduced and UMARS consequently improves both NI and router utilization. Timedivision-multiplexed circuits imply bandwidth discretization, leading to inevitable over-allocation and complicating the task of achieving high utilization. This together with unbalanced hot-spot traffic, leaving some parts of the network lightly loaded and others congested, lead to inherent low utilization in some of the example use-cases. Note that utilization is only to be optimized after all constraints are met. 
An MPEG application
An existing MPEG codec SoC with 16 cores constitutes our design example and results are shown in Table 1 . The architecture uses a single external SDRAM with three ports to implement all communication between cores. A total of 42 flows tie the cores together. Using the design flow presented in [7] (clustered mapping, xy routing and greedy slot allocation) results in a 2 × 3 mesh, referred to as clustering in Table 1 , with a total estimated area of 2.35 mm 2 . For comparison, a naive mapping with one core partition per NI is almost double in size, whereas the worst-case write latency remains more or less unaffected.
A manually optimized mapping manages to reduce the network area with 21% and an almost four-fold reduction of the average worst-case write latency is observed [7] .
UMARS arrives at a mesh of equal size to what is achieved using the manually optimized mapping. Fewer NIs are needed leading to reductions in router area. Smaller buffer requirements, attributable to less bursty time-slot allocation, results in reduced NI area. Total area is reduced by 17% and average worst-case latency by 4% compared to the optimized handcrafted design. The solution is achieved in less than 100 ms on a 500 MHz Solaris UltraSparc IIe. Only a 20% increase in run-time is observed when compared to pure load-balancing path selection, without mapping and slot allocation. 
CONCLUSION AND FUTURE WORK
In this work we have presented the UMARS algorithm which integrates the three resource allocation phases: spatial mapping of cores, spatial routing of communication and TDMA time-slot assignment. The algorithm is decomposed into a hierarchical structure where mapping is no longer done prior to routing but instead during it. UMARS improves over existing mapping and routing algorithms by using a single consistent objective-function.
The time-complexity of UMARS is low and experimental results indicate a run-time only 20% higher than that of path selection alone.
We apply the algorithm to an MPEG decoder SoC, improving area 33%, power 35% and worst-case latency by a factor four over a traditional multi-step approach.
The importance of the flow traversal order and the objective function are not yet fully evaluated and both play a critical role in improving on the moderate results achieved in some use-cases.
To allow a more extensive design space exploration for both mapping and routing, UMARS can be extended to a k-path algorithm, enabling a trade-off between complexity and optimality.
