ABSTRACT Network-on-chip (NoC) has appeared to be an impending substitute for the communication paradigm in modern very large scale integration embedded systems. Apart from many design challenges, application mapping on the NoC system is one of the most intractable and challenging optimization problems. In this paper, we propose a hybrid, branch & bound (BB)-based exact mapping (BEMAP) algorithm, for mapping real-time embedded applications on the NoC architecture. The BEMAP optimizes the latency and throughput of the NoC system and minimizes power consumption under the bandwidth constraint. This method utilizes the modular exact and systematic search optimization techniques to obtain a multi-objective optimized solution to the mapping problem of the NoC designs. The proposed algorithm exploits the stateof-the-art BB algorithm, in order to obtain optimized results against its competitors. Experimental results under the benchmarks of several real-time embedded applications show that the proposed algorithm achieves up to 19.93% savings in power consumption and 61.10% improvement in network latency for two dimension mesh and torus topologies.
I. INTRODUCTION
The growing demand for the complex embedded systems and shrinking feature size of the Very Large Scale Integration (VLSI) technology, introduced System-on-Chip (SoC) designs [1] . SoC integrates a large number of Intellectual Property (IP) blocks in a single silicon die to optimize power, latency, and area utilization of the system [2] . In the last decade, SoC has proven its importance in every aspect of life, ranging from industrial to consumer electronics. Due to its demanding position, SoC also stumbles upon some complex design challenges of scalability, modularity and fault tolerance, with the traditional shared bus designs. Networkon-Chip (NoC) has emerged as a potential elucidation for the communication bottleneck in the embedded systems. The Processing Elements (PEs) in a NoC system, communicate data packets through routers over a communication link under a specific network topology [3] - [6] . NoC inherits many features from macro computer networks by updating and incorporating indispensable changes, required for onchip networks [7] , [28] .
In the design challenges of the NoC systems, application mapping is one of the structural design requirements to achieve optimal system performance. It is a Non Polynomial (NP)-hard combinatorial optimization problem, mitigating assigned tasks on the available NoC tiles of a specific NoC topology, with an efficient routing algorithm [8] . Application mapping optimization requires n! computations for n number of tasks to map an application on the NoC platform. For example, to map an application of 100 tasks on the PEs of the NoC architecture with a 10 × 10, 2D mesh topology, requires 9.33 × 10 157 computations to find the optimal solution. It is impractical to solve large combinatorial optimization problem in polynomial time by linear programming methods with existing available commercial computation systems. For mapping smaller applications on NoC architecture, it is likely to solve the problem with the exact optimization model of Integer Linear Programming (ILP) and Mixed Integer Linear Programming (MILP) to acquire an exact optimal solution [10] , [11] , [27] . Fast search based optimization techniques and heuristics are generally used for such type of NP-hard problems [12] - [14] . These methods are fast, because they are based on intelligent search methods, however, they usually generate an infeasible solution due to trap in local minima. To address this dilemma, we have introduced a new hybrid optimization technique, composed of both the exact and search based optimization methods as shown in the middle branch of Fig. 1 .
Exact and search based optimization are the two significant search methods, in finding a feasible solution to an NP-hard mapping problem. Exact optimization is composed of mathematical modeling, and it is generally used for small search space to find a real solution to the predicament. To overcome the time convergence problem of the exact optimization, search based optimization techniques are used to find a near optimal solution to the application mapping problem [23] - [26] . Search based optimization is categorized into systematic search and heuristic search methods. Heuristic search method is either transformative or constructive with or without iterative improvement. We have presented a new branch of hybrid optimization, named as Hybrid Exact & Search (HES) optimization. The HES optimization is categorized into HES systematic and HES heuristic optimization methods. HES heuristic is further divided into HES constructive and HES transformative optimization techniques. HES constructive optimization is composed of HES constructive with iterative improvement and HES constructive without iterative improvement. Our previous two mapping algorithms, namely Segmented Brute-force Mapping algorithm (SBMAP) and Optimized Near-optimal Mapping algorithm (ONMAP), respectively, fall into these two categories.
In this paper, we propose a hybrid, BB based Exact Mapping (BEMAP) algorithm. The BEMAP algorithm includes in the category of HES systematic optimization method. BEMAP utilizes both the properties of systematic search optimization and the exact method. To speed up the exact part of the algorithm, a modular partition based approach is adopted in this research work. The BEMAP algorithm is designed for the topological placement of IPs on the network nodes of the NoC platform. The algorithm utilizes the BB and the modular exact algorithms for mapping the embedded applications. The BB algorithm is fast, however, it usually generates a suboptimal solution to the mapping problem. We have combined this algorithm with our modular exact mapping algorithm to produce an efficient and optimized solution to the mapping problem. The algorithm segregates the initial solution and searches for the best solution, using network cost functions. The cost function is the product of bandwidth and hop counts from source to the destination node of the network. When the algorithm finds the best solution of the segmented module, it retains the mapping and proceeds to the next module. The algorithm computes the cumulative best mapping in the final stage of the mapping technique. We have implemented the BEMAP algorithm in the NoCTweak simulator [22] to compute network performance parameters under the constraint of bandwidth reservation.
The rest of the paper is structured as follows. Section II presents background work on the NoC application mapping, whereas, Section III describes the problem formulation. In Section IV, we present the proposed BEMAP algorithm. Section V evaluates the mapping results of the BEMAP algorithm by taking real world embedded benchmark applications. Section VI presents concluding remarks and future extension of this research work.
II. RELATED RESEARCH WORK
Low power NoC based Multiprocessor System-on-Chip (MPSoC) designs are the most extensive requirements of the recent and future generation embedded systems. Apart from many design challenges, application mapping optimization problem is one of the key research areas in the field of NoC systems. Power consumption and the network traffic latency are heavily dependent on the mapping of the application on the target NoC architecture, which is an NP-hard optimization problem. By mapping heavily communicating tasks on topologically, near neighborhood IPs of the NoC platform, the average message delay and power consumption can significantly be reduced. To address this important research area, a number of mapping algorithms have been proposed in the literature.
Hu et al. [9] presented energy-aware mapping algorithm for NoC architecture, under the bandwidth constraint. The authors introduced BB algorithm to map the application tasks on the NoC tiles. The algorithm accelerated the run time of the mapping process, however, it produced low performance results. Lei and Kumar in [15] proposed a two-step Genetic Algorithm (GA) for mapping task graphs on NoC architecture. The authors minimized the execution time of the scheduling and mapping process of the tasks by reducing the delay between the messages without considering reduction in power and energy consumption. Murali and de Micheli [16] presented Near-optimal Mapping (NMAP) algorithm for the NoC application mapping with the constraint of bandwidth reservation. This algorithm followed the approach that the heavily communicating tasks would be the nearest neighbors in the network to minimize the communication delay. The authors performed a pairwise swapping to improve the results. The traffic is routed across different paths in the network to satisfy the bandwidth requirements. This approach performed better, however, it compromised the results optimality at the cost of simulation time of the architecture. To speed up the computation time of adhoc Simulated Annealing (SA), Lu et al. [17] proposed a fast Cluster based Simulated Annealing (CSA) algorithm at the cost of optimized solution.
Radu and Vintan [18] presented an Optimized Simulated Annealing (OSA) algorithm for mapping the embedded application on the 2D mesh topology. In this work, the authors optimized the annealing parameters to speed up the computation time, however, it compromised the performance results as compared to the adhoc SA algorithm. Ascia et al. [19] proposed Multi-objective Genetic Algorithm for mapping the IP cores on NoC tiles to optimize network performance and power consumption. Jena [20] proposed GA algorithm for the mapping of IP cores, using 2D mesh topology of the NoC architecture to optimize network power consumption, link bandwidth and network performance. GA is good heuristic for fast convergence time, however, it usually produces infeasible solution to the optimization problem. Sepulveda et al. [21] presented Multi Objective Adaptive Immune Algorithm (MAIA), for the optimization of network power and latency of the NoC applications. The research work in [30] - [33] also presented some novel ideas, however, the NoC application mapping is still an open problem for the research community due to its NP-hard nature.
Most of the above cited research work, mainly focused on the speed of simulation time and paid less attention to the outcome of the optimum feasible solution, which is the main objective of the application mapping, particularly for power and energy minimization. The proposed BEMAP algorithm optimizes, mainly the feasible solution of the application mapping for the performance parameters of the NoC systems, in addition to the optimization of simulation time.
III. PROBLEM FORMULATION
To formulate the mapping problem, we present the following definitions and mathematical models for the NoC systems.
Definition 1: A Network Task Graph (NTG), N = N (T , C) is a directed acyclic graph in which each vertex of the graph represents a task (T = T 1 , T 2 , T 3 , . . . , T n ) with the associated communication information and deadlines. The directed arc (c i,j ∈ C, i = 1, 2, 3, . . . , j = 1, 2, 3, . . . ) between the tasks represents data volume and interdependencies between the application nodes.
Definition 2: A Network Core Graph (NCG), G = G(P, A) is a directed graph in which vertices of the graph represent the available PEs (P = P 1 , P 2 , P 3 , . . . , P n ) for the task execution. The directed arc (a i,j ∈ A) shows characteristic parameters and required bandwidth between the IP cores (p i to p j ).
Definition 3: NoC Architecture Graph (NAG), A = A(R, H ) is a topology graph in which node of the graph, (R = R 1 , R 2 , R 3 , . . . , R n ) shows a network router and the directed arc, (h i,j ∈ H )represents the routing channel between the routers. The router transmits and receives the data packets of the associated IP core. The routing channel (h i,j ) provides the physical path for the transmission of packets from source to destination PEs. The routing channel is associated with data bandwidth requirements, B ti,tj .
The mathematical models used in BEMAP algorithm are presented as follows:
According to the Bit Energy Model [9] the total energy consumption (E T ) of the NoC architecture is given by:
the parameter, B ti,tj is the arc bandwidth from tile t i to tile t j , E S is the Switch and E L is the Link Energy. N h is the Manhattan distance of the NoC architecture, whereas n represents the number of nodes in the architecture for the target application. The Manhattan distance from the source (x i , y i ) to the destination node (x j , y j ) of the NoC architecture is given by:
The communication cost is computed, using the following equation, which is also the performance indicator for application mapping.
For optimized solution the objective functions; Min {E T } or Min {Cost} or both must be satisfied.
For computing average latency, throughput, power and energy consumption of the network, the CMOS (Complementary Metal Oxide Semiconductor) Standard Cell Library Model is used for BEMAP algorithm. The CMOS Cell Library Model utilizes the standard CMOS cell library data for computation of the power estimation and performance parameters of the NoC system. In the NoCTweak simulator, RTL designs in Verilog of all the router components were synthesized with Synopsys Design Compiler. The RTL designs were placed and routed with Cadence SoC Encounter, using the CMOS standard cell library. Post-layout data of these components was fed to the simulator for the NoC performance estimation based on the activities of components while running a certain traffic pattern [22] .The average latency of the network under this model is given by:
where, Lt ij is the packet latency of packet j, N i is the number of packets received by processor, i after the warm-up time and N is the number of processors in the platform. The average throughput of the network is given by:
where, T sim is the simulation time and T wrm is the warm-up time of the simulations. The average Power of the network is calculated by:
the parameter, Pw (act,j) is the post-layout active power and Pw (inact,j) is the post-layout inactive power of component, j. The component, α (i,j) is active percentage of the component, j in the router, i (after T wrm ). Finally the average energy consumed by each packet in the network is given by:
where, N P is the total number of packets, traversed across the network of the NoC architecture. 
IV. BEMAP ALGORITHM
The proposed BEMAP algorithm is composed of BB and Exact mapping algorithms for the topological placement of the PEs on the NoC architecture. The BB algorithm (algorithm 1) solves combinatorial optimization problems by enumerating the feasible solutions through the explorations of a search tree. The algorithm efficiently ambles through the exploration tree that represents the solution space. It forms a tree of the sub problems, as it advances through the solution space. It constructs upper and lower bounds for the root problem. The solution is feasible, if the boundary conditions are satisfied at a particular node; otherwise, the algorithm partitions the node to the child nodes. The search continues until the best solution is found by trimming all the nodes.
Hu and Marculescu [9] [29] is an extended version of the NOCMAP simulator with the addition of the reliability algorithm for the NoC architecture.
Our proposed algorithm (algorithm 2) integrates the BB algorithm of the NOCMAP and the modular exact optimization method, to develop an efficient hybrid mapping algorithm named as BEMAP. As we assume one to one mapping, therefore the algorithm gets input data from Network Task Graph (NTG) and Network Core Graph (NCG) for mapping the application. The BB Algorithm generates the initial solution for mapping the application on the target NoC topology. This mapping contributes as an input source for the exact part of the algorithm. The algorithm splits this mapping into small modules, each containing at the most, ten IPs, depending on the speed and size of the search space. The value ten for each segment is chosen, because beyond this value the simulation speed drastically reduces, due to the NP-hard nature of the problem as shown in Fig. 2 .
The proposed BEMAP as well as BB mapping algorithms are implemented in the open source NoCTweak simulator for analysis and evaluation of the results. NoCTweak is a SystemC based NoC simulator developed by A. T. Tran and B. M. Baas [22] . BEMAP is a multi-objective mapping algorithm that computes and optimizes power, latency and throughput under the constraint of link bandwidth. Fig. 3 shows the flow of the algorithm, using an arbitrary example. The initial mapping is performed on the NTG/NCG as supplied in the input file by the user to the simulator. The algorithm then generates application mapping, using BB algorithm. The BEMAP algorithm splits the BB mapping into different modules and applies modular systematic mapping optimization method. The size of the module is user defined and can be changed in the user configuration settings. Fig. 4 shows the mapping of Context Adaptive Variable Length Coding (CAVLC) benchmark through BEMAP algorithm. The vertices show the processing tasks and the arcs represent the communication bandwidth with the directional traffic flow. In this case, we take the number of tasks in the NTG, equivalent to the number of PEs of the NCG, therefore scheduling is not required. The NCG is, therefore mapped on 4 X 4, NAG of 2D mesh topology, which reveals that the heavily communicating tasks are placed close to each other by the algorithm for optimized performance results. For example, PE10 and PE9 have a traffic volume of 1428 MB/S, therefore, they are placed close to each other on Tile (0, 0) and (0, 1), respectively, with one hop count i.e. with minimum communication cost. Similarly, PE7 is mapped on Tile (1, 1) close to PE9 and so on. The algorithm utilizes the Bit Energy Model to find the optimized mapping by the minimum cost computation method. The NoC performance parameters are calculated, using the CMOS Cell Library Model.
V. RESULTS AND ANALYSIS
NoCTweak simulator is utilized to evaluate the performance of BEMAP algorithm. NoCTweak has already embedded NMAP for application mapping of the NoC benchmarks. We have designed and implemented the BEMAP algorithm in the NoCTweak simulator along with BB algorithm of the NOCMAP simulator. applications are collected from literature and used for the analysis and evaluation of BEMAP algorithm [9] , [22] .
A. 2D MESH TOPOLOGY
We have selected 2D mesh topology, wormhole XY dimension order routing with 2GHz clock frequency and 0.5 (flits/cycle/node) traffic injection rate, for configuration setting of the simulator as shown in Table 1 . The absolute results for power, communication cost and latency of BB, BEMAP and NMAP algorithms for the listed applications are shown in Table 2 . The percentage savings for these parameters are also shown in Table 3 . The absolute values in Table 2 show that BEMAP performs better as compared to BB and NMAP algorithms and produces improved results for the network power consumption and communication cost. BEMAP has 0.83%, 13.97%, 1.83%, and 0.29% improvement in power consumption than BB under CAVLC, AUTO-IND, TELECOM, MMS and VOPD applications, respectively as shown in Table 3 . The power improvements of BEMAP as compared to NMAP algorithm are 1.86, 3.97, 6.63 and 0.14 percent, respectively for the above listed applications. The network cost savings of BEMAP algorithm are 1.15%, 6.11%, 4.64%, and 1.10% as compared to BB, and 3.98%, 0.76%, 13.64%, and 0.45% as compared to NMAP algorithm for CAVLC, AUTO-IND, TELECOM and MMS applications, respectively. The cost and power consumption of BB and BEMAP are identical for VOPD application (Table 2) , because this is the lowest possible communication cost for VOPD, which is also confirmed by the ILP method. The cost and power improvements of the BEMAP algorithm for VOPD are 3.54% and 1.38% as compared to NMAP algorithm. The normalized results of the performance parameters with respect to BB algorithm are also shown in Fig. 5 . The power consumption and network cost of BEMAP algorithm is better for most of the applications as shown in Fig. 5 (a) and Fig. 5 (b) .
The latency improvements of BEMAP algorithm are 0.99%, 45.50% and 14.23% for CAVLC, AUTO-IND, and MMS applications, respectively as compared to BB algorithm. BEMAP has 1.48% and 60.90% lower latency as compared to NMAP algorithm for CAVLC and AUTO-IND applications as shown in Table 3 and Fig. 5 (c) .The latency of NMAP algorithm for some applications like VOPD is better, however, the communication cost and power consumption is worse for most of the applications.
The simulation time and throughput comparison of BEMAP, BB and NMAP algorithm are shown in Table 4 . The simulation time of BEMAP algorithm is comparable with BB and NMAP algorithm and is less than half a minute for all of these benchmarks. The network has no congestion due to the uniform traffic injection rate, therefore; throughput is almost constant for these benchmark applications as shown in Fig. 5 (d) . The comparative analysis in this Section, shows that the quality of the performance parameters of the proposed algorithm is better than BB and NMAP algorithms with comparable simulation time.
B. TORUS TOPOLOGY
We have also utilized torus topology, in order to get additional results for the quantitative analysis of the proposed algorithm under different topology. The absolute values of the performance parameters of BB, NMAP and BEMAP algorithms for CAVLC, AUTO-IND, TELECOM, MMS and VOPD applications are shown in Table 5 . The proposed algorithm produces optimized results for most of the embedded application benchmarks as shown in Table 6 and Fig. 6 .
The power savings of the proposed algorithm are up to 13.97% as compared to BB algorithm and 19.93% compared VOLUME 6, 2018 to the NMAP algorithm. The communication cost savings are up to 4.64% and 3.05% for BB and NMAP algorithms, respectively. The latency improvement by the application of BEMAP algorithm is very impressive that goes up to 45.50% and 61.10% as compared to BB and NMAP algorithms, respectively. The improvement in latency is due to the inherent characteristics of the bypass links in the torus topology. In addition, the simulation time and throughput results are also comparable to BB and NMAP algorithms as shown in Table 7 . The throughput produced by the proposed algorithm is either equal or better than BB and NMAP algorithms. The simulation results proved that the proposed algorithm outperforms for most of the performance parameters and can be used effectively, for better and optimized mapping results than its competitor algorithms.
C. TRAFFIC ANALYSIS
To check the response of the proposed BEMAP algorithm for traffic variations, the traffic is injected at different rates and intervals to the network. We have mapped the Picture In Picture (PIP) real time application on the NoC architecture of 2D mesh topology, using BB, BEMAP and NMAP algorithms. The experimental results reveal that the power consumption of the BEMAP algorithm is lower and much better even at 100% traffic workload as shown in Fig. 7 (a) . This is because the proposed algorithm intelligently distributes the traffic in the network and maps the heavily communicating IPs close to each other to avoid traffic congestion. The throughput of the BB, BEMAP and NMAP algorithms is almost uniform and identical at below 50% of the injected traffic. At higher traffic rates the BEMAP algorithm performs better than BB and NMAP algorithms as shown in Fig. 7 (b) . The improvement in throughput is due to the selection of lower Manhattan distance of the overloaded IPs for the packets traversal in the network.
The algorithm is also validated for different types of network traffics such as Constant Bit Rate (CBR) and Variable Bit Rate (VBR). The power consumption of the BEMAP algorithm is much lower and identical in both, CBR and VBR traffic for PIP application as shown in Fig. 7 (c) . The BEMAP algorithm has also obtained better throughput at different streams of traffic as compared to its competitor algorithms ( Fig. 7 (d) ).
These results show that the algorithm possesses the Quality of Service (QoS) with Guaranteed Throughput (GT), because the proposed algorithm ensures no packet loss even at heavy traffic loads. We have evaluated the algorithm with XY dimension-ordered routing algorithm, however, the proposed algorithm with NoCTweak simulator can also be utilized, using Negative-First (NF), West-First (WF), North-Last (NL) and Odd-Even (OE) minimal adaptive routing algorithms that guarantee QoS and avoid packets collision. The BEMAP algorithm has deadlock and livelock free traffic flow with these adaptive routing algorithms. The BB algorithm has the QoS, using Best Effort (BE) service, because at lower traffic injection rate, the algorithm performs better, however, at higher traffic workloads, its performance degrades both for power consumption and network throughput.
VI. CONCLUSION
In this paper, we proposed a hybrid BEMAP algorithm that is based on hybrid optimization techniques for mapping of embedded applications on the NoC system. BEMAP algorithm is composed of BB algorithm and modular exact systematic search method. In this technique, the fast Branch & Bound (BB) algorithm is used for the initial mapping, which is then optimized by the modular exact optimization method. The Bit Energy Model and CMOS Cell Library Model are used for the computation of the cost effective mapping of the NoC performance parameters under the constraint of link bandwidth reservation. The algorithm efficiently maps the selected real world application workloads on the available NoC tiles of the 2D mesh and torus topologies. BEMAP generates efficient mapping of the real benchmark applications than its competitors, NMAP and BB in terms of power, throughput and latency with comparable simulation time. The algorithm performs effectively even at higher traffic workloads with low power consumption and high throughput. In the future research work, we will utilize the proposed algorithm for mapping the real world benchmark applications on other NoC topologies such as cmesh, butterfly and binary tree.
ACKNOWLEDGMENT
The present research has been conducted by the Research Grant of Kwangwoon University in 2018. 
