ABSTRACT Network-on-chip (NoC) is an emerging alternative to address the communication problem in embedded system-on-chip designs. One of the key and major issues is the optimized mapping of the embedded applications on the underlined NoC platform. In this paper, we propose the bandwidth-constrained multi-objective segmented brute-force mapping (SBMAP) algorithm, which minimizes the communication energy consumption and reduces the computational complexity of the NoC designs. The algorithm generates efficient mapping of the embedded applications on the processing elements of the NoC system by segregating the application into multiple segments. It utilizes the property of modular systematic search, which produces high-performance results with optimized simulation time. We compared the SBMAP algorithm with the state-of-the-art mapping techniques, such as branch and bound (BB), near-optimal mapping (NMAP), and random mapping algorithms for mapping real-world application workloads. The experimental results validated the efficiency of the proposed algorithm against its competitors for most of the performance parameters of the NoC designs. The improvement in energy consumption of the SBMAP algorithm is up to 53% for 2-D mesh and 62% for torus topology as compared with the NMAP, BB, and random algorithms for video object plane decoder, picture in picture, Wi-Fi receiver, and multimedia system real-time application benchmarks.
I. INTRODUCTION
The recent trend in the deployment of the power efficient System-on-Chip (SoC) designs provoked the research community to develop NoC based designs for power and performance improvements. The classical bus based, power hungry systems have scalability and performance issues. Networkon-Chip (NoC) has emerged as a promising solution for embedded System-on-Chip designs [1] , [2] . NoC is packet based, on-chip communication switching network designed for communication among the Intellectual Property (IP) cores of the SoC systems [3] . NoCs use packets to exchange data between processing elements (PEs) via network fabric that consists of Resource Network Interfaces (RNI), routers and interconnecting links as shown in Figure 1 .
There are different research issues [4] in NoC designs, and researchers are moving ahead to resolve them by distinctive research methodologies [5] - [7] . The design flow of the NoC architecture consists of Application Partitioning, Tasks Scheduling and Application Mapping processes on the target NoC platform. Application Partitioning and Tasks Scheduling are related to the task deadlines and execution time. These are traditional CAD problems and have been addressed by the CAD community in the area of hardware/software co-design phase [28] , [29] . However, in NoC designs, one of the most important, and core issues is the mapping of applications on the underlined NoC platform [8] . Application mapping determines the topological placement of the IPs onto NoC platform in order to optimize certain metrics of performance, e.g. energy, latency, throughput, and power. Hard NoC represents NoC architecture, which has pre-designed computation and communication infrastructure. There is no flexibility in changing IP cores on to their placeholders. Firm NoCs have pre-designed communication architecture, but the topological placements of the IPs are still to be decided in the architecture. Application mapping is an NP-hard (Non Polynomial hard) problem [9] , because the search space increases pictorially with the system size. To map k Intellectual Property (IP) cores on n network nodes (k ≤ n), the possible core arrangements (S) on the NoC network is given by:
When the number of IP cores is identical to the number of network nodes (n = k), the possible IP mappings on the network nodes becomes n!. It is, therefore a combinatorial optimization problem that requires efficient heuristic algorithms for optimized solutions. Integer Linear Programming (ILP) produces the best solution for a small problem size of the real world applications [10] , [11] . If the NoC size scales up, ILP methods cannot solve the problem in polynomial time as evident from (1) . Different heuristics are proposed in the literature that speed up the computation time [12] - [16] , but compromise the best feasible solution.
As modern android and embedded Systems-on-Chip designs are battery powered and consume high power due to their complexity, therefore optimization of energy and performance parameters comes out to be the important aspects in the development of these systems. Application mapping optimization is the most important part in the design phase of the embedded systems because it can greatly affect the energy consumption and performance parameters [31] - [35] . In this research work we propose a BandwidthConstrained Multi-Objective Segmented Brute-Force Mapping algorithm (SBMAP) for NoC application mapping. SBMAP utilizes the property of modular systematic search to reduce the computational time complexity and increases the quality of the feasible solution for the performance parameters as compared to other heuristics. The algorithm divides the input stream of the application into multiple segments, and solves it by permutation based modular systematic search method. The input stream is the collection of IP cores and tasks with the required bandwidth of the embedded application. The tasks having high communication demands with their neighbors are grouped in distinct segments for initial mapping. The initial mapped segments are then iteratively optimized for best possible solution by the algorithm. For each partially generated mapping, the algorithm calculates the energy and cost of the network and retains that mapping, which has minimum cost and energy consumption. The algorithm keeps the track record of the previous segmented streams of data for generating cumulative optimized solution. SBMAP algorithm is embedded in the NoCTweak simulator [26] for generating optimized power, throughput and latency with the constraint of bandwidth reservation.
The rest of this paper is organized as follows: In section II, we present related research work on NoC application mapping. Section III briefly describes the problem formulation and mathematical models for the NoC performance parameters. Section IV presents the proposed BandwidthConstrained Multi-Objective SBMAP algorithm. Section V analyzes the simulation results, and finally, in section VI, we present concluding remarks and future work.
II. RELATED WORK
Application mapping is an NP-hard problem and can be handled through heuristic or systematic search techniques. Various algorithms, namely Branch and Bound (BB), Near-optimal Mapping (NMAP), Random algorithm, Simulated Annealing (SA), Taboo Search (TS), Genetic Algorithm (GA) and Particle Swarm Optimization (PSO) algorithms are used for energy, power, latency, throughput, and bandwidth optimization of NoCs [17] , [30] .
Hu [18] presented Branch and Bound (BB) mapping algorithm for topological placement of IPs onto NoC platform to minimize total energy consumption, with the constraint of link bandwidth. The author compared the results of the algorithm with the adhoc Simulation Annealing (SA) method and proved that the BB algorithm is faster, but the algorithm turned out to sub optimal results than SA technique. Lei and Kumar [19] proposed a delay based two-step Genetic Algorithm (GA) for on-chip communication of NoC. The minimization of the total execution time to the tasks is considered as objective function for both mapping and scheduling of the IPs. The application is mapped on 2D mesh topology to minimize the execution time. For mapping the IP cores on NoC, delays of messages are estimated using a Delay Model from source to destination node. To find the critical-path and schedule the vertices of the task graph on NoC nodes, Asynchronous as Late as Possible (ALAP) and Asynchronous as Soon as Possible (ASAP) scheduling is proposed. This approach did not consider power and energy consumption which is an important aspect of the application mapping.
Murali and De Micheli [20] presented an algorithm for mapping of IP cores onto NoC architecture. The network traffic is divided among the IPs across multiple links on a 2D mesh topology with the constraint of bandwidth reservation. The author proposed heuristic approach with the minimal path routing algorithm for the mapping of the cores on 2D mesh topology. This approach included initialization of the mapping, minimal path computations and pairwise swapping of the vertices. NMAP algorithm is presented, that divided the traffic along different minimum paths, such that the bandwidth constraint is satisfied. The results obtained by NMAP are compared with BB algorithm, and it is observed that for fewer numbers of cores, BB produced better solution, but when the system size scaled up, NMAP performed better as compared to BB. Lu et al. [21] proposed Cluster based Simulated Annealing (CSA) to reduce the simulation time of the annealing process for large system size. The CSA clustered the IPs according to pre-defined rules. It considered the cluster as a single identity in minimizing system size to speed up the simulation time and obtained near optimal solution. In the clustering process, the edge tiles are combined in the same group while the central tiles are shared with a different group, based on its neighborhood. The clustering technique speeded up the computation time, but compromised the optimal results. Radu and Vintan [9] presented an Optimized Simulated Annealing (OSA) algorithm for NoC application mapping on 2D mesh topology. In this method, the author optimized the annealing parameters to produce optimal results with shorter time as compared to conventional simulation annealing. Annealing schedule, Number of iterations per temperature level, Acceptance function, Probability Distribution Function (PDF) based swapping and stopping conditions are modified to speed up the simulation. The simulation time in this method is still higher than other heuristic techniques.
Ascia et al. [22] proposed a Multi-Objective mapping heuristic for the 2D mesh topology of the NoC architecture to optimize network performance and power consumption. Multi-Objective Genetic Algorithm is presented for mapping of the IP cores. Jena [23] proposed an algorithm for IP mapping using 2D mesh topology of the NoC architecture. A Multi-Objective GA based heuristics is used to map the IP cores and find the mappings that optimize network power consumption, link bandwidth and network performance. Sepulveda et al. [24] proposed MAIA (Multi-Objective Adaptive Immune Algorithm), which is a multi-application evolutionary algorithm to solve the NoC mapping problem. Harmanani [25] presented the Simulated Annealing (SA) algorithm for task assignment to network nodes on a 2D mesh topology. The author also proposed a routing algorithm, which optimized message blocking, bandwidth and throughput of the network. Most of the above stated mapping algorithms compromise either application simulation time or results optimality. The proposed SBMAP algorithm optimizes both the computation time as well as the results of the performance parameters of the NoC system.
III. PROBLEM FORMULATION
In NoC designs, an application is represented by Network Task Graph (NTG) which is subsequently scheduled by a scheduler on the available IP cores through Network Core Graph (NCG). NCG is then transformed and mapped by an efficient mapping algorithm on NoC topology through NoC Architecture Graph (NAG).
Definition 1: A Network Task Graph (NTG), is a directed acyclic graph, N = N (T , C) in which each vertex of the graph represents a task, (t i ∈ T , i = 1, 2, 3 . . . .) for the computational resource of the application. The task is associated with execution time, energy consumption and resource deadlines. The directed arc, (c i, j ∈ C, i = 1, 2, 3 . . . , j = 1, 2, 3 . . . .) represents either data volume or communicated information between the communicated tasks (t i , t j ).
Definition 2: A Network Core Graph (NCG), G = G(P, A), is a directed graph, in which vertex of the graph, (p i ∈ P) represents the intellectual property (IP) core or Processing Element (PE). The directed arc (a i, j ∈ A) shows characteristic parameters and required bandwidth between the IP cores (p i to p j ).
Definition 3: NoC Architecture Graph (NAG), A = A(R, H ) is an architecture graph, in which each vertex (r i ∈ R) shows a router node in the graph, and the directed arc, (h i, j ∈ H ) repesents the routing channel or link between router, r i to the router, r j . The router transmits and receives the data volume of the associated IP cores, and the routing channels (N h ) provide the routing paths for the communicated packets. The routing channel is associated with data bandwidth information, B ti, tj .
The following mathematical models are used for energy, power, throughput and latency calculations of the SBMAP algorithm for the optimized application mapping on the NoC architecture.
A. BIT ENERGY MODEL
The Bit Energy [8] of the NoC platform is given by:
E Bit is the energy consumed for sending a unit bit of data from source to destination node and includes the Switch Energy (E S ) and Link Energy (E L ) of the NoC network. Average network energy (E ti, tj ), consumed in transmitting a unit bit of data from source tile (t i ) to destination tile (t j ) is given by:
Where, N h is the Manhattan distance from source node (x i , y i ) to the destination node (x j , y j ) of the NoC architecture and is given by:
The total energy consumption of the network is therefore, given by:
B ti, tj is the arc bandwidth from tile t i to tile t j . Therefore,
The cost can be computed using the following equation:
Different mappings will generate different cost and energy solutions, and our objective is to find a mapping function that produces minimum cost and energy for the entire network operation. In this research work cost and energy of the NoC applications are used as the performance indicators for different applications, mapped to their placeholders.
B. CMOS CELL LIBRARY MODELL
CMOS Cell Library Model utilizes post layout cell data of standard CMOS libraries for calculation of timing and power estimation [26] . Our proposed SBMAP algorithm uses CMOS standard Cell Library Model for computations of average latency, throughput, power and energy consumption of the NoC system. The average latency of the network under this model is given by:
Where, Lt i, j is the packet latency of packet j, N i is the number of packets received by the processor, i after the warmup time and N is the number of processors in the platform. The average throughput of the network is given by:
Where, T sim is the simulation time and T wrm is the warmup time of the simulations. The average power of the network is calculated by:
Where, Pw act, j is the post-layout active power and Pw inact, j is the post-layout inactive power of the component, j. The parameter α i, j is the active percentage of the component, j in the router, i (after T wrm ). Finally, the average energy consumed by each packet in the network is given by:
Where, N P is the total number of packets, traversed across the network of the NoC architecture.
C. ORION MODELL
ORION Model [27] calculates the power and energy at the discrete and component level of the network and can be embedded within the simulation environment for total energy calculations of the network. ORION model is not used in our proposed SBMAP algorithm but can be utilized as an extended part of this research work.
IV. SBMAP ALGORITHM
The proposed Segmented Brute-Force Mapping (SBMAP) algorithm takes the Network Task Graph (NTG) and Network Core Graph (NCG) as inputs and efficiently performs the topological placement of these tasks on the available tiles of the NoC platform to generate efficient NoC Architecture Graph (NAG). Transformation from NTG to NCG intrinsically requires the scheduling of the tasks (T ) on the available processors (P) for execution. When T = P, or T < P, a single task can be assigned to an individual PE, while on the other hand, if T > P, then two or more tasks have to be scheduled on a single PE to accommodate all the tasks of the NTG. For this purpose, a scheduler is required before performance simulations. The Scheduler handles the control and data dependencies. It accomplishes activities such as the execution time, deadlines and priorities of the processing tasks. In this research work, we consider one to one mapping of tasks on the IP cores, i.e.; T = P, therefore scheduling is not required at this stage and the transformation of NTG to NCG therefore, has no timing bounds and deadline constraints for the simulations.
In the proposed algorithm (Algorithm 1), the initial mapping is based on the bandwidth requirements and communication workload between PEs. The communication intensive tasks are grouped in decades or less than a decade segment to minimize computation time and network energy consumption. These segments are then iteratively checked by the SBMAP algorithm for minimum cost computation, using the modular systematic search method. The mapping and cost of the preceding groups are retained and used sequentially for cost calculation of the application as shown in Figure 2 . Segmented Brute-Force Mapping (SBMAP) algorithm systematically search the problem space for optimized mapping of an application on NoC platform and follows the following sequence of procedure:
• SBMAP initializes the task graph by keeping the most communicating tasks in identical segments.
• Divides the search space into small segments of IPs to minimize execution time.
• Searches the best solution for each segment by permutation with modular systematic search method.
• Retains and updates the best mapping of each segment to minimize energy consumption.
• Explores all the segments for optimized mapping, based on the cumulative cost comparison.
• Calculates power, energy, latency and throughput of the best searched mapping. The proposed algorithm is multi-objective in nature, with the constraint of bandwidth reservation. The main objectives of the proposed algorithms are the optimize performance parameters such as power, energy, latency and throughput of the embedded application. Figure 3 shows the flow of the mapping algorithm using an arbitrary example, having 16 tasks. The application mapping is performed on the input files of NTG and NCG, as supplied by the user to the simulator using SBMAP algorithm. The algorithm generates initial mapping by grouping heavily communicating tasks into distinct segments in the first phase. The SBMAP algorithm then splits the initial mapping into different modules and applies modular systematic mapping optimization technique. The size of the module is user selective and can be customized in the user configuration settings. The algorithm utilizes the Bit energy model to find the optimized mapping using the minimum cost computation method. The NoC performance parameters are calculated using the CMOS cell library model.
The code of the proposed algorithm is written in SystemC and embedded in NoCTweak simulator. NoCTweak is an open source SystemC based simulator developed by Anh T. Tran for NoC design simulations [26] . The simulator integrates two algorithms (NMAP and Random algorithm) for application mapping of the embedded systems. In addition Branch and Bound (BB) algorithm of the NOCMAP simulator [18] is also implanted in the NoCTweak simulator for comparison and analysis of the proposed as well as its competitor algorithms. The NoCTweak platform is utilized to provide a fair and uniform simulation environment for comparison and analysis of energy, power, throughput and latency parameters of all the underlined mapping algorithms.
V. RESULTS AND ANALYSIS
To verify the effectiveness of SBMAP algorithm, two different topologies namely, 2D mesh and torus are used for mapping and comparative analysis. As a case study, four real time benchmarks, Multimedia System (MMS) Figure 4 [17] , [26] for mapping and evaluation against their performance parameters. The Network Task Graphs (NTGs) of the benchmarks in Figure 4 show the tasks, the communication workloads (MB/S), interdependencies, and the traffic flow of the communicating tasks. For example the MMS application shown in Figure 4 (a) requires 25 processing tasks from T 0 to T 24. The communication bandwidth from T 0 to T 1 is 38106 MB/S with unidirectional link. Similarly, the remaining nodes are represented in the graph with the required bandwidth, traffic flow and interdependencies. As mentioned in section III, we consider one to one mapping of tasks on the IP cores of the network, therefore the MMS application is mapped on 5×5, 2D mesh and torus topologies for analysis. The WiFi application has 24 tasks and it is therefore, mapped on 5×5, mesh and torus topologies. VOPD occupies 16 tasks, and it is mapped on 4×4, while PIP requires 9 tasks and hence mapped on 3×3, 2D mesh and torus topologies. The simulation settings, utilized for the comparison of the application mapping of the embedded applications, under the uniform NoCTweak platform are shown in Table. 1. These configuration settings are only considered for fair comparison of the algorithms and have no effect on the design structure of the proposed algorithm.
A. 2D MESH TOPOLOGY
The results obtained by our proposed SBMAP algorithm for 2D mesh topology are compared with NMAP, BB and Random algorithms as shown in Table 2 and 3. Table 2 shows the simulated results and the percentage savings of the SBMAP algorithm for power, energy and cost computations. Similarly, Table 3 shows latency, simulation time, throughput, and the percentage improvements of the SBMAP algorithm over BB, NMAP and Random mapping algorithms. NMAP is a standard algorithm for the comparison of mapping algorithms and therefore, we normalized the results to the NMAP algorithm as shown in Figure 5 and Figure 6 . SBMAP algorithm has 28.8%, 0.5% and 38.7% improvement in power consumption than BB, NMAP and Random algorithm respectively, for VOPD application as shown in Table 2 and Figure 5(a) . For PIP application, it has 24.5% power savings than NMAP algorithm and 32.2% improvement over Random algorithm. For MMS application, SBMAP has 23.7% improved performance than Random algorithm. Power consumption improvement for 80211arx is 15.4%, 28% and 45.3% as compared to BB, NMAP and Random algorithm respectively. The reduction in the energy consumption is 28.8% and 36.4% as compared to BB and Random algorithm for VOPD application as shown in Table 2 and Figure 5(b) . For PIP application, the energy savings are 24.5% and 31.3% as compared to NMAP and Random VOLUME 6, 2018 algorithm, respectively. MMS has 24.1% more energy consumption by mapping through Random algorithm. SBMAP has 20%, 35.6% and 53.5% lower energy consumption for 80211arx application as compared to BB, NMAP and Random algorithm respectively. The absolute values and savings of the communication cost are also shown in Table 2 . Figure  5(c) shows the normalized cost measurements of SBMAP for different applications, which are lower as compared to BB, NMAP and Random algorithm. Network latency of SBMAP algorithm has better improvements for these applications than BB, NMAP and Random algorithm, as shown in Table 3 and Figure 6 (a). The simulation time in Figure 6 (b) of the SBMAP algorithm is slightly higher than its prior arts, but it is within a range of a few seconds (Table 3) . However, the quality of the solution and therefore, energy consumption of the portable android devices are more important than a slight increase in simulation time. Simulation time is only evolved in the design phase of the system, and can be speeded up using faster computers/accelerators. Throughput has almost steadfast behavior except for 80211arx application, which has 9% improvement than NMAP and Random algorithm as shown in Figure 6 (c). The constant throughput is because of the fact that the network has no saturation for the traffic injected into the network. The mapping results regarding communication cost, latency, power and energy consumption show that SBMAP is more efficient than BB, NMAP and Random algorithms for all the listed applications. The improvements in performance results are due to the property of modular systematic search method in the algorithm as compared to blind search of other algorithms or their random heuristic nature.
B. TORUS TOPOLOGY
To analyze and compare the SBMAP algorithm against its competitors, we also considered torus topology for the VOLUME 6, 2018 aforementioned embedded applications. Table 4 shows the absolute values and percentage savings of the SBMAP algorithm in terms of power, energy, and cost measurements. The results normalized to NMAP values are also shown in Figure 7 . The latency, simulation time and throughput results are shown in Table 5 and Figure 8 . The results show that SBMAP outperforms in all the performance parameters than BB, NMAP and Random algorithms under different embedded application workloads. The results also confirm the efficiency of the proposed mapping algorithm when applied to torus topology as compared to 2D mesh. The improvements in power consumption of the SBMAP algorithm for VOPD application are 2.8%, 4.9% and 32.3% as compared to BB, NMAP and Random algorithms respectively. The SBMAP power savings for PIP, MMS and 80211arx applications are up to 55.7% as shown in Table 4 and Figure 7(a) . Similarly the reduction in energy consumption of the proposed algorithm is 0.6 to 62% as compared to BB, NMAP and Random algorithms respectively (see Table 4 and Figure 7(b) ). The cost improvements are up to 44.4%, 11.1% and 66.7% as compared to BB, NMAP and Random algorithms as shown in Figure 7 (c). Table 5 shows latency, simulation time and throughput comparisons of the listed algorithms for torus topology. The results reveal that the proposed mapping algorithm incurs low latency when used on torus topology as compared to 2D mesh. The latency improvements for VOPD, PIP, MMS and 80211arx applications are up to 1.7%, 31.4% and 27% as compared to BB, NMAP and Random algorithms respectively (see Table 5 and Figure 8(a) ). The simulation time shown in Figure 8 (b) is slightly compromised in order to get optimal results, but still, it is very low, because customarily the mapping process is carried out prior to the design implementation in most cases. The simulation time can be further improved using the state of the art fast computers or accelerators. Throughput is almost constant, because the network has no congestion at this traffic as shown in Figure 8 (c). The optimized results obtained for SBMAP algorithm confirms its efficiency for both the 2D mesh and torus topologies against all of its competitor algorithms.
VI. CONCLUSION
This research work addressed a very hot and demanding issue of the application mapping of the real time embedded applications in NoC based systems. We presented, Bandwidth-Constrained Multi-Objective Segmented Brute-Force Mapping (SBMAP) algorithm, and compared it with Branch and Bound (BB), NMAP and Random algorithm for average latency, throughput, power, and energy consumption of the on-chip networks. SBMAP adds the property of segmentation to the Exact mapping techniques of the IP cores on NoC tiles. The algorithm interactively searches the optimal mapping of the different segments of task graph and achieves the best results by applying modular systematic search method. Experimental results show a significant reduction in power and energy consumption of the proposed mapping algorithm as compared to its competitors for real world application benchmarks. Improved results are also obtained for throughput, latency, and cost measurements for the 2D mesh and torus topologies of the NoC architecture. The improvements in results are due to the fact that the proposed algorithm searches the best solution for the application mapping as compared to the blind search of other available algorithms. As a substitute for good performance results, the simulation time of the algorithm is also comparable with the existing algorithms. Hence our mapping algorithm can be used as a good mapping heuristic for its better performance in terms of average latency, power, and energy consumption.
This algorithm can also be utilized for application mapping of the embedded benchmarks on other topologies like folded torus, butterfly, fat tree, and remains as a future extension of this research work. 
