The paper addresses the problem of topological mapping of intellectual properties (IPs) on the tiles of a meshbased network on chip (NoC) architecture. The aim is to obtain the Pareto mappings that maximize performance and minimize the amount of power consumption. As the problem is an NP-hard one, we propose a heuristic technique based on evolutionary computing to obtain an optimal approximation of the Pareto-optimal front in an efficient and accurate way. At the same time, two of the most widely-known approaches to mapping in mesh-based NoC architectures are extended in order to explore the mapping space in a multi-criteria mode. The approaches are then evaluated and compared, in terms of both accuracy and efficiency, on a platform based on an event-driven trace-based simulator which makes it possible to take account of important dynamic effects that have a great impact on mapping. The evaluation performed on real applications (an MPEG-4 codec and a cellular phone application) confirms the efficiency, accuracy and scalability of the proposed approach.
I. Introduction
Continuous improvements in semiconductor technology mean that a whole processing system comprising processors, memories, accelerators, peripherals, etc. can now be integrated in a single silicon die. In addition, a reduction in the time-to-market has led researchers to define methods based on the reuse of pre-designed, pre-tested modules in the form of intellectual properties (IPs). Despite this, hardware designers are not yet able to fully exploit the abundance of transistors that can be integrated with current technology. Designer productivity, in fact, is growing by just 20% a year, as compared to an increase of over 60% a year by technology [33] . This gap will have to be reduced in order to respond to future requests by the consumer applications market (smart phones, automotive electronics, home networks, entertainment systems, etc.). Possible solutions to these problems can be sought in platform based design (PBD), which is based on the reuse of components, architectures, applications and implementations [8, 21, 26] . Of course the aim is always to obtain a good trade-off between generality and performance. Generality makes it possible to reuse hardware, software, development flows, etc., while performance (latency, cost, power, etc.) can be guaranteed by using specific dedicated architectures.
Without doubt, today, the on-chip interconnection system represents one of the major elements which has to be optimized in designing a complex digital system. The International Technology Roadmap for Semiconductors [33] foresees it will represents the limiting factor for performance and power consumption in next generation systems-on-achip (SOCs). The continuous reduction in the time-to-market required by the telecommunications, multimedia and consumer electronics market makes full-custom design of an interconnection system inappropriate and has led to the definition of design methodologies focusing on design reuse. This is confirmed by the great standardization effort made by the VSI Alliance [42] and the development, by the major EDA and Semiconductor companies, of on-chip interconnection systems that are easy to integrate and scale [20, 3, 29, 35, 36] . Although, however, they are good solutions for current SOCs integrating fewer than 5 processors and rarely more than 10 master buses, their use in next-generation systems, which are likely to integrate hundreds of modules, seems hardly feasible.
The limiting factor is mainly the topological organization of the interconnection between the various units, which will substantially remain bus-based. As regards performance, the continuous reduction in gate delays and increase in wiring delays will cause significant synchronization problems. In 50 nm technology, the projected chip die edge will be around 22 mm, with a clock frequency of 10 GHz. An optimistic estimate of the propagation delay for a signal crossing a chip diagonally ranges between 6 and 10 clock cycles [38] . At any rate, Moore's law will remain valid for the next 10 years and single processors will not be able to use all the transistors on a chip. Synchronous regions will occupy an increasingly lower fraction of a chip [37] giving rise to locally synchronous, globally asynchronous solutions [16] . Applications will be modelled as a set of communicating tasks with different characteristics (e.g. control-dominated, data-dominated) and origins (reused from previous projects or acquired from third parties), which will make implementations extremely heterogeneous.
Giuseppe Ascia, Vincenzo Catania and Maurizio Palesi

A. Network on Chip
A type of architecture which lays emphasis on modularity and is intrinsically oriented towards supporting such heterogeneous implementations is represented by Network-onChip (NoC) architectures [12] . These architectures loosen the bottleneck due to delays in signal propagation in deepsubmicron technologies and provide a natural solution to the problem of core reuse by standardising on-chip communications. The NoC architectural topology most frequently referred to can be represented by an n × m mesh. Each tile of the mesh contains a resource and a switch. Each switch is connected to a resource and the four adjacent switches. A resource is generally any core: a processor, a memory, an FPGA, a specific hardware block or any other IP compatible with with the NoC interface specifications. More generally, a resource may be represented by a complex multi-master and multi-slave system using an interconnection network based on shared-bus. The design flow for an architecture of this kind involves several steps. First the application has to be split up into a set of concurrent communicating tasks. Then the IPs are selected from the IP portfolio and the tasks are assigned and scheduled. Finally, the IPs have to be mapped onto the mesh in such a way as to optimise the metrics of interest.
The last phase is currently assuming more and more interest in the scientific community [17, 28] . Actually, it has a strong impact on typical performance indexes to be optimized. Unfortunately, the mapping problem is an istance of constrained quadratic assignment problem which is known to be NP-hard [14] . The search space of the problem increases factorially with the system size. It is therefore of strategic importance to define methods to search for a mapping that will optimise the desired performance indexes (performance, power consumption, quality of service, etc.) with a good tradeoff between accuracy and efficiency. This represents the main focus of this paper. In addition, these strategies have to a multi-criteria exploration of the space of possible architectural mapping alternatives. The objectives to be optimised are, in fact, frequently multiple rather than single, and are almost always in contrast with each other. There is therefore no single solution to the problem of exploration (i.e. a single mapping) but a set of equivalent (i.e. not dominated) possible architectural alternatives, featuring a different trade-off between the values of the objectives to be optimised (Paretoset).
B. Contribution
In this paper we present a multi-objective exploration approach for the mapping space of a mesh-based NoC architecture. The approach, based on evolutionary computing techniques, is an efficient and accurate way to obtain the Pareto mappings that optimize performance and power consumption. In addition, two of the most widely known approaches to topological mapping of IPs in a mesh-based NoC architecture [17, 28] have been extended to achieve multi-criteria optimization and have been compared with the approach proposed here. In contrast with the approaches in the existing literature which use static analysis to evaluate a mapping, here we use an event-driven trace-based simulator which makes it possible to take account of important dynamic effects that have a great impact on performance indexes to be optimise. To the best of our knowledge this work is the first attempt to attack the topological mapping problem for NoC architectures from a multi-objective point of view taking care of model important dynamic effect such as contention for outgoing links, backpressure effects, influence of buffer size, packet size, etc.
C. Paper Organization
The rest of the paper is organized as follows. Section II summarizes some of the most important contributions in the field of topological mapping of IPs/cores in mesh-based NoC architectures. Section III presents the simulation and evaluation framework used and the impact of the architectural and application parameters on the performance indexes considered. Section V our approach for exploration of the mapping space is presented. In the same section we discuss the multiobjective extension of two other algorithms proposed in literature we compare to. Experimental results are reported in Section VI. Finally, Section VII summarizes our contribution and outlines some directions for future work.
II. Previous Work
The problem of mapping in mesh-based NoC architectures has been addressed in three previous papers. Hu and Marculescu [17] present a branch and bound algorithm for mapping IPs/cores in a mesh-based NoC architecture that minimizes the total amount of power consumed in communications with the constraint of performance handled via bandwidth reservation. The same authors in [19] extend the approach to constructs a deadlock-free deterministic routing function such that the total communication energy is minimized. Murali and De Micheli [28] address the problem under the bandwidth constraint with the aim of minimizing communication delay by exploiting the possibility of splitting traffic among various paths. Lei and Kumar [25] present an approach that uses genetic algorithms to map an application, described as a parameterized task graph, on a meshbased NoC architecture. The algorithm finds a mapping of the vertices of the task graph on the available cores so as to minimize the execution time.
These papers do not, however, solve certain important issues. The first relates to the mapping evaluation model used, which can be defined as "static". The exploration algorithm decides which mapping to explore without taking important dynamic effects of the system into consideration. For example, failure to model the effects of bus contention causes components which communicate with each other more fre-111 quently to be clustered, whereas it may be more effective to separate components whose traffic flows overlap in time so as to increase the degree of concurrency. In the abovementioned works, in fact, the application to be mapped is described using task graphs, as in [25] , or simple variations such as the core graph in [28] or the application characterization graph (APCG) in [17] . These formalisms do not, however, capture important dynamics of communication traffic. They hypothesize worst-case conditions, which leads to several mappings being discarded and thus a highly conservative exploration. The second problem relates to the optimization method used. It refers in all cases to a single performance index (power in [17] , performance in [28, 25] ). As we will see in the section devoted to experiments, optimization of one performance index may lead to unacceptable values for another performance index (e.g. high performance levels but unacceptable power consumption). We therefore think that the problem of mapping can be more usefully solved in a multi-objective environment, i.e. one in which there is no single solution but a set of mapping alternatives (which we will indicate as Pareto mapping), each featuring a different tradeoff between performance indexes, from which the designer (or decision maker) will choose the most suitable.
The contribution we intend to make in this paper is to propose a multi-objective approach to solving the problem of mapping IPs/cores in mesh-based NoC architectures. The approach will use evolutionary computing techniques to explore the mapping space with the goal to optimize performace and power consumption. The mappings visited during the exploration process will be evaluated using a trace-based approach which gives an excellent combination of accuracy and efficiency features. Figure 1 shows the NoC topology we will refer to. It is a two- dimensional mesh of processing resources. Each processing resource is connected to the communication network by a switch. We will call the pair formed by a resource and a switch a tile. The term mapping will be used to indicate assignment of an IP/core to each tile in the NoC. Each switch in the NoC is connected to the four adjacent switches except for those at the network boundaries. Switches send data from one network interface to the other by means of packets. Such a packet consists of one or more flow control digits (or flits), were a flit is the minimal transmission unit. On each side of a switch there is an output and an input port. The input port has a finite-length FIFO buffer in which flits to be routed are queued. The use of the FIFO is regulated by back-pressure mechanism [18] . Under this scheme, a flit will be held in the buffer until the downstream router has empty space in the corresponding input FIFO. Thus, the network will not drop any packet in transit. This is extremely important for NoC architectures which may not implement very advanced endto-end protocol.
III. Evaluation of a Mapping
The routing algorithm features static XY routing in which a flit is first routed in a horizontal direction (X) and then, when it reaches the column where the destination tile is located, it is routed in a vertical direction (Y ). Of course the XY routing is a minimal path routing algorithm and is free of deadlock and livelock [15] . As a transmission scheme we use wormhole routing because of the low cost (the buffer capacity can be less than the length of a packet) and low latency (the router can start forwarding the first flit of a packet without waiting for the tail). To describe the functioning of the various components of the simulation framework we will use a representation based on a variation of a finite-state machine which we will indicate as a behavioral annotated graph (BAG). Each machine state is identified by a name, a set of operations (op 1 , . . . , op n ) and two attributes which we will call latency and power (See Figure 2) . Transition from one state to another is represented by an oriented arc associated with a condition (transition only occurs when the condition is met). The conditions are evaluated after a time equal to the value of the attribute latency, starting from the instant at which the state is entered. If none of the conditions on the arcs are met, the machine remains in the current state and the process is repeated. Otherwise there is a state transition and the total amount of energy consumed while the machine remains in this state is measured. This can be summed up as follows: process . The operation performed in both these states is to consume the flit at the head of the queue and then, when the latency time ends, to switch unconditionally back to the Idle state. Figure 4 (a) shows the interface of a switch. Each of the five input ports has an associated queue (buffer). Each output port is associated with an input signal (with the suffix Ready) which is asserted whenever the element connected to the relative port is ready to accept a flit. transmit , which models the power consumed on the interconnection buses between the switches.
The simulation is event-based and is performed by stimulating the network with concurrent trace files. Each trace file is a sequential list of communication patterns. Each pattern comprises three fields: a source identifier, a destination identifier, and the amount of information exchanged. The amount of traffic sent by the source core to the destination core is subdivided into packets and each packet is routed according to the routing scheme and BAGs described above.
A. Motivation
In this section we wish to demonstrate (using an experimentbased approach) that accurate modeling of the communication dynamics is essential in order to evaluate a network.
We will begin out analysis by considering as our performance parameter the speed at which a network handles a certain amount of incoming traffic. This mainly depends on the speed at which the switches route packets. If, for example a switch A has to forward the packet at the head of the input queue from its port α to the port β in the adjacent switch B, two events can occur: (i) the input queue in the port β is not full, or (ii) the input queue in the port β is full. In the former case, A can forward the packet, thus freeing a slot in the queue in port α. In the latter case, A has to wait for B to eliminate at least one packet from the input queue in port β before it can forward the packet. In general, therefore, the overall performance of the network (measured as the time required to handle all the incoming traffic) improves if the size of the switch input queues increases. With an increased input queue capacity, in fact, a generic switch needing to forward a packet to another switch will have a greater probability of being able to queue the packet in the input port of the other switch. Figure 5 shows the time required to handle traffic versus the size of input queues in the switch ports. The values were obtained on a 5x5 network. The latency and power attributes of the core BAGs were randomly set between 0 and 1 for each core and 0.1 for all the switches. The traffic was generated considering communication between the network nodes to be equally probable (that is, the probability that node A will communicate with node B is equal to the probability that node C will communicate with node D, however A, B, C and D are taken). The flow of data exchanged between two nodes has a Gausssian distribution with an average of 128 bytes and a variance of 64 bytes. Eight different traces formed by 100 patterns were injected in parallel, so as to simulate 8 concurrent communications at each instant. Each point in the graph was obtained by measuring the time taken to handle the traffic in 100 different mappings and calculating the average value. It can, however, be observed that in some cases an increase in the size of the switch queues may increase the traffic handling time. Figure 6 shows this possibility. It gives the traffic handling time for 10 different mappings with switches having input queues that allow a maximum of two and four packets to be queued. The traffic handling times for the second network are generally shorter than those for the first network, with one exception. With mapping 6, in fact, the traffic is handled faster in the first network. This behavior can only be detected via a dynamic analysis of the system, that is by taking into account the dynamic interaction between the various traffic flows, which is only possible by performing trace-based simulations.
It should also be observed that the optimal mapping is greatly affected by the architectural parameters of the network. Let us consider, for example, the size of the switch input buffers. In Figure 6 it can be seen that a mapping may be optimal for one network but not for another. Of the 10 mappings considered, in fact, mapping 5 is by far the best for the second network but the second worst for the first network.
To evaluate the impact of mapping and relate it to the traffic characteristics the following experiment was performed. 1000 mappings were randomly generated for each network n × n, n ∈ {3, 4, 5}. n 2 /2 simulations were run for each mapping, relating to different traffic scenarios. These scenarios differed in the number of pairs of cores simultaneously communicating with each other. They range from an absolute lack of concurrency (that is, one and only one pair of cores are communicating at any one time) to maximum concurrency (at any one time there are n 2 /2 pairs of cores communicating with each other). Figure 7 shows the relationship between the maximum and minimum traffic draining times for 1,000 random mappings in the traffic scenarios described above. As can be seen, when the size of the network increases, so does the impact of mapping on performance. For a 5x5 network, for example, choosing a suitable mapping can improve performance by over 40%. It should be pointed out that these values are extremely conservative. They were obtained considering only 1,000 random mappings as compared with the 25! 10 25 that are possible. It should also be noted that the impact of mapping depends greatly on the traffic characteristics. In all the cases considered, the maximum impact is obtained in traffic scenarios in which the number of pairs of cores communicating concurrently is equal to half the maximum number of pairs that can communicate concurrently. Figure 8 shows the framework for exploration of the space of possible mappings in mesh-based NoC architectures.
IV. Exploration Framework
It comprises two macro blocks: a NoC simulator (to evaluate the performance indexes to be optimized for any mapping), and an Exploration engine (which determines the next mapping to be evaluated). The inputs to the framework are:
• Architectural parameters: for example, topology, network size, communication protocols, size of buffers in switches, priority assignment schemes, etc.
• Application parameters: these mainly refer to the characteristics of the communication traffic involved in the application being considered. They may relate to both the characterization of statistical models of the traffic exchanged between the various network resources, and real traces obtained by measuring the communication traffic during execution of the application. Useful application parameters to specify traffic in statistical models are: packet generation rate (packets can be generated at random or periodical intervals, or in a bursty or uniform • Set of BAGs: these specify the functional behavior of each element in the NoC and also contain characterization information for estimation of the timing and power consumption parameters.
The flow of operations involved in exploration generally consists of repeating two phases: evaluation of one or more mapping alternatives, and determination of the next mapping/s to be evaluated. The first phase is carried out using a NoC simulator, which evaluates the performance indexes to be optimized. These represent the input for the second phase, which implements the exploration algorithm and produces the next mapping/s to be evaluated. The mappings evaluated are stored and can be used by the exploration algorithm to decide the next step. This iterative process is concluded when a stop criterion is met. Then the non-dominated mappings (Pareto mappings) are extracted from the mappings evaluated.
In this paper we will focus on the second phase of the framework, the one referring to the mapping space explo-115 ration algorithms.
V. Multi-Objective Exploration of the Mapping Space
The mapping problem is an instance of a constrained quadratic assignment problem which is known to be NPhard [14] . The search for an optimal mapping (henceforward referred to as exploration) is also complicated when the concept of optimality is not limited to a single performance index (or objective) but comprises several contrasting indexes. The traditional approach to a multi-objective optimization is to aggregate the objectives into a single one by means of a weighting mean. The main drawback to this approach is that it does not cover the non-convex regions of the Pareto-front and requires several instances of the optimization algorithm to be run with different weights. In this section we present: 1) an approach to multi-objective mapping space exploration that uses evolutionary algorithms as the optimization strategy; 2) multi-objective extension of an exploration algorithm based on the branch-and-bound proposed in [17] ; and 3) multi-objective extension of a variation of the exploration algorithm proposed in [28] .
A. Problem Formulation
The mapping problem can be expressed by Figure 9 . Given a target application described as a set of concurrent tasks which have been assigned and scheduled, to exploit such an architecture, the fundamental questions to answer are: i) which tile each IP should be mapped to, ii) what routing algorithm is suitable for directing the information among tiles, such that the metrics of interest are optimized. More precisely, in order to get the best power/performance tradeoff, the designer needs to determine the topological place-ment of these IPs onto different tiles. Referring to Figure 9 , this means to determine, for instance, onto which tile [e.g. (3,1), (1, 3) etc.] each IP (e.g. DSP2, ASIC1 etc.) should be placed. While task assignment and scheduling problems have been addressed before [9] , the mapping and routing problems described above represent a new challenge, especially in the context of the regular tile-based architecture, as this significantly impacts the energy and performance metrics of the system. Formally, if C is the set of cores, and T the set of tiles, we will use the term mapping to indicate an injective and surjective function M : C → T that associates the tile t ∈ T on which c is mapped with each c ∈ C.
Evaluating a mapping means obtaining the related performance indexes for a specific traffic scenario. If S indicates a traffic scenario, we define the evaluation function
which yields the values of the n performance indexes relating to the mapping M for the traffic scenario S. In our case study, Figure. 9: Graphic explaination of the mapping problem.
for example, the evaluation function corresponds to the simulation framework (described in [5] ) and the performance indexes are those the platform is capable of measuring (power, communication latency, bandwidth, throughput, etc.). Evaluation of an incomplete mapping made up of a set of cores C ⊂ C with a traffic scenario S is performed by evaluating the mapping on a traffic S obtained by filtering out all communication flows in which the source or destination is a core c ∈ C . Given a traffic scenario S and two mappings M 1 and M 2 , M 1 can be said to dominate M 2 (which will be indicated as
. . , n} and there exists at least one
The set of Pareto mappings is a set of mappings that do not dominate each other. The Pareto front is the image of the evaluation function for the set of Pareto mappings. If M is the set of all possible mappings, the Pareto-optimal set P is the set of Pareto mappings such that
The aim of the approach we propose is to obtain as accurate an approximation as possible of the Pareto-optimal front by evaluating (visiting) as few mappings as possible. The optimization metrics we consider are the completion time and the total energy consumption. Formally, given a set of cores C, a set of tiles T (such that |C| = |T |), and a communication traffic scenario S (which models the communication between the cores c ∈ C), the topological mapping problem is the problem of finding the Pareto-optimal set P of mappings that optimise in both completion time and total energy consumption.
B. GA-based Multi-Objective Exploration of the Mapping Space
The use of evolutionary algorithms (EAs) as a multiobjective optimization technique is of increasing appeal. The fields of application are numerous, including among others computer science, engineering, economics, finance, industry, physics, chemistry, and ecology. EAs have been demonstrated to be very powerful and generally applicable for solving difficult multi-objective problems. Such algorithms create an interesting alternative to other approaches since they can be scaled with the problem size and can be easily run on parallel computer systems. In VLSI design, EAs have been applied to a very broad range of problems: in problems relating to layout such as partitioning [1] , placement and routing [40] , in design problems including power low-power synthesis [7] , technology mapping [23] and netlist partitioning [2] . There are various approaches to GA-based multi-objective optimization, divided into three main categories [10]:
1. Approaches using aggregation functions, 2. Approaches not based on the notion of Pareto optimum, 3. Pareto-based approaches.
The first type (those that use aggregating functions) reduce the problem of multi-objective optimization to one of scalar optimization by aggregation of the objective functions [41, 32] . The main disadvantage to aggregation functions is that they do not generate proper Pareto-optimal solutions in the presence of non-convex search spaces, which is a serious drawback in most real-world applications.
The second approach (not based on the notion of Pareto optimum) solves some of the difficulties encountered in the first, such as finding suitable weights to combine (aggregate) the objectives. One of the most famous approaches in this class is VEGA [32] . The drawback to this approach is that even if the user defines the objective functions independently of each other, the algorithm combines them. Thus, under certain conditions (i.e., when proportional selection is used) the resulting fitness function turns out to be a linear combination of the objective functions in which the weights depend on the distribution of the population in each generation. The problems are thus the same as those of the first approach i.e. not finding certain points in concave regions.
Currently, the third class of approaches (Pareto-based approaches) is the most promising. The basic algorithm consists of selecting Pareto non-dominated individuals from the rest of the population. These individuals are then assigned the highest rank and eliminated from further contention. Another set of Pareto non-dominated individuals are determined from the remaining population and are assigned the next highest rank. The procedure is repeated until the whole population is suitably ranked. The most widely used approaches belonging to this class are NSGA-II [13] , PESA [11] and SPEA2 [43] . A simple steady-state Pareto-based evolutionary algorithm is presented in [39] that uses an elitist strategy for replacement and simple uniform scheme for selection. Here, no fitness calculations, ranking, sub-populations, niches or auxiliary populations are required.
In this paper we propose the use of a heuristic technique based on EAs for multi-objective mapping space exploration. More specifically, we chose SPEA2, which is very effective in sampling from along the entire Pareto-optimal front and distributing the solutions generated over the trade-off surface. SPEA2 is an evolution of SPEA [44] and incorporates a fine-grained fitness assignment strategy, a density estimation technique, and an enhanced archive truncation method. The chromosome is a representation of the solution to the problem, which in this case is described by the mapping. Each tile in the mesh has an associated gene which encodes the identifier of the core mapped in the tile. In an n × m mesh, for example, the chromosome is formed by n × m genes. The i-th gene encodes the identifier of the core in the tile in row i/n and column i%m (where the symbol % indicates the modulus operator).
The crossover and mutation genetic operators were have been suitably redefined. More specifically, a crossover between two mappings M f and M m generates two new mappings M s1 and M s2 constructed as follows. Let t x,y ∈ T be the tile in row y and column x. Given a mesh of H rows and W columns, two random numbers x ∈ {1, 2, . . . ,W − R} and y ∈ {1, 2, . . . , H − R} (where R is a user defined parameter) are extracted. Then, the crossover operator simply swaps the two regions consisting of tiles from t x,y to t x+R,y+R . Figure 10 describes the crossover operator. Where the func- Tile T s ; 
C. Pareto-based Branch-and-Bound Approach
In [17] Hu and Marculescu present an approach using branch-and-bound as the mapping space exploration strategy. The approach is, however, a mono-objective one. In this subsection we will extend their approach in order to perform multi-objective exploration of the mapping space. We will call our approach Pareto-based Branch-and-Bound (PBBB).
Let {c 1 , c 2 , . . . , c N 2 } be the set of cores in the system in decreasing order with respect to the communication traffic. The core c 1 can be mapped on any of the N 2 tiles in the mesh. These N 2 mappings generate the first layer of a tree which is the starting point for the branch-and-bound algorithm. For each first-level mapping the core c 2 can be mapped on any of the N 2 − 1 free tiles, thus generating a second level N 2 × (N 2 − 1) mappings. This is the branch phase of the algorithm and is described in pseudo-code in Figure 12 . Where the MakeMappings(M,c) function, given a mapping template M and a core c, yields a set of mappings obtained by mapping c on each free tile in M.
Each mapping at this level is evaluated (simulated) and then characterized according to the optimization objectives, which in our case are power and delay. The dominated mappings are discarded, while the branch and bound phases are reiterated on the survivors. This is the bound phase of the algorithm as described in pseudo-code in Figure 13 . Where the ExtractPareto(M ) function extracts the
r e t u r n M ; 9 } Figure. 12: Branch phase of the branch-and-bound algorithm.
Pruning ( M , T pbbb ) ; non-dominated mappings from the set M . To prevent the algorithm from degenerating the bound phase is followed by a further pruning phase. Let us indicate the set of mappings generated by the bound phase as M . If |M | > T (where T is a user-defined threshold) |M | − T mappings are eliminated at random from M . The Pruning(M ,T pbbb ) function randomly eliminates mappings from a set M if the cardinality of this set exceeds a threshold T pbbb in such a way as to make the cardinality of M equal to T pbbb . The branch and bound phases are reiterated until all the cores have been mapped. For example, indicating the mappings obtained in the bound phase as M 1 , M 2 , . . . , M n , the core c 3 will be mapped for each of them on to the N 2 − 2 possible tiles. The n × (N 2 − 2) mappings will be the third level of the tree. The algorithm terminates when all the cores have been mapped and the leaves of the tree will be the Pareto mappings. A pseudo-code description of PBBB is given in Figure 14 . Where the SortByTraffic(C) function orders the set of cores C according to the communication traffic.
D. Pareto-based NMAP Approach
Murali and De Micheli in [28] propose NMAP, an algorithm that maps the cores in a mesh NoC architecture with the aim of minimizing the average communication delay. In this subsection we will extend NMAP to perform a multi-objective exploration of the mapping space. Unlike [28] , however, we will refer to a routing XY. We will call this approach Paretobased NMAP (PBNMAP). Figure 16 , while Figure 17 describes the main program. 
VI. Experiments
A. MPEG-4 Codec
In order to evaluate the various approaches in real traffic scenarios, an MPEG-4 simple profile @ level 2 codec was used as a case study [34] . A general block diagram of the encoder and decoder is shown in Figure 18 . For the hardware/software partitioning reference was made to the MoVa architecture described in [22] . It adopts a macroblock-based pipeline with 4 stages for the encoder and 3 for the decoder. More specifically, the encoding section performs coarse motion estimation in the first stage, fine motion estimation fine and motion compensation in the second stage, discrete cosine transform and quantization in the third stage, and finally reconstruction and production of the stream in the fourth stage. In the decoding section, the first stage involves variable length decoding of each data stream; in the second stage it performs sequential inverse cosine transformation, inverse quantization and motion compensation; the third and final stage is reconstruction.
To obtain the traffic traces the C application implementing the codec [24] was modified with the addition of a monitor code to record the volume of incoming and outgoing traffic in the various functional blocks into which the application is partitioned. Table 1 shows the 16 cores implementing the codec. They were characterized in terms of timing by using the clock cycle data in [22] for the execution of each operation (DCT, MC, etc.). For power characterization, we used the mean values given in the datasheets [27, 31] . For the interconnection system we used an approach similar to the one presented in [17] . To characterize the switches, a 5x5 switch was implemented in VHDL following the architecture described in [6] . It was synthesized with a Synopsys Design Compiler using the Virtual Silicon 0.13µm, 1.2V technological library and analyzed using Synopsys Design Power using different random input data streams for the inputs of the switch. The amount of power consumed by a flit for a hop switch was estimated as being 0.181nJ. We assumed the tile size to be 2mm × 2mm and that the tiles were arranged in a regular fashion on the floorplan. The load wire capacitance was set to 0.50 f F per micron, so considering an average of 25% switching activity the amount of power consumed by a flit for a hop interconnect is 0.384nJ. Figure 19 shows the application characterization graph of the MPEG-4 codec. Each vertex of the graph represents a core. An edge that connects a core i to a core j defines a communication flow from core i to core j. Each edge is characterized by a set of attributes such as the traffic volume (T i, j ) and the minimum bandwidth requirement for the communication (B i, j ). The latter one is used as an exploration constraint. More precisaly, a mapping is rejected if it does not satisfy at least one of such constraints. These constraints are set by performing a profiling of the application and annotating the traffic volume exchanged between the various application components. For example, to decode N frames at X fps we have
The following values were used for the free parameters of the exploration algorithm. For GAMAP we chose a population of 50 mappings, a crossover probability of 0.7 and a mutation probability of 0.1. The R parameter of the crossover operator was set to 2. These values were chosen after numerous simulations and were the values that on average led to better solutions or shorter convergence times. The number of generations was set runtime by means of a stop criterion based on analysis of the convergence of the Pareto-front [4] . For PBBB, the parameter T pbbb was set to 100. Figure 20 gives the power values and traffic clearing times for 10,000 random mappings. It also shows the Pareto fronts obtained by GAMAP, PBNMAP, and PBBB, and the solutions found by BB [17] and NMAP [28] . As can be seen, the solutions obtained by GAMAP dominate those obtained by the other approaches. The figure also shows the good trade-off between delay and power (respectively equal to a factor of 3 for delay and 2.5 for power). Figure 21 (a) gives the number of simulations (i.e. mappings evaluated by GAMAP) for varying numbers of generations. It gives the number of simulations actually performed and those virtually performed if no caching mechanism had been used. Figure 21(b) gives the normalized delay and en- Finally, Figure 22 shows a point (the minimum energy mapping) in the Pareto set obtained by GAMAP. The cores specific to the encoding section are shown against a dark gray background, whereas those specific to the decoding are against a white background. The cores shared by the encoder and decoder are shown against a light gray background and have been mapped (in this case) in the centre of the NoC. In the decoding section, the cores VOM and DB are topologically separated from VLD, MEMD and ISC as there is no direct communication flow between these sets: they communicate by means of a ring represented by the core REC. In the encoding section there are also two separate parts which do not communicate directly but through the set of shared cores. Figure 23 (a) is a block diagram of a mobile phone application in which it is possible not only to hold a normal conversation but also to listen to an MP3, surf the web, receive and send images, and listen to emails. The application example used is the airport scenario described in [30] . In this example the traffic flows are generated under certain synchronization constraints. For example, as can be seen from Figure 23 (b), which shows a fragment of the communication timeline, it is not possible to read an email and perform MP3 streaming at the same time.
B. Cell Phone
The application was partitioned into 13 cores [one for each block shown in Figure 23(a) ] and mapped onto a 4 × 4 NoC. Cores for a concurrent synthesized application in which each core communicates at random with the others were mapped onto the remaining 3 tiles. Figure 24 shows the solutions obtained by GAMAP and PBBB together with the evaluation of 10,000 randomly generated mappings. In this case it was not possible to complete the exploration using PBNMAP due to the great number of Pareto mappings obtained at each iteration during the first phase of the algorithm (Figure 15 ). The main reason for this behavior lies in the characteristics of the traffic considered. More specifically, in the first phase of the algorithm the mapping of a core that does not communicate with any of the other cores already mapped generates as many Pareto mappings as there are free tiles. In such situations the ExtractPareto(M ) function returns the same set M , the mappings of which will be extended in the following iteration by the MakeMappings(M , c) function to map the core c, generating a new set of mappings |M | × f in size (where f indicates the number of free tiles in the incomplete mapping). Obviously, the more often this situation arises, the more quickly the number of mappings to be evaluated (and 121 thus the number of simulations to be performed) grows. In this example it happens quite frequently because the application was partitioned using a coarser granularity. The traffic flows, in fact, involve on average fewer cores than the previous examples, thus reducing the probability that a core being mapped will communicate with at least one of the cores already mapped.
Going back to Figure 24 we can observe a great range of dispersion between the points (2.3x for delay and 2.5x for energy consumption) which once again requires efficient techniques to explore the mapping space. In this example GAMAP and PBBB yield the same solution but the former requires only 1,227 simulations as compared with the 9,893 required by the latter.
VII. Conclusions
In this paper we have proposed a strategy for topological mapping of IPs/cores in a mesh-based NoC architecture. The approach uses heuristics based on multi-objective genetic algorithms to explore the mapping space and find the Pareto mappings that optimize performance and power consumption. At the same time, two of the most widely-known approaches to mapping in mesh-based NoC architectures have been extended in order to explore the mapping space in a multi-criteria mode. The approaches have been then evaluated and compared, in terms of both accuracy and efficiency, on a platform based on an un event-driven trace-based simulator which makes it possible to take account of important dynamic effects that have a great impact on mapping. The experiments carried out on real applications (an MPEG-4 encoder/decoder system and a cellular phone application) confirm the efficiency, accuracy and scalability of the proposed approach. Future developments will mainly address the definition of more efficient genetic operators to improve the precision and convergence speed of the algorithm. Evaluation will also be made of the possibility of optimizing mapping by acting on other architectural parameters such as routing strategies, switch buffer sizes, etc.
