Abstract-Multi-processor Systems-on-chip are currently designed by using platform-based synthesis techniques. In this approach, a wide range of platform parameters are tuned to find the best trade-offs in terms of the selected system figures of merit (such as energy, delay and area). This optimization phase is called Design Space Exploration (DSE) and it generally consists of a Multi-Objective Optimization (MOO) problem.
Abstract-Multi-processor Systems-on-chip are currently designed by using platform-based synthesis techniques. In this approach, a wide range of platform parameters are tuned to find the best trade-offs in terms of the selected system figures of merit (such as energy, delay and area). This optimization phase is called Design Space Exploration (DSE) and it generally consists of a Multi-Objective Optimization (MOO) problem.
The design space of a Multi-processor architecture is too large to be evaluated comprehensively. So far, several heuristic techniques have been proposed to address the MOO problem, but they are characterized by low efficiency to identify the Pareto set. In this paper we propose a methodology for heuristic platform based design based on evolutionary algorithms and multi-level simulation techniques. In particular, we extend the NSGA-II with an approximate neural network meta-model for multiprocessor architectures in order to replace expensive platform simulations with fast meta-model evaluation. The model accuracy and efficiency is improved by exploiting high-level platform simulation techniques. High-level simulation allows us to reduce the overall complexity of the neural network and improving its prediction power.
Experimental results show that the proposed techniques is able to reduce the number of simulations needed for the optimization without decreasing the quality of the obtained Pareto set. Results are compared with state of the art techniques to demonstrate that optimization time due to simulation can be sped up by adopting multi-level simulation techniques.
I. INTRODUCTION
In the recent years, Multi-Processor Systems-on-Chip (MPSoC) and Chip-Multi-Processors (CMPs) have become the de facto standard for embedded and general-purpose architectures. The platform-based design methodology [1] represents the winning paradigm to design optimized architectures and meeting time-to-market constraints. In this context, parametric System on-Chip (SoC) simulation models are built and evaluated to accurately optimize the architecture to meet the target application requirements in terms of execution time, power consumption and other performance indexes. The main difficulty encountered in performing the above design space exploration is the very long simulation time required to evaluate a single platform configuration. In fact, it can vary from several hours to several weeks, depending on the application and platform complexity and the system resources dedicated to the simulation. For this reason, researchers are focusing on techniques able to create analytic meta-models of the target objective functions from simulation data collected during optimization [2] , [3] .
In this paper we propose a methodology for heuristic platform based design based on evolutionary algorithms and multilevel simulation techniques. In particular, the contributions of this paper are the following:
• We tackle the problem of designing multi-processor systems-on-chip by using a state-of-the-art evolutionary algorithm (NSGA-II) [4] .
• We extend the above algorithm with an approximate analytic meta-model for multi-processor architectures in order to replace long platform simulations with fast metamodel evaluation. The model is based on Artificial Neural Networks.
• We improve the accuracy and efficiency of the neural network meta-model by exploiting high-level simulation techniques. High-level simulation allows us to reduce the overall complexity of the neural network and improving its prediction power. We finally present a set of experimental results obtained by applying the proposed platform-based design heuristic to the exploration of a Chip Multi-Processor architecture [5] (with a number of cores ranging from 2 to 16 cores).
II. RELATED WORK
Several methods have been recently proposed in literature to reduce the design space exploration complexity by using traditional statistic techniques and advanced exploration algorithms. Among the most recent heuristics for power/performance architectural exploration we can find [6] - [8] . In [6] , the authors compare the Pareto Simulated Annealing, the Pareto Reactive Taboo Search and Random Search exploration to identify energy-performance trade-offs for a parametric super-scalar architecture running a set of multimedia kernels. In [7] , a combined Genetic-Fuzzy system approach is proposed. The technique is applied to a highly parameterized SoC platform based on a VLIW processor in order to optimize both power dissipation and execution time performance. The technique is based on a Strength Pareto Evolutionary Algorithm coupled with fuzzy system rules in order to speed up the evaluation of the system configurations In [8] , domain knowledge about the platform architecture has been used in the kernel of a design space exploration framework. The exploration problem is converted to a Markov Decision Process (MDP) problem whose solution corresponds to the sequence of optimal transformations to be applied to the platform. The requisite of domain knowledge is the main difference with respect to the proposals in [6] , [7] .
Meta-model assisted optimization based on Evolutionary Strategies (ESs) has been deeply studied by researchers in last decade [9] - [13] . The most traditional approach for assisting ESs is the application of the meta-model approximation directly to the fitness computation [10] of a certain percentage of individuals. In particular, in [13] , the authors stress on the need of balancing the meta-model use with the actual simulator/accurate model in order to obtain the highest speed up without impact to solution quality. The work in [14] , instead, proposes to integrate the meta-model only in a preselection operator which excludes from the optimization the worst configurations.
To the best of our knowledge, this paper represents the first approach merging multi-level simulation techniques into a holistic, meta-model assisted design space exploration flow for MPSoCs.
III. A MULTI-LEVEL MPSOC EXPLORATION HEURISTIC
The IP reuse and platform-reconfigurability approaches are converging into a new design paradigm (platform-based design [1] ), which is strongly influencing today's automatic system synthesis. In this context, a microprocessor-based platform is composed of a number of intellectual property (IP) blocks which are integrated, extended and customized for a particular application. The IP-based design flow dramatically lowers the risk of subsystem integration and configuration errors, reducing up to 60% [15] the platform design time while achieving the highest quality of results (QoR) in the implementation of the design.
In general, Design Space Exploration (DSE) consists of an optimization process which takes into account a typical set of IP parameters mainly associated with the memory subsystem configuration (e.g., cache size), the parallelism of the processor (e.g., number of processors and issue width) and the on-chip interconnect configuration. The optimization problem involves the minimization (maximization) of multiple objectives (such as latency, energy, area, etc.) making the definition of optimality not unique [16] . In fact, a system which is the best from the performance point of view, can be the worst in terms of power consumption and vice-versa. Optimal configurations for which no direct dominance can be stated are part of the so called Pareto set [16] .
The methodology proposed in this paper enables the efficient identification of an approximate Pareto set of candidate architectures by minimizing the number of simulations of system configurations. This is a notable achievement, since, nowadays, evaluating the system level figures of merit (e.g., time and energy) of a single system configuration means hours or days of simulations under a realistic workload for complex SoCs.
The chip-multi-processor (CMP) architecture we target is a platform composed of a variable number of out-of-order processors with design-time configurable, private L1 and L2 caches. Inter-processor communication is based on a highbandwidth split-transaction bus supporting a write-invalidate snoop-based MESI coherence protocol acting directly between L2 caches, (see Figure 1 ). To ensure the coherency of shared data in the memory hierarchy, this protocol generates invalidate/write requests between L1 and L2 caches. To estimate system-level metrics, we leveraged the SESC [5] simulation tool, a fast MIPS instruction set simulator for CMPs providing dynamic energy (indicated as η(x)) and execution cycles (indicated as τ (x)) associated with the execution of user-selected applications on a system configuration x. In this paper, we focus our analysis on applications derived from the SPLASH-2 [17] parallel benchmark suite.
Our exploration methodology aims at finding an optimal system configuration minimizing both energy η(x) and execution time τ (x) of a given application:
To solve such a multi-objective problem we propose an exploration methodology based on the NSGA-II [4] evolutionary strategy. Figure 2 shows the basic structure of a conventional, NSGA-II based MPSoC exploration flow. The algorithm maintains a current population of chromosomes which is processed by mutation and cross-over operators (GEN block) to generate a new population X. The objective function Ω is then evaluated by means of a low-level simulator (in our case an almost cycle-accurate architectural simulator) and filtered by means of a SEL block to generate population of the successive iteration. The iteration stops whenever the target stopping criterion (such as a maximum number of total generations) is met.
The chromosome structure used in this paper is shown in Figure 3 . The first part of the chromosome is dedicated to expressing the available processor/task parallelism expressed by using an integer number associated with the available processor configurations (or levels). This parameter determines also the degree of coarse-grained parallelism of the application which is suitably selected during the simulation (note that, in this paper, we don't consider task mapping as part of the problem). The second part, is dedicated to express the actual number of instructions which can be dispatched to the processor functional units simultaneously. Also in this case, a level is associated with the processor issue widths. The final part is dedicated to express the actual configuration of the memory subsystem in terms of block size, associativity and overall cache size of the system caches. Each cache parameter is encoded with an integer level which is associated with feasible cache configurations (typically, a power of two). Overall, the chromosome encoding covers the set of parameters shown in Table I which consists into a grand total of 2 17 architecture configurations.
A. Multi-level modeling
The main difficulty encountered by using a plain NSGA-II algorithm is the very long simulation time required to evaluate the system-level objective function Ω(x) and, in turn, the fitness of each chromosome since it depends on actual simulations of the target architecture. In fact, for applications of commercial interest, a single simulation can take even several hours, depending on the platform complexity and the system resources dedicated to the simulation.
To solve the above problem, researchers typically introduce analytic meta-models such as Artificial Neural Networks (ANN) [2] for the target objective function:
such that the components ofΩ(x) (i.e.,η(x) andτ (x)) are efficiently evaluated and represent a reasonable approximation with respect to the metrics measured with an actual simulation of the system. The model is constructed and updated with simulation data collected during optimization and it is a very effective tool for analytically predicting the behavior of the system platform without resorting to a system simulation. The design space exploration techniques which are based on such analytic models are called model-assisted and have shown promising results in the field of MPSoC exploration [2] . However, we must point out some drawbacks:
• Complexity. The number and topological content of the neural networks depend on the target system level metrics. For example, in our case, we need two neural networks, trained in such a way that the prediction on the execution time and energy consumption is accurate enough for every solution in the current population. However, the precision required by this process can be unattainable by reasonably sized networks and, at the same time, it represents an overkill since our goal is to individuate only the Pareto dominant solutions.
• Bootstrap and evolution control. The time needed to bootstrap the neural network is directly dependent on the number of samples which should be simulated with a low-level, accurate simulator. This time period can be extremely long, depending on the target architecture to be simulated. Moreover, the evolution control strategy (i.e., the one which selects which individuals should be further refined with the low-level simulator) is forced to choose between a coarse, analytic prediction or a low level simulation, without intermediate choices.
In this paper, we propose to approach the above problems by combining two techniques:
• Simplified models for Pareto ranking. Since our target is to identify dominant solutions, we propose to modify the classic model-assisted approach by using a neural network only to predict the non-domination rank of candidate solutions in reference to configurations simulated so far. The non-domination rank ρ(x) can be thought as the number of times the original population should be peeled off from its Pareto set in order to have no solutions dominating x. It is a measure of how much deep configuration x is nested into sub-optimal solutions. The higher ρ(x) the worst is the configuration x. More formally, given a set of simulated configurations Λ 0 , we define an operator Π such that Π(Λ 0 ) is its Pareto set. We define a sequence of subsets Λ n ⊆ Λ 0 such that
We define the non-domination rank of a solution x ∈ Λ 0 as:
• High-level simulation. As noted above, the traditional model assisted methodologies cannot choose intermediate accuracy models which can speed up the training of the neural network and speed up the over optimization time. We thus introduce high-level simulation into the design space exploration process. In contrast to a lowlevel simulation, which entails (almost) cycle-accurate modeling techniques (such as event-based simulation), a high-level simulator utilizes some form of abstraction to speed up the evaluation of the system figures of merit.
An example of such abstraction is the transaction, a form of communication of information between the elements of the system. A transaction is typically annotated with timing and energy consumption information and speeds up the simulation by hiding the details of communication hand-shaking. A simulator using this kind of abstraction is called transaction-level simulator. We assume that, for the architecture in consideration, we have the possibility to choose between a low-level and a high-level simulator. Thus we introduce an extended evolution control strategy which takes into account comprehensively and efficiently both the low-level and the high-level simulator. The techniques presented above are combined together to form the innovative contribution of this paper: the Multi-Level NSGA-II exploration strategy. This strategy is detailed in the following subsection. 
B. The proposed multi-level exploration flow
The proposed exploration flow is called Multi-Level NSGA-II (ML-NSGA-II) and it is shown in Figure 4 . Basically it consists of a variation of the conventional, NSGA-II based design flow shown in Figure 2 ; the flow is parameterizable with respect to a parameter β which is used for discriminating good solutions from bad solutions.
1) The GEN block performs permutation and cross-over on the current population creating a set of configurations
2) X is fed to the ANN (we hypothesize that the ANN has been already trained). The ANN predicts only the rank ρ(x) associated with each configuration x ∈ X.
3) The imprecise evaluation filter (IPE-filter) splits X into two sets, X LS and X HS . In order to do that, the IPEfilter identifies the lowest rank value t such that β% of configurations simulated so far have rankρ ≤ t (note that β is a parameter of the exploration flow, β = 0% implies that all the configurations go to the high-level simulator while β = 100% redirects all the configurations to the low-level simulator). X LS is the set of good candidates with predicted rankρ ≤ t. X HS contains bad candidates havingρ > t. 4) The low-level simulator provides a slow but accurate evaluation of objective function Ω(x), for each x ∈ X LS . 5) The high-level simulator provides a fast but rough estimation of the objective functionΩ(x), for each x ∈ X HS . 6) The Pareto set is consolidated (or further refined) by means of the low level simulator. This process iteratively launches the low-level simulator for computing the actual value of Ω(x), for x ∈ X HS in the Pareto set ofΩ(x). 7) Each configuration in X LS and X HS is tagged with the value of Ω(x) (orΩ(x) if Ω(x) has not been computed) and returned to SEL block of NSGA-II algorithm. 8) The ANN model is re-trained with updated information on the rank of the configurations simulated so far. This improves the estimated rankρ(x) of future configurations to be evaluated.
To bootstrap the neural network model, we perform an analysis of a set of randomly chosen configurations corresponding to 0.1% of the design space. This analysis is performed by using the high-level simulator plus a consolidation phase with the low-level simulator (as shown in step 6). The collected data is used for the initial training of the ANN model.
IV. EXPERIMENTAL RESULTS
In this section, we present a set of experimental results of the application of the proposed optimization methodology to the design space exploration of the CMP architecture as described in Section III.
A. Simulation Set-Up
Before delving into the experimental results, let us describe the simulators we have used for validating our methodology.
As a low-level simulator, we leveraged the SESC [5] tool, an almost cycle-accurate architectural-level simulator for chip multiprocessor architectures. This simulator estimates both energy consumption and execution time of a given application by using a number of consolidated modeling techniques (among which CACTI [18] and WATTCH [19] ). To perform the analysis, we used a set of benchmarks derived from the Stanford Parallel Applications for Shared Memory (SPLASH) [17] suite. The SPLASH suite is organized into two sections: kernels and applications. We selected a partial subset of the suite composed of the following applications:
• Kernels: Complex 1D FFT (fft), Blocked LU Decomposition (lu), Blocked Sparse Cholesky Factorization (cholesky), Integer Radix Sort (radix).
• Applications: Barnes-Hut (barnes), Adaptive Fast Multipole (fmm), Ocean Simulation (ocean), Volume Rendering (volrend). All the experimental results presented in the following sections have been averaged over the above SPLASH-2 benchmarks. Each benchmark's data-set has been chosen such that the average, low-level simulation time is less than 5 minutes.
As a high-level simulator, we have built a transaction-level model of the multiprocessor architecture. Each core is modeled by using instruction set simulation, where each instruction is annotated with a rough estimate for base execution time and energy consumption. Each time a core accesses the memory, we use transaction-level modeling to obtain an overall latency and energy consumption of the transaction, which is summed up with the base instruction cost. The resulting transactionlevel simulator is two orders of magnitude faster than the low-level simulator while showing a mean relative error of ∼ 32%.
B. Rank model selection and validation
The neural model used for rank prediction is a feedforward ANN with one single hidden layer. We used a back-propagation algorithm to perform the on-line training of the model. To avoid over-fitting (i.e., a decrease in the generalization power of the model), we divided the training set in two partitions, 80% learning set and 20% test set. We iterated the training algorithm over the learning set as long as the error on the test set decreases (known as early stopping criterion). According to [20] , we set to 8 the size of the hidden layer of the network which is within the range of input and output parameters.
Model accuracy: initial behavior. To evaluate the quality of the rank approximations obtained with the ANN model, we begin with a visual inspection of the error when it is trained with a low number (50) of samples. This case should be representative of the worst-case scenario which happens when bootstrapping the neural network in the initial phase of the optimization flow.
To compare the predicted vs. the actual rank we use a bubble plot. A bubble plot is a way of representing the relationship between three variables on a scatter-plot. Observations on two variables are plotted in the usual way using circles as symbols; the radii of the circles are made proportional to the associated values for the third variable. In our case, the centers of the circles coincide with ρ andρ, while the radii are associated to the amount of samples that correspond to the number of architectures x with the specific ρ,ρ coordinates. Larger circles positioned on the line with unitary slope mean a more precise model. Figure 5 shows a bubble plot of the actual rank ρ versus the approximated rankρ predicted by the neural model for all the SPLASH benchmarks considered in this paper. Figure 5 shows that we can reasonably assume that the selected ANN model is accurate enough for predicting, at a coarse level, the ranks of the architectural configurations of the considered architecture. Model accuracy: steady state behavior. To further investigate the predictive capabilities of our model during the steady state of the optimization algorithm, we evaluate the value of the coefficient of determination R 2 by using an increasing set of observations as a training set. R 2 is the proportion of variability in a data set that is accounted for by the statistical model. As a rule of thumb, the higher R 2 the better the accuracy of the model. Figure 6 shows the trend of the average R 2 for all the SPLASH benchmarks, as the set of training samples increase up to 2000. As can be seen, the model reaches a very good accuracy as long as the training sample set increases in size. As will be seen in the next section, the average number of training observations for a typical optimization run is around 400. For that scenario, the average R 2 is around 0.9 which validates our confidence on the ANN model.
C. Optimization Results
In this section, we present the results obtained by using the proposed ML-NSGA-II methodology on the target architecture. The experimental results are averaged over the given set of benchmarks. First, we describe the tuning of the IPE-filter parameter β in the proposed algorithm, then we compare it with the plain version of the NSGA-II. In both cases (ML-NSGA-II and plain version of the NSGA-II), the population size in the NSGA-II has been set to 50 individuals while the GEN block (Figures 2 and 4 ) steadily produces 80% of new population by means of cross-over and 20% by means of mutation. We remark that the elitism of NSGA-II algorithm is granted by the block SEL, which maintains the best individuals over generations.
To fairly compare ML-NSGA-II and NSGA-II, we use the following, correlated criteria:
• Exploration time. This criterion corresponds to the total amount of time employed by the optimization heuristics to run until obtaining a desired quality of the heuristic solution.
• Quality of the solution set. This criterion evaluates how much the solution found with the heuristics are close to the exact Pareto set of the problem after a fixed amount of time. We measure the quality in terms of Average Distance from Reference Set (ADRS) [21] . The ADRS is essentially a measure of the distance of an heuristic Pareto solution with respect to the exact Pareto solution; it is usually measured in terms of percentage, the higher the ADRS the worse is the heuristic solution. The exact Pareto solution has been computed by exhaustively simulating the entire design space with the low-level simulator. Figure 7 presents the results obtained with the proposed ML-NSGA-II strategy by varying the IPE-filter parameter β in the range { 50%, 30%, 20%, 10%, 0% }. In particular, Figure  7 (a) presents the results for the exploration time when the desired ADRS (percentage w.r.t. full-search) is changed, while Figure 7 (b) presents the results for the ADRS when the desired exploration time (percentage w.r.t. full-search) is changed.
As can be seen from Figure 7 , both criteria present a convex behavior with respect to the value of β. In particular, we can find a sweet spot (minimum value) for β = 20% which presents a good trade-off between exploration speed and accuracy. In fact, increasing the value of β increases the number of architectures to be simulated at low-level (thus the exploration accuracy) at the expense of the exploration time. When reducing β, the exploration speed increases due to a large number of architectures simulated rapidly, by using the high-level simulator, but decreases the exploration accuracy. Thus, in the following discussion, we will thus set the IPE-FILTER β to 20%. Figure 8 (a) shows the simulation time saving for ML-NSGA-II with respect to the plain NSGA-II. For all the different values of the ADRS we considered ({5%, 2.5%, 1%, 0.5%}) the time saving of the proposed methodology is constantly around 50%. Figure 8(b) , represents the ADRS improvement of the proposed ML-NSGA-II with respect to the plain NSGA-II, by varying the percentage of exploration time with respect to the full-search. The ML-NSGA-II heuristic is able to provide a ten-fold improvement with respect to the plain NSGA-II algorithm.
Finally, Figure 9 shows the exploration time break-down among the different components of the ML-NSGA-II heuristic with β = 20%. As expected, most of the exploration time (95.4%) is spent in simulating architectural configurations at low-level while only the 4.3% of the time is spent with high-level simulations. The time needed for the remaining components (ANN model training for the rank prediction) is 
V. CONCLUSIONS
In this paper we proposed Multi Level NSGA-II (ML-NSGA-II), a methodology for heuristic platform based design based on evolutionary algorithms and multi-level simulation techniques. In particular, we extended the NSGA-II with an approximate neural network meta-model for multi-processor architectures in order to replace expensive platform simulations with fast meta-model evaluation. The model accuracy and efficiency is improved by exploiting high-level platform simulation techniques. High-level simulation allowed us to reduce the overall complexity of the neural network and improving its prediction power.
Experimental results showed that the proposed technique is able to reduce the simulation time of up to 50% with respect to traditional heuristic techniques by increasing also the quality of the final Pareto set.
