Abstract -High-Level Synthesis (HLS) is the process of developing digital circuits from behavioral specifications. It involves three interdependent and NP-complete optimization problems: (i) the operation scheduling, (ii) the resource allocation, and (iii) the controller synthesis. Evolutionary Algorithms have been already effectively applied to HLS to find good solution in presence of conflicting design objectives. In this paper, we present an evolutionary approach to HLS that extends previous works in three respects: (i) we exploit the NSGA-II, a multi-objective genetic algorithm, to fully automate the design space exploration without the need of any human intervention, (ii) we replace the expensive evaluation process of candidate solutions with a quite accurate regression model, and (iii) we reduce the number of evaluations with a fitness inheritance scheme. We tested our approach on several benchmark problems. Our results suggest that all the enhancements introduced improve the overall performance of the evolutionary search.
Introduction
High-Level Synthesis (HLS) [8] is concerned with the design and implementation of digital circuits starting from a behavioral description, a set of goals and constraints, and a library of different types of resources. HLS typically consists of three steps: the scheduling, the resource allocation and the controller synthesis. The scheduling assigns each operation to one or more clock cycles (or control steps) for the execution. The resource allocation assigns the operations and the produced values to the hardware components and interconnects them using connection elements. Finally, the controller synthesis provides the logic to issue datapath operations, based on the control flow. Unfortunately, it is non-trivial to solve these problems as they are NP-complete and strongly interdependent. In addition, the high-level synthesis problem is multi-objective and most of the design objectives are contrasting by nature. Therefore, developers usually apply an iterative refinement cycle: at each step they (i) manually apply transformations, (ii) synthesize the design, (iii) examine the results coming from the synthesis of the design solution, and (iv) modify the design to trade off the design objectives. This process is usually called design space exploration.
Evolutionary algorithms (EAs) have been successfully applied [12, 21] to such complex explorations, since their behavior is very similar to the one of a designer: they iteratively improve a set of solutions (i.e. alternative designs) using the results of their evaluations as a feedback to guide the search in the solution space. In addition, EAs proved to work well on large optimization problems even if (i) the search space is constrained, (ii) there are few information available on EAs can also easily deal with different objectives, without the need of combining them into a single objective function. The main drawback of EA approaches is the need to evaluate a huge number of design alternatives. This is a serious concern as in HLS problems the solution evaluation is a very expensive process. To meet the time-to-market constraints we need to shorten the design process without reducing the quality of the solutions discovered.
In this paper we present an evolutionary framework to perform a fully automated design space exploration for HLS problems. In addition, to compute the fitness of the evolved solution we replace the usual expensive evaluation process with a cost model coupled with an inheritance fitness scheme. In particular, our approach extends previous works on the application of EAs to HLS [14, 21, 24, 25] basically in three respects: (i) while in previous works focused on evolutionary approaches to optimize a human designed objective function, we exploit NSGA-II [9] , a multi-objective genetic algorithm, to perform a fully automated design space exploration; (ii) we exploit a regression model to perform a fast and quite accurate evaluation of the candidate solutions; (iii) to our knowledge, this is the first work that applied a fitness inheritance scheme to HLS in order to reduce the number of evaluations. We validated our approach on several benchmark problems. Our empirical results suggest that both the regression model introduced and the fitness inheritance scheme result in an improvements of the design space exploration process.
This chapter is organized as follows. After discussing relevant work in Section 1.2, we describe our approach in Section 1.3. In Section 1.4 we discuss the issues of the solution evaluation. Then, in Section 1.5 we show how the fitness can be computed through a prediction model to decrease dramatically the cost , consists of replacing a part of the expensive solution evaluation with an estimation model that, given some relevant features of the solution, provides an estimation of its objective values. Then, we present two different techniques to reduce expensive evaluations: cost modeling and fitness in-heritance [29, 34] . The former technique, presented in The latter technique, detailed in Section 1.6, allows to reduce the number of fitness evaluations by replacing some of them with a surrogate, based on the fitness values of other individuals previously evaluated. Experimental evidences on a set of historical benchmarks for the HLS problem, both in terms of quality of the solutions w.r.t. the design objectives and overall execution time of the exploration, are presented and discussed for each technique.
Related Work
The common techniques used in high-level synthesis can be classified into three categories: exact, heuristic and non-deterministic approaches.
The exact approaches [7, 18] exploit mathematical formulations of the problem and may find the optimal solution. Unfortunately, their computational requirements grow exponentially with the size of the problem and are impractical for large designs.
The heuristic approaches [8, 28, 35] work on a single operation or resource at once and perform continuous refinements on the set of solutions. The decision process is deterministic, so they do not explore all the design alternatives, possibly leading to sub-optimal solutions. Furthermore, most of these techniques perform the scheduling and the allocation sub-tasks separately, with the scheduling usually performed as the first step.
To support scalability and to explore a larger set of alternative designs, several non-deterministic approaches (e.g., [20] ), and in particular GAs [12, 14, 21, 24, 25] , have been efficiently applied to HLS. Most of them focused on only one of the HLS sub-task. In [24] , GAs are used to schedule the operation, while in [25] they are used to allocate and bind a scheduled graph. Grewal et al. [14] implemented a hierarchical genetic algorithm, where genetic module allocation is followed by a genetic scheduling. Araújo et al. [1] used a genetic programming approach, where solutions are represented as tree productions (rephrased rules in the hardware description language grammar) to directly create Structured Function description Language (SFL) programs. This work presents a different approach w.r.t. the previous ones, but it is difficult to control the optimizations. Krishnan and Katkoori [21] proposed a priority-based encoding, where solutions are represented as a list of priorities that defines in which order the operations should be chosen by the scheduling algorithm. However, they performed a single-objective optimization using a weighted average of the design objectives, that has been proved to be not effective [39] . Several works [12, 25] introduced a binding-based encoding, also for system-level synthesis [27, 36] , where solutions are represented as the binding between operations and functional units where they will be executed. In some of these approaches the exploration can generate unfeasible solutions that have to be recovered or discarded, wasting time and computation resources.
EAs often require to evaluate a number of candidate solutions that might easily result computationally unfeasible. This generally happens in real-world problem and it is also the case of HLS. Accordingly, in the literature several evaluation relaxation [13] techniques have been introduced to speedup EAs: an accurate, but computationally expensive, fitness function is replaced by a less accurate, but inexpensive, surrogate. Following the early empirical design, theories have been developed to understand the effect of approximate surrogate functions on population sizing and convergence time and to enhance speedups (see [31] for further details). The surrogate can be either endogenous [34] or exogenous [2, 19, 23] . Fitness inheritance [34] is one of the most promising endogenous approach to evaluation relaxation: the fitness of some proportion of individuals in the population is inherited from the parents. Sastry et al. [33] use a model based on least squares fitting, applied in particular to extended compact genetic algorithm (eCGA [16] ). Chen et al. [6] present their studies on fitness inheritance in multi-objective optimization as a weighted average of parent fitness, decomposed in the different n objectives. Recent studies investigated the impact of fitness inheritance on real-world applications [11] and different exploration algorithms [30] . Exogenous surrogate are typically used in engineering applications [2, 10] and consists of developing a simplified model of the real problem to provide an inexpensive surrogate of the fitness function. In particular, in HLS several simplified models for area and timing have been proposed in the literature. In [26] , simple metrics are proposed to drive the optimization algorithms, even if some elements are not correctly considered (e.g., steering logic or effects of optimizations performed by the logic synthesis tools). In [3] the area is estimated with a linear regression approach that is also able to model the effects of the logic optimizations. Unfortunately, most of the models proposed provide a poor guidance to the optimization process as they do not take into account the resource binding and the interconnections [5] . In this work we focus on data-flow applications that involve only area models, however we refer the interested reader to [4, 22] for timing estimation models.
Design Space Exploration with Multi-Objective Evolutionary Computation
The proposed methodology is shown in Figure 1 .1(a). The inputs are the behavioral description of the problem in C language, a library of resource descriptions and a set of constraints to be met, specified in XML format. We exploit a customized interface to the GNU GCC compiler 1 to generate the related GIMPLE, representing the behavioral specification. From this, a combined control and data dependencies data structure (CDFG) is built. CDFG allows the identification of the operations that should be mapped and scheduled on the functional units described in the resource library provided as input, as well as of the precedences among them. The core of our methodology, shown in Figure 1 .1(b) and detailed in Section 1.3.1, is the design space exploration that exploits a multi-objective GA (NSGA-II [9] ) to concurrently optimize the different design objectives: the area and the latency. This design space exploration iteratively improves a set of candidate solutions to explore the most promising design subspaces. Finally, a Register-Transfer Level (RTL) specification in a hardware description language (e.g. VHDL, Verilog or SystemC) is generated for each one of the non-dominated solutions contained into the final population resulting from the exploration algorithm.
Design Space Exploration Core
The design space exploration core is shown in Figure 1.1(b) . Initially a population of N candidate solutions is randomly created and evaluated through a complete high-level synthesis, with respect to the design objectives. In the current implementation, area and performance have been considered. Once the evaluation is completed, the solutions are sorted. After the initialization step, a new population of N candidate solutions is created. In particular, each element of the new population, called offspring, is generated by applying the common genetic operators (i.e., crossover and mutation) to the existing solutions, called parents. Finally, each offspring created is evaluated as well and added to the population. The resulting population of size 2N is then sorted again and the worst N solutions are discarded. All the steps described above, except the initialization, are thus iteratively repeated until the following stopping criterion is met. Whenever the set of best solutions is not improved in the last 10 iterations, the size of the population, N , is increased by 50%. When, even increasing the population size, no best solutions are found, the optimization process is stopped. At the end, the non-dominated solutions found by the exploration algorithm are returned.
Genetic Algorithm Design
In this section, we present the design of the NSGA-II that drives our design space exploration.
Solution encoding. In this methodology, the chromosome is simply a vector where each gene describes the mapping between the related operation in the behavioral specification and the functional unit where it will be executed. With this formulation both the resource allocation (i.e., the total number of required functional units) and the operations binding (i.e., the assignment of the operations to the available units) are encoded at the same time and all the information that is necessary to generate the structural description of the design solution is encoded. This encoding was introduced for the first time in [12] and is inspired to the approach proposed in [25] . The main advantage in using this encoding is that all genetic operators (see Section 1.3.2) create feasible solutions. In fact, the recombination of the operations binding simply results in a new allocation or binding. In this way, good solutions can be obtained just using common genetic operators, without needing procedures to recover unfeasible solutions.
Initial population. At the beginning of each run, an initial population of admissible resource bindings is created. It can be created by random generation or, to cover a larger design space and to speedup the exploration process, by generating some known solutions (e.g. the one with the minimum number of functional units or the minimum latency). This allows the algorithm to start from some interesting points and then to explore around to improve them.
Fitness function. To evaluate the solutions we used the following multiobjective fitness function:
where Area(x) is an estimation of the area occupied by the solution x, T ime(x) is an estimation of the latency of the solution x, computed as the worst case execution time of the scheduled solution, in terms of clock cycles. The goal of the genetic algorithm is to find the best trade-offs with respect to this cost function.
Ranking and selection. The ranking of solutions is an iterative process.
At iteration k, all the solutions are first sorted according to the fast-nondominated-sort. Then, the non-dominated solutions are classified as solution at the k-level and removed from the solutions to be ranked. The process is repeated until all the solutions have been ranked. At the end of the evolutionary process, the whole set of solutions ranked as the best ones will be the outcome of the optimization. We refer to this set as the aprroximation of the Pareto-optimal set discovered by the evolutionary process.
Genetic operators. To explore and exploit the design space, the usual genetic operators are used, the unary mutation and the binary crossover. The two operators are applied respectively with probability P m and P c . Mutation is an operator used for finding new points in the search space. Mutation has been implemented with a relatively low rate (e.g., P m =10%) and it is applied as follows: each gene is modified with probability P µ , changing the corresponding binding information. Crossover is a reproduction technique that mates two parent chromosomes and produces two offspring chromosomes. Given two chromosomes, a standard single-point crossover is applied with a high probability (e.g., P c =90%). The crossover mechanism mixes the binding information of the two parent solutions.
Performance Measure
The outcome of muti-objective EAs is a set of solutions that represent the best estimate of the Pareto front in the objective space. Accordingly, evaluating and comparing the outcome of different EAs is not trivial as it is in singleobjective optimization. In particular, several metrics have been introduced in the literature [38] with different features and aims. In general a performance metric can provide either a relative measure (e.g., Non Dominated Combined Set Ratio [37] ) or an absolute measure (e.g., S metric [38] ). The former type of metric are devised to compare only two set of solutions, while the latter allow to rank several set of solutions on a specific problem. In this work we used a performance metric that is equivalent to the S metric as (i) we need to compare several set of solutions and (ii) it is a scale-independent metric. In minimization problems with two objectives the S metric can be computed as the hypervolume between the set and the anti-ideal objective vector [17] . Unfortunately in HLS, the anti-ideal objective vector is not always defined. Accordingly we set each objective of the anti-ideal vector to the worst value discovered during all the evolutionary runs. In addition, for a better readability we defined the performance measure as the area complementary to the S metric in the positive quadrant. Accordingly, the smaller is the used metric, the better the set of solutions is.
Solution Evaluation
The crucial point to obtain a fast and effective convergence of the exploration is the quality of the solution evaluation. In particular, the values of the fitness function should be as close as possible to the effective values that would be obtained through the actual implementation of the design on the target technology and the evaluation of the desired characteristics (e.g., area or latency). For this reason, the best fitness is obtained with a complete synthesis of each design solution. A complete synthesis includes the following two steps. First, a high-level synthesis flow is performed from a fully specified design solution to a structural description of the circuit. Then a logic synthesis step is applied to generate the circuit for the target technology from its structural description. Our approach targets the Field Programmable Gate Array (FPGA) technology. A FPGA is a semiconductor device that can be configured by the customer or the designer after manufacturing. FPGAs are becoming an interesting alternative to Application Specific Integrated Circuits (ASICs) as they allow to customize the system without the need of an expensive development process. In particular, FPGAs fit the needs of embedded systems design where they are used to develop accelerators specific to improve the performance of the applications that will be used on the systems. Nevertheless, the choice of this technology introduces additional difficulties in the design process that tools for high level synthesis and design space exploration need to address. FPGAs are composed by a set of configurable logic units, typically Look Up Tables (LUTs) with four inputs and one output, that are used to represent the logic functions, and a series of flip-flops. These elements are organized in Configurable Logic Blocks (CLBs) that communicates through a programmable interconnection network. Modern FPGAs may also feature dedicated blocks for some type of operations (e.g. hardware multipliers) and embedded memories. The generation of a FPGA based design requires to process the specification of the circuit in a hardware description language with a synthesizer, like Integrated Software Environment (ISE) for Xilinx devices or Quartus for ALTERA solutions. A FPGA synthesis tool follows several stages. The first stage (synthesis) transforms the specification into a set of logic primitives and memory elements for the reconfigurable device. The second stage (mapping) maps these basic components to the specific device available. In the last stage (routing) the blocks are connected together and with the input/output pins. The initial data on the occupation are available after the first stage. This process, however, is quite expensive in terms of time. Depending on the complexity of the design, a single logic synthesis may require hours to be completed. As a result, the solution cannot be evaluated by simply adding the contribution of the allocated components, but the effects of the logic synthesis step has to be somehow considered. In previous works, only the area of the functional units (i.e. the resources that performs the operations) and registers were included in the solution evaluation, as they were considered much more relevant than interconnection elements (e.g., multiplexers). However, recent studies [5, 15] demonstrated that the area of the interconnection elements has by far outweighed the area of the functional units. In ASICs, this brings undesirable side-effects, like an unacceptable propagation due to the long wires determined by an inefficient components placement. In FPGAs, this situation is critical for area calculation, since a large amount of LUTs may be used to connect and wire the functional blocks. This strongly motivates the design of techniques that take into account the amount and size of interconnection elements. Not considering them could lead to an inaccurate area estimation and to a final solution that does not meet the area constraints. Unfortunately, due to the complexity of analysis and the interdependence of the synthesis steps, all the information is available only after the complete synthesis of the design solutions. Some examples of the computational effort required to produce a complete synthesis for various designs with our HLS tool and the Xilinx ISE version 10.1 are reported in Table 1 .1. We used a system with a Intel Core 2 Duo T7500 CPU (2,2 GHz, 4 MB of second level cache) and 2 GB of memory. These results clearly show that that the com-plete synthesis cannot be efficiently included into any black-box optimization algorithm that usually performs a huge number of design evaluations. This motivates us to investigate different solutions to reduce the execution time of the solution evaluation, limiting the impact on the quality of the final solutions. If we reduce the time required to evaluate a design solution, more alternatives could be analyzed in the same time and a larger portion of the design space can be explored.
In the following sections we introduce two different techniques to speed-up the evaluation process. In Section 1.5 we discuss how to replace the fitness computation with a cost model to avoid the expensive logic synthesis step to evaluate candidate solutions. In particular, we show how the accuracy of the cost model affect the overall performance. Then, in Section 1.6 we investigate the application of a fitness inheritance inheritance mechanism to reduce the number of evaluations performed without degrading the performance of the evolutionary process.
Building Cost Models
In this section, the time-consuming logic synthesis step is substituted with a model of performance and area, based on relevant features of the structural descriptions obtained by the high-level synthesis step. To compute the performance, it is necessary to count the control steps required by the design to execute all the operations, which correspond to the number of clock cycles required to execute the design. To compute the area, then, it is necessary to perform the logic synthesis of the specifications produced by the HLS flow. As described in Section 1.2, the typical approach in literature is to build a fitness surrogate that, considering some features of the design, is able to estimate its occupation. In the following, we present two possible cost models for the area: one linearly combines the number of functional units present in the design and their area and counts the memory elements, the other one is a linear regression that also takes into account interconnections.
However, even if the solution modeling allows to reduce the time required to evaluate a solution, it introduces an approximation that could affect the explorations. For this reason, the accuracy of the models and the quality of the designs obtained by the exploration using these models will be discussed and analyzed, respectively, in Section 1.5.2.1 and Section 1.5.2.2.
Cost Models
One of the simplest models used in HLS flows counts the number of functional units and memory elements. An area estimation in terms of LUTs for each type of functional unit (e.g. adder, subtractors, multipliers) can be easily obtained through the synthesis of such elements. The linear combinations of these values provides an initial estimation of the overall occupation of the design [21] . The HLS flow can instead estimate the number of single bit flip-flops by counting the number of registers required by the design, for both the data-path and the state encoding registers of the control-FSM. Consequently, the first estimation model we adopted to compute the area of a design is shown in Fig.1.2 . This model is easy to develop and allows a very fast estimation of the area occupation of the design. However, it can only model how the exploration affects the number of functional units or registers, and does not account for the effects of the interconnection elements. The contribution of the controller is limited to the number of memory elements required to encode the state. The logic to compute the outputs or the transition function is ignored. Such a solution was proposed, several years ago, mainly for data-intensive designs targeting ASIC technology, considering that the interconnections and the controller had a reduced impact on these design.
Nevertheless, as discussed in Section 1.4, recent studies demonstrated that this approach is not applicable with FPGAs [5] , and it is becoming inefficient also for ASICs [15] . We thus investigated more detailed models for generating the required values to verify if it is possible to obtain better approximations. We started with an already existing area model for FPGAs [3] , and generalized it for several reasons. First, the vendors (e.g., Xilinx or Altera) offer tools with different approaches to translate the structural descriptions into the logic functions and to interconnect the logic blocks. Second, the devices, even if provided by the same vendor, can use different architectures (e.g., LUTs with a different number of inputs). So, we introduced a generic model and a methodology to specialize it, in order to address different vendors' tools and different devices. The final area model we used for fast estimation is shown in Figure 1 .3. For each architecture A the model divides the area into two main parts: the Flip-Flop part and the LUT part. While the Flip-Flop part is easy to estimate using the same formula of the previous approach, the LUT part is a little more complex. Four main parts contribute to the global area in terms of LUT: FU, FSM, MUX and Glue. The FU part corresponds to the contribution of the functional units and so its value is still the sum of the area value of each functional unit. The other three parts (FSM, MUX, Glue) are obtained by using a regression-based approach: #LUT Glue = ⌈α7 * #LUTF SM + β7 * A.DataP ath.N umRegisters + γ7⌉ A.Area.LUT = α8 * #LUTF SM + β8 * #LUTF U + γ8 * #LUTMUX + δ8 * #LUT Glue + ǫ8 A.Area = α8 * A.Area.LUT + α9 * A.Area.F F Fig. 1.3 Linear regression model to estimate area occupation for the structural design A.
• the FSM contribution is due to the combinatorial logic used to compute the output and next state; • the MUX contribution is due to the number and size of multiplexers used in the datapath; • the Glue contribution is due to the logic to enable writing in the flip flops and to the logic used for the interaction between the controller and the datapath.
The model is then specialized for the particular vendor's tools and devices by using a linear regression approach similar to [3] , obtaining an accurate estimation of the design objectives, if properly adapted. For this reason, one of the main drawbacks is that, each time the designer changes the experimental setup, it requires an initial phase of tuning, that could be time-consuming and error-prone.
Experimental Evaluation
In this section, the models are validated by using the set of benchmarks presented in [7] and targeting a Virtex XC2VP30 FPGA. The logic synthesis is executed with Xilinx ISE ver. 10.1. We performed the coefficient extraction for the model based on linear regression and its validation using two datasets, each one composed by different hardware architectures of the benchmarks. The resulting model is shown in Fig. 1.4 .
The error of the two models is discussed in Section 1.5.2.1, while their impact on the final estimates of the Pareto-optimal set is analyzed in Section 1.5.2.2. In particular, we demonstrate which model is better to drive the optimization process carried on by the genetic algorithm. We validated the models on a dataset composed by 73 designs that represent different architectures of the benchmarks described in [7] and shown in on linear regression approximates the real values with a good accuracy. In particular, the simplified model shows an average error of 43.39±20.00%, while the maximum error is 73.35%. The model based on linear regression, instead, has an average error equal to 2.22±2.20%, with a maximum error of 11.85%. Thus, we can confirm that is able to accurately estimate all the area contributions of a structural description and that it can be effectively integrated in the proposed methodology to drive the exploration algorithm.
Performance of the Methodology
The error information is insufficient to determine which model should be preferred. We need to evaluate the effects of the adoption of the models on the resulting estimates of the Pareto-optimal set. The more accurate is the model, the better it would drive the design space exploration, resuling in a better estimate of the Pareto-optimal set. However, even a simple model might be enough to perform an effective design space exploration, if it would be able to identify and consider the most relevant features of the design. Consequently, we performed different experiments, alternatively adopting different area models. Each experiment consists of 100 generations and involves a population of 100 candidate solutions. The results averaged over 10 runs are shown in Table 1 .2, where the column Area measures the quality of the non-dominated set discovered. In particular, the lower is this value, the better is the outcome of the optimization processes. NSGA-II DSE and Synthesis values represent, respectively, the Pareto points coming out from the exploration algorithm and the results after their actual synthesis. The results show that the linear regression model systematically outperforms the simplified model. The reason is that the linear regression model is more accurate, and it is able to consider the effects on the solution evaluation of all the components contained in the final architecture. Furthermore, having a model that generates a more accurate fitness function results in a larger number of points in the estimate of the Pareto-optimal set. Some interesting approximations of the Pareto-optimal curve are also graphically compared in Fig. 1.6 . In Fig. 6(a) and Fig. 6(b) we see that the linear regression model systematically outperforms the simplified one in terms of quality of the Pareto-optimal set. In Fig. 6(c) , for large designs, the model that considers only functional units and registers obtains better results. In fact, in this region of the design space, the impact of the multiplexer is limited (about 15-20%) and a fitness function focused only on functional components and registers is more suitable to drive the exploration algorithm. In the other region of the space, where few functional units are in the designs, the multiplexers have a larger impact (about 70-75%) and the fitness function that takes into account their occupation obtains better results. Finally, in Fig. 6(d) , the multiplexers are not so relevant for the design. As a result the two models are almost equivalent, as also shown by the similar values in Table 1 .2. Table 1 .1 shows that also the HLS step is computationally intensive, even if much less than the logic synthesis one. HLS impact becomes bigger as the dimension of the problem grows. We thus expect that, when applied to larger problems, it could become another significant bottleneck of the methodology.
Fitness Inheritance
For this reason, in this section we exploit fitness inheritance to substitute all the steps of the complete synthesis by interpolating the fitness of previously evaluated individuals. Fitness inheritance is a technique totally orthogonal to the solution proposed in the previous section. The individuals used for fitness inheritance can, in fact, be evaluated with any approach (e.g., actual synthesis or modeling with one of the proposed models). The key idea is that, with this approach, we try to limit the overall number evaluations rather than reducing the time required for a single evaluation (i.e. the synthesis steps, HLS or logic synthesis). Interpolation is usually much less time consuming, thus we can save some of the time required for a complete synthesis.
Note that this technique is less dependent on the problem than solution modeling. In fact, to build the model, the designer should identify the relevant features of the design solutions, synthesize the related hardware descriptions and establish a correspondence. On the contrary, fitness inheritance is only based on the definition of the chromosome encoding and the fitness of previously evaluated individuals.
However, to produce an effective surrogate, we needed to carefully take into account some aspects. In particular, we focused our attention on the percentage of individuals to be estimated, on the parents to choose and on how to combine their fitness. We present and discuss these aspects in Section 1.6.1, and then compare the quality of some different solutions in Section 1.6.2.
Provided a proper analysis of these aspects, the results show that fitness inheritance is able to consistently reduce the execution time of all the methodology. We also demonstrate that, if the parameters are not correct, the method can even degrade rather than improving the performance of the exploration algorithm.
Inheritance Model
In the proposed approach, only in the first, initial population the fitness of all the individuals is evaluated. In the subsequent populations, only the fitness of a portion of the population is evaluated, while the remaining ones inherit the fitness through interpolation of the values already computed. In particular, the fitness of individual Ind i is inherited with probability p i . To compute the fitness estimation for Ind i , we need to calculate the distance between it and all the individuals that can be effectively used for the estimations. The estimation can be based on the ancestors, i.e., all the individuals that have been effectively evaluated starting from the first generation, or on the parents, i.e., all the individuals that have been effectively evaluated only in the latest generation. In both the cases, we will call this set S in the rest of the section. The fitness value of Ind i is thus estimated as follows. The chromosome of Ind i is mapped onto a binary vector of size N , where each variable of the vector is uniquely related to a gene of the chromosome. The vector is instantiated by the following delta function:
where Ind i [k] is the value associated to the k-th gene of the individual Ind i . After the delta function has been computed for all the N genes of the chromosome, the distance d i,j between individual Ind i and individual Ind j is calculated as follows:
this function is normalized with the size of chromosome, so its value is always between 0 and 1. The distance d i,j measures the similarity of two individuals. If these are totally different (there is not any matching gene), the value will be 1. On the other hand, if the two individuals are identical, the value will be 0. Only individuals that are considered neighbors in this space will be kept for the fitness estimation. We call r the maximum distance that an individual should have to be kept. The name r is used to remember the term radius, since the region delimited by this value can be imagined as a N -dimensional hypersphere centered at individual Ind i . All the individuals Ind j ∈ S, having distance smaller than the radius r, can be considered as points inside this hypersphere. Therefore, all these individuals will be considered for estimation and the distance value is modified as follows:
where all individuals outside the hypersphere are equivalent to points at infinite distance and they will not be considered for estimation. To perform the estimation, we require a minimum number of points in this region. If there are not enough points, it means that there is no sufficient local information to estimate and individual. So, it will be really evaluated. If there are enough points, instead, the estimation can be performed on the set S ′ of points, selected as follows:
for each objective z. F it z k is the value of the objective z for the individual Ind k and (1 − d ′ i,j ) is used as a measure of closeness between individuals. f and g are functions that change the contribution of the two terms. We formulated the term (1 − d ′ i,j ) in this way since the distance d ′ i,j does not go to infinite, but has a value between 0 and 1. Therefore, we consider the values associated to 1 equivalent to an infinite distance (i.e., no contribute to the fitness). As explained above, this weighted average is computed for all the objectives considered in the optimization. The resulting value is then returned to the genetic algorithm, which can so proceed. A flag is also associated to the individual Ind i to remember that the fitness has been estimated and not really evaluated. This allows the algorithm to identify the estimated individuals when needed. In particular, in the last generation the fitness of all the individuals are tested for evaluation. Individuals that have already been evaluated will be skipped, while the estimated individuals will be effectively evaluated. Thus, when the exploration ends, all the individuals on which the final nondominated set is computed will have a real fitness value associated.
Experimental Evaluation
In this section, we evaluate different aspects related to fitness inheritance and compare several parameter settings. The parameters for the GA are the same used in Section 1.5.2.2. In all the experiment, the fitness evaluation uses the linear regression model. In Section 1.6.2.1 we present, discuss, and compare different functions to weight the fitness contributions of the evaluated individuals. In Section 1.6.2.2 we apply fitness inheritance both to the ancestors and to the parents and compare the results. Finally, we analyze the effects of different inheritance percentages (p i ) and distance rates (r). 
Weighting Functions
We considered three weighting functions (i.e., function g in Eq.1.5) for inheritance: linear, quadratic and exponential. The first model is computed as follows:
where the fitness of the evaluated individuals are linearly combined with the related distances 1 − d ′ i,j from the candidate individual Ind i . While, the second model is computed as follows:
where the quadratic function in (1 − d ′ i,j ) is used to increase the weight of distance, similarly to the Physics equations for gravity or magnetism. However, we adopt a proportion with (1 − d) 2 and not (1/d) 2 , that allows dealing with infinite distance as described above. The last model is computed as:
where the distance is exponentially weighted, emphasizing even more the contribution of the nearest individuals to the fitness estimation of Ind i . These functions have been applied both to the ancestors and to the parents. The distance rate has been set to r = 0.20 and the inheritance rate to p i = 0.5. In the former case, the set S of individuals considered increases generation by generation, while, in the latter case, the size is constant and related to the size of the population. When the ancestors are used, the inheritance model analyzes all the elements of the set for distance calculation, and the time required for fitness inheritance could overcome the time required by the function evaluation itself. Thus, in this case, fitness inheritance reduces the number of evaluations, but may also degrade the overall execution time of the methodology. At opposite, if the methodology is applied only to the parents, both the number of evaluations and the execution time of the methodology are significantly reduced. Since less individuals are available for computing the inheritance information (see Eq. 1.4), the number of evaluations is larger than with the ancestors. Table 1 .3 shows the data about the number of evaluations and about the overall execution time.
Finally, Table 1 .4 compares the quality of the results with the different weighting functions. As in Section 1.5.2.2, the area delimited by the approximated Pareto-optimal curve gives a qualitative evaluation of the explorations. The results show that the quadratic function is the most efficient solution to weight the fitness contributions. In fact, this function emphasizes the individuals closer to the candidate more than the linear function. With respect to the exponential function, which (strongly) emphasize only very similar individuals, it also consider more distant contributions (always inside the radius).
Parameter Analysis
In this section, different inheritance rates (p i ) and different distance rates (r) are studied. The parameters for the GA are the same used in Section 1.5.2.2. The fitness evaluation uses the linear regression model and exploits inheritance on parents with the quadratic weighting function in all the experiments. Table 1 .5 shows the results of explorations where fitness inheritance is applied with different inheritance rates. Note that values of p i between 0.40 and 0.55 provides a good trade-off between the quality of the exploration and the related execution time. The reason is that, with lower values, few individuals are chosen for inheritance. On the other hand, with higher values, the number of really evaluated individuals is limited. When there are not enough similar individuals (at least 10), we swap the fitness evaluation to the HLS flow and the area model. Therefore, the execution time is not reduced as expected. The results obtained in our experiments are also consistent with the optimal proportion for inheritance derived in [32] Finally, Table 1 .6 reports the results obtained with p i = 0.5 while changing the distance rates r. Almost all the considered rates give good results. However, values comprised between 0.20 and 0.25 perform best. In fact, with lower values, limited information is available for inheritance, while, with higher values, additional noise is introduced in the interpolation.
Conclusions
In this work, we presented an evolutionary approach to HLS design space exploration problem based on NSGA-II, a multi-objective evolutionary algorithm. We exploited two orthogonal techniques, surrogate fitness and fitness inheritance, to reduce the time necessary to the expensive solution evaluations. The fitness surrogate was computed with a linear regression model that takes into account the contributions of all the components of the design (e.g., interconnections or glue logic) and the effect of the optimizations introduced by the logic tool: replacing the logic synthesis process with such a surrogate model, we can save a lot of computational time. Fitness inheritance was used to reduce the number of evaluations, by evaluating only a fixed portion of the population. We validated our approach on several benchmarks and our results suggest that both the proposed techniques allows to speed-up the evolutionary search without degrading its performance. At the best of our knowledge, this is the first framework for the HLS design space exploration that exploits at the same time a surrogate fitness model as well as a fitness inheritance scheme.
