Abstract-High Level Synthesis (HLS) frameworks allow to describe hardware designs in a high-level language (C/C++), without burdening developers with the error-prone task of specifying their implementation in detail. The HLS process is usually controlled by user-specified directives, which influence the implementation area and latency. Nonetheless, the correlation between directives and performance is often difficult to foresee and to quantify. Addressing this challenge, we herein propose a heuristic that, by only exploring a subset of possible configurations for an HLS design, is able to retrieve a close approximation of its Pareto Frontier of non-dominated implementations. Our framework identifies regions of interest in the design space, and iteratively searches for new solutions within such regions, or in their combinations. Experimental evidence across multiple benchmarks showcases that our approach to HLS design space exploration reaches better Pareto approximations, and with less required synthesis runs, with respect to State of the Art alternatives.
I. INTRODUCTION
The common practice for the design of digital Integrated Circuits (ICs) relies on Hardware Description Languages (HDLs), such as VHDL or Verilog. HDL-based . In addition, these approaches require the concurrent specification of a circuit functionality and of its implementation. These two issues compound, making the exploration of different design variants, having diverse costs and performances, an error-prone and time-consuming process.
High Level Synthesis (HLS) tackles these issues by raising the abstraction level at which hardware circuits are defined, allowing an automated translation of e.g. C/C++ code to hardware. Moreover, HLS frameworks enable, through the use of synthesis directives, the rapid generation of different versions of the same ICs, without requiring any further source code modification. HLS is particularly useful for the development of Application Specific Integrated Processors (ASIPs), which combine software-programmable processors and dedicated accelerators, and are widely used in the embedded systems domain. By sharing a common high-level language (C/C++) across the hardware/software divide, HLS workflows enable a fast exploration of ASIPs systems having accelerators with different characteristics, as well as a direct path to their implementation. Different directive settings can lead to designs with widely different area and latency. On the other hand, typically only few of the resulting implementations are Pareto-optimal. HLS tools give little guidance on which directive settings lead to Pareto solutions. This task is far from obvious, since the rich portfolio of possible synthesis optimizations causes an exponential growth of feasible configurations for a given design. Moreover, directives tend to be inter-dependent according to complex patterns, and their effectiveness is highly influenced by low-level application characteristics such as data dependencies. In this light, an approach based on trial-anderror is clearly sub-optimal, while an exhaustive search is unfeasible beyond very simple cases, because of the explosion in the number of possible configurations.
Herein, we present a novel heuristic, able to well approximate the area/latency Pareto frontier of hardware designs developed with HLS, while synthesizing only a small subset of the possible directives configurations. Our framework combines a) a smart choice of the initial exploration points and b) an iterative cluster-based refinement of the estimated Pareto front. Our approach is agnostic both to the set of directives offered by an employed HLS tool, and to the underlying system characteristics. Different HLS frameworks can therefore be interfaced with no required adaptations. Our contribution is two-fold:
• We propose the use of probabilistic distribution to favour the selection of extreme values for each directives. The proposed distribution allows to start the design space exploration closer to the Pareto frontier.
• We introduce a cluster-based heuristic, able to effectively divide the design space into regions, and focus the exploration only on promising ones. The heuristic is composed of two stages: an intra-cluster exploration stage which aims to refine the search within a region, followed by an inter-cluster exploration stage which allows to discover new, unexplored areas of the design space. Our strategy outperforms state-of-the-art methodologies in terms of the number of required synthesis, the closeness of the retrieved (estimated) Pareto frontier with respect to the true one, and the algorithm run-time. In the example presented in Figure 1 , referring to the Autocorr function from the gsm CHStone benchmark [1] , an Average Distance from Reference Set (ADRS) of 0.0128 was achieved with only 234 synthesis runs (over 1728 possible configuration), and an average algorithm run-time of ∼201 seconds.
The paper continues as follows: in Section II we compare our contribution with respect to related efforts in the field. Section III provides a high-level description of the main features characterizing our strategy. Section IV introduces the mathematical foundations and the notation used in this paper and Section V details the various steps composing our cluster-based heuristic. Section VI and VII provide the experimental evidence highlighting the performance of our approach. Concluding remarks are exposed in Section VIII.
The source code of the framework implementation, as well as a database of the area-latency values derived from the exhaustive explorations for the considered benchmarks, are available at http://www.inf.usi.ch/phd/ ferretti/cluster-based-DSE.html [2] .
II. STATE OF THE ART
A number of academic and industrial High-level Synthesis frameworks exist, including LegUp [3] , ROCCC [4] , SPARK [5] and Xilinx VivadoHLS [6] . All of them allow users to guide the synthesis process through specific synthesis directives.
Proposed strategies to automate the ensuing design space exploration can be broadly divided in two categories. A first approach, investigated in [7] , [8] , [9] , [10] and [11] , is to derive a high-level model of a target application, and estimate the effect of changing the values of its synthesis directives. Since they strive to approximate the behaviour of HLS tools, these works require a small number of synthesis runs (which are only used to fine-tune the results), but are also tightly linked to specific HLS frameworks and to their employed architectural assumptions. Moreover, they usually take into account only An alternative strategy, which we also adopt, is to treat HLS suites as black boxes, only observing their inputs (directives values) and outputs (area, latency of a design). Machine learning techniques can then be used to infer the behaviour of a synthesis tool, and generate a model for the effect of directives. To this end, the authors of [12] introduced a framework based on spectral analysis, while [13] proposes an algorithm based on Simulated Annealing and [14] , [15] propose algorithms based on Response Surface Modelling. Shafer et al. [11] propose instead to use learning-based methods to implement local search techniques. The paper of Liu et al. of [16] investigates a methodology based on Random Forest [17] and its results favourably compare with other black-box alternatives mentioned above. Therefore, we adopt this methodology as a baseline for our experimental evaluation in Section VII.
III. OVERVIEW
Our exploration strategy is motivated by the observation that some combinations of values, when assigned to directives, result in high-quality implementations, while others are sub-optimal, leading to designs with high cost and low performance. 1 Starting from a small initial set of area/latency points in a design space, we therefore explore it by clustering solutions with a high degree of similarity, and discarding clusters which are distant from the Pareto curve.
Clustering allows to de-compose the Design Space Exploration (DSE) problem in many smaller sub-problems, effectively lowering its complexity. Solutions are clustered considering their similarity both according to the synthesis output and in the values of the input directives, as shown in Figure  2 . Two points (e.g. ϕ 3 and ϕ 4 in Figure 2 ) may be assigned to the same cluster if they present similar design parameters, even if they have quite different area and delay. The grouping of points into clusters is performed each time a new solution is synthesized, letting the correlation between values of different directives to naturally emerge. Therefore, our strategy waives the need for a model of the effect of each directive on a target design. An approach only relying on intra-cluster exploration may never reach DSE regions which do not include any point in the initial set. To avoid this pitfall, clusters are combined to generate new ones. This inter-cluster step enables the exploration of points whose characteristics are in-between two previously considered design space regions. Since it searches for intermediate solutions, such strategy is most effective when points with extreme features values (high or low) are included in the DSE as part of the initial set. To ensure this condition, the features in the initial sampling set are generated according to a U-shaped probabilistic distribution (detailed in Section V-A) . Figure 3 summarises the clustering-based heuristic presented in this work. Starting from the initial sampling of the design space, the Clustering, Cluster selection, Intracluster exploration (expansion of each cluster) and Inter-cluster exploration steps (generation of new clusters) are iteratively performed until either no new solution is found, or a userdefined budget of synthesis runs expires.
A detailed description of the DSE framework steps is provided in the following sections.
IV. PROBLEM FORMULATION AND NOTATION
We treat the DSE of possible implementations of an HLS design as a multi-objective optimisation, with area (A) and latency (L) as objective functions. In fact, the DSE process can be expressed as:
The synthesis tool, in fact, receives in input the code describing a functionality to be implemented in hardware, and a set of HLS directives values ⃗ f ∈ D, generating as output a pair of real numbers representing area and latency: (a, l) ∈ S ⊂ R×R. S and D are termed the Synthesis space and the Design space, respectively.
We aim at finding an approximate Pareto frontierP of the best performing implementations, as close as possible to the one deriving from exhaustive search (P ), while minimizing the number of synthesis runs. The Pareto frontier is defined as the set of non-dominated solutions in S:
In other words, iff p ∈ P , then no other solutions in S have simultaneously less area and less latency than p. The initial sampling step defines the starting set of points belonging toP , which is then iteratively refined, resulting in the final output of our framework.
The Design space D contains the set of all legal combinations of directives, and it is composed of the union of all feature vectors ⃗ f . We name the set of admissible values for a directive as its feature set F . A feature vector is then an element of the cartesian product among all feature sets:
To obtain an equally distributed representation of the features in F , we represent these in a discretised form ⃗ f * , in which each element f i has the same distance from its previous and following elements in F with F ∈ [0, 1]. For the synthesis space, instead, a normalized representation is adopted and can be straightforwardly derived by dividing each element in S by the maximum area and latency values among the explored configurations.
Example. Figure 4 shows a simple HLS code snippet and three possible HLS directives that can be applied: function inlining, input bundling and loop unrolling. Inlining can either be performed or not, so the inline directive has two legal values. Similarly, the two arrays can either be bundled on the same port or each assigned to a different one. Each data point in the DSE problem is completely characterized by its values in S and D. We call the concatenation of these two spaces the Clustering space Φ, and its elements ⃗ ϕ:
During the exploration, only a part of Φ is visited. We nameΦ the space containing only the solutions evaluated by our heuristic. Similarly we refer to S and D as to the entire Synthesis and Design spaces and withS andD to their subsets for which, during the algorithm execution, a synthesis has been performed.
The example in Figure 2 shows four different synthesized Figure 3 
A. Initial sampling
This step generates the initial set of configurationsD, and derives the first approximation of the design space Pareto curveP . For an initial sampling size n, theD space is composed of n unique feature vectors ⃗ f . The elements of each ⃗ f originate from the probabilistic sampling of a symmetric Beta distribution, which is characterized by a density function, defined over an interval 0 ≤ x ≤ 1, as:
B(α) Where B is the Beta function:
As shown in Figure 5 , this U-shaped distribution has, with a value of α lower than 1, a higher probability associated to the boundary values of x. By adopting it, the initial explored setF will contain elements whose features have, with high probability, extreme values from the respective set. This is a desired property in our framework, so that Pareto solutions with in-between feature values will be explored during the refinement steps.
To derive the set of initial exploration pointsF , which are synthesized by the HLS tool, the probabilistic samples are converted to feature values in the F domain, approximating them to the nearest legal configuration. The retrieved area and latency numbers generate the first instance ofS. Finally,S andF are concatenated to generateΦ.
B. Clustering
Once the initial set ofΦ is generated, its elements (ϕ) having common characteristics are aggregated into clusters. To this end, we relied on the Hierarchical Clustering algorithm [18] [19] . As a similarity metric, we considered the squared euclidean distance among the points in theΦ space. To ensure a good balance between intra-and inter-cluster exploration, multiple clusters should be present, while most of them should aggregate multiple points. We governed this tradeoff via a clustering factor, which sets the number of clusters to a percentage of the number of the explored designed points. An exploration of different settings for this parameter is presented in Section VII.
The clustering operation partitions theΦ space in multiple clusters C i , and definesC as:
Each cluster is characterised by its centroid ⃗ c, which is, for each (a, l, f 1 , f 2 , ...) component, the average value among the points belonging to the clusters. Moreover, clusters possess a boundary in the S space, corresponding to a 4 elements tuple containing the maximum and minimum values of area and latency of the cluster points.
Example: 
C. Clusters selection
This step selects which clusters will be considered for the generation of new points in the Synthesis space.
To perform it, as shown in Figure 6a , we consider: the Pareto frontier of the explored design spaceP (pictured as diamonds), the Pareto frontier of the centroids of the clusters P C (pictured as a line among centroids) and the cluster boundaries (pictured as dashed rectangles). This data is employed to select candidate clusters corresponding to design space regions which contain promising solutions. Only the points belonging to these clusters are considered in the following intra-and inter-cluster exploration steps, while the rest are discarded.
Candidate clusters are selected according to three criteria:
2) if a cluster C i belongs to the Pareto frontier of centroids P C , then C i is a candidate cluster.
3) for each ⃗ c ∈C\P C , if A(⃗ c) and L(⃗ c) are inside the
boundaries of an element ofP C , then C i is a candidate cluster.
Example: Figure 6 exemplifies the application of the cluster selection criteria. In Figure 6b , A, B and C are all selected because each of them contains elements which belong tō P . Figure 6c shows an example where the application of the second criterion leads to the selection of clusters A and B. Finally, Figure 6d shows an example where the third criterion is applied, since the centroid of cluster C is within the boundary of cluster B.
D. Intra-cluster exploration
Intra-cluster exploration identifies new solutions by examining candidate clusters individually. For each cluster, the algorithm considers the points belonging to its local Pareto frontier P Ci . These points are pair-wise combined in the Φ space by performing a vector addition relative to the cluster centroid, generating new estimated solutionsφ (Figure 7) . Combinations which do not improve the global Pareto frontier are discarded without performing a synthesis run.
Estimatedφ elements may have features values which do not correspond to valid settings for the HLS directives (e.g.: they are not integer numbers). Up to three valid configurations are therefore derived according to the following rules:
1) each component is casted to the closest feature value.
2) all components are upcasted to a valid feature value.
3) all components are downcasted to a valid value. After this pass, estimated points which have already been synthesized are also discarded. The resulting set off feature vectors are then used to invoke the HLS synthesis tool to retrieve the corresponding (non-estimated) area and latency. Finally, the new obtained ϕ are added to theΦ space and an updated Pareto frontier is retrieved.
Example: Figure 7 shows an example of intra-cluster exploration. The dark filled dots are the points of the cluster before 
E. Inter-cluster exploration
In order to discover unexplored regions of the design space, this stage combines design points belonging to different clusters. This step considers all pairings of the clusters inP C , merging them and calculating their common centroid.
Vector addition, relative to the centroid, is applied to the Pareto points of the merged cluster. The resulting set of estimatedφ vectors are casted to valid configuration as in the intra-clustering stage. Finally, configurations which are not yet part ofΦ are synthesized, and the Pareto frontier is updated.
Example: Figure 8 shows an example of an inter-cluster exploration. The Pareto points of cluster A, ϕ 1 and ϕ 2 (dark filled dots), are pairwise summed with the ones of cluster B, ϕ 3 and ϕ 4 (light filled dots). The results of their vector sum generate three estimated design pointsφ 1,3 ,φ 2,3 andφ 2,4 (dashed empty dots). The vector sum between ϕ 1 and ϕ 4 is estimated not to improve the Pareto frontier and is therefore discarded.
VI. EXPERIMENTAL SETUP
The HLS exploration framework presented in the previous sections has been implemented in Matlab, which, as part of its Statistics and Machine Learning Toolbox [19] , provides an implementation of the Hierarchical-Clustering algorithm employed in the clustering stage. We used VivadoHLS from Xilinx [6] as a high-level synthesis tool, the Kintex7 FPGA as the target architecture, and a clock constraint of 10ns. HLS benchmarks are derived from the CHStone suite [1] . Table I reports them, as well as the directives employed in each case and their considered values, derived by manual inspection of the benchmarks. The table also indicates the resulting design space size.
To assess the performance of our methodology, we compare our result with the heuristic proposed by Liu et al. [16] , which we re-implemented. It is based on the Random Forest algorithm [17] , which refines an initial design space sampled with the Transductive Experimental Design (TED [20] ) method. Such combination has been shown in [16] to outperform other strategies based on different machine learning algorithms [9], [12], [14] and [15] . As a further baseline, we also considered our implementation and that of Liu et al. when a random initial sampling is adopted.
As a quality metric, we employed the Average Distance from Reference Set (ADRS), which measures the difference between two Pareto curves. For the case of two objective functions (in our case, area and latency) ADRS is defined as:
A low value of ADRS indicates thatP well approximates P , while a high one reports a low-quality approximation.
For all experiments, P is derived from an exhaustive search of all possible directive configurations. Such brute-force exploration required multiple days of computation on each of the considered benchmarks, highlighting the importance of exploration heuristics targeting HLS. Since both our methodology and that of Liu et al. have a probabilistic component governed by a seed value, for each experimental setting we ran the algorithms 100 times, averaging the results.
VII. EXPERIMENTAL RESULTS
Tuning. We have evaluated the effect of varying the parameters required by our framework: the clustering factor (governing the size and number of clusters generated at runtime) and the number of points evaluated in the initial sampling phase. Figure 9 shows the impact of adopting different clustering factors, for the considered benchmarks, on the achieved ADRS. The data reported in the figure considers a number of clusters equal to 5%, 10%, 15% and 20% of the number of explored design points. It highlights that this parameter has a small impact on the quality of the results, with a value of 15% leading, on average, to marginally better results.
The size of the initial sampled set plays instead a more important role, and this is seen in Figure 10 . The figure shows the performance of our heuristic, in terms of mean ADRS achieved, for different initial sampling sizes (5%, 10%, 15% and 20% of the design space) for the Decode function from the gsm benchmark. It can be observed that, the higher the initial sampling size, the better approximation the algorithm finally converges to -i.e., it converges to a lower ADRS value. On the other hand, if we consider an a-priori limited number of synthesis, then lower initial sampling sizes can outperform higher values. For example, for a budget of 100 synthesis, an initial sampling of 10% reaches a lower ADRS than an initial sampling of 15%. Indeed, if the number of synthesis is limited, there is a tradeoff between how many synthesis should be spent initially, and how many are then left for exploration. We show results for an initial sampling size of 10% and a clustering factor of 15%, but we further experimented with all initial sampling sizes and clustering factors, obtaining consistent results.
ADRS comparisons. We now perform a comparative evaluation of our methodology with respect to the one proposed by Liu et al. of [16] . The comparison is illustrated in Figures 11 through 15. These figures report the ADRS achieved by our framework (Clust-Beta), with a maximum budget of synthesis equal to 40% of the total design space and an initial sampling budget equal to 10%. Results are compared with five other combinations of initial sampling and refinement exploration strategies: the intra-and inter-cluster exploration combined with random or TED initial sampling, and Random Forest (RF) exploration of a Beta, random or TED initial sampling. Across all benchmarks, Clust-Beta consistently outperforms alternative methodologies, both when a low or a high number of synthesis are considered, with the only exception of Autocorr for a high synthesis budget. We believe that the competitive advantage of Clust-Beta qualitatively lays in our design space decomposition, together with the intra-and intercluster exploration and the use of a Beta-distribution for the initial sampling. The combination of these factors enables to focus the exploration only on the most promising regions of the design space. We further experimented with all initial sampling sizes reported in Figure 10 , sweeping the synthesis budget up until both our method and the best performing alternative converge or reach 40% of the design space size. Results are consistent with the one shown in Figures 11-15 : Clust-Beta outperforms the other considered methodologies most (87%) of the times, resulting in smaller ADRS.
Run-time comparisons. Lastly, we have evaluated the algorithm run-time (without considering the time required for the synthesis) of Clust-Beta with respect to RF-TED. For the run-time comparison we have considered the execution time of our methodology and RF-TED running both algorithm until no new synthesizable configuration are generated, or 40% of the design space is explored. The run-time is then divided, in both cases, by the number of synthesis effectively run, so that we could have a measure of the time spent in the exploration engine itself. Results are reported in Table II . Note that, besides reaching a better approximation of the Pareto curves, our methodology is also quicker compared to the one proposed by Liu et al. of [16] . The highest speedup is achieved in the Reflection case, which is the largest among the considered applications, hinting at a better scalability of our methodology.
VIII. CONCLUSIONS
The High Level Synthesis exploration methodology illustrated in this paper allows to discover high-quality implementations of a hardware functionality, while only requiring the synthesis of a small percentage of the design space. The paper addresses a major challenge posed by HLS tools, which usually offer a rich collection of configuration options, but provide little guidance on how these options in turn affect the area and latency of a resulting design.
Our methodology can effectively discover favourable combinations of synthesis directives, by clustering solutions which have a high degree of similarity. By then combining only the most effective directives characterising each cluster, the methodology is able to focus the search towards global Paretooptimal solution.
The proposed methodology is agnostic to the low-level characteristics of the explored design and to the employed HLS tool. In fact, it does not require an analysis of the application data/control dependencies (as in [11] , [7] ), nor a model of the considered directives set (as in [10] ). It advances the state of the art in black-box HLS exploration frameworks [16] by providing a better Pareto approximation of bestperforming implementation of digital designs, while requiring less synthesis runs. 
