Hardware accelerators are key to the e ciency and performance of system-on-chip (SoC) architectures. With high-level synthesis (HLS), designers can easily obtain several performance-cost trade-o implementations for each component of a complex hardware accelerator. However, navigating this design space in search of the Pareto-optimal implementations at the system level is a hard optimization task. We present COSMOS, an automatic methodology for the design-space exploration (DSE) of complex accelerators, that coordinates both HLS and memory optimization tools in a compositional way. First, thanks to the co-design of datapath and memory, COSMOS produces a large set of Pareto-optimal implementations for each component of the accelerator. Then, COSMOS leverages compositional design techniques to quickly converge to the desired trade-o point between cost and performance at the system level. When applied to the system-level design (SLD) of an accelerator for wide-area motion imagery (WAMI), COSMOS explores the design space as completely as an exhaustive search, but it reduces the number of invocations to the HLS tool by up to 14.6×. the increasing complexity of SoCs and accelerators, most of the design e ort should move away from the familiar register-transfer level (RTL) by embracing system-level design (SLD) [18, 42] with high-level synthesis (HLS) [32, 39] .
INTRODUCTION
High-performance systems-on-chip (SoCs) are increasingly based on heterogeneous architectures that combine generalpurpose processor cores and specialized hardware accelerators [4, 8, 22] . Accelerators are hardware devices designed to perform speci c functions. Accelerators are become popular because they guarantee considerable gains in both performance and energy e ciency with respect to the corresponding software executions [9-11, 20, 23, 29, 41, 48] .
However, the integration of several specialized hardware blocks into a complex accelerator is a di cult design and veri cation task. In response to this challenge, we advocate the application of two key principles. First, to cope with and automatize the DSE process [36, 37] . Several studies, however, highlight the importance of private memories to sustain the parallel datapath of accelerators: on a typical accelerator design, memory takes from 40% to 90% of the area [16, 30] ; hence, its optimization cannot be an independent task. Second, HLS tools are based on heuristics, whose behavior is not robust and often hard to predict [24] . Small changes to the knobs, e.g., changing the number of iterations unrolled in a loop, can cause signi cant and unexpected modi cations at the implementation level. This increases the DSE e ort because small changes to the knobs can take the exploration far from the Pareto-optimality.
Contributions
To address these limitations, we present COSMOS 1 : an automatic methodology for the DSE of complex hardware accelerators, which are composed of several components. COSMOS is based on a compositional approach that coordinates both HLS tools and memory generators. First, thanks to the datapath and memory co-design, COSMOS produces a large set of Pareto-optimal implementations for each component, thus increasing both performance and cost spans. These spans are de ned as the ratios between the maximum value and the minimum value for performance and cost, respectively.
Second, COSMOS leverages compositional design techniques to signi cantly reduce the number of invocations to the HLS tool and the memory generator. In this way, COSMOS focuses on the most critical components of the accelerator and quickly converges to the desired trade-o point between cost and performance for the entire accelerator. The COSMOS methodology consists of two main steps ( Figure 1 ). First, COSMOS uses an algorithm to characterize each component of the accelerator individually by e ciently coordinating multiple runs of the HLS and memory generator tools. This algorithm nds the regions in the design space of the components that include the Pareto-optimal implementations (Component Characterization in Figure 1 ). Second, COSMOS performs a DSE to identify the Pareto-optimal solutions for the entire accelerator by e ciently solving a linear programming (LP) problem instance (Design-Space Exploration).
We evaluate the e ectiveness and e ciency of the COSMOS methodology on a complex accelerator for wide-area motion imagery (WAMI) [3, 38] , which consists of approximately 7000 lines of SystemC code. While exploring the design space of WAMI, COSMOS returns an average performance span of 4.1× and an average area span of 2.6×, as opposed to 1.7× and 1.2× when memory optimization is not considered and only standard dual-port memories are used.
Further, COSMOS achieves the target data-processing throughput for the WAMI accelerator while reducing the number of invocations to the HLS tool per component by up to 14 .6×, with respect to an exhaustive exploration approach. 
Organization
The paper is organized as follows. Section 2 provides the necessary background for the rest of the paper. Section 3
describes few examples to show the e ort required in the DSE process. Section 4 gives an overview of the COSMOS methodology, which is then detailed in Sections 5 (Component Characterization) and 6 (Design-Space Exploration).
Section 7 presents the experimental results. Section 8 discusses the related work. Finally, Section 9 concludes the paper.
PRELIMINARIES
This section provides the necessary background concepts. We rst describe the main characteristics of the accelerators targeted by COSMOS in Section 2.1. Then, we present the computational model we adopt for the DSE in Section 2.2.
Hardware Accelerators
Several accelerator designs have been proposed in the literature to realize hardware implementations that execute important computational kernels more e ciently than corresponding software executions [9, 10, 23, 29, 41, 48] . The accelerators can be located either inside (tightly-coupled) or outside (loosely-coupled) the processing cores [16] . The former class of accelerators is more suitable for ne-grain computations on small data sets, while the latter is better for coarse-grain computations on large data sets. We focus on loosely-coupled accelerators in this paper because the complexity of their design requires a compositional approach. WAMI is representative of a set of classes of applications that can be bene t from the adoption of the loosely-coupled accelerator model and a compositional design approach.
Architecture. We design our accelerators in SystemC. Figure 2 illustrates their typical architecture. They are made of multiple components that are designed individually to cope with the current limitations of HLS tools in optimizing complex components. Partitioning the accelerators into multiple components allows HLS tools to handle them separately, thus reducing the synthesis time and improving the quality of results. Each component is specied as a separated SystemC module and represents a computational block within the accelerator. The components communicate by exchanging the data through an on-chip interconnect network that implements transaction-level modeling (TLM) [19] channels. These channels synchronize the components by absorbing the potential di erences in their computational latencies with a latency-insensitive communication protocol [7] . This ensures that the components of an accelerator can always be replaced with di erent Pareto-optimal implementations without a ecting the correctness of the accelerator implementation. COSMOS employs channels with a xed bitwidth (256 bits) and does not explore di erent design alternatives to implement the communication among the components. It can be extended, however, to support this type of DSE by using, for example, the XKnobs [35] or bu er-restructuring techniques [13] . Each component includes a datapath, which is organized in a set of loops, to read and store input and output data and to compute the required functionality. There are also private local memories (PLMs), or scratchpads, where data resides during the computation. PLMs are multi-bank memory architectures that provide multiple read and write ports to allow accelerators to perform parallel accesses. We generate optimized memories for our accelerators by using the M memory generator [37] . Several analyses highlight the importance of the PLMs in sustaining the parallel datapath of accelerators [16, 30] . PLMs play a key role on the performance of accelerators [25] , and they occupy from 40% to 90% of the entire area of the components of a given accelerator [30] .
Execution. Figure 3 reports an example of execution of an accelerator made of multiple components. The execution of each component of the accelerator is divided in three phases (showed on the top of the gure for Component #1). In the load phase the components communicate with the on-chip interconnect network to read the input data and store it in the PLMs. In the compute phase the components execute the given functions on the data currently stored in the PLMs. In the store phase the components communicate with the on-chip interconnect network to store the output data available in the PLMs. These three phases can be pipelined by using techniques such as ping-pong or circular bu ers [16] , as shown on the top of the gure. After having identi ed the minimum block of data that is su cient to realize the required function in each component, e.g., a frame, the execution of the components can be: (i) completely overlapped when there are no dependencies (e.g., Component #1 and #K), or (ii) serialized when a component needs input data from another component to start its computation (e.g., Component #1 and #2).
Computational Model
To formally model the loosely-coupled accelerators we use timed marked graphs (TMGs), a subclass of Petri nets (PNs) [34] . TMGs are commonly used to perform compositional performance analysis of discrete-event systems [6] .
While TMGs do not allow to capture data-dependent behaviors, they are a practical model to analyze stream processing accelerators for many classes of applications, e.g., image and signal processing applications. A PN is a bipartite graph de ned as a tuple (P,T , F , w, M 0 ), where P is a set of m places, T is a set of n transitions, F : (P × T ) ∪ (T × P) is a set of arcs, w : F → N + is an arc weighting function, and M 0 ∈ N m is the initial marking, i.e. the number of tokens at each p ∈ P. A PN is strongly-connected if for every pairs of places p i and p j there exists a sequence of transitions and places such that p i and p j are mutually reachable in the net. A PN can be organized in a set of strongly-connect components, i.e., the maximal sets of places that are strongly-connected. A TMG is a PN such that (i) each place has exactly one input and one output transition, and (ii) w : F → 1, i.e., every arc has a weight equal to 1. To measure performance, TMGs are extended with a transition ring-delay vector τ ∈ R n , which represents the duration of each particular ring.
The minimum cycle time of a strongly-connected TMG is de ned as: max {D k /N k | k ∈ K }, where K is the set of cycles of the TMG, D k is the sum of the transition ring delays in cycle k, and N k is the number of tokens in cycle k [40] . In this paper, we use the TMG model to formally describe the accelerators. We use the term system to indicate a complex accelerator that is made of multiple components. Each component of the system is represented with a transition in the TMG whose ring delay is equal to its e ective latency. The e ective latency λ of a component is de ned as the product of its clock cycle count and target clock period. The maximum sustainable e ective throughput θ of the system is then the reciprocal of the minimum cycle time of its TMG, if the TMG is strongly connected. Otherwise, it is the minimum θ among its strongly-connected components. We use λ and θ as performance gures for the single components and the system, respectively. We use the area α as the cost metric for both the components and the system.
MOTIVATIONAL EXAMPLES
Performing an accurate and as exhaustive as possible DSE for a complex hardware accelerator is a di cult task for three main reasons: (i) HLS tools do not always support PLM generation and optimization (Section 3.1), (ii) HLS tools are based on heuristics that make it di cult to con gure the knobs (Section 3.2), and (iii) HLS tools do not handle the simultaneous optimization of multiple components (Section 3.3). Next, we detail these issues with some examples.
Memories
The joint optimization of the accelerator datapath and PLM architecture is critical for an e ective DSE. Figure 4 depicts the design space of G , a component we designed for WAMI. The graph reports di erent design points, each characterized in terms of area (mm 2 ) and e ective latency (milliseconds), synthesized for an industrial 32nm ASIC technology library. The points with the same color (shape) are obtained by partially unrolling the loops for di erent numbers of iterations. The di erent colors (shapes) indicate di erent numbers of ports for the PLM 2 . By increasing the number of ports, we notice a signi cant impact on both latency and area. In fact, multiple ports allow the component to read and write more data in the same clock cycle, thus increasing the hardware parallelism. Multi-port memories, however, require much more area since more banks may be used depending on the given memory-access pattern. Note that ignoring the role of the PLM limits considerably the design space. By changing the number of ports of the PLM, we obtain a latency span of 7.9× and an area span of 3.7×. By using standard dual-port memories, we have only a latency 
HLS Unpredictability
Dealing with the unpredictability of the HLS tool outcomes is necessary to remain in the Pareto-optimal regions of the design space [24] . This is highlighted by the magni ed graph in Figure 4 that reports the number of iterations unrolled for each design point of G . By increasing the number of iterations unrolled in a loop for a particular con guration of the PLM ports we expect to obtain design points that have more area and less latency. In fact, unrolling a loop increases the number of hardware resources to allow more parallel operations. However, an e ective parallelization is not always guaranteed. Some combinations of loop unrolling have a negative e ect on both latency and area due to the applications of HLS heuristics (e.g., points 7u, 8u and 9u in Figure 4 ). In fact, HLS tools need to insert additional clock cycles in the body of a loop when (i) operation dependencies are present or (ii) the area is growing too much with respect to the scheduling metrics they adopt (HLS tools often perform latency-constrained optimizations to minimize the area). This motivates the need of dealing with the HLS unpredictability in the DSE process. COSMOS applies synthesis constraints to account for the high variability and partial unpredictability of the HLS tools.
Compositionality
Complex accelerators need to be partitioned into multiple components to be e ciently synthesized by current HLS tools. This reduces the synthesis time and improves the quality of results, but signi cantly increases the DSE e ort. Figure 5 reports a simple example to illustrate this problem. On the top, the gure reports two graphs representing a small subset of Pareto-optimal points for G and G , two components of WAMI. Assuming that they are executed sequentially in a loop, their aggregate throughput is the reciprocal of the sum of their latencies. On the bottom, the gure reports all the possible combinations of the design points of the two components, di erentiating the Pareto-optimal combinations from the Pareto-dominated combinations. These design points are characterized in terms of area (mm 2 ) and e ective throughput (1/milliseconds). In order to nd the Pareto-optimal combinations at the system level, an exhaustive search method would apply the following steps: (i) synthesize di erent points for each component by varying the settings of the knobs, (ii) nd the Pareto-optimal points for each component, and (iii) nd the Pareto-optimal combinations of the components at the system level. This approach is impractical for complex 
Pareto Dominated

Fig. 5. Example of composition for G
and G , two components of WAMI. The graphs on the top report some Paretooptimal points for the two components. The graph on the bo om shows all the possible combinations of these components, assuming they are executed sequentially in a loop. In the graph of the composition, the e ective throughput is used as the performance metric.
accelerators. First, step (i) requires to try all the combinations of the knob settings (e.g., di erent number of ports and number of unrolls). Second, step (iii) requires to evaluate an exponential number of combinations at the system level to nd those that are Pareto-optimal. In fact, if we have n components with k Pareto-optimal points each, then the number of combinations to check is O(k n ). This example motivates the need of a smart compositional method that identi es the most critical components of an accelerator and minimizes the invocations to the HLS tool. In order to do that, COSMOS reduces the number of combinations of knob settings that are used for synthesis and prioritizes the synthesis of the components depending on their level of contribution to the e ective throughput of the entire accelerator.
THE COSMOS METHODOLOGY
As shown in Figure 1 , COSMOS consists of the following steps:
(1) Component Characterization (Section 5): in this step COSMOS analyzes each component of the system individually;
for each component it identi es the boundaries of the regions that include the Pareto-optimal designs; starting from the HLS-ready implementation of each component (in SystemC), COSMOS applies an algorithm that generates knob and memory con gurations to automatically coordinate the HLS and memory generator tools; the algorithm takes into account the memories of the accelerators and tries to deal with the unpredictability of HLS tools;
(2) Design-Space Exploration (Section 6): in this step COSMOS analyzes the design space of the entire system; the system is modeled with a TMG to nd the most critical components for the system throughput; then, COSMOS:
• formulates a LP problem instance to identify the latency requirements of each component that ensure the speci ed system throughput and minimize the system cost; this step is called Synthesis Planning (Section 6.1);
• maps the solutions of the LP problem to the knob-setting space of each component and runs additional synthesis to get the RTL implementations of the components; this step is called Synthesis Mapping (Section 6.2). Example 1. Figure 6 shows an example of using the λ-constraint. The loop (reported on the left) contains two read operations to two distinct arrays, i.e., γ r = 1, and one write operation, i.e., γ w = 1. We assume that all the operations that are neither read nor write operations can be performed in one clock cycle, i.e., η = 1. The two diagrams (on the right) show the results of the scheduling by using two ports for the PLM and by unrolling two or three times the loop, respectively. In the rst case (unrolls = 2), the HLS tool can schedule all the operations in a maximum of h 2 (2) = 3 clock cycles. Thus, this point would be chosen by Algorithm 1 to be used as upper-left extreme. In the second case (unrolls = 3), the HLS tool is not able to complete the schedule within h 2 (3) = 4 clock cycles (it needs at least 5 clock cycles). Thus, this point is discarded.
Note that the λ-constraint is not guaranteed to obtain a Pareto-optimal point due to the intrinsic variability of the HLS results. Still, this point can serve as an upper bound of the region in the design space. Note also that the λ −constraint cannot be applied to loops that (i) require data from sub-components through blocking interfaces or (ii) do not present memory accesses to the PLM. In these cases, in fact, it is necessary to extend the de nition of the estimation function given in Equation (1) to handle such situations. Alternatively, COSMOS can optionally run some synthesis in the neighbourhood of the maximum number of unrolls and use a local Pareto-optimal point as the upper-left extreme.
Manuscript submitted to ACM
Memory Generation
After the two extreme points of a region have been determined, the algorithm instructs the memory generator to create the PLM architecture (line 9). COSMOS uses M [37] to generate optimized PLMs for the components.
M has been integrated with the commercial HLS tool we use for the experimental results (Section 7). The CDFG, created by the HLS tool, is analyzed to nd the arrays speci ed in the code and their access patterns. Then, a memory is generated according to these speci cations and the area required for the PLM is added to the logic area reported by the HLS tool (line 10). The memory architecture is tailored to the component needs and is optimized with respect to the required number of ports and access patterns. In particular, given a certain number of ports, M combines several SRAMs, or BRAMSs in case of FPGA devices, into a multi-bank architecture (Figure 2 ). Each SRAM (BRAM) provides 2 read/write ports, thus by combining them in a multi-bank architecture M allows the component to perform multiple accesses in parallel [2] .
DESIGN-SPACE EXPLORATION
After the characterization of the single components of a given accelerator, COSMOS uses a LP formulation to nd the Pareto-optimal design points at the system level. The DSE problem at the system level can be formulated as follows:
Problem 1. Given a TMG model of the system where each component has been characterized, a HLS tool, and a target granularity δ > 0, nd a Pareto curve α versus θ of the system, such that:
(i) given two consecutive points d, d on the Pareto curve, they have to satisfy:
ensures a maximum distance between two design points on the curve;
(ii) the HLS tool must be invoked as few times as possible.
This formulation is borrowed from [28] , where the authors propose a solution that requires the manual e ort of the designers to characterize the components. In contrast, COSMOS solves this problem by leveraging the automatic characterization method in Section 5 and by dividing it into two steps: Synthesis Planning and Synthesis Mapping.
Synthesis Planning
Given a strongly-connected system TMG, COSMOS uses the following θ -constrained cost-minimization LP formulation:
where the function f i returns the implementation cost (α) of the i-th component given the ring-delay τ i of transition t i , σ ∈ R n is the transition-ring initiation-time vector, M 0 ∈ N m is the initial marking, τ − ∈ R m is the input-transition ring-delay vector, i.e., τ − i is the ring-delay of the transition t k entering in place p i (note that τ − min and τ − max correspond to the extreme λ min and λ max of the components), and A is the m × n incidence matrix de ned as:
The objective function minimizes the implementation costs of the components, while satisfying the system throughput requirements. Given the component extreme latencies λ min and λ max , it is possible to determine the values of θ min and θ max by labeling the transitions of the TMG of the system with such latencies. By iterating from θ min to θ max with a ratio of (1 +δ ), we can then nd the optimal values of λ for the components that solve Problem 1. This formulation guarantees that the components that are not critical for the system throughput are selected to minimize their cost. The cost functions f i in Equation (2) are unknown a-priori, but they can be approximated with convex piecewise-linear functions. This LP formulation can be solved in polynomial time [5] , and it can be extended to the case of non-strongly-connected TMGs.
Synthesis Mapping
Given the optimal values of λ of each component that solve Problem 1, it is necessary to determine the knob settings that included in any region, COSMOS uses the slowest point of the next region that has a larger number of ports. This does not require a synthesis run (because that point has been synthesized during the characterization), and it is a conservative solution because, as in the case of failure of the λ-constraint, we are willing to trade area to preserve the throughput.
EXPERIMENTAL RESULTS
We implement the COSMOS methodology with a set of tools and scripts to automatize the DSE. Speci cally, COSMOS includes: (i) M [37] to generate multi-bank memory architectures as described in Section 5, (ii) a tool to extract the information required by M from the database of the HLS tool, (iii) a script to run the synthesis and the memory generator according to Algorithm 1, (iv) a program that creates and solves the LP model by using the GLPK Library 3 (Section 6.1), and (v) a tool that maps the LP solutions to the HLS knobs and runs the synthesis (Section 6.2).
We evaluate the e ectiveness and e ciency of COSMOS by considering the WAMI application [38] as a case study.
The original speci cation of the WAMI application is available in C in the PERFECT Benchmark Suite [3] . Starting from this speci cation, we design a SystemC accelerator to be synthesized with a commercial HLS tool, i.e., Cadence C-to-Silicon. We use an industrial 32nm ASIC technology as target library 4 . We choose the WAMI application as our case study due to (i) the di erent types of computational blocks it includes and (ii) its complexity. The heterogeneity of its computational blocks allows us to develop di erent components for each block and show the vast applicability of COSMOS. The C speci cation is roughly 1000 lines of code. The speci cation of our accelerator design is roughly 7000 lines of SystemC code.
Computational Model
We model the WAMI application as a loosely-coupled accelerator. Figure 8 executed in software to preserve the oating-point precision. Therefore, it is modeled with a xed e ective latency during the DSE process.
Component Characterization
COSMOS applies Algorithm 1 (Section 5) to characterize the components of the system. Table 1 reports the results of the characterization for the WAMI accelerator: the algorithm used by COSMOS (COSMOS) is compared with the case in which memory is not considered in the characterization (No Memory). In the latter case, we assume to have only standard dual-port memories. For each component, the table reports the latency span (λ span ), i.e., the ratio between the maximum latency and the minimum latency, the area span (α span ), i.e., the ratio between the maximum area and the minimum area. For COSMOS, the table shows also the total number of regions identi ed by the algorithm (re ). For Algorithm 1 we use a number of ports in the interval [1, 16] and a maximum number of unrolls in the interval [8, 32] , depending on the components. COSMOS guarantees overall a richer DSE, as evidenced by the average results. For some components the algorithm extracts only one region because multiple ports can incur in additional area for no latency gains. This happens when (i) the algorithm cannot exploit multiple accesses in memory, or (ii) the data is cached into local registers which can be accessed in parallel in the same clock cycle, e.g., for C D . On the other hand, in most cases COSMOS provides signi cant gains in terms of area and latency spans compared to a DSE that does not consider the memories. points of the regions, the graphs show also the intermediate points that could be selected by the mapping function.
The small graphs on the right magnify the corresponding regions reported on the left. As in the examples discussed in Section 3, increasing the number of ports has a signi cant impact on the DSE, while loop unrolling has a local e ect within each region. Another aspect that is common among many components is that the regions become smaller as we keep increasing the number of ports. For example, for G in Figure 9 (c), we note that by increasing the number of ports, we reach a point where the gain in latency is not signi cative. This e ect, called diminishing returns [1] , is the same e ect that can be observed in the parallelization of software algorithms. In some cases, changing the ports increases only the area with no latency gains as discussed in the previous paragraph. This is highlighted in Figure 9 (d) , where for C D we report two additional regions with respect to those speci ed in Table 1 .
The diminishing-return e ect can also be observed by increasing the number of unrolls inside a region, e.g., Figure 9 (b). This is why COSMOS exploits Amdahl's Law (Section 6.2). On the other hand, we notice some discontinuities of the Pareto-optimal points within some regions, e.g., the region in the bottom-right corner of Figure 9 (a). Even by applying the λ − constraints (Section 5) it is not possible to completely discard the Pareto-dominated implementations. In fact, by further restricting the imposed constraints, i.e., by reducing the number of states that the HLS tool can insert in each loop, we observe that also the Pareto-optimal implementations are discarded. Thus, it is not always possible to obtain a curve composed only of Pareto-optimal points within a certain region. Finally, the Pareto-optimal points outside the regions are not discarded by COSMOS. They can be chosen when it is necessary to perform the mapping (Section 6.2).
Design-Space Exploration
After the characterization of the single components, COSMOS applies the DSE approach explained in Section 6. It rst nds the optimal solutions at the system level by using Equation ( 2) (Section 6.1). It then applies the mapping function to determine the knob settings of the single components and runs the necessary synthesis (Section 6.2). Figure 10 shows the resulting Pareto curve that includes the planned points (from Equation (2)) and the mapped points (returned by the mapping function). These design points are characterized in terms of e ective throughput (frame/s) and area (mm 2 ). To quantify the mismatch between the planned points and the mapped points we calculate the following ratio: Figure 10 is labeled with its corresponding σ % value. Note that the curve obtained with LP is a theoretical curve because the points found at the system level do not guarantee the existence of a corresponding set of implementations for the components. The error is mainly due to the impact of the memory, which determines a signi cant distance between two consecutive regions (e.g., the points with more than 10% of mismatch in Figure 10 ). In fact, if a point is mapped between two regions it must be approximated with the lower-right point of the next region with lower e ective latency. This choice permits to satisfy the throughput requirements almost always, but at the expense of additional area.
In fact, even if Equation (2) is constrained by the system throughput, it is not always guaranteed to obtain the same throughput because it is not always the case that there exists a mapped point that has exactly the same latency of a planned point. To solve this issue, one could try to reduce the clock period and satisfy the throughput requirements.
Finally, to demonstrate the e ciency of COSMOS, Figure 11 shows the number of invocations to the HLS tool. For each component of WAMI, the right bars report the breakdown of the synthesis calls performed in each phase of the algorithm. At least two invocations are necessary for each region to characterize a component. Then, we have to consider the invocations that fail due to the λ − constraints, and nally, the invocations required at system level on the most critical components (mapping). Some components do non play any role in the e ciency of the system. For example, for M M , there are no invocations after the characterization because only the slowest version has been requested by Equation ( 2) (to save area). This component is not important to guarantee a high throughput for the entire system. Moreover, some synthesized points belong to multiple solutions of the LP problem, as in the case of D . Therefore, COSMOS avoids performing an invocation of the HLS with the same knobs more than once. On the other hand, the left bars in Figure 11 report the number of invocations required for a exhaustive exploration. Such exploration requires to (i) synthesize all the possible con gurations of unrolls and memory ports for each component, (ii) nd the Pareto-optimal design points for each component, and (iii) compose all the Pareto-optimal designs to nd the Pareto curve at the system level (Section 3). The left bars in Figure 11 show the number of invocations to the HLS tool required in step (i). COSMOS reduces the total number of invocations for WAMI by 6.7× on average and up to 14.6× for the single components, compared to the exhaustive exploration. Further, while COSMOS returns the Pareto-optimal implementations at the system level, to nd the combinations of the components that are Pareto optimal with an exhaustive search method, one has to combine the huge number of solutions for the single components. In the case of WAMI, the number of combinations, i.e., the product of the number of Pareto-optimal points of each component, is greater than 9 * 10 12 . This motivates the need of using a compositional method like COSMOS for the DSE of complex accelerators.
Summary
We report a brief summary of the achieved results:
• COSMOS guarantees a richer DSE with respect to the approaches that do not consider the memory as integral part of the DSE: for WAMI, COSMOS guarantees an average performance span of 4.06× and an average area span of 2.58× as opposed to 1.73× and 1.22×, respectively, when only standard dual-port memories are used; COSMOS obtains a richer set of Pareto-optimal implementations thanks to memory generation and optimization;
• COSMOS guarantees a faster DSE compared to exhaustive search methods: for WAMI, COSMOS reduces the number of invocations to the HLS tool by 6.7× on average and by up to 14.6× for the single components;
COSMOS is able to reduce the number of invocations thanks the compositional approach discussed in Section 6;
• COSMOS is an automatic and scalable methodology for DSE: the approach is intrinsically compositional, and thus with larger designs the performance gains are expected to be as good as smaller ones, if not better. While an exhaustive method has to explore all the alternatives, COSMOS focuses on the most critical components.
RELATED WORK
This section describes the most-closely related methods to perform DSE. We distinguish the methods that explore single-component designs (reported in Section 8.1) from those that are compositional like COSMOS (in Section 8.2).
Component DSE
Several methods have been proposed to drive HLS tools for DSE. There exist probabilistic approaches [43] , search algorithms based on heuristics, such as simulated annealing [44] , iterative methodologies that exploit particle-swarm optimization [33] , as well as genetic algorithms [17] , and machine-learning-based exploration methodologies [26, 31, 45] .
All these methods try to quickly predict the relevance of the knobs and determine the Pareto curves of the scheduled RTL implementations in a multi-objective design space. None of these methods, however, consider the generation of optimized memory subsystems for hardware accelerators. Conversely, other methods focus on creating e cient memory subsystems, but without exploring the other HLS knobs. For instance, Pilato et al. [36] propose a methodology to create optimized memory architectures, partially addressing the limitations of current HLS tools in handling memory subsystems. This enables a DSE that takes into account also the memory of accelerators. However, that work focuses on optimizing the memory architectures and not in proposing e cient DSE methods. Similarly, Cong et al. [12] explore memory reuse and non-uniform partitioning for minimizing the number of banks in multi-bank memory architectures for stencil computations. Di erently from these works, COSMOS coordinates both memory generators, like the one proposed in [37] , and HLS tools to nd several Pareto-optimal implementations of accelerators. Other methodologies apply both loop manipulations and memory optimizations. For instance, Cong et al. [14, 15] adopt polyhedral-based analysis to apply loop transformations with the aim of optimizing memory reuse or partitioning. Di erently from these works, COSMOS focuses on con guring the knobs provided by HLS, after applying such loop transformations.
Indeed, COSMOS realizes a compositional-based methodology, and thus it nds Pareto-optimal implementations of the entire system, and not only of the single components. The rst step of COSMOS consists in the characterization of components to identify regions of the multi-objective design space where feasible RTL implementations exist. This step di ers from previous works [27, 28, 43] for two main aspects. First, COSMOS includes memory generation and optimization in the DSE process. Second, COSMOS applies synthesis constraints to account for the high variability and partial unpredictability of the HLS tools. Such constraints consider both the dependency graph of the speci cation and the memory references in each loop. Thus, COSMOS identi es larger regions of Pareto-optimal implementations.
Other methods, such as Aladdin [47] , perform a DSE without using HLS tools and without generating the RTL implementations, estimating the performance and costs of high-level speci cations (C code for Aladdin). COSMOS di ers from these methods because it aims at generating e cient RTL implementations by using HLS and memory generator tools. Indeed, such methods can be used before applying COSMOS to pre-characterize the di erent components of an accelerator that is not ready to be synthesized with HLS tools. Since the design of HLS-ready speci cations requires signi cant e orts [39] , this can help the designers to focus only on the most critical components, i.e., those that are expected to return good performance gains over software executions. After this pre-characterization, COSMOS can be used to perform a DSE of such components and obtain the Pareto-optimal combinations of their RTL implementations.
System DSE
While the previous approaches obtain Pareto curves for single components, only few methodologies adopt compositional design methods for the synthesis of complex accelerators. The approach used by COSMOS predicts the Pareto curve at the system level, similarly to those proposed by Liu et al. [28] and Haubelt and Teich [21] . Di erently from these works, COSMOS correlates also the planned design points, which are simply theoretical (the LP solutions), with feasible high-level knob settings and memory con guration parameters. Further, COSMOS focuses on optimizing the HLS knobs, e.g., loop manipulations, and memory subsystems, rather than tuning low-level knobs, e.g., the target clock period.
CONCLUDING REMARKS
We presented COSMOS, an automatic methodology for compositional DSE that coordinates both HLS and memory generator tools. COSMOS takes into account the unpredictability of the current HLS tools and considers the PLMs of the components as an essential part of the DSE. The methodology of COSMOS is intrinsically compositional. First, it characterizes the components to de ne the regions of the design space that contain Pareto-optimal implementations.
Then, it exploits a LP formulation to nd the Pareto-optimal solutions at the system level. Finally, it identi es the knobs for each component that can be used to obtain the corresponding implementations at RTL. We showed the e ectiveness and e ciency of COSMOS by considering the WAMI accelerator as a case study. Compared to methods that do not consider the PLMs, COSMOS nds a larger set of Pareto-optimal implementations. Additionally, compared to exhaustive search methods, COSMOS reduces the number of invocations to the HLS tool by up to one order of magnitude.
