Abstract-High-level synthesis (HLS) tools have been increasingly used within the hardware design community to bridge the gap between productivity and the need to design large and complex systems. When targeting heterogeneous systems, where the CPU and the field-programmable gate array (FPGA) fabric are both available to perform computations, a design space exploration (DSE) is usually carried out for deciding which parts of the initial code should be mapped to the FPGA fabric such as the overall system's performance is enhanced by accelerating its computation via dedicated processors. As the targeted systems become more complex and larger, leading to a large DSE, the fast estimative of the possible acceleration that can be obtained by mapping certain functionality into the FPGA fabric is of paramount importance. Loop pipelining, which is responsible for the majority of HLS compilation time, is a key optimization toward achieving high-performance acceleration kernels. A new modulo scheduling algorithm is proposed, which reformulates the classical modulo scheduling problem and leads to a reduced number of integer linear problems solved, resulting in large computational savings. Moreover, the proposed approach has a controlled tradeoff between solution quality and computation time. Results show the scalability is improved efficiently from quadratic, for the state-of-the-art method, to linear, for the proposed approach, while the optimized loop suffers a 1% (geomean) increment in the total number of cycles.
Scaling Up Modulo Scheduling
for High-Level Synthesis of the application are best suited for each technology, followed by their actual implementation on the corresponding technology. Currently, for the mapping of the application to the FPGA element, modern HLS tools require user provided directives to guide the compilation mapping process, aiming to increase a performance metric. These directives usually target memory partitioning, loop unrolling, loop tiling, and loop pipelining.
Defining the most appropriated directive settings is a challenging problem, as the optimization of the code that is mapped to the FPGA leads to a combinatorial design space exploration (DSE) problem [1] , which is often explored partially by the user over many iterations [2] . Regarding automating the DSE, [1] , [3] - [5] propose the analysis of many optimization directives, employing heuristics and statistic models to reduce the number of options in the design space, achieving a reduction of 19.5× on average over brute force search [3] , [5] .
Complementary to the above approaches, a direct way to reduce the DSE total time is by reducing the time taken to apply each optimization in the code. As loops usually are responsible for the bulk of computation in many applications, loop optimizations are considered to have the most impact on performance and as such, loop pipelining is a key optimization to achieve high throughput. However, most of the HLS compilation time is taken by the modulo scheduling algorithms used to create the loop pipelines [6] , restricting as such the efficient DSE in systems that contain loops with a large number of instructions. This paper proposes a new modulo scheduling problem formulation that achieves a better asymptotic complexity with the number of instructions in the loop body than the state-of-theart approaches. Additionally, the new presented methodology allows to control the balance between time available for the problem solution and the achieved quality of results, enabling fast exploration at the initial DSE stages and higher quality designs in the final stages.
This paper is structured as follows. Section II presents the modulo scheduling problem, and Section III presents the stateof-the-art modulo schedulers in literature. Section IV presents the new formulation proposed for the modulo scheduling problem that creates an explicit separation between the scheduling and allocation parts, exploring the last with a genetic algorithm (GA) presented in Section V. Section VI presents a theoretical comparison and results over benchmarks between the proposed and state-of-the art schedulers. Finally, Section VII concludes this paper. 
II. MODULO SCHEDULING PROBLEM
The modulo scheduling problem is defined as given a piece of code that contains a loop, the aim is to find the minimum possible initiation interval II (i.e., after how many clock cycles a new loop iteration can be started), a schedule of the loop instructions, and an allocation into the available resources, without violating any dependencies.
An example of the above problem is captured in Code 1, which also introduces the basic modulo scheduling concepts. The number of loop iterations is captured by the loop trip count tc that is defined as tc = UB − LB + 1. Each instruction i has an associated delay D i representing the number of cycles for its execution.
The loop latency l is defined as the number of clock cycles that one loop iteration takes to complete. The latency l is calculated considering a single loop iteration execution as l = max(t i +D i ), where t i is the cycle where instruction i starts its execution (referred as "starting time"). The loop latency l depends on the number and type of the available resources and on the instructions' schedule. Table I presents four possible schedules for Code 1, assuming a single type of resource (op1), where I j i represents the execution of instruction I i on iteration j, each row represents a clock cycle and each column a resource instance.
In Table I , the first two schedules are not pipelined. Thus, all instructions in a loop iteration are scheduled to be completed before the next loop iteration starts. The first schedule uses only one resource, resulting in l = 3, while the second one uses two resources and has l = 2. A loop without pipeline takes total cycles = tc×l clock cycles to complete its execution.
The last two schedules represent cases where pipelining has been performed and two and three resources are available, respectively. Using two resources, I k 2 can start its execution together with I k 3 , thus II = 2. Analogously, with three resources, I k+1 1 can start its execution together with I k 2 and I k 3 , thus II = 1. A pipelined loop finishes an iteration every II cycles, except for the first one that takes l cycles, resulting in total cycles = l+II ×(tc−1) clock cycles to finish its execution.
Current approaches for solving the module scheduling problem, and finding the minimum II that results to the fastest loop execution, rely on first trying to find a solution for an estimating lower bound for II, and in case they fail, to iteratively increase the target II and try again till a solution is found. The lower bound for II is defined as MII = max(RecMII, ResMII), where RecMII is the recurrence minimum II, and ResMII is the resource-constrained minimum II [6] . Thus, given a candidate II, the problem of minimizing total cycles can be expressed as finding a schedule for the target candidate II with minimum latency l, which can be obtained through an as soon as possible (ASAP) schedule.
In this paper, we divide the modulo scheduling problem into its scheduling and allocation parts. The scheduling part consists of defining the starting time t i for all instructions I i in the loop such that no dependencies are violated. The allocation part consists of choosing which congruence class and resource instance (defined as variables m i and r i , respectively) instruction I i will be executed. Note that the congruence class m i of instruction I i is defined as m i = t i %II. The above information is represented through a module reservation table (MRT), which is a II × k∈R a k matrix, where a k is the instances number of resource type k and R is the set of all resource types.
Through the rest of this paper, we will consider the modulo scheduling problem for a given II, assuming there is an outer loop searching for the possible II, as is the case in all stateof-the-art approaches.
III. STATE-OF-THE-ART MODULO SCHEDULING APPROACHES

A. SDC Modulo Scheduler
Zhang and Liu [7] proposed a system of difference constraints scheduler (SDCS), an approach that divides the modulo scheduling problem into its scheduling and allocation parts. The scheduling part includes the dependency, timing, and chaining constraints captured through a system of difference constraints (SDCs) form. As such, a totally uni-modular matrix of constraints can be constructed, whose linear relaxation solution is integral, and thus is optimal for the original problem as well [8] . As such, the original integer-programming problem, with exponential solve complexity, is mapped to its relaxed version that exhibits polynomial solve complexity.
Definition 1: An MRT (an allocation) is said "valid" if all positions (slots) are assigned to up to one instruction.
The allocation part is solved with a greedy heuristic, which starts solving the scheduling problem without considering the resource constraints. If the resulting MRT is not valid, the heuristic adds new SDC constraints to the SDC problem, in an attempt to force the conflicting instructions to start at different times. The process is repeated until there are no conflicts in the MRT, or until a certain number of attempts are performed. If the algorithm fails to find a valid allocation, the candidate II is increased, and the whole process is repeated.
The above iterative heuristic approach [7] is improved in [6] , by introducing the possibility of backtracking on the conflict Algorithm 1: SDCS Algorithm Presented in [6] input instructions, allowing them to be sent back to the scheduling queue, significantly improving the solutions latency and also finding solutions for II candidates for which the previous approach would have failed. However, the number of iterations of the method is increased, and consequently the number of SDC problems solved is also increased.
The example below illustrates the number of SDC problems that are required to be solved by this approach in the worst-case. The scheduler (Algorithm 1) and backtracking (Algorithm 2) algorithms are replicated from [6] .
Definition 2: "Loop size" is defined as the number of instructions in the intermediate representation (IR) code of the loop body. "Problem size" is defined as the number of variables and constraints in an optimization problem formulation.
First, in Algorithm 1, the nonresource constrained schedule is calculated once, resulting in the ASAP schedule (line 1) that serves as the basis for setting the instructions starting time. In lines 2-4, the instructions are queued to be processed in order. The number of iterations is constrained by the budget (line 3), which is defined as budget = budgetRatio * n, where n is the loop size and budgetRatio is a user defined constant. As such, the above limit makes the worst-case number iterations of Algorithm 1 equal to budget, regardless on how many instructions are sent back to the scheduling queue by the backtracking policy. In each iteration, the solver, i.e., the function responsible for solving the SDC scheduling problem, is invoked in lines 11 and 17 as well as a number of times within the BackTracking function. 15 Remove all SDC constraints for S; 16 Remove S from MRT;
17
Add S to schedQueue; 18 Add SDC constraint: t I = evictTime; 19 Update MRT and prevSched for I;
As illustrated in the BackTracking function (Algorithm 2), an instruction I is attempted to be scheduled from its ASAP time to its current scheduled time (lines 1-3). In the worst-case, its ASAP time is zero and its current scheduled time is the length of the whole schedule, which, in the worst-case, is a sequential path. Therefore, in the worst-case, the solver is invoked As such, in the worst-case, the total number of solver calls is
The rest of Algorithm 2 deals with the case where the provided schedule is not feasible. In that case, a time slot is selected (lines 4-8) and an instruction from that slot is removed from the SDC problem formulation (lines 9-12) and the necessary updates are performed in the corresponding MRT structure (lines [13] [14] [15] [16] [17] . As such, a time slot is available for an instruction to be scheduled (lines 18 and 19) , and the control returns to Algorithm 1.
B. ILP Modulo Scheduler
As the objective is to create a schedule for the loop instructions for a given II, the problem can be naturally formulated as a resource-constrained scheduling-allocation integer linear problem (ILP). Additionally to resources, other HLS metrics can be added in the ILP to improve or ensure hardware quality, as timing, chaining [8] , and soft [9] constraints. Instead of separating the problem scheduling and allocation parts, Oppermann et al. [10] proposed to use the complete formulation, which always allows finding the optimal solution for a given II, if it exists. This formulation requires O(n 2 ) variables and constraints to ensure MRT validity, which are called overlap variables [11] , where n is the loop size. Moreover, as this approach cannot be expressed in the SDC form, its solver scales exponentially with the problem size. However, there is no need to solve the formulation multiple times as in the SDCS approach but only once, which counter-balances the computation time required for obtaining a single solution. Moreover, the ILP approach is also capable to quickly prove that the problem generated by a candidate II is infeasible, which is a secondary way to counter-balance its exponential complexity. We will refer to it by ILP scheduler (ILPS) in this paper.
Formulation 1 presents a simplified version of the ILP formulation presented in [10] , where the variables domain is described in Table II (replicated from [10] ).
The objective function (line 2) captures the starting time of all instruction, which needs to be minimized, leading to an ASAP schedule.
For each dependency on the data flow, a constraint is created (line 4), where b ij ≥ 0 is the loop-carried dependency distance (which is represented as a back-edge in the dataflow graph), and b ij = 0 for data dependencies (forward edges in the data flow). Note that Oppermann et al. [10] considered b ij ∈ {0, 1}, while this paper adopts the more general case where b ij ∈ Z + 0 , as proposed by Canis et al. [6] . 2) i∈O t i
3) subject to:
To guarantee that two instructions will not occupy the same MRT slot, the binary overlap variables ij (lines 5-7 and 11) and μ ij (lines 8-11) are defined for each pair of instructions I i and I j that share the same resource type k, making the number of overlap variables to grow quadratically with the loop size. Lines 4 and 15 capture the dependency and timing constraints used in [10] . Lines 13 and 14 provide the upper bounds for r i and m i , respectively. Line 12 links the values of t i to m i using the auxiliary variables y i . We use the notation on Table II through the remaining of this paper.
Codina et al. [12] presented a comparison between the existing module schedulers in the literature, including iterative, slack, swing and stage modulo schedulers, that concludes that the iterative modulo scheduler (which is the base for SDCS) achieves better II values, despite requiring longer executions times. As the achieved value of II is critical in the HLS context when targeting high-performance designs, the iterative method has been chosen to act as a baseline in this paper.
IV. SDC FORMULATION FOR SEPARATING SCHEDULING AND ALLOCATION
The main idea behind the proposed approach is to separate the ILP formulation scheduling and allocation parts, and to use an SDC formulation to solve the scheduling part and another method to explicitly traverse through valid MRTs (i.e., the allocation space). This approach differs from [6] , which modifies the SDC problem (scheduling part) by adding constraints trying to force its solution to have a valid MRT indirectly.
Section IV-A presents how Formulation 1 is transformed to allow the scheduling and allocation parts to be separately explored. Section IV-B presents a relaxation of the proposed formulation, that is useful due to its feasibility properties.
A. Separating Scheduling and Allocation
On Formulation 1, the overlap variables (lines 5-11) guarantee that two instructions are not allocated to the same MRT slot, and thus ensuring a valid MRT. In the proposed approach, the allocation part will ensure a valid MRT construction, and as such the overlap variables are eliminated from the formulation. The rest of the problem formulation has: dependency constraints (line 4), other HLS constraints (line 15), and the linker constrains between t i and m i (line 12). As such, the remaining problem can be rewritten as minimize: 2) subject to:
3)
Other HLS constraints in the SDC can also be added in Formulation 2 in a similar way.
Assuming a valid MRT has been produced by the allocation stage, the m i and r i are known, and by solving the SDC Formulation 2, the y i values are calculated in polynomial time, leading to the t i values. Furthermore, the latency of the solution is used as an MRT measure of quality, since solving Formulation 2 for different MRTs results in schedules that differ only in latency, which is the only nonconstant factor in total cycles = l + II × (tc − 1), for a given II.
Summarizing, formulation 2 allows us to divide the modulo scheduling problem into the allocation and scheduling parts explicitly. The proposed method traverses the MRT space and produces possible allocations, which then drive the scheduling part through the SDC Formulation 2. The traversing of the MRT space is guided by the latency of the obtained solutions.
The traversing of the MRT space can be performed by various methods such as any heuristic, meta-heuristic, evolutionary algorithm, machine learning techniques, while the scheduling part can be optimally solved using Formulation 2, in polynomial time.
Nevertheless, Formulation 2 is not feasible for any MRT construction.
Definition 3: We define "feasible MRT" and "infeasible MRT" as MRTs to which Formulation 2 is feasible and infeasible, respectively.
The feasibility and infeasibility of Formulation 2 is demonstrated by the example in Fig. 1(a) . Let us assume that we only have two resource instances to execute instructions I i = {1, 2, 3, 4, 5}, thus minII = 3 and the MRT has three rows and two columns. Fig. 1(b) and (e) shows the nonresource constrained ASAP (NRCASAP) and the optimal MRTs, respectively. Fig. 1(c) shows a feasible MRT, with scheduling times t = {0, 1, 1, 3, 2}, and the back-edge constraint given as Fig. 1(d) shows an infeasible MRT, with scheduling times t = {0, 3, 1, 1, 5}, and the back-edge constraint given as t 5 
As such, the allocation space exploration needs to also consider infeasible MRTs, for which the quality of the obtained solution (i.e., latency) is not available. As the MRT is infeasible, the operations cannot be scheduled, and as such, there is no quality metric (i.e., schedule latency) associated with the MRT. In this fashion, a possible solution is to discard the infeasible MRTs and keep generating MRTs and analyzing only the feasible ones, as summarized in the flow presented in Fig. 2 . However, experimental work indicated that finding a feasible MRT through the above algorithm is rare. More specifically, using benchmark cp (described in Section VI-C), 145 872.23 random MRTs on average (50 repetitions) need to be created before finding a feasible one.
Instead of discarding infeasible MRTs, this paper proposes an approach that would allow the association of an infeasible MRT with a latency value, and as such would allow its incorporation in the MRT exploration stage. The proposed approach is based on relaxing Formulation 2. The relaxed formulation solution results in a schedule for the loop with an associated latency, even though this may not be an actual solution for the nonrelaxed problem. However, it can be used to compare infeasible MRTs with feasible ones, as illustrated in the flow in Fig. 3 .
Section IV-B presents a relaxation for Formulation 2 that is guaranteed to be feasible even for infeasible MRTs. 3) subject to:
B. Relaxation of Formulation 2
In Formulation 3, there is no guarantee that x i values are multiples of II, and as such, y i are not guaranteed to be integer. Thus, there is no guarantee that t i %II = (x i + m i )%II = m i . In other words, the MRT that corresponds to the solution (calculated as t i %II) is not guaranteed to be the same valid MRT used to construct the problem in Formulation 3, thus, it is not guaranteed to be valid. This has no consequences for the flow proposed in Fig. 3 , since the schedule values (t i ) are only used for comparison.
In the following paragraphs, it is proved that if a feasible MRT exists for Formulation 2, then Formulation 3 is always feasible (regardless of the MRT). As such, by employing Formulation 3, the proposed methodology ensures that a schedule can be derived that can be used to guide the DSE process.
Notation: Let δx ab represent the x a − x b . For the following proof, the following bounds need to be introduced.
Bound 1 is a consequence that 0 ≤ m i ≤ (II − 1) for any instruction I i by definition. Bound 2 implies that instruction I i delay is smaller than II, which is true for functional units are not pipelined themselves. Thus, it is imperative that every instruction has to finish its execution before being executed again. Bound 3 is a consequence of bound 2.
The first consideration to make is that the dependencies created by the forward edges (I i → I j ) can always be satisfied regardless of the MRT. This is because the right-hand side (RHS) in Formulation 2 can be only 0 or −1, since b ij = 0, and bounds 1 and 2 make −(II − 1) (II −1) . The forward edge constraints are also always valid for Formulation 3 since it is a relaxation of Formulation 2. Thus, we will only consider the back-edges constraints from now on.
Using bound 1, we can expand the RHS of Formulation 2 to enclose all possible values of the MRT, including the ones that would make it infeasible, resulting in the following equation:
In the same fashion, for Formulation 3 to be feasible for any MRT, is equivalent to rewrite the RHS on Formulation 3 as inequality (3) . That is, δx ij has to be smaller than the lower bound of
Thus, the condition for the feasible space of Formulation 3 to encompass all values of MRT is if inequality (2) is encompassed by inequality (3), resulting in the following equation:
If II = 1, inequality (4) always holds. Thus, we only need to analyze the case when II > 1.
Using bound 3, we can see that the left-hand side (LHS) of inequality (4) is equal to b ij , implying that the condition in inequality (4) can be rewritten as
Finally, using bound 3, and that b ij ≥ 1 by definition, we conclude that inequality (5) always holds, meaning also that inequality (4) always holds.
As such, if Formulation 2 is feasible for an MRT, then Formulation 3 is always feasible regardless the MRT.
V. SDC-BASED GENETIC ALGORITHM
As noted in Section IV-A, any method of choice can explore the allocation space as long valid and feasible MRTs are produced by the method where Formulation 2 is used to solve the scheduling problem. In this paper, a GA is selected as it is known to be applicable for problems without information about the solution structure [13] .
The first step in the creation of a GA is the design of the GA chromosomes. Formulations 2 and 3 allow us to select the MRTs as chromosomes (individuals). In this way, we reduce the number of problem variables (described in lines 5-11 on Formulation 1) by capturing these constraints in the GA chromosome encoding. As will be explained later, the chromosome evolution guarantees a valid MRT, and as such, lines 5-11 are removed (with the overlap variables).
A. Genetic Algorithm
The GA purpose is to evolve MRTs (individuals) according to the final schedule latency (fitness). That is, since II is given, and tc is constant, the GA evolves individuals to reduce the solution latency. MRTs are implemented as lists of (m i , r i , I i ) triplet, where (m i , r i ) correspond instruction I i MRT slot, which is a common way to represent sparse matrices such as the MRT.
1) Main Algorithm:
The proposed GA algorithm is described in Algorithm 3, to which we will refer to as a genetic algorithm scheduler (GAS). The parameters impacting GAS are the population size nPop, the insemination rate i r , the number of new individuals created at each generation offspringSize, the minimum number of generation minGen, and the mutation probability mutationProb.
The first step is the creation of a random population (lines [1] [2] [3] [4] . The initial population is inseminated with individuals created by randomly legalizing the NRCASAP schedule MRT (lines 5 and 11) using the legalization process (described in detail in Algorithm 5). Since the NRCASAP schedule is the unreachable ideal, we expect that its MRT values contain "good genes" and the optimality will be close to a slight modification of its MRT.
The population insemination introduces some knowledge about the solution structure into the GA, which is shown to improve GA results as shown in [14] . Fig. 4 shows the number of generations a typical GAS execution takes to find a feasible MRT in function of the population size, with and without the insemination. For Fig. 4 , we used the benchmark complex (described in detail in Section VI-C). Consider the example in Fig. 1(a) , with a 0 = 2 resource instances and MII = 3. Fig. 1(b) presents the NRCASAP MRT, which had three instruction with congruence class m = 1. The legalization process can generate the MRTs presented in Fig. 1(c), (d) , or (e), which will be used to inseminate the initial population. The population is then evolved for minGen generations (lines [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] . The evolution process consists of creating offspringSize new individuals, called cubs (lines [14] [15] [16] [17] [18] [19] [20] [21] [22] , from two different individuals randomly selected from the population (line 15) through a cross-over process (described in detail in Algorithm 6). The mutation occurs by attributing a new random empty MRT slot to the instructions in Finally, the algorithm does not stop until it reaches the minimum number of generations minGen, or until it finds a feasible MRT (line 27). Furthermore, we define budget as a maximum number of generations in case a feasible MRT is not found (lines 25 and 26), which cause GAS to return a failure for the given II meaning that the candidate II should be increased.
2) Evaluation Algorithm: Algorithm 4 describes the evaluation function, which considers two cases. The first case is when the MRT is feasible, and thus we have a solution for the problem. The second case is when the MRT is infeasible, and thus Formulation 3 is used to calculate the fitness with a penalty given by how many conflicts there were in the solution of Formulation 3.
First, the algorithm tries to solve Formulation 2 (line 1), and the resulting schedule latency is returned if it is feasible (lines 2-5). If Formulation 2 is infeasible for the given MRT, Formulation 3 (line 7) is solved, which is guaranteed to be feasible (Section IV-A). The MRT corresponding to the relaxed problem solution MRT relaxed is calculated using m relaxed i = t i %II (line 9), where t i = (x i + m i ). The latency is calculated using t i values (line 9). Since MRT relaxed is not guaranteed to be valid, we modify it using the validation function (line 11), which return how many resource conflicts the nonvalid MRT had and randomly fixes the MRT (described in Algorithm 5). Finally, the function returns the latency l for the relaxed schedule plus a penalty according to the number of resource conflicts the MRT had times II (line 12). This penalty captures that the relaxed schedule can be a solution for the nonrelaxed one, which happens when no resource conflicts occur, and also punishes relaxed schedules with conflicts, making them less likeable to survive through generations.
As example, consider Fig. 1(c) , which is feasible and Formulation 2 returns the schedule t 1(c) = {0, 1, 1, 2, 3}, with latency l 1(c) = 4, resulting in fitness 1(c) = 4. Now, consider Fig. 1(d) , which is infeasible, and solving Formulation 3 results in x 1(d) = {0, 1, 0, 0, 0}, thus t 1(d) =  {0, 1, 1, 1, 2} . Note that MRT relaxed has instructions {I 2 , I 3 , I 4 } in the same congruence class, thus it is not valid, and the validation function will return the latency l 1(d) = 3 and the number of conflicts n_outs = 1, resulting in fitness 1(d) = 3 + 1 * 3 = 6. Furthermore, MRT relaxed need to be validated, before being added back to the population.
3) Validation Algorithm: Algorithm 5 describes the function used to modify nonvalid MRTs into valid ones, randomly handling conflicts, and also counting how many conflicts there were in the MRT. This function is necessary for the insemination and individual evaluation processes since invalid MRTs can be generated in both procedures.
Each resource constrained instruction I is attempted to be allocated in its current MRT slot (lines 2 and 3), and if there is already another instruction allocated to that slot, I is attempted to be scheduled in another resource instance, with the same congruence class (lines [4] [5] [6] [7] [8] . This solves conflicts where two instructions are assigned for the same resource instance, but there are still other available instances in the same congruence class.
If all resources are busy, a randomly allocated instruction I r in the congruence class is selected (line 10), and the number of conflicts is incremented (line 16). Finally, both instruction dispute a coin toss to define which one will be allocated to a random MRT empty slot (lines [11] [12] [13] [14] [15] . This random selection is made as there is no information on which allocation is the best to be performed.
To illustrate the validation process, consider Fig. 1(b) . First I 1 , I 2 , and I 3 will be successfully allocated in their congruence class. Then I 4 will fail to be allocated with m 4 = 1. A random allocated instruction with m = 1 is chosen, which can be I 2 or I 3 . Let us assume that I 2 is chosen, then a random selection between I 2 and I 4 will be performed. If I 2 is selected, it remains as it is, and I 4 is randomly allocated into an empty slot, which can result to MRTs depicted in Fig. 1(c) or (e). If I 4 is selected, I 2 is randomly allocated which leads to the MRT depicted in Fig. 1(d) .
4) Cross-Over Algorithm: Algorithm 6 describes the single point cross-over function used. The cross-over point is a randomly chosen number of instructions (line 1), which are copied to cub 1 and cub 2 , respectively, from parents p 1 and p 2 (lines 2-4). The rest of the instructions are copied from the opposite parent (lines 5-7). This simple process might cause MRT conflicts since an instruction of the first part of p 1 can occupy the same slot as an instruction in the second part of p 2 . Possible conflicts are solved using the validation process described in Algorithm 5.
To illustrate the cross-over, consider Fig. 1 0, I 1 ), (1, 0, I 2 ), (1, 1, I 3 ), (0, 1, I 4 ), (2, 0, I 5 )} and p 2 = MRT 1(d) ={(0, 0, I 1 ), (2, 0, I 2 ), (1, 1, I 3 ) , (1, 0, 
VI. COMPARISON WITH SDC AND ILP MODULO SCHEDULERS
A. Performance Models
The total number of individuals evaluated by the GA is:
Under the proposed formulation, a maximum of two SDC problems are solved for each individual. As such, for GAS to be faster than SDCS, the totalIndividuals need to be less than the number of problems solved by SDCS.
Let nPop = offspringSize = α × n and maxGen = β, where α and β are parameters defined by the user, and n is the loop size. Thus, totalIndividuals = n × (α(1 + inseminationRate) + β). Since in the worst-case two SDC problems are solved per individual (Formulations 2 and 3), GAS is set to solve O(n) SDC problems.
The conditions for GAS to be faster than the SDCS worst-case is
Inequality (6) provides a sufficient condition for GAS to be faster than SDCS worst-case, and indicates that there exists a large enough n to which GAS will be faster than SDCS worstcase. The impact of the above condition will be revisited in the performance evaluation section where the parameters α and β are tuned.
B. Characterization
A high-level comparison between SDCS, ILPS, and GAS is presented in Table III . The table shows that ILPS has the worst solver complexity, which also has the worst scaling with the loop size n.
In the case of GAS, the problem size scales better than in the case of SDCS, as GAS creates only one variable for each instruction I i , instead of D i (as proposed by Oppermann et al. [10] ), as well as GAS does not add constraints related to the allocation as SDCS does (Section III-A). As such, the GAS's main sources of speed-up compared to SDCS is: 1) the smaller number of problems solved, which is achieved due Formulation 3 usage and the efficient MRT exploration by the GA and 2) the fact that the generated problems exhibit smaller sizes.
In case of SDCS, its main source of inefficiency lies in its iterative heuristic search for finding a solution with a valid MRT. A secondary source of inefficiency is its incapability to prove infeasibility for a given II, requiring the allocated time budget expiration. Such infeasible IIs are explored by Oppermann et al. [10] using a different RecMII estimation than [6] , which results in smaller and infeasible MII.
The ILPS's main source of inefficiency lies on the quadratic problem size scaling, and on the exponential complexity of its solver. In contrary, GAS search solves a number of problems that scale linear with n, as opposed to the quadratic scaling of number of problems in SDCS. Furthermore, GAS problems are in the SDC form, and thus its solver exhibits polynomial complexity.
Concerning the quality of the achieved solution, SDCS does not guarantee latency or II optimality due to its heuristic nature. If a solution cannot be found within a preallocated time budget, the II is increased, and a new search is performed. On the contrary, ILPS guarantees both latency and II optimality, since the solver guarantees to find the optimal solution for the problem, for a given II, or declare infeasibility. GAS does not guarantee latency or II optimality since the GA may fail to find a feasible MRT before it reaches its stop criteria.
C. Benchmark Selection
The proposed approach is evaluated using 13 HLS benchmarks, and their details and characteristics are summarized in Table IV . All suitable loops of the above benchmarks were submitted for scheduling using the SDCS, ILPS, and GAS approaches. Benchmarks mt, dv, fat, cp, and ac are the same ones used in [6] . Please note that benchmarks mt, fat, and ac are synthetic benchmarks that have been designed specifically to test SDCS's scalability and are originally used in [6] . These benchmarks use more than 50% resource-constrained nodes.
The rest of the benchmarks exhibit various data access pattern, loop dependencies, logic operations, nested loops, and induction variables to validate the approaches under different code constructions.
Moreover, as benchmarks rs, cv, gp, ft, and sh contain more than one loop in their source code, the entries in Tables IV-VI reflect the combined loop characteristics.
All schedulers were implemented as part of LegUP [21] infrastructure, where the SDCS is natively implemented. As LegUP imposes restrictions on which loops can be pipelined, the same restrictions are also applied for ILPS and GAS. As such, all loops have bounds and array sizes modified to statically determined values, and loops with unknown bounds and more than one induction variable were set not to be pipelined. Since only innermost loops, which are in the main C file, can be pipelined, all benchmarks were adapted to a single C file, and only the innermost loops were selected to be pipelined. Furthermore, a full inline is applied in the whole code, since LegUP implements local memories only for arrays that are used exclusively within the context of one, and only one, function.
It should be noted that the above code modifications are applied due to the LegUP's restrictions. The discussed schedulers can be implemented on other compilers without these restrictions, as has been demonstrated by Oppermann et al. [10] .
D. Performance Evaluation
The results presented in this section were obtained on an Ubuntu 14.04 machine, with 16 GB of RAM, and an Intel Core i7-2600 CPU @ 3.40 GHz. All schedulers are implemented inside LegUP 4.0 structure as Low Level Virtual Machine (LLVM) 3.5 opt passes. Gurobi 7.5 [22] is chosen as solver for all schedulers.
The SDCS budget is set as budgetRatio = 6 (LegUP default) and INCREMENTAL_SDC = 1 (speeds up SDCS [23] ). The ILPS time budget is set to 5 min. The parameters of GAS were empirically set as follows: α = 0.10, β = 10, and i r = 0.05, for all benchmarks, satisfying α(β + 1 + i r ) = 1.105 < 6 = budgetRatio, with the expectation that GAS will, in most cases, solve fewer problems than the SDCS worst case. The mutation rate is empirically defined as mutationProb = 1%.
The three schedulers are tested with respect to the obtained quality of solution and the time required to reach that solution on the above benchmarks. The obtained results are presented in Tables V and VI, corresponding to the average of 50 repetitions. Overall, all schedulers achieved similar results (II and loop latency), but with a clear time improvement for the proposed GAS scheduler. Table V presents the number of problems solved, and the number of variables and constraints for all schedulers. As the results demonstrate, GAS solves significantly simpler problems (i.e., fewer variables and constraints) than SDCS. Furthermore, the larger (i.e., more instructions) the code is, the fewer problems are solved by GAS when compared to SDCS. Table VI presents the obtained II, the solution's latency, the total number of cycles, and the scheduling time to reach the solution, for all loops of all benchmarks. Results show that all schedulers reached the minimum II.
With respect to the obtained solution latency, SDCS produce solutions with 12% longer latency when compared to ILPS, while GAS produces solutions with 13% longer latency when compared with SDCS. However, the impact of the latency is amortized in the total number of cycles in the equation total cycles = l + II × (tc − 1), as tc usually is a larger number than l. As illustrated in Table VI, the solutions produced by GAS exhibit on average 1% penalty in the total cycles.
Finally, Table VI shows the computation time taken by SDCS, ILPS, and GAS. ILPS is not comparable to the other Latency evolution according to the number of generations for benchmark cp.
approaches due to its exponential scaling with the problem size. GAS presents 55% improvement over SDCS in our tests (geomean).
To exemplify scalability, Figs. 5 and 6 present the computation time and achieved II as a function of the loop size, for the benchmarks in Table IV when loop unrolling is applied. The unrolling factor is presented in the number above the GAS marks. Benchmarks gp, ft, sh, and j2 are excluded from this evaluation due to small unrolling factors that they can support (i.e., their loops can be unrolled only by factors 1 and 2). Fig. 5 illustrates that the proposed GAS approach scales better with the loop size than SDCS. Moreover, the results show that GAS can provide solutions for long codes, where for the same cases SDCS fails (missing points).
Moreover, Fig. 6 shows that GAS can find smaller IIs than SDCS in the case of long codes, which indicates an improved performance on the total number of cycles for large codes, contrasting the results obtained for smaller codes presented on Table VI . Fig. 7 shows the GAS solution latency for the benchmark cp (the most challenging benchmark as indicated by Tables V and VI) as a function of the number of generations and parameterized according to the population size. Fig. 7 shows how the population size and number of generations can be increased to obtain a smaller latency, showing the tradeoff between computation time and solution quality.
It should be noted that the results reported are obtained by applying the above module scheduling approach on a small number of loops. However, in the case of a DSE in a large application, multiple kernels need to be considered for loop pipelining, resulting in a combinatorial number of possible configurations. As such the obtained gains from the proposed approach will reflect to a significant reduction of the required absolute DSE time. For example, the DSE method presented by Schafer [5] evaluates up to 250 designs per benchmark. Some of these evaluations vary the loops unrolling Table VII , show that all schedulers spend more time to provide a solution as compared to Table VI , since now they also handle cases where II is infeasible, however, the schedulers relative performance has not changed.
VII. CONCLUSION
The work in this paper presents a novel formulation for the modulo scheduling problem, as well as a proposed solution that is based partially on the use of a GA. Results have shown that a significant speedup can be achieved compared to the current state-of-the-art approaches, where the proposed formulation also enjoys better scalability properties paving the way for attacking larger source codes.
In summary, the results show that both ILPS and GAS can produce valid schedules, which are not optimal if a limited time budget is enforced. The ILPS produces nonoptimal solutions during the branch and bound process in the solver, and GAS produces a valid non-"proven to be"-optimal solution for each individual. Furthermore, the resulting nonminimum latency contributes weakly to the total number of cycles for the complete loop execution, what minimizes the GAS impact on the overall loop performance.
The obtained results illustrate that the proposed scheduler achieves the number of SDC problems solved O(n) scalability, where n is the number of LLVM IR instructions in the loop code. This is a significant improvement compared to the worstcase scalability O(n 2 ) provided by the current state-of-the-art SDCS approach.
Furthermore, the scaling test results indicate that GAS is capable of finding solutions for large codes where SDCS fails to do so, and also suggest that GAS is capable to find smaller II than SDCS, indicating a more efficient exploration of the solution space.
