Hardware accelerators generated by polyhedral synthesis techniques make extensive use of affine expressions (affine functions and convex polyhedra) in control and steering logic. Since the control is pipelined, these affine objects must be evaluated at the same time for different values, which forbids aggressive reuse of operators. In this article, we propose a method to factorize a collection of affine expressions without preventing pipelining. Our key contributions are (i) to use semantic factorizations exploiting arithmetic properties of addition and multiplication and (ii) to rely on a cost function whose minimization ensures correct usage of FPGA resources. Our algorithm is totally parameterized by the cost function, which can be customized to fit a target FPGA. Experimental results on a large pool of linear algebra kernels show a significant improvement compared to traditional low-level RTL optimizations. In particular, we show how our method reduces resource consumption by revealing hidden strength reductions. 
INTRODUCTION
Since the end of Dennard scaling, computer architects have strived to build energy-efficient computers. The trend is to trade generality for energy efficiency by using specialized hardware accelerators such as GP-GPU or Xeon-Phi [18] , to quote a few. Recently, reconfigurable FPGA circuits [10] have appeared to be a competitive alternative [35] . With FPGAs, the program is the circuit: genericity is ultimately traded for energy efficiency. However, designing a circuit is far more complex than writing a C program. Disruptive compiler technologies are required to automatically generate a circuit configuration from an algorithmic description while finding an appropriate trade-off between parallelism and I/O bandwidth. Polyhedral compilation techniques have a long-term history of success in automatic parallelization for HPC [17] . Roughly, loop iterations are represented with polyhedra (hence the name), then code optimizations are specified with geometric operations and integer linear programming. Polyhedral analysis enables reasoning about massively parallel computations with a compact representation. Powerful analysis were designed for extracting 52:2 C. Alias and A. Plesco parallelism [11] , scheduling pipelined circuits [3] , resizing optimally the buffers [1] , or tuning I/O requirements to fit memory bandwidth [2, 27] , to quote a few. Polyhedral analysis is used successfully in high-level circuit synthesis [4, 31] . The result is a high-level description of the circuit whose control logic involves a large collection of piecewise affine functions. Minimizing the resource usage of affine control while guaranteeing the throughput is a major challenge in polyhedral synthesis. Few approaches address low-level affine control synthesis in the context of polyhedral circuit synthesis. Actually, most of the research effort in the polyhedral model has focused on source-level transformations. For instance, Alias et al. [2] propose a source-level approach at C level before high-level synthesis to produce an optimized I/O system for a circuit. Zuo et al. [38] optimize the control structure at source level on a C program before using VivadoHLS. In this article, we will not follow the same guidelines. Rather, we show how, for a given control, and without any attempt to optimize its structure, we can produce dedicated hardware machinery that outperforms by 30% the generic optimizations applied on RTL by a state-of-the-art synthesis tool. To do so, we rely on semantic factorizations, a generalization of common subexpression elimination, which proves to be particularly effective on affine control. Semantic factorizations take profit of associativity and commutativity of addition and multiplication to simplify the control. Note that semantic transformations are not new in the polyhedral model. To quote a few, semantic properties of operators are already exploited to recognize algorithms [22] or to extract instruction patterns at source level [37] . As far as we know, this is the first time that semantic factorizations are used to optimize affine control while keeping it exact. However, when approximation is allowed, the complexity of the controller can be reduced [19, 23] and control algorithms can be simplified [6, 29] . In our case, this would not apply: control has to be exact; no approximation is allowed.
In this article, we propose a technique to compact a collection of affine objects (affine expressions and affine constraints) by exploiting semantic properties of addition and multiplication. More specifically,
• The compaction is driven by a cost function whose minimization ensures a proper usage of FPGA resources. The cost can be customized to target a given FPGA.
• The result is a DAG pipelinable at will and ready to be mapped on the FPGA, whose resource usage minimizes the cost function.
• Experimental results show that our algorithm significantly outperforms the low-level optimizations applied on RTL by a state-of-the-art synthesis tool.
This article is structured as follows. Section 2 gives a short introduction to polyhedral synthesis and introduces the concepts used in the remainder of our work. Section 3 presents our compaction algorithm. Section 4 presents the experimental results. Section 5 reviews the related work. Finally, Section 6 concludes the article and draws perspectives.
PRELIMINARIES
This section presents the basic math concepts required to understand the notion of affine control (convex polyhedra, piecewise affine functions), then polyhedral control synthesis is briefly introduced. In the remainder of this section, n, p, and q are positive natural integers: n, p, q ∈ N − {0}.
Convex Polyhedra
Given a linear form a * : R n → R and a scalar α ∈ R, the set H ≥ (a * , α ) = {x ∈ R n , a * (x ) ≥ α } is said to be a closed half-space. A convex polyhedron P is a finite intersection of closed half spaces:
A is the matrix whose rows are τ 1 , . . . , τ q , and b = (α 1 , . . . , α q ) T ,
Polyhedral Synthesis
A parallelizing compiler analyzes the input program and maps the computation to a parallel architecture. The new execution order must reproduce the original computation: each operation must be fed with the same data, and the original data dependences must be respected. However, checking data dependence between two operations is undecidable. Even the sequence of operations executed on a given input-the execution trace-is undecidable. Usually, compiler analysis overapproximates the execution trace as well as the data dependences. However, the approximation made is usually rough and the compiler may miss many opportunities of parallelization. Another approach is to restrict the compiler analysis to programs whose execution trace and data dependences are input invariant and can be expressed with decidable sets.
Affine control loops. The polyhedral model focuses on kernels with affine control loops manipulating arrays [17] . The control is exclusively made of for loops, if, and sequence. Data types allowed are arrays, structures, and scalar variables (seen as dimension 0 arrays); there are no pointers. In addition, loop bounds, conditions, and array indices must be affine functions of surrounding loop counters and structure parameters (e.g., array size). This ensures that execution trace may always be expressed as a union of integer polyhedra. Most linear algebra and signal processing kernels can fit into this model. Figure 1 (a) depicts such a kernel iteratively computing the heat equation on a 1D mesh [21] stored in the array u 0 . α is a rational constant depending on discretization parameters, and β = 1 − 2α. With this program model, the execution of a single assignment S is always controlled by a nest of affine for loops guarded by affine conditions. Such an iteration is uniquely represented by the vector of surrounding loop counters i. The execution of S at iteration i is denoted by S, i . The set D S of iteration vectors is called the iteration domain of S. By construction, the iteration domain D S is always an integer polyhedron, hence the name of 52:4 C. Alias and A. Plesco the framework. The original execution order is given by the lexicographic order over D S , which is also computable. Figure 1 (b) depicts the iteration domains for the different assignments of the heat-1D kernel. As mentioned on the code (a), the initialization iterations are represented with •; the kernel iterations with •, ⊗, and ; and the use iterations with •. Red arrows represent data dependences, as discussed in the next paragraph.
Data dependences. With the polyhedral model, execution traces can be summarized exactly with integer polyhedra. This make it possible to build precise compiler analysis (data dependences, scheduling, data/computation allocation, etc.) thanks to integer linear programming and geometric operations [1, 2, 11, 15, 16] . For instance, array dataflow analysis [15] computes exact data dependences-in other words, a function h S,r , called source function, which maps each read r of each assignment execution S, i to the assignment execution defining the read value h S,r ( i). On the running example,
Source functions are always integer piecewise affine modulo the encoding of assignments •, ⊗, × with integers and the padding of iteration vectors so they have the same dimension. In polyhedral HLS, source functions are often used to multiplex the data for each read of each assignment and to handle synchronizations and communications between parallel units [4, 31] . The complexity of the source function h S,r (number of clauses, number of affine constraints per clause) may increase exponentially with the dimension of the iteration domain D S . Hence, efficient compaction techniques are required.
Scheduling and code generation. Provided the data dependences, the next step is to change the execution order to improve quality criteria (parallelism, data reuse, etc.). This is done by computing a scheduling function θ S that maps each execution S, i to a timestamp θ S ( i). In the polyhedral model, we seek for affine schedules θ S ( i) = A i + b, the timestamps θ S (D S ) being vectors ordered by their lexicographic order. In a way, θ S : R n → R p translates a nest of n loops to a target nest of p loops, each component of (t 1 , . . . , t p ) = θ S (i 1 , . . . , i n ) being the iteration of the operation S, i 1 , . . . , i n in the transformed loop nest. A simple criterion to maximize parallelism is to minimize p, so a maximum number of operations will share the same date [16] (and thus will be scheduled to be executed in parallel). Once the schedule is found, it remains to generate the control that executes the assignments in the order prescribed by the schedule. Many approaches have been developed [5, 12] . The best approach for HLS is to produce a control automaton per assignment S that issues a new iteration vector i of S at each clock cycle [12] . Two integer piecewise affine functions are required: a function First S that issues the first iteration of S with respect to θ S (initial state), and a function Next S that maps each iteration of S to the next iteration of S to be executed with respect to θ S (transition function). On the running example, we would have
To improve reuse, affine scheduling is usually combined with affine partitioning (or tiling) [11] . Each relevant iteration domain D S is partitioned into parallelepipeds by translating a collection of cutting hyperplanes
being the coordinates of the partition containing the original iteration vector (i 1 , . . . i n ). Again, the resulting domain D T S is an integer polyhedron that can be scheduled thanks to affine scheduling. However, affine partitioning highly complexifies the control on the generated program. Figure 1 
The final execution order is depicted with grey arrows. For the assignment •, the functions First • () and Next • are
.
Remark that this Next function is simplified to avoid the exponential blowup of clauses. When several domains overlap, the first clause is chosen. This is another reason techniques to simplify generic piecewise affine functions do not apply here. All in all, the multiplexing and the control involved in this example have a total of 669 affine constraints and 137 affine expressions. Clearly, they should be compacted before being mapped to an FPGA. This article provides an efficient algorithm to compact several integer piecewise affine functions, provided as a pool of affine constraints and expressions, as a DAG efficiently using FPGA resources.
OUR ALGORITHM
In this section, we present our algorithm to turn a collection of affine expressions and affine constraints to a compact DAG. Section 3.1 discusses the cost function to be minimized. Then Section 3.2 defines the semantic factorizations considered to optimize the control: expression factorization and constraint factorization. Section 3.3 explains how all possible combinations of semantic factorizations (expression and constraint) can be summarized with a graph. Finally, Section 3.4 shows how to select the best composition with respect to the cost function.
Cost Model
Our algorithm leverages a cost function to derive a resource-efficient DAG from a pool of affine control functions. Our algorithm is fully parametrized by the cost function, which could be customized at will to fit a given target. In this section, we provide an example of cost function, which happens to be relevant for FPGA targets.
Cost of a DAG. An FPGA consists of reconfigurable building blocks with lookup tables, 1-bit adders, and 1-bit registers (ALM with Altera, CLB with Xilinx). In addition, RAM blocks and DSP blocks are usually provided. Our DAGs use only integer operators (integer addition, integer multiplication by an integer constant), which require an amount of building blocks proportional to the bitwidth of the result. Hence, the resource usage of a DAG D = (N , E) can be modeled as a simple weighted sum:
where bw(n) denotes the bitwidth of the result computed by the operator n of the DAG and w(n) denotes, roughly, the number of building blocks required by n to compute 1 bit of result. bw is simply computed for each node of the DAG by a bottom-up application of the rules bw(x + y) = 1 + max(bw(x ), bw(y)) and bw(x * y) = bw(x ) + bw(y) starting from the bitwidth of the input variables. Furthermore, w can be customized at will to fit a target FPGA. In the following, we will assume that w(+) = 1 and w( * ) = 100. With that choice, our algorithm will tend to decompose affine expressions with multiplications by a power of 2. Note that the cost model is not intended to reflect the actual resources requirement. Rather, it should be viewed as an objective function whose minimization leads to desired properties.
This model is refined to handle two special cases: (i) |a * x | = 0 if a is 0, 1, or a power of 2, and (ii) when n is a multiplication by a negative constant. In the latter case, the extra cost of complement-by-2 must be taken into account. With a > 0, −a * x = a * x + a, where x denotes the logic complement of x. The cost is
The first term is the cost of the logic complement x, the second term is the cost of the multiplication (when a is 1 or a power of 2, this term is removed), and the third term is the cost of the addition.
Affine forms. Our algorithm builds the DAG by adding affine forms incrementally and needs an upper bound on the cost of the sub-DAG computing an affine form. Consider an affine form u = n i=1 a i x i + b, where the x i are integer variables and the coefficients a i and b are integer constants. In the worst case, the term a i x i (or b) with the largest bitwidth is evaluated first, each addition increasing the size of the result of 1 bit. Hence, the worst possible bitwidth for the result of u is Again, the cost function |.| is a parameter of our algorithm. It could perfectly be refined/redefined to fit a different target.
Motivating Examples
Consider affine expressions E 1 = i + 2j + k and E 2 = 5i + 2j + 3k, where i, j, and k are input variables. Common subexpression elimination would produce the DAG sketched in Figure 2 (a). The resources used are four adders, two multipliers by a constant, and one shifter. Now remark that E 2 = E 1 + 4i + 2k. This leads to the DAG in Figure 2 (b). With that expression factorization, the resources required are now four adders and three shifters, which is better than the first solution.
A similar factorization scheme can be applied to affine constraints. Consider the normalized affine constraints C 1 : 4i + 3j < 0 and C 2 : −4i − 3j + 1 < 0. Writing C 2 : 4i + 3j ≥ 0, it is easy to detect that C 2 = ¬C 1 . Now consider the affine constraint C 2 : −5i − 3j − 1 < 0. There is no direct connection with C 1 . But if we write C 2 : 5i + 3j ≥ 0, the affine expression of C 2 (5i + 3j) can be obtained from the affine expression of C 1 (4i + 3j), giving the improved DAG depicted in Figure 2 
With that constraint factorization, the resources used are reduced to two adders, one multiplier by a constant, and one shifter.
Expression and constraint factorization raise several issues that must be handled carefully:
• Expression factorization is not always beneficial. If one tries to derive E 1 from E 2 , with E 1 = E 2 + (−4i − 2k ), the resource usage would be worse than the direct solution (four adders and five multipliers by a constant). A best combination of factorizations must be found among all possible combinations.
• Constraint factorization (C 2 from C 1 ) is a terminal transformation. Indeed, the expression of C 2 (e 2 s.t C 2 : e 2 < 0) is never computed. Hence, subsequent factorizations involving the expression e 2 are not possible. For this reason, expression factorizations will be preferred over constraint factorizations.
This article proposes a unified way to represent the possible sequence of factorizations of affine expressions and constraints and to select combination of factorizations minimizing the resource consumption.
Realization Graph
All possible combinations of semantic factorizations will be summarized in a realization graph G r .
Basically, the nodes of G r are affine expressions and constraints, and an edge u Δ − → v means that v can be realized from u with a cost of Δ. Intuitively, a rooted path in G r would give a realization of the reached nodes. Depending on the factorization (expression or constraint), a specific edge is issued, as explained in the following sections.
Expression factorization. Given a DAG node computing an expression u, an expression v can be computed by applying the factorization rule v = u + (v − u). In that case, we would add to the DAG the following components:
• A sub-DAG computing v − u (to be optimized as well)
• An adder taking the output nodes of u and v − u. The additional resource cost would then be the cost of the operator + plus the cost of the affine form v − u:
We register this possible design choice to a realization graph G r , whose nodes are expressions and constraints to be computed and whose edges u Δ − → v express that v can be computed from u with an additional resource cost of Δ. When the target node is a constraint v < 0, the edge has the same meaning. In general, the incoming edge with the smallest cost u Δ − → v will be preferred to design v.
Constraint factorization. Given a DAG node computing an affine constraint normalized as u < 0, the constraint v < 0 can be derived from u < 0 with a simple logic negation when v < 0 ≡ ¬(u < 0), which means that v < 0 ≡ −u − 1 < 0, or more simply that u + v = −1. This gives a first simple test to detect negations. Otherwise, remark that
. Hence, v < 0 can be computed from u by adding the following components to the DAG:
• A sub-DAG computing −1 − u − v (to be optimized).
• An adder taking the output nodes of u and −1 − u − v.
• The result of the adder is checked by connecting the most significant bit (to have < 0) to a negation.
The additional resource cost would then be the cost of the operator +, plus the cost of the affine
Final algorithm. Figure 3 depicts our algorithm to build the realization graph G r from a pool of expressions E and constraints C. Expressions and constraints are inserted incrementally in the graph G r (lines 3 through 7), and constraint nodes are marked to be distinguished from expression nodes (line 6). Finally, a special node initial_node is added to the graph G r (line 8) and connected to each node u with an expression factorization edge labeled by |u | (lines 9 and 10). Thus, initial_node will serve as a starting point to select the best realization, as explained in the next section. Indeed, edges initial_node |u | − − → u suggest a direct realization of u, whereas edges u Δ − → v suggest that v can be realized from u with a cost Δ.
Each expression is inserted with procedure insert (lines 11 through 14). Prior to inserting the expression e, the maximal strict subexpression with each node n of G r is inserted. This ensures a maximal subexpression factorization between the expressions of E and C. Indeed, expression factorization is not able to factor strict subexpressions, only cases where u is a subexpression of v are detected: if u = 3i + j and v = 3i + j + 5k, then v is naturally expressed as u + 5k. However, if u = 3i + j + 4k, v = 3i + j + k, then the best solution is a factorization by the strict subexpression 3i + j, which does not appear with pure expression factorization. Then expression factorization edges are inserted between each pair of nodes whenever it is beneficial (lines 15 through 23) using the rule described earlier. For presentation reasons, we use the notation ϕ (u, v) for (1 + max{bw worst (u), bw worst (−1 − u − v)}) · w(+). Expression factorization is beneficial when the circuitry added for v is strictly less expensive than computing v directly (line 19). Then the symmetric case (computing u from v) is considered for completeness.
Each constraint e < 0 is inserted in the graph G r (lines 5 through 7). The expression e is inserted as described earlier (line 6). Then negation edges between e < 0 and constraint nodes of G r are added (line 7) by using procedure insert_neg_edge (lines 24 through 32). For each constraint node u < 0 of G r (line 25) without expression factorization edge to e = v < 0 (line 26), the negation edge is added whenever constraint factorization is beneficial. For presentation reasons, we use the notation ψ (u, v) for (1 + max{bw worst (u), bw worst (−1 − u − v)}) · w(+). Cases with expression factorization edge are preferred, as expression factorization is always more beneficial than constraint factorization (see the discussion in Section 3.2). As for expressions, constraint factorization is beneficial when the circuitry added for v < 0 is strictly less expensive than computing v < 0 directly (line 27). Similarly, the symmetric case (computing u < 0 from v < 0) is considered for completeness.
Example. Consider the affine constraints C depicted in the following table with the input bitwidths bw(i) = 2 and bw(j) = bw(k ) = 8. build_realization_graph (∅,C) produces the graph depicted in Figure 4(a) . The common subexpression between constraints 1 and 2 (inserted at line 13) produces the node 2j depicted in white. Constraint factorization edges are dashed. Each edge is labeled by its cost Δ computed according to the rules given in Section 3.1. Again, these rules can be parameterized to fit the target. Consider constraints 1 and 2 and their nodes in G r . G r suggests that constraint 1 can be realized either directly with cost 22 (edge from initial_state), or from subexpression 2j with a cost 19. In turn, 2j can serve as a basis to realize other expressions as constraint 3 with cost 19 (edge from 2j to 4i + 3j). Constraint 3 can be used to realize constraint 4 thanks to a negation factorization of cost 12. This shows that choices need to be done on G r to find the best combination of factorization. This is the purpose of the next section.
Finding an Efficient Realization
An expression factorization edge u Δ − → v of the realization graph G r means that expression of v (if v is a constraint e < 0, the expression is e) may be realized from the expression of u with a cost Δ. 
each u i being realized from u i−1 at cost Δ i . The total cost is Δ 1 + · · · + Δ n + Δ. If v and u n are constraints, then v may be realized with a constraint factorization edge from u n :
In that case, the expression of v would not be available. Indeed, v < 0 would be evaluated without computing v. Then no realization could start from v: the negation edges are terminal. In addition, nodes along the path can be used to compute others nodes of G r . However, each node must have a single predecessor in the obtained subgraph, which is then a tree. These remarks lead to the following definition.
Definition 3.1 (Realization). Let G r be the realization graph of expressions E and constraints C.
A realization is a subgraph T ⊆ G r that satisfies the following conditions:
(1) Each expression/constraint is realized correctly: T is a tree rooted at initial_node, and T spans E ∪ C. In other words, a realization is a particular spanning tree of G r . Condition (2) avoids useless computation of common subexpressions (see white nodes in Figure 4 (a)): common subexpressions are forced to be intermediate results in the final realization. The cost of a realization is the sum of the weights Δ on its edges. Hence, finding an efficient realization amounts to compute a minimal spanning tree of G r , under the constraints specified in Definition 3.1.
The algorithm for finding a minimum realization is given in Figure 5 . The algorithm proceeds into two steps. First, a minimum spanning tree rooted on initial_node is found among the expression factorization edges of G r by using a variant of Prim's greedy heuristic (lines 4 through 7). The search is stopped once all expression/constraint nodes are covered. Second, constraint factorization is considered for orphan constraints (lines 8 through 15 ). An orphan constraint v is neither factorized (the father is initial_node) nor involved in an expression factorization (leaf) (lines 9 and 10) nor involved in a previous constraint factorization (line 11). A best local constraint factorization is found for v and added to the realization (lines 12 and 13). As mentioned earlier, constraint factorizations are terminal. This is enforced by excluding the source u from the nodes to be considered for subsequent constraint factorizations (line 14). The edge from initial_node to v is removed from the realization, as v is now realized from u with a constraint factorization (line 15).
We choose to restrict constraint factorization to orphan constraints, as constraint factorization is always less beneficial than expression factorization: it is terminal, and the gain for the constraint is likely to be comparable to an expression factorization. Hence, constraints involved in expression factorization (thus nonorphan) are excluded. Figure 4 (b) depicts the realization tree T obtained from the realization graph of Figure 4 (a). The edges chosen for the realization tree are in red. The total cost of the design (72) is greatly improved compared to direct realization (487) (only arcs from initial_node, no factorization at all) and compared to common subexpression factorization (391) (edges from node 2j and direct realization of −5i − 3j − 1). Among the four factorization edges of T (edges not coming from initial_node), there are three expression factorizations (labeled by costs 19, 19, and 22) and one constraint factorization (dashed edge labeled by 12). Among the two expression factorizations, 1 is a subexpression factorization (targeting i + 2j + k), and the two others are not (targeting 4i + 3j and 5i + 2j + 3k). The latter two are said to be semantic: they are obtained by playing on semantic properties of addition and multiplication and could not be found by subexpression factorization. Constraint factorization (dashed edge labeled by 12) is also semantic: it could not be found directly on the negation with subexpression factorization. All in all, this example shows the important role played by semantic factorization for expression and constraints in reducing the resource cost of the set of constraints. Figure 6 depicts our algorithm to build the DAG from the realization tree found in G r . The inputs are the realization tree T and the set of expressions E and constraints C to be realized. They are not specified to simplify the presentation. The output is the DAG and a mapping node[.] linking expressions/constraints of E and C to their implementation in the DAG. The algorithm is a recursive depth traversal of T . Each time an edge of T is traversed, the DAG is updated accordingly. Two additional inputs are used to traverse T (t) and to build the DAG (d). The invariant is as follows: when calling build_dag(t,d), t is already realized in the DAG, and the realization root in the DAG is pointed by d. If t is an expression of E, node[.] is updated with d (line 3). If t is the expression of a constraint c ∈ C, the circuitry to check t < 0 is added to the DAG, and the root is linked to c (lines 4 through 6). The recursive traversal is handled in the remaining lines. Initially, build_dag is called with t = initial_node and d = null. A DAG dag(u) is built for each target node u. Its root serves as starting point for the traversal (lines 7 through 10). The circuitry is added to the DAG for selected expression factorizations (lines 11 and 12) and constraint factorizations (lines 13 through 20) by following the rules described in Section 3.3. Since constraint factorization is terminal, no recursive call is required. Consequently, node [.] should be updated in that place (lines 16 and 20) .
Example (Cont'd).

Building the DAG
Rules for expression and constraint factorization produce a pool of new expressions E new in the DAG (u − t for expression factorization line 12, and −1 − t − u for constraint factorization line 19). In turn, these new expressions may be further optimized. Then our algorithm is applied recursively on E new . The output of the new DAG is used in the current DAG in place of the expressions of E new . Notice that recursive calls are optional. The recursive depth can be used as a tuning knob to customize the degree of reuse in the DAG. Figure 4(d) . For the sake of clarity, we have tagged realization roots in the DAG. i + k has tag 1, 4i + j had tag 2, 4i + 2k has tag 3, and i has tag 4. Here, the compaction has detected that 4i is a common subexpression. Multiplications by a power of 2 are represented by shifts ( 1 for ×2 and 2 for ×4). These realizations serve as building blocks for the final DAG on Figure 4 (c). Again, the nodes have been tagged for the sake of clarity. Here, the tags are the ranks of constraints C given in Section 3.3.
Example (Cont'd).
EXPERIMENTAL EVALUATION
In this section, we present the results obtained by applying our algorithm on a large benchmark of applications, with and without polyhedral optimization. Section 4.1 describes the experimental setup. Then Section 4.2.1 presents synthesis results on the FPGA. Section 4.2.2 discusses the semantic factorizations found in the benchmarks. Finally, Section 4.2.3 presents statistics on the behavior of our algorithm (execution time, recursive depth).
Experimental Setup
We have applied our algorithm to simplify the affine control generated for the kernels of the benchmark suite PolyBench/C v3.2 [26] . Table 1 depicts the kernels and the synthesis results obtained on the FPGA.
For each kernel, a DPN process network is generated using the Dcc tool [4] . Dcc optimizes the data transfers and the pipeline execution with an affine schedule based on a loop tiling [2, 3] . The execution order is deeply restructured, and the affine control per process (control automaton,mux/demux) can be quite complex. Before deriving the DAGs, we simplify the control polyhedra with various heuristics including gist and integer set coalescing [32] . Then for each process, we collect the affine control and apply our algorithm to produce a DAG. Table 1 presents the sum of the criteria collected for each process: #dags is the total number of DAG produced, #C is the total number of affine constraints, and #E is the total number of affine expressions. All in all, we have produced and analyzed a total of 261 DAGs from 4,990 constraints and 2,464 expressions.
The main innovation of this work is to explore semantic factorizations for simplification. Indeed, the expression 3i + j could be factorized by 2i + j because of semantic properties of addition and multiplication, whereas common subexpression factorization 3i + j would be restricted to syntactic subterms 3i, j, and 3i + j. The latter is also referred to as nonsemantic factorization. The factorizations found by our algorithm are either semantic or not, depending on the arcs chosen by build_realization_tree. Thus, we want to make sure that through a FPGA synthesis tool, a DAG optimized with semantic factorization will effectively use fewer resources. This would ensure that semantic factorizations allow a significant improvement compared to the optimizations applied by an industrial tool.
Experimental Results
This section analyzes the results obtained on Polybench/C according to two criteria defined in the experimental setup. Section 4.2.1 presents synthesis results on the FPGA and shows the effectiveness of our approach. Section 4.2.2 discusses the semantic factorizations found in the benchmarks. Finally, Section 4.2.3 shows how often our algorithm is applied recursively and presents the execution times.
Synthesis Results.
We have implemented a VHDL generator for our DAGs and a direct generator that puts the affine expressions in VHDL and let the synthesis tool do the optimizations-typically, common subexpression elimination and Boolean optimizations. This way, we can compare our approach to the optimizations applied by the synthesis tool. The DAGs are generated using the hierarchical approach. Both direct and optimized designs are pipelined at ALM level by adding a sufficient number of registers to the outputs. This way, the synthesis tool will perform low-level logic optimizations and retiming to redistribute the registers through the design. The synthesis was performed using Quartus Prime TM 16.1.2 from Intel on the platform on the Arria 10 10AX115S2F4I1SG FPGA with default synthesis options (optimization level: balanced). Intel Quartus Prime is capable of applying highly advanced optimizations automatically including common subexpression factorization and many other advanced Boolean optimizations. The DAGs were tested using the GHDL simulation tool over uniformly distributed random stimuli.
The synthesis results are presented in Table 1 . The synthesis results for our DAGs are provided in the SEM+Quartus column (semantic factorizations + Quartus). The synthesis results for the direct implementation are given in the Quartus column (Quartus only). The gain in ALMs compared to the direct implementation is given in the Gain column. Both implementations run at the maximum FPGA frequency of 645.16MHz. The frequency is limited by the target MAX delay limited by the signal hold timing and other physical constraints. Both versions use a significant amount of five to six input ALMs, thus achieving a higher compression ratio. There is no ALM overhead because of registers, as for all of the examples, they are entirely packed inside the ALMs containing logic. Experimental results show a significant gain in ALM compared to the direct implementation optimized by Quartus (common subexpression elimination). The average gain is 30%, with a deviation of 5%. Remark that the kernel gesummv shows slightly less gain than the other kernels. The kernel gesummv has many simple polyhedra with small bitwidths in the computations. In that case, low-level Boolean optimizations are more effective than semantic factorizations: arithmetic operations are merged with Boolean ∧ operations from the polyhedra using large (up to 7 bit) LUT. For the DAGs parts involved, this results in three to four times less ALMs than the optimized DAGs. The optimized DAGs can benefit less from these optimizations, as the bitwidth of the operations increases through the computations. Nonetheless, even for that kernel, the benefit of semantic factorizations is significant. The synthesis results confirms the validity of our models and approach. Semantic factorization appears to complement nicely the optimization applied by Quartus and may be used profitably as an optimizing preprocessing for affine control. Table 2 provides the actual number of factorizations (arcs) found for each kernel. The #e-arcs column gives the number of expression factorizations. The next column, #sem, gives the number of semantic expression factorizations. The same goes for constraints: the #c-arcs column gives the number of constraint factorizations, and the next column, #sem, gives the number of semantic constraint factorizations. Figure 7 (b) depicts the proportion of semantic factorizations u → v that replace v by an expression with less multiplications and more multiplications with a power of 2. These factorizations produce a strength reduction. With our cost model, our algorithm tends to maximize these factorizations. Table 2 gives the number of such factorizations for each kernel (#pow2 columns). Again, we distinguish between expression factorizations and constraints factorizations. We observe that 12% of the factorizations (36% of the semantic factorizations) enable a strength reduction. In general, strength reductions are very effective to reduce hardware resources Among the numerous combinations of semantic factorizations found in the examples, a frequent pattern is a semantic factorization with a strength reduction u → v producing a term v Figure 7(a) ). Two thirds of the terms are very simple (e.g., −1 * c, c − 1, t − 1 for gesummv) and do not exhibit factorization opportunities. They are derived directly, without factorization. The remaining third contains more complex terms with factorization opportunities. For these terms, the interesting factorizations are semantic. In contrast, kernels heat3d and fdtd2d are dominated by nonsemantic factorizations. These are kernels with a large number of constraints. We found many occurrences of our factorization pattern where constraints w i differ only by a constant shift v → w i = v + λ and can be produced by a nonsemantic factorization.
Distribution of Semantic Factorizations.
Recursive Calls and Execution
Time. Semantic factorizations applied by our algorithm produces fresh expressions (denoted by E new in Section 3.5), which are in turn optimized by applying our algorithm recursively. The rec-depth column of Table 3 provides the maximum number of nested recursive calls for each kernel. Many kernels need one recursive call: semantic factorizations were applied, producing fresh expressions E new , in turn optimized with our algorithm. A quarter of the kernels even require two recursive calls: the semantic factorizations required by E new produce fresh expressions E new new in turn optimized with our algorithm. We have run 8 370 111  407  386  53  35  21  10  4  lu  7 213 87  177  167  28  15  10  3  0  mvt  11 118 70  69  60  25  10  9  7  0  seidel-2d  5 226 63  277  270  39  18  7  3  1  symm  13 213 116  129  115  43  19  14  8  0  syr2k  10 135 90  74  69  27  12  5  4  1  syrk  9 118 81  66  61  24  9  5  4  1  trisolv  9 93 56  58  51  19  10  7  4  0  trmm  8 118 79  62  57  26  12  5  5  0 our algorithm on a Intel Core i5 CPU M 540 @ 2.53GHz with 3,072KB L2 cache and 3GB RAM, which is a pretty light configuration compared to the usual requirements of circuit synthesis. Table 3 provides the execution time in seconds for the construction of the realization graph G r (build_realization_graph, insert-time column), and for the computation of the best realization, the generation of the DAG and the subsequent recursive calls (build_realization_tree; build_dag, dag-time column). Most of the time is spent in the construction of the realization graph G r , the significant operation being the insertion of an affine expression (insert, Figure 3 ). Even with this light configuration, the overall execution time tends to be small with a median execution time of 3 seconds.
RELATED WORK
Few approaches address the mapping of affine control in the context of polyhedral circuit synthesis. With Compaan/Laura, the control frequently executed is synthesized as a DAG with common subexpression factorization [14] , the control less frequently executed being left to a sequential controller. This generates bubbles at each start of the innermost loop. When loops are restructured in such a way that the innermost loops often have a few iterations [11] , this limits the throughput of the controller. For instance, high-degree stencils often used in HPC require very sharp tiles whose corners have a few innermost loop iterations. In addition, the sequential controller requires a microprogram to be stored in a ROM. As storage resources are limited on FPGAs, this would limits the control, and hence the parallelism and finally the performances of the circuit. The authors also propose a runtime distance approach, which splits the iteration domain into phases where the multiplexing is constant (variant domains). The iterations spent in each phase are counted thanks to polyhedral analysis [13] , then the control iterates through the phases with a counter. As far as we know, this approach was not evaluated. However, the amount of clock cycles is usually expressed with a piecewise affine pseudopolynomial that is usually far more complex than the original control. Additionally, it requires full multipliers (variable times variable), which are also quite limited on today's FPGA (DSP units). Again, this approach would limit the parallelism of the application. Sometimes the control can involve integer divisions by a constant [15] , and it is then said to be quasiaffine. Zissulescu et al. [36] propose a set of recipes to get rid of integer divisions and modulos (emulated by integer divisions). Among the recipes, strength reduction adds data dependences that may hinder parallelism. Additional (but light) control is also required. However, this is an important optimization that could be profitably used as a complement to our approach. Zuo et al. [38] propose several source-level transformations to simplify the control for affine loop nests in front of an HLS tool. This approach is relevant when the outcome of a polyhedral optimization is a single unperfect loop nest with all program statements. As stated in Section 2, our front-end polyhedral optimizations split the control between processes communicating through channels. This way, the control per process is simpler-a simple perfect loop nest-and does not require such optimizations. This approach complements ours: we are not optimizing the control structure; we derive an optimized hardware structure for a given affine control. Piecewise affine functions have received a lot of attention in the control community since Bemporad et al. [9] showed that explicit solutions of model predictive control (MPC) can be expressed with piecewise affine functions. Since then, many approaches have been designed to map piecewise affine functions to FPGAs using binary search trees [25, 30] , lattice-based representation [24, 28] , a mix thereof [8] , or hash functions [7] . Explicit solutions to MPC can be approximated to reduce the complexity of piecewise affine controllers [19, 23] . In addition, search algorithms can be simplified and give an approximate solution [6, 29] . In our case, this would not apply: control has to be exact; no approximation is possible. Tondel et al. [30] rely on a binary search tree to seek the right affine function to apply. The construction minimizes the depth of the tree by grouping in the same branch domains sharing the same affine function. Then the circuit walks through the tree by using a sequential controller [25] . However, the sequential controller is not directly pipelinable. This leads to throughputs of several cycles per iteration, which is not desirable for our purpose. Additionally, storage resources are required to store the tree. This limits the duplication of these units, hence the parallelism of the final circuit. Lattice-based representation [20, 28, 33] is an alternative representation of piecewise affine functions as a min of max of elementary affine functions f (x ) = min 1≤i ≤k max j ∈I i a i j (x ), each min term representing a convex part of f . Wen [34] provides an algorithm for generating the lattice based representation of an affine function. Then the lattice-based representation can be mapped directly on the FPGA [24] or mixed with an improved binary search tree [8] . The direct mapping leads to a throughput of five cycles per point on a Links Spartan 3 FPGA, which is not sufficient for our purpose (we expect one cycle per point). As well, it is not clear that the min/max representation would be more compact than our DAGs. Anyway, lattice-based representation assumes piecewise affine functions (hence continuous), which is generally not the case for affine control, as explained in Section 2. Bayat et al. [7] use a hash function to locate the affine function to be applied. Basically, the function domain is subdivided in cells with a grid. The hash function maps each cell to the intersecting function clauses. Then a few iterations find out the relevant clauses. The trade-off is this: the bigger the cell, the smaller the storage requirement, the bigger the cycles per point (throughput). The throughput can only be increased at the price of a bigger storage. As mentioned previously, using storage resources of the FPGA is not desirable to implement affine control.
CONCLUSION
In this article, we have proposed an efficient algorithm to compact a collection of affine constraints and expressions by exploiting semantic properties of addition and multiplication. The compaction is driven by a customizable cost function whose minimization ensures proper usage of FPGA resources. The result is a DAG ready to be mapped on the target FPGA. Synthesis results on the FPGA show that our approach complements the optimizations applied by Quartus very well and can be used as a preprocessing step. We also show that, according to our cost model, our method is significantly better than the classical common subexpression factorization.
Thus far, the technique has been used to optimize the control at the process level, each process running in parallel. If we try to optimize the control involved in all processes as a single DAG, the control would be serialized and we would miss the benefit of parallelization. In the future, we plan to extend this technique to factorize the control common to processes without hindering the parallelism. Our method is bounded to affine control, but polyhedral control may include integer divisions. In addition to preprocessing, new rules and patterns can be defined. In addition, our method deals with constraints with inequalities only. When equalities are explicitly stated, other factorizations may apply. Finally, nothing forces strength reductions; they occur only when the pool of affine expressions happens to allow it. We believe that special expressions could be systematically added to allow more strength reductions.
