Abstract-Loop pipelining is one of the most important optimization methods in high-level synthesis (HLS) for increasing loop parallelism. There has been considerable work on improving loop pipelining, which mainly focuses on optimizing static operation scheduling and parallel memory accesses. Nonetheless, when loops contain complex memory dependencies, current techniques cannot generate high performance pipelines. In this paper, we extend the capability of loop pipelining in HLS to handle loops with uncertain dependencies (i.e., parameterized by an undetermined variable) and/or nonuniform dependencies (i.e., varying between loop iterations). Our optimization allows a pipeline to be statically scheduled without the aforementioned memory dependencies, but an associated controller will change the execution speed of loop iterations at runtime. This allows the augmented pipeline to process each loop iteration as fast as possible without violating memory dependencies. We use a parametric polyhedral analysis to generate the control logic for when to safely run all loop iterations in the pipeline and when to break the pipeline execution to resolve memory conflicts. Our techniques have been prototyped in an automated source-to-source code transformation framework, with Xilinx Vivado HLS, a leading HLS tool, as the RTL generation backend. Over a suite of benchmarks, experiments show that our optimization can implement optimized pipelines at almost the same clock speed as without our transformations, running approximately 3.7-10× faster, with a reasonable resource overhead.
Polyhedral-Based Dynamic Loop Pipelining for
High-Level Synthesis are now a stable technology, enabling high hardware design productivity. State-of-the-art HLS tools like Xilinx Vivado HLS [1], Intel FPGA SDK for OpenCL [2] , and LegUp [3] are able to synthesize programs written in high-level languages like C/C++/OpenCL into hardware designs described in VHDL/Verilog. Hardware architectures are automatically optimized and synthesized in the process. For many applications, there is still a considerable gap between the quality of results produced by HLS tools and those obtained through manual optimization of an RTL hardware design. Computational bottlenecks are typically located in some critical loops of high-level programs, and hence loop pipelining has emerged as one of the preeminent optimization techniques in HLS. Loop-pipelining techniques work by automatically detecting when a loop iteration does not depend on its predecessors, and hence can begin executing before its predecessors have completed. However, complex interiteration dependencies can hinder this process, and cause existing HLS tools to take an overly conservative approach to scheduling. The optimization method presented in this paper aims to make high-performance loop pipelining possible for the loops having uncertain dependencies (i.e., parameterized by an undetermined variable) and/or nonuniform dependencies (i.e., varying between loop iterations).
The motivational loop shown in Listing 1 contains a parameterized affine recurrence equation [4] . In this loop, there is an undetermined variable m in the write access pattern of array A. The loop iterator i ranges from zero to N − 1, where N is constant. The value of m is not known at compile time. Therefore, the sequence of write accesses to elements of array A cannot be completely determined. Indeed, whether the loop can be pipelined actually depends on the value of the parameter m, as illustrated in Fig. 1 . When m = 0, there is no memory dependency in the loop execution as shown in Fig. 1(a) . When m = 1, the result of each iteration has to be generated before the start of the next iteration, which implies an interiteration dependency (also known as a recurrence). As shown in Fig. 1(b) , loop pipelining with an initiation interval (II) of one cycle would violate the read-after-write (RAW) dependency. When m ≥ 3, there will be no recurrence 0278-0070 c 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
(a) (b) (c) violation in the pipeline as shown in Fig. 1(c) . This uncertain data dependency prevents existing HLS tools from exploiting loop pipelining by default, because they only support a fixed II. As a result, a sequential pipeline schedule will be synthesized for this loop. Our optimization presented in this paper enables the statically scheduled pipeline to run at dynamic speed. This is the basic idea of our approach: implement the pipeline scheduled for the smallest II and throttle the execution of loop iterations according to a compile-time dependency analysis. To understand when the pipeline needs to slow down, we use parametric polyhedral analysis to first synthesize a lightweight runtime check, such as 1 ≤ m ≤ 2 for Listing 1 according to Fig. 1 . The demonstration of this analysis is preliminarily presented in [5] as parametric loop pipelining. When there exist memory conflicts that have to be resolved, the polyhedral analysis is further used to synthesize the pipeline break points. As demonstrated in Fig. 2 , we can break the pipeline execution at the (m + 1)th iteration (i = m) to resolve the RAW conflict like those shown in Fig. 1(b) ; nevertheless, the next potential conflict will happen at the (2m + 1)th iteration (i = 2m). To keep the pipeline of Listing 1 as busy as possible, we need to halt its execution after every m iterations when the memory conflicts appear. The strategy of breaking the pipeline execution can also optimize loops with nonuniform memory dependencies, which can appear in many applications, such as matrix decomposition and triangular matrix computation. In these applications, the critical loops have memory dependencies that are statically analyzable but vary with the value of the induction variable. An example of such a loop is shown in Listing 2. These loops can be optimized by loop splitting, first proposed in [6] .
We implement the proposed optimization as a source-tosource code transformation applied before invoking a commercial HLS tool. The lightweight runtime throttle check and the pipeline breaks can be introduced, alongside appropriate loop-pipelining directives, to guide HLS to implement the desired pipeline architecture. Therefore, our transformation is also flexible enough to be applied to different HLS tools. The rest of this paper explains how our optimization approach can be generalized and automated in a prototype flow.
In particular, we make the following contributions. 1) We formulate the problem with a general parametric polyhedral model, allowing us to precisely characterize the interiteration dependencies from both uncertain and nonuniform memory access patterns. 2) We develop an algorithm that can generate the conflict region of parameters which is used as the runtime check to decide when a high-throughput pipelined schedule (i.e., loop pipelining with a low II) can be achieved without violating memory conflicts. 3) We develop a polyhedral transformation that realizes the efficient insertion of pipeline breaks for HLS. 4) We implement our entire optimization as a fully automated source-to-source code transformation framework, which is compatible with, and builds on top of Vivado HLS. This tool is open-source in a public Github repository. 1 The remainder of this paper is organized as follows. Section II presents related work in HLS. Section III gives a motivational example and its analysis for loop pipelining with uncertain variables. Section IV describes the formulation of our parametric polyhedral analysis, transformation and its implementation details. Section V presents the benchmarks and experimental results, and conclusions are drawn in Section VI.
II. RELATED WORK
Loop pipelining, known in software compilers as "software pipelining," was originally designed for very long instruction word (VLIW) processor architectures [7] . When loops can be shown to be free of interiteration dependencies, the instructions from several iterations can be unrolled and interleaved to mitigate the impact of long intraiteration dependencies between instructions. The technique can ensure memory bandwidth is effectively utilized by keeping multiple memory operations inflight at once. For VLIW machines, independent operations can be scheduled on a fixed number of parallel computational units. Classical compilation techniques like iterative modulo scheduling [8] can find an effective timespace mapping to the fixed computational units within a processor.
A. Static Scheduling for Loop Pipelining
Where loop pipelining is applied in the context of an HLS tool, we have the additional freedom to select how many computational units we wish to implement. A tradeoff can be made between the number of dependent operations chained within a single clock cycle, and the minimum clock-period of an implementation. Zhang and Liu [9] proposed a sophisticated approach to exploit this, which captures the dependent operations and their associated latency, and models resource and clock frequency requirements. The "system of difference constraints" that they establish can be solved efficiently to explore a range of schedules achieving different area-time tradeoffs. The approach by Canis et al. [10] further improves the method by trying to reduce the latency between interiteration dependent memory accesses. Their recurrence minimization helps to increase the likelihood of achieving higher parallelism. In both of these works, the authors rely on knowing, at compile time, all the dependencies that exist between operations. Where parameters are uncertain and there is the possibility of loop-carried dependencies, their approaches must adopt a conservative schedule that assumes iterations contain recurrences. This paper overcomes this conservatism by selecting from different schedules at runtime, when the values of all parameters are known.
B. Polyhedral-Based Transformation for Loop Pipelining
Among other recent efforts to optimize loop pipelining for HLS, polyhedral analysis has frequently been used.
Morvan et al. [11] proposed a method using polyhedral analysis to improve nested loop pipelining. To overcome conflicts of memory dependencies in a pipeline, their approach first flattens the nested loop and then inserts wait states ("bubbles") to resolve memory conflicts. However, their bubble insertion requires that there is no conflict of memory dependencies in the innermost loop. Unlike their approach, our optimization can be applied at the innermost loop level, and it is developed for the nested loops with uncertain and/or nonuniform memory dependencies. Li and Pouchet [12] introduced an index-set splitting technique on top of classical affine loop transformations [13] to improve inner-loop parallelism. The index-set splitting is used for nonuniform memory dependencies and limited memory ports. In the first case, their approach directly separates the innermost loop into several subloops according its dependence patterns. Then, fast pipelining is applied on those subloops without any dependency. Similarly, in the second case, the parallel innermost loop is split into subloops according to different memory port conflict properties. The generated subloops can be pipelined at the inner loop with the best possible parallelism, so that the execution speed of the entire loop will not be limited by the worst property. However, our splitting technique is different from separating out loop iterations with irregular memory dependencies, because the purpose of our splitting is to insert the pipeline breaks for resolving memory conflicts when necessary. After our fine-grained transformation, we could apply fast pipelining on all the subloops split from the original loop.
C. Irregular Loop Pipelining
Besides regular loop structures, there are active HLS research efforts investigating pipelining for loops with irregular structures. Tan et al. [14] describe an approach called ElasticFlow to apply loop pipelining on a class of irregular loop nests. In their proposed pipeline architecture, multiple pipeline instances of dynamic-bound inner loop are scheduled to execute in parallel, so that this approach prefers no interiteration dependencies in the outer loops. This paper is targeted to a different set of irregular loops from those of ElasticFlow, where we improve the pipeline parallelism by analyzing interiteration dependencies. Alle et al. [15] implemented a compilation method that transforms loops with dynamic data dependencies into specialized pipeline architectures. They add disambiguation logic in the hardware pipelines that can fully analyze the interiteration dependency at runtime. Dai et al. [16] proposed the integration of a template hazard resolution unit in HLS to resolve runtime conflicts on memory ports and data dependencies caused by indirect or conditional memory accesses. The pipeline is executed speculatively with a small II, and it will replay some iterations when a memory conflict is detected. For both [15] and [16] , the hardware complexity of the runtime detection circuits is proportional to the number of dependent memory accesses and the depth of the pipeline stages. Although these techniques are also able to optimize our target loops, we apply more comprehensive static analysis to generate efficient and lightweight logic to control the pipeline execution at runtime.
D. Polyhedral Model for Memory Optimization
Polyhedral optimization has been widely studied as an optimization tool for modern software compilers [17] , [18] . In recent years, it has also been applied for optimizing custom memory systems of loop programs in HLS research. Liu et al. [19] proposed a mathematical formalization and an algorithm to implement minimized on-chip data reuse buffers in FPGA designs. Considering SDRAM as the offchip memory in a common FPGA system, Bayliss and Constantinides [20] implemented a polyhedral tool to produce address sequencers for an SDRAM interface to optimize off-chip memory bandwidth. Pouchet et al. [21] proposed another automated polyhedral HLS framework for better data reuse that combines loop transformations. Wang et al. [22] introduced a polyhedral theory and algorithm for generalized memory partitioning. These previous works apply polyhedral analysis to study memory reuse or partitioning problems for improving loop latency and parallelism. In this paper, we focus on developing a new parametric polyhedral analysis for both uncertain and nonuniform memory dependencies, enabling us to pursue aggressive loop pipelining that is not yet possible in modern HLS tools.
E. Extending the Polyhedral Model for Software Compilation
Polyhedral analysis and transformations can realize powerful optimizations because the underlying loop manipulation can be precisely expressed by algebraic representations. Unfortunately, using the polyhedral model also restricts the input program to be statically analyzable. To overcome the limitation of static analysis, there are several ways to extend the applicability of polyhedral model. Benabderrahmane et al. [23] developed an approach that extends polyhedral expressibility by over approximation. Predication statements are used to handle nonaffine conditionals and loop bounds, so that general programs can be converted into the polyhedral model for analysis and transformations. To eliminate pessimistic conservatism, some approaches leverage runtime information to enable dynamic exploitation of loop parallelism. Jimborean et al. [24] and Sukumaran-Rajam and Clauss [25] have developed comprehensive speculative execution frameworks, where polyhedral transformations are used to promote parallelism at runtime. Alternatively, Venkat et al. [26] proposed to use a dedicated inspector to support nonaffine transformations at runtime. Index arrays in loop bounds and memory accesses are represented by uninterpreted functions in static analysis, and both affine and nonaffine loop transformations are composed to increase loop parallelism effectively. More recently, loop versioning using polyhedral techniques is shown to improve compiler-based loop transformations with low overhead. Doerfert et al. [27] have developed a framework to derive a minimized set of preconditions, so that a variety of complex loop transformations can be enabled at runtime according to the validation of these preconditions. Similarly, Sampaio et al. [28] proposed a quantifier elimination scheme that combines static test generation and runtime evaluation to trigger appropriate loop transformations.
III. MOTIVATION

A. Loop Pipelining
Loop pipelining is implemented by overlapping the execution of loop iterations. The logical operations within successive loops are mapped to hardware resources. The mapping must ensure that each hardware resource only executes one operation in each clock cycle. Where RAW loop-dependencies exist in the original code (a value is written in one iteration and read in a subsequent iteration), a static pipeline schedule must be constrained to preserve these dependencies. The constant interval between the start of successive iterations is called the II, and reflects the degree of parallelism, in the sense that for the same latency, a pipeline with smaller II has more iterations running in parallel at any given clock cycle.
If we denote the latencies of the operations executed before the loop body and of a single loop iteration by L pre and L iter respectively, and the loop trip count is N, then the latency of the entire loop is equal to
(
When N is large enough, this latency is approximately equal to N × II. Therefore, the performance of a loop is mainly determined by its II. To achieve a small II for loop pipelining, HLS tools need to solve complex scheduling problems [9] , [10] . Unlike resource constraints that may vary with the requirements of different hardware implementations, iteration-dependency constraints are quite intrinsic. A complex dependency constraint could significantly constrain our ability to reduce the II of a loop pipeline.
As an example, Fig. 3 (a) and (b) illustrates two potential loop pipelining strategies with fixed IIs for the motivational loop shown in Listing 1. We still assume that the latency of each iteration is three clock cycles. Since there is an uncertain variable in the memory access pattern, the write reference of array A in the current iteration may be the same as the read reference of array A in a future overlapped iteration, which happens in Fig. 1(b) . Due to this uncertain memory dependency, the best case of loop pipelining shown in Fig. 3 (a) cannot be achieved with modern HLS tools. Current HLS tools will choose a conservative solution, in order to ensure correctness for all cases. We show this conservative solution in Fig. 3(b) . Here, the possibility of a memory dependency between successive iterations prevents any loop pipelining with a fixed II.
B. Memory Dependence Analysis
With our parametric polyhedral analysis in Section IV, we wish to formally describe the memory dependencies in a nested loop so that we can determine when and in which pattern memory conflicts may happen. Here, we present an intuitive illustration of the analysis process with the 1-D loop shown in Listing 1. This process will be generalized and formalized in Section IV.
To analyze the memory dependencies of a loop, we need to formally model the memory access sequence. These patterns are described by array indexing functions and loop bounds, For each separate array from the source code, we can form two sets of indexing functions, one containing all the read accesses and the other all the write accesses. The Cartesian product of these two sets is a set of paired indexing functions. Two paired accesses are dependent if and only if the address written in the current iteration will be read in a future iteration. The dependence iteration distance d(p, v) is the smallest number of iterations between the execution of two such dependent data accesses, which can be derived from their affine indexing functions. Since the dependence iteration distance may vary in our target loops, we can evaluate the conflict region of d(p, v), which will lead a read access to run before the completion of its dependent write access during the pipeline execution.
As shown in Fig To analyze this memory dependency, we can obtain d(m, i) = m. According to the given loop scheduling, the latency L is the period when the execution of the dependent read access will violate the interiteration memory dependency. In other words, A[i+d(m, i)] cannot be any gray read access shown in Fig. 3(c) . If the target II is equal to II, there will be (L/II) iterations being processed in the pipeline during the latency L. Thus, we could derive the cases in which the dependent read access will be executed in this period under the current pipeline schedule. In these cases, d(m, i) will satisfy the conditions in (2), which denotes its conflict region
Intuitively, when d(m, i) does not satisfy these conditions, no memory conflicts will happen in the given pipeline schedule. There will be either no memory dependency between a write and a future read [such as Fig. 1(a) ] or enough iterations between them [such as Fig. 1(c) ]. According to Fig. 3(c) , we obtain the conflict region as 1 ≤ m ≤ 2 based on (2), where L = 3 and II = 1.
C. Proposed Loop Transformation
In current HLS tools, only the worst case of uncertain and nonuniform memory dependencies is considered for loop pipelining. This leads a static pipeline schedule to have a large and conservative II. As illustrated in Listing 3, we propose a source-to-source code transformation, which will guide HLS tools to implement the pipeline as shown in Fig. 4 .
Before the loop starts, the conflict region is first evaluated by the if-condition derived from (2) . These conditions will be synthesized into lightweight detection logic by HLS. The output of this detector will enable different pipeline execution modes. When the conflict region is not satisfied, the loop will be executed in the else-branch which is realized as a pipeline with II = 1. Otherwise, the loop will be executed with pipeline breaks in the then-branch. The pipeline breaks are realized by inserting a loop dimension outside the original loop. The step size of the new outer loop is determined by the dependence iteration distance d(m, i) = m. The inner loop, which is the original loop, is also forced to be scheduled with II = 1. Like the runtime execution shown in Fig. 2 , the split controller will still run the loop in a fast speed but pause the pipeline input after every m iterations are issued. The analysis can prove that there will be no memory conflict because the data written within the inner loop will be read only after the pipeline break. In Fig. 4 , the data paths of both execution modes are all statically scheduled with the smallest II. The related HLS directives (pragmas in Vivado HLS) are inserted in the real code. Their associated address generators (Addr Gen) are in charge of calculating array indices. Although the hardware logic appears to be duplicated in the pipeline body, they are in different branches of an if-condition. We therefore let the HLS tools decide how to exploit resource sharing for better timing or less resource overhead in the physical implementation. This also makes the transformation unrelated to any specific code tuning for resource sharing but flexible to support different HLS tools.
IV. POLYHEDRAL OPTIMIZATION
In this section, we first give an introduction to our polyhedral model formulation in Section IV-A. After the introduction, we further formulate the parametric polyhedral analysis of the memory dependencies in Section IV-B. To generate the conflict region, we present an algorithm based on the analysis in Section IV-C. Then in Section IV-D, we propose a polyhedral transformation for loop splitting. Finally, a source-to-source code transformation framework is introduced in Section IV-E, which is created to integrate our optimization method into existing HLS tools.
A. Preliminaries
The input of our analysis is a nested loop with d I dimensions. In the previous sections, we use the 1-D loop shown in Listing 1 to illustrate the idea of our analysis and transformation. In modern HLS tools like Vivado HLS, loop flattening (also known as loop coalescing) is enabled before the pipeline scheduling by default [1]. This transforms a multidimensional loop into a single-dimensional loop, so that the entire nested loop can be pipelined to achieve better throughput than just pipelining the innermost loop. Therefore, we aim to optimize the pipelining of the entire nested loop that will be flattened in the HLS backend. The theoretical background of our parametric polyhedral analysis can be found in [29] . Beyond the previous work, our optimizations in this paper are developed with the specific use of parameter properties in the polyhedral model.
In a given nested loop, there are N pair pairs of memory accesses visiting the same arrays. Our analysis is described below as capturing RAW memory conflicts, but can also support other memory dependencies. The undetermined variables in the memory accesses can be represented by a parameter vector p ∈ P, where P ⊆ Z d P represents potentially known ranges of these variables and d P is the number of undetermined variables. It is noteworthy that a parameter can also be an indirect array access whose index is not related to any induction variable of the given loop nest. If such indirect array access can be profiled statically, we can obtain its P to have a more accurate analysis. In this paper, we use the superscript p to indicate a dependence on parameters.
Definition 1 (Iteration Domain): Given a d I -dimensional loop nest, the iteration domain D p is a parametric set of vectors of the form
where v is the iteration vector of d I induction variables. The inequality system represents the bounds for all loop levels, where A is a rational matrix and b is a rational vector. 
Definition 2 (Lexicographic Order): Given two iteration vectors
v holds if and only if
This lexicographic order on v represents the execution order of loop iterations, which means that v is executed after v in the pipeline. 
Definition 4 (Iteration Dependency Map):
Given the kth pair of write and read accesses to the same array, the iteration dependency map Q p k links write and read iterations in D p such that 
An element in Q p indicates that two iterations access the same data element in D p , i.e., signifies the possible presence of an RAW memory dependence. For example, the equality constraint of the iteration dependency map in Listing 4 is 2 0 0 1
In practice, the equality constraint in Q p k may be piecewise affine when there are conditions in loop bounds or around memory accesses. For space reasons, we assume the equality constraint is always affine in this paper, but our implementation also supports the piece-wise case.
B. Parametric Polyhedral Analysis 1) Memory Conflicts in Loop Pipelining:
As mentioned in Section III, the loop transformation is affected by both interiteration memory dependencies and pipeline scheduling. The information about pipeline scheduling is assumed to be available for our analysis. To formally evaluate memory conflicts in loop pipelining, we need to determine which iterations will violate memory dependencies.
Definition 5 (Conflict Domain):
Given an iteration dependency map Q p k , target initiation interval II and scheduled latency L k between the kth pair of dependent memory accesses, the conflict domain S p k is a parametric set of iteration vectors in D p such that
which generates the lexicographically minimum point of v linked to v based on the equality constraint (3) in Q p k , and
which counts the number of iterations that are executed before the iteration v. In general, I p (v) can be expressed as a parametric pseudo-polynomial, as known as an Ehrhart polynomial, that counts the integer points of a parametric polytope [30] . Similar to (2), the inequality condition checks the existence of memory conflicts based on given pipeline scheduling. As shown in Definition 5, the conflict domain S p k includes the iterations that will violate memory dependencies when the nested loop is flattened and pipelined with the target II. To include the dependencies implied from all pairs of dependent accesses, the global conflict domain, S 
2) Constructing the Conflict Domain:
In this paper, we use the integer set library (isl) [31] to construct and analyze the parametric polyhedral models. The general form of I p (v) is a parametric pseudo-polynomial, which are representable with isl. However, sophisticated analysis of parametric pseudopolynomial is limited in isl.
calculates the smallest number of iterations between v and v . Morvan et al. [11] estimated the lower bound of counting iterations between v and Z p k (v) to check pipeline legality for nested loops, but this bound was mentioned to be not always tight and is not parametric. As a compromise, we limit the
to be an affine expression so that isl can be used to count iterations parametrically for sophisticated integer set analysis.
In our following implementation, the memory dependencies incurred by (v, v ) ∈ D p leads (5) to count the integer points in a rectangular subset of D p , such that: 
where 
holds), then there must be at most one dimension (say, j) with a parametric trip count, and the dependence difference at every dimension outside j must be constant (i.e., ∀0 ≤ i < j, δ p k,i is constant). An example is shown in the following loop:
2) Nonrectangular Case: Otherwise, let j be the innermost level with a trip count varying with some outer loops. Then every level inside j must have a constant trip count (which means that
holds), the dependence difference at level j − 1 must be 0 or 1 (i.e., δ p k,j−1 ≤ 1), and the dependence difference at every level outside j − 1 must be 0 (i.e., ∀0 ≤ i < j − 1, δ p k,i = 0). An example is shown in the following loop: 
MAff snk ← TakeMultiAff(Q p ) 8: MAff src ← CreateMultiAff(v) 9: MAff δ ← SubMultiAff(MAff snk , MAff src ) 10 :
For unsupported cases, we can alternatively analyze the inner loops of the given nested loop. In the worst case, the innermost loop can always be analyzed by our approach, because we have
C. Conflict Region Generation
Following the analysis in Section III-B, we generate the conflict region by constructing its complement set in Z d P , i.e., the safe region. The safe region of the kth pair of dependent memory accesses is a set of parameter vectors P k ⊆ P such that
which includes all possible parameter values that make the conflict domain empty. The global safe region P safe = N pair k=1 P k , is the intersection of all local safe regions, and allows conflict-free pipelining of the entire loop nest.
The main algorithm for generating the conflict region for a given nested loop is described in Algorithm 1, where we simplify many operations of the isl library into abstract functions. This algorithm requires a given pipeline scheduling and a target II that is relatively small for the short execution time of the input loop. The iteration domain D p is also extracted from the loop beforehand. First, the global safe region P safe is initialized as Z d P and further constrained as the algorithm progresses. Next, in the for-loop (lines 4-14) , we analyze all pairs of possible dependent memory accesses labeled with k.
In lines 5 and 6, the iteration dependency map Q p k is generated and simplified. Function ComputeFlow(w k , r k , D p ) is to create the equality constraint in (3), which will map the write access (source) to the read accesses (sink) visiting the same data point. Both write and read iterations should satisfy that v ∈ D p ∧ v ∈ D p . We only need to check the dependency of the read access in the earliest sink iterations. Therefore, function LexMin is applied to filter out the lexicographically minimal sink iteration, which is equivalent to Z p k (v) in (4), so that a simplified dependency map is assigned to Q p .
The dependence difference δ generates hardware architectures from original and transformed C code, as the RTL generation back-end in our flow. The HLS tool is first used to synthesize the original loop without considering interiteration memory dependencies. The scheduling information for this pipeline is used for further analysis. Since Vivado HLS is a commercial tool, we can only use the tool as a black box without internal detailed scheduling information. This also means that our approach can be applied to other RTL generation back-ends. Currently, the achieved II is extracted from the first synthesis as the target II in Algorithm 1. We also extract the pipeline latency achieved from the first synthesis, and assign it to all L k in Algorithm 1 instead of the scheduled latency between the kth pair of write and read. This leads L k to be an upper bound value, which has the potential to tighten the conflict region.
As shown in Fig. 6 , the loop information is captured by two open-source tools. The Clang front-end parser [33] generates an abstract syntax tree (AST) from the input C code. The polyhedral extraction tool [34] uses isl to extract the loops as the SCoPs from the Clang AST. Finally, the transformed C code is generated by PoTHoLeS [35] . PoTHoLeS is a polyhedral compilation tool developed by us, which conducts user-specified loop analysis, transformation, and code generation based on isl. This tool is available in a public Github repository. 1
In Fig. 7 , we demonstrate the code transformation produced by our tool. The detection of the conflict region is realized by the outermost if-condition, which is generated by Algorithm 1. The fast loop in the else-case is same as the fast execution shown in Listing 3. The then-case includes three subloops split from the original loop according to (7) , which will be synthesized into a split controller as shown in Fig. 4 . Because the conflict domain of the original loop is parameterized, bounds of subloops contain m and code macros like min() and max(). From the original loop, the dependence difference is correctly recognized as m+i. Memory conflicts are only related to the iterator i, and thus our tool splits the original loop at the outer loop dimension. According to (8) , a new loop level is inserted in subloop 2 with an induction variable k, which realizes block-wise loop splitting. Since Vivado HLS cannot apply flattened pipelining on a nested loop with variable bounds, only the loop dimension inside the inserted one in subloop 2 can be pipelined. Therefore, we leverage this feature to implement block-wise loop pipelining.
In Vivado HLS, forcing resource sharing can be realized by replacing the duplicated loop bodies with the same function call and disabling the feature of function inlining. Such complementary transformation could be effective to reduce some resource overhead, but this may also sacrifice the sharing opportunities across the boundary of function calls. The design tradeoff of resource sharing is both application and tool specific, which is out of the scope of this paper, and we leave it for future investigation.
V. EXPERIMENTAL RESULTS
A. Experimental Setup
In this paper, our code transformation framework uses Xilinx Vivado HLS 2017.2 as the RTL generation backend. The target FPGA device is a Virtex 7 XC7VX485T. In all experiments, the target clock period is set to 3 ns, which is expected to produce a balanced tradeoff between clock speed and resource usage. We export generated RTL codes to Xilinx Vivado Design Suite 2017.2 to collect clock and resource usage results after RTL synthesis, place, and route. Furthermore, all generated pipelines are tested by C/RTL cosimulation with dedicated testbenches to confirm functional equivalence with the original code. 
B. Benchmarks
We choose eight benchmark loops used in our previous works [5] , [6] for our experimental study. These benchmarks reflect some typical uncertain and nonuniform memory dependencies, which are usually not covered in a full benchmark suite like Polybench [36] . All memory arrays contain single precision floating point numbers. All uncertain variables are int values, i.e., lie between INT_MIN and INT_MAX as defined in <limits.h>. The source code of benchmarks, testbenches, and their transformation used in the experiments are available in a public Github repository. 2 The details of the benchmark loops are summarized in Table I . tri_sp_slv is an 1-D loop obtained from a triangular sparse matrix solver, which has one undetermined iteration causing a memory conflict. adi_int is a 2-D loop from Kernel 8 in the Livermore benchmark suite [38] . floyd_warshall is a 3-D loop for finding shortest paths from Polybench [36] , which has one fixed iteration causing a memory conflict in its innermost loop.
C. Performance Improvement
As shown in Table I , various memory dependence patterns of the benchmarks lead to different optimization strategies applied in the transformation. One special case is found in pivot, where the analysis guarantees that the loop can be always executed in the fast pipeline. In addition, the conflict dimension q of adi_int is at the innermost loop where δ p k,q is found to be 1, and thus it is not necessary to split this loop in its conflict region. Table II provides the detailed results of pipeline performance. In this table, columns with the title "Orig" indicate characteristics of the original pipeline and columns with the title "Tran" indicate characteristics of the transformed pipeline implementation. Columns with the title "Fast" indicate the pipeline performance achieved when the generated lightweight checks determine lower IIs are safe. Columns with the title "Split" indicate the performance when the pipeline breaks have to be inserted to avoid memory conflict. Furthermore, "Pre-Loop Cycles" represents the number of cycles executed before the start of loop body (L pre ) and "Iteration Cycles" represents the number of cycles for one loop iteration (L iter ).
Following the architecture of the transformed pipeline shown in Fig. 4 , the detector logic should be executed before the start of the loop body. These additional operations are observed to double L pre on average, but they only cost a few cycles to finish. This indicates that the complexity of the detector logic is lightweight. L iter is also slightly increased to support higher parallelism during the pipeline scheduling. Since the latency of executing the entire loop is calculated by (1), the impact of L pre and L iter is often negligible, especially when the loop body has a large trip count N.
After our proposed transformation, almost all the nested loops or subloops can be safely pipelined by the HLS backend tool without considering any interiteration dependency. Across our benchmarks, this allows HLS scheduling to achieve much smaller II ranging from just 1 to 3 cycles in the fast mode. These achieved IIs lead to 10× higher peak performance of the transformed pipelines. To evaluate runtime performance of the benchmark loops in their conflict region, we measured loop execution latency with further experiments in RTL cosimulation. For each loop with uncertain memory accesses, we generated 100 test cases with random values of parameters that were ensured to be within the conflict region. The random tests already cover all combinations of the parameters in the conflict region. For each test with each benchmark, we also collected the corresponding loop trip count and execution latency in clock cycles. Their tested trip counts are also summarized in Table I . The average cycles per iteration in Table II shows a 3.7× speed-up of the pipeline throughput in the conflict region.
D. Analysis of Runtime Performance
According to Table I , when the loops are split by conflict domain (S p conf ), the transformed loops have an average cycles/iteration close to II, as shown in Table II . In particular, the second subloops of tri_sp_slv and floyd_warshall (v) ) in these loops. The transformed tri_sp_slv has a relatively larger cycles/iteration due to its undetermined loop bounds. When the trip count is too small, the runtime performance of the transformed pipeline cannot benefit from the improved parallelism.
For dist_param and row_col, the entire loop can be treated as subloop 2. In these benchmarks, only splitting by dependence difference is applied, and pipeline performance changes with the parameters determined at runtime. We further evaluated the runtime performance of dist_param to illustrate the speed-up of this splitting stage. Fig. 8 compares two types of pipeline architectures, where our polyhedral-based dynamic loop pipeline is denoted by PolyDLP. We also collected the maximal performance of dist_param at different values of m, which is plotted as a dashed line. Each point of the maximal performance is collected from the pipeline synthesized from the loop whose m is replaced by a constant value. The behavior of these pipelines is only correct for their specific values of m, so that their performance represents the upper bound of the runtime parallelism. The conflict region of dist_param is 1 ≤ m ≤ 13, where the transformed pipeline have the dynamic breaks inserted at runtime. When m is 1, all iterations have to be executed sequentially. Due to extra operations added to support PolyDLP, the transformed pipeline is slower than the original one only in this case. When m becomes larger, there are fewer break points inserted in the pipeline execution, and its runtime performance becomes much closer to the maximal one. When the loop is executed in the fast mode (where m = 0 or m ≥ 14), the transformed pipeline can achieve the maximal performance as expected. Therefore, our static analysis and transformation allows the pipeline to dynamically adjust its throughput to avoid any memory conflict.
E. Timing and Resource Overhead
As shown in Table III , our transformation has very little impact on the achievable clock period, but it generally increases the hardware resource usage to achieve higher parallelism. We also evaluate the design choice of the highest pipeline parallelism, which is obtained by synthesizing the original loop without considering any interiteration dependency. Its results are shown under the columns with the title "HP," which helps us to better understand the effect of resource sharing. After our transformation, the average increase of lookup tables (LUTs), flip-flops (FFs), and DSP blocks is 87%, 73%, and 14%, respectively. However, resource overhead is still less significant than performance improvement even in the conflict region, as witnessed by a 45% average reduction of the area-time product.
Due to higher parallelism achieved after the transformation, more operations are required to work at the same time in the pipeline bodies shown in Fig. 4 . First, besides the detector logic and the more complex finite state machine, the increase of LUTs and FFs is mainly caused by the unshared address generators. These addressing logic mainly consists of integer arithmetic operators, such as adders and multipliers. Their implementation in our relatively small benchmarks will cause little resource pressure for modern high-density FPGAs. Thus, the HLS backend tends to duplicate these operators across mutually exclusive conditionals in favor of using fewer multiplexers for less routing complexity. In order to eliminate some unnecessary duplication, resource constraints on integer multipliers have been added in dist_itr_param and tri_sp_slv, which does not affect other resources or timing. Second, the resource sharing between the floating-point data paths is well supported by the HLS backend. This can be observed by the small difference of DSP usages between HP and Tran, and thus the increase of DSP blocks is mainly due to the higher parallelism of the data path.
VI. CONCLUSION
In this paper, we proposed a new optimization method for a class of loops with uncertain and nonuniform memory dependencies. The method combines compiler-based analysis and runtime optimization. The optimized pipelines can execute the loop iterations as fast as possible, when specific conditions are detected, or pipeline breaks are inserted at runtime. We formulate a general parametric polyhedral analysis and transformation for resolving complex memory conflicts in these pipelines. A source-to-source code transformation framework is prototyped for evaluating our propose optimizations. With experiments over a suite of benchmarks, we show that the transformed pipelines can achieve a 3.7-10× speed-up with a reasonable resource overhead. In future work, we intend to lift the restriction of affine expressions in the analysis, allowing for the better support of indirect memory accesses. Furthermore, the static pipeline scheduling for HLS can be co-optimized with our techniques to minimize the resource overhead and further improve the performance.
