Specialized hardware architectures promise a major step in performance and energy efficiency over the traditional load/store devices currently employed in large scale computing systems. The adoption of high-level synthesis (HLS) from languages such as C/C++ and OpenCL has greatly increased programmer productivity when designing for such platforms. While this has enabled a wider audience to target specialized hardware, the optimization principles known from software design are no longer sufficient to implement high-performance codes, due to fundamental differences between software and hardware architectures. In this work, we propose a set of optimizing transformations for HLS, targeting scalable and efficient architectures for high-performance computing (HPC) applications. We show how these can be used to efficiently exploit pipelining, on-chip distributed fast memory, and on-chip streaming dataflow, allowing for massively parallel architectures with little off-chip data movement. To quantify the effect of our transformations, we use them to optimize a set of high-throughput FPGA kernels, demonstrating that they are sufficient to scale up parallelism within the hardware constraints of the target device. With the transformations covered, we hope to establish a common framework for performance engineers, compiler developers, and hardware developers, to tap into the performance potential offered by specialized hardware architectures using HLS.
MOTIVATION
Since the recent ending of Dennard scaling, when the power consumption of digital circuits stopped scaling with their size, compute devices become increasingly limited by their power consumption [80] . In fact, the shrinking feature size even increases the loss in the metal layers of modern microchips. The load/store architectures in use-today suffer mostly from the cost of data movement and addressing [30] . Other approaches such as dataflow architectures have not been widely successful due to the varying granularity of applications [22] . However, applicationspecific dataflow can be used to lay out memory, such as registers and buffers, to fit the specific structure of the computation and minimize data movement. Reconfigurable architectures, such as FPGAs, can be used to implement application-specific dataflow [10, 65, 71] , but they are too hard to program [6] . Traditional hardware design languages, such as VHDL and Verilog, do not benefit from the rich set of software engineering techniques that improve programmer productivity and code reliability. For these reasons, the community is beginning to embrace hardware development techniques based on traditional procedural languages such as C or C++. These tools and languages are commonly called high-level synthesis (HLS) [13, 45] . In this way, HLS bridges the gap between hardware and software development and enables basic performance portability implemented in their compilation systems. For example, HLS programmers do not have to worry how exactly a floating point operation is implemented on the target hardware. For the same source code, a good compiler will generate the necessary circuit when compiled for an Intel Stratix V FPGA, and will transparently use optimized floating point cores when compiled for a Stratix 10. However, compiler optimizations are fundamentally limited. Numerous HLS systems [46, 48] synthesize hardware designs from C/C++ [11, 24, 32, 47, 54, 85] , OpenCL [52, 72, 82] and others [3, 4, 23, 28, 49] . All-in-all, HLS provides a viable path-way for the software and hardware communities to meet and address each other's concerns.
For many applications, compute performance is a primary goal which is achieved through careful tuning by highly-specialized performance engineers. To guide these engineers, optimizing transformations for CPU [5] and GPU [61] are well-understood. For HLS, a comparable collection of guidelines and principles for code optimization has yet to be established. Optimizing codes for hardware implementations is drastically different from optimizing codes for a fixed architecture. In fact, the optimization space is larger because it contains known software optimizations, and in addition, programmers can change the microarchitecture and design application-specific circuits in HLS. Thus, the established set of transformations is not sufficient because it does not consider aspects of optimized hardware design, such as pipelining.
In this work, we define a set of optimizing transformations that compilers or performance engineers can apply in order to improve the performance of hardware layouts generated from HLS codes. For this, we discuss how code transformations known from tuning for fixed hardware apply to HLS. Furthermore, we introduce a set of optimizing transformations at the HLS level that generate pipelined hardware layouts with optimized buffer distributions. We show that these key transformations mainly aim at laying out the buffers into an application-specific dataflow architecture that efficiently uses the available distributed storage and computation.
Key transformations for high-level synthesis
We propose a set of optimizing transformations that are fundamental to designing scalable and efficient hardware kernels in HLS. These transformations are often composed of multiple basic source code transformations, such as strip-mining and loop interchange, that achieve the desired patterns, and we will list these when relevant. We divide them into four categories, as given below:
Pipeline-enabling transformations:
(1) Transposition: resolve loop-carried dependencies by transposing the iteration space. (1) Vectorization: single instruction multiple data (SIMD) parallelization.
(2) Replication: increase amount of compute logic to scale up performance without spending bandwidth by exploiting on-chip memory. (3) Streaming dataflow: partition kernel into multiple processing elements to separate scheduling, improve placement and routing results, and optimize memory performance. (4) Tiling: fit large domain sizes into available fast memory.
Secondary transformations:
(1) Memory access extraction: extract memory accesses from computations, allowing them to be optimized separately. (2) Memory oversubscription: amortize bandwidth from nondeterministic data sources by accessing memory at a higher rate than required by the kernel.
(3) Memory striping: stripe memory onto multiple banks to multiply access bandwidth. (4) Type demotion: demote to cheaper data types when allowed by precision requirements.
Software transformations: traditional software transformations that apply directly to HLS.
We will show how transformations can be applied manually by a performance engineer by directly modifying the source code, by giving examples before and after a transformation is applied, but many are also amenable to automation in an optimizing compiler. Before diving into the transformations, however, we need to establish the metrics for performance in a pipelined design, as a target of optimization in the following.
Basics of pipelining
Pipelining is the essence of efficient hardware architectures. The primary advantage of custom hardware over fixed architectures is that expensive instruction decoding and data movement between memory, caches and registers can be avoided, by sending data directly from one computational unit to the next. We quantify pipeline performance using two primary characteristics, described below.
• Latency (L): the number of cycles it takes for an input to propagate through the pipeline and arrive at the exit, i.e., the number of pipeline stages. For a directed acyclic graph of dependencies between computations, this is the critical path.
• Initiation interval or gap (I ): the number of cycles that must pass before a new input can be accepted to the pipeline. A perfect pipeline has I = 1 cycle, as this is required to keep all stages in the pipeline busy. Consequently, the initiation interval can be considered the inverse throughput of the pipeline; e.g., I = 2 cycles implies that the pipeline stalls every second cycle, reducing the throughput of all pipelines stages by a factor of 1 2 . To quantify the importance of pipelining, we consider the number of cycles C it takes to execute a pipeline with latency L (both in [cycles]), taking N inputs, with an initiation interval of I [cycles], assuming a reliable producer and consumer at either end, which is exactly:
The time to execute all N iterations with clock rate f [cycles/s] of this pipeline is thus C/f . By formulating our program as a pipeline, optimization can be condensed to three primary goals:
(A) Perfect pipelining: achieve I = 1 cycle for all essential components, i.e., ensure that all pipelines run at maximum throughput. (B) Scaling/folding: fold N by scaling up the parallelism of the design, thus cutting the total number of pipeline iterations required to execute the program. (C) Saturation: saturate pipelines for the majority of the runtime to avoid stalls.
The rest of this paper is organized as follows. Section 2 will cover transformations that enable (A), and Section 3 covers transformations that achieve (B). Together, these make up the core of hardware optimization, as all these transformations will apply to nearly every HPC program. Section 4 covers transformations that contribute to (C), as well as more situational optimizations. Section 5 covers the relationship between well-known software optimizations and HLS, and accounts for which of these apply directly to HLS code. Finally, Section 7 includes performance results for a selection of kernels optimized using the transformations presented here, and we conclude in Section 8.
PIPELINE-ENABLING TRANSFORMATIONS
This category of transformations covers detecting and resolving issues that prevent pipelining of computations. When analyzing a basic block of a program, the HLS tool determines the dependency graph between computations, and pipelines operations accordingly to achieve the target initiation interval. There are two classes of problems that hinder this process:
(1) Interface contention (intra-iteration): a hardware resource with limited ports is accessed multiple times in the same iteration of a loop. This could be a FIFO queue or RAM block that only allows a single read and write per cycle, or an interface to external memory, which only supports sending one request per cycle. (2) Loop-carried dependency (inter-iteration): an iteration of a pipelined loop depends on a result produced by a previous iteration. If the latency of the operations producing this result is L, the minimum initiation interval of the pipeline will be L. For each of the following transformations we will give examples of programs exhibiting properties that prevent them from being pipelined, and how the given transformation can resolve this.
All examples use C++ syntax, which allows objects (in particular FIFO buffers) and templating. We perform pipelining and unrolling using a pragma based syntax, where loop-oriented pragmas always refer to the following loop/scope, which is the convention used by Intel/Altera HLS tools (as opposed to applying to current scope, which is the convention for Xilinx HLS tools).
Iteration space transposition
For multi-dimensional iteration spaces, loop-carried dependencies arising from accumulation can often be resolved by reordering the loops, adding additional buffers to store intermediate results. This also affects the memory access pattern, which can significantly impact memory performance. We will see these effects by applying the transformation to a concrete example.
Consider the matrix multiplication code in Listing 1a, computing C = A · B + C, with matrix dimensions N , M, and P. The inner loop m ∈ M accumulates into a temporary register, which is written back to C at the end of each iteration p ∈ P. The multiplication of elements of A and B can be pipelined, but the addition on Line 8 requires the result of the addition in the previous iteration of the loop, resulting in an initiation interval of L + , where L + is the latency of an addition for the given data type (for integers L +,int = 1 cycle, and the loop can be pipelined without further modifications). To avoid this, we can transpose the iteration space, swapping the P-loop with the M-loop, with the following consequences:
• Rather than a single register, we now require an accumulation buffer of depth P and width 1.
• The loop-carried dependency is resolved, as we only update each location every P cycles.
• A, B, and C are all read in a contiguous fashion, achieving perfect spatial locality (we assume row-major memory layout. For column-major we would interchange the P-loop and N -loop).
1 for (int n = 0; n < N; ++n) • Elements from A are only read once per iteration of the M-loop.
The modified code is shown in Listing 1b. We leave the accumulation buffer defined on Line 2 uninitialized, and implicitly reset it on Line 7, avoiding P extra cycles to reset.
Accumulation interleaving
For loop-carried dependencies on an accumulation variable where it is undesirable to transpose the full iteration phase, we can interleave accumulations to resolve the dependency by 1) partially folding an outer loop, or by 2) accumulating partial sums, then collapsing them in a separate module. We distinguish between the two cases below.
Nested accumulation interleaving.
For accumulations done in a nested loop, we can resolve loop-carried dependencies due to accumulation by pipelining across multiple instances of the outer loop, using a buffer to store intermediate results.
Listing 2 shows this transformation on an N-body simulation code. We strip-mine the outer loop by a factor K ≥ L acc , where L acc is the latency of the accumulation operation (in this case double addition), and absorb it into the inner loop. This allows I = 1 cycle by interleaving the accumulation of K instances of the outer loop in parallel, at the cost of a saturation and drain phase, and a buffer of depth K. This has the additional benefit of reducing memory bandwidth usage, as every external particle loaded is reused K times, cutting the total memory transferred by a factor of K.
Single-loop accumulation interleaving.
If no outer loop is present, we have to perform the accumulation in two separate stages, at the cost of extra resources. For the first stage, we perform a transformation similar to the nested accumulation interleaving, but strip-mine the inner (and only) loop into blocks of size K ≥ L acc , accumulating partial results into a buffer of depth K. On the last pass over the partial results, values will be streamed to the second phase (for more on streaming, see Section 3.3). The second phase is responsible for collapsing the partial results, and must be pipelined with an initiation interval less than or equal to the total number of iterations of the first phase to avoid pipeline stalls. For large input sizes, a single additional reduction unit thus suffices.
It is important to note that native accumulation units, if available, should be favored over either method due to higher resource efficiency (e.g., a single-adder floating point accumulator [9] ). 1 Vec IterSolver(Vec state, int T) { Listing 3. Pipeline across multiple inputs to maximize throughput despite loop-carried dependency.
Cross-input accumulation interleaving
For algorithms with loop-carried dependencies (e.g., due to a non-commutative reduction), we can still maintain high throughput by pipelining across multiple inputs to the algorithm. This procedure is similar to the interleaving done in Section 2.2, but requires altering the behavior of the program to accept multiple elements that can be interleaved.
The code in Listing 3a shows an iterative solver code with an intrinsic loop-carried dependency on state, with a minimum initiation interval corresponding to the latency L Step of the (inlined) function
Step. There are no loops to interchange, and we cannot change the order of loop iterations due to the carried dependency. While there is no way to improve the latency of producing a single result, we can improve the overall throughput of the circuit by a factor of L Step by pipelining across N ≥ L Step different inputs, i.e., overlap solving for different starting conditions. This effectively corresponds to injecting another loop over inputs, then performing transposition or nested accumulation interleaving with the inner loop. The result of this transformation is shown in Listing 3b.
Inlining
In order to successfully pipeline a code section, all function calls within must be absorbed into the pipeline. The simplest way to achieve this is inlining, which instantiates the called function as dedicated hardware as part of the pipeline. As a preprocessing step, this transformation is no different from the software equivalent and is handled transparently by most compilers when possible, but results in additional hardware being generated for every inlined function call. Inlining is thus desirable in all contexts that don't otherwise allow significant reuse of hardware resources. We implicitly assumed inlining in Listing 2, for example when assigning vectors on Line 5, when performing vector addition on Line 9, or when calling the Force function, also on Line 9. Both the member functions and the free function call must thus be inlinable, as well as pipelineable in the inlined context.
Cyclic buffering
When iterating over regular domains in a pipelined fashion, it is often sufficient to express buffering patterns using cyclic FIFO buffers. A common set of applications that adhere to this pattern are stencil applications such as partial differential equation solvers [19, 66, 70] , image processing pipelines [29, 59] , and convolutions in deep neural networks [7, 38] , all of which are typically traversed using a sliding window buffer. These applications have been shown to be a good fit to 1 for (int n = 0; n < N; ++n) { FPGA architectures [20, 21, 33, 50, 51, 76, 87] , as FIFO buffers (also referred to as just "FIFOs") are natively supported, either as shift-registers or RAM blocks configured as FIFOs.
Opportunities for cyclic buffering often arise naturally from transforming programs to a pipelineable state. If we consider the transposed matrix multiplication code in Listing 1b, we notice that the read from acc on Line 7 and the write on Line 8 are both sequential, and cyclical with a period of P cycles. We could therefore substitute the array with a FIFO buffer of depth P, replace the read and write with FIFO queue operations Pop and Push, respectively. The resulting code is shown in Listing 4. The same transformation can be applied to the accumulation codes in Listings 2b and 3b.
Listing 5 shows two examples of applying cyclic buffering to simple sliding window stencil code, namely a 2D Jacobi stencil, which updates each point on a 2D grid to the average of its four neighbors: north, west, east and south. To achieve perfect data reuse, we buffer every element read in sequential order from memory until it has been used for the last time: after processing two rows (illustrated in Figure 1 ), when the same value has been used as all four neighbors.
In Listing 5a we explicitly instantiate two FIFO line buffers on lines 1-2. We only read the south element from memory in each iteration of the stencil (Line 8), which we store in a FIFO buffer (Line 13). This element is then reused after M cycles, when it is used as the east value (Line 10), shifted in registers for two cycles until it is used as the west value (Line 14), after which it is pushed to the north buffer (Line 13), and reused for the last time after M cycles on Line 9. This scheme is illustrated in Figure 1 . For more detail we refer to other works on the subject [15, 76] .
Listing 5b includes an alternative pattern to express a sliding window buffering scheme in HLS. Rather than explicitly creating the FIFOs and registers required to propagate the values, a single array is used, which is shifted by one element every cycle using unrolling (Line 14). The compute elements access elements of this array directly, relying on the tool to infer the partitioning into FIFOs and registers (loop idiom recognition [5] ) that we did explicitly in Listing 5a. While this method is less verbose, its implicit nature makes it more tool-dependent, as it can compile to inefficient hardware if the pattern is not recognized.
Pipelined loop flattening/coalescing
To minimize the number of cycles spent in saturating and draining pipelines (i.e., not streaming at full throughput), we can flatten nested loops. A pipelined loop has a saturation, streaming and drain phase, with the total number of cycles as given by Equation 1. Listing 6a shows a code with two nested loops, along with the total number of cycles to execute the program. The drain phase of the inner loop must be paid every iteration of the outer loop, or in terms of Listing 5. Two ways of reducing memory accesses in a stencil code from 4 to 1 using explicit buffering.
but for applications where N 1 is comparable to L 1 , even if N 0 is large, this means that the drain of the inner pipeline can significantly impact performance. By coalescing the two loops into a single loop (shown in Listing 6b), the next iteration of the outer loop can be executed immediately after the previous finishes, leaving only a combined draining phase of L 0 + L 1 cycles at the end of the program.
To perform the transformation in Listing 6, we had to absorb any code present after each execution of the inner loop (Line 5 in Listing 6a) into the coalesced loop, adding a loop guard (Line 4 in Listing 6b). This contrasts the loop peeling transformation, which is used by CPU compilers to regularize loops to avoid branch mispredictions and increasing amenability to vectorization. While loop peeling can also be beneficial in hardware, e.g., by avoiding deep conditional logic in a pipeline, small inner loops can see a significant performance improvement by eliminating the draining phase. It should additionally be noted that the modulo used in the loop guard is amenable to strength reduction, i.e., for values of N 0 that are a power of two, where this operation reduces to a binary AND, or the more intrusive transformation of re-introducing individual loop-counters (an example of such code is given in Section 4.1) for each iteration variable present before the flattening, which will preserve the desired pipeline properties.
Pipelined loop fusion
We can exploit fine-grained dependencies between consecutive loops to fuse them into a single pipeline using loop guards. This transformation is closely related to loop fusion [36] from software optimization. For two consecutive loops with latencies/bounds {L 0 , N 0 } and {L 1 , N 1 }, respectively, that are both pipelined with initiation interval I , the total runtime according to Equation 1 is
. If we can fuse the two loops without breaking dependencies between them, this can be reduced to
Listing 7 shows an example of pipeline fusion applied to the GEMM code from Listing 9, fusing both the buffering of A and the write back to C into the inner loop, using loop guards and exploiting the fine-grained dependencies between the three loops. In addition to saving clock cycles, the code now constitutes a perfect loop nest, and can be coalesced similarly to Listing 6. An alternative way of performing pipeline fusion is to instantiate each stage as a separate processing element, and stream fine-grained dependencies between them (Section 3.3).
SCALABILITY TRANSFORMATIONS
Parallelism in HLS revolves around the folding of loop nests, which is achieved through unrolling. In Section 2.1 and 2.2, we used strip-mining and reordering to avoid loop-carried dependencies by changing the schedule of computations in the pipelined loop nest. In this section, we similarly strip-mine and reorder loops, but with an additional unrolling of the strip-mined chunks. Pipelined loops constitute the iteration space; the size of which determines the number of cycles it takes to execute the program. Unrolled loops, in a pipelined program, correspond to the degree of parallelism in the program, as every expression in an unrolled statement is required to exist as hardware. We can thus move nested loop iterations from the sequential schedule into the parallel schedule. This corresponds to folding the sequential iteration space, as the number of cycles taken to execute the program are effectively reduced by the inverse of the unrolling factor.
Vectorization
We implement SIMD parallelism with HLS by partially unrolling loop nests in pipelined sections. This is the most straightforward way of folding our iteration space to obtain parallelism, as it can often be applied directly to the inner loop, without further reordering.
Listing 8 shows two functionally equivalent ways of vectorizing a loop over N elements by a factor of W : Listing 8a strip-mines a loop into chunks of the vector size and unrolls the chunk, while Listing 8b uses partial unrolling by specifying the unroll factor. OpenCL additionally includes built-in vector types, such as float4, float8, and int16, which similarly replicate registers and compute logic by the specified factor, but with less flexibility in choosing the vector type and length. The vectorization factor W [operand/cycle] is constrained by the available bandwidth B [Byte/s] to external memory according to
where f [cycle/s] is the clock frequency of the vectorized logic and S [Byte/operand] is the operand size. While vectorization is a straightforward way of parallelization, it is bottlenecked by external memory bandwidth, and is thus not sufficient to achieve a scalable design. Furthermore, because the energy cost of I/O is orders of magnitude higher than moving data on the chip, it is desirable to exploit on-chip memory and pipeline parallelism instead (this follows in Sections 3.2 and 3.3).
Replication
We can achieve scalable parallelism in HLS without relying on memory bandwidth by exploiting data reuse, distributing input elements to multiple computational units replicated through unrolling. This is the most potent source of parallelism on hardware architectures, as it can conceptually scale indefinitely with available silicon. Viewed from the paradigm of cached architectures, the opportunity for this transformation arises from temporal locality in loops. Replication draws on bandwidth from on-chip fast memory by storing more elements temporally, combining more elements with new data loaded from external memory to increase parallelism, allowing more computational units to run in parallel at the expense of buffer space. This is distinct from vectorization, which requires us to widen the data path that passes through the processing elements.
To demonstrate this process, we will look at how this can be done for the GEMM code from Listing 1. In Section 2.1, we saw that reordering loops allowed us to move reads from matrix A out of the inner loop, re-using the loaded value P times for P streamed columns of matrix B. To obtain the final result, every column of A is combined with every row of B. If we consider that every loaded value of B will contribute to all N rows of A, we realize that we can perform more computations in parallel by keeping multiple values of A in local registers. By buffering K elements of A prior to streaming the full B-matrix, we can fold the outer loop over rows by a factor of K, using unrolling to multiply the amount of compute (as well as buffer space required for the partial sums), by a factor of K. The result of this transformation is shown in Listing 9.
Streaming dataflow
For complex codes it is common to partition functionality into multiple modules or processing elements (PEs), streaming data between them through explicit interfaces according to the dataflow between them. In contrast to conventional pipelining, PEs arranged in a streaming dataflow architecture are scheduled separately. There are multiple benefits to this:
• Different functionality runs at different schedules. For example, issuing memory requests, performing memory requests, and servicing memory requests require different pipelines, state machines, and even clock rates. ...
[x, y−P, t+P ] Fig. 3 . Fold the time dimension of an iterative stencil by streaming across replicated processing elements.
• Modularity and testing: smaller components are easier to reuse, debug and verify.
• Synchronization, such as pipeline stalls, only need to propagate within the PE.
• Large fanout/fanin is challenging to translate into hardware (1-to-N /N -to-1 connections for large N ). This can be resolved by partitioning components into smaller subparts, thus adding more pipeline stages to the design.
• The effort to schedule loops increases with the number of statements that need to be considered for the dependency and pipelining analysis. Scheduling logic in smaller chunks can be beneficial for both runtime and result.
To move data between PEs, channels with a handshaking mechanism are used. These data channels double as synchronization points, as they imply a consensus on the program state. In practice, channels are (with the exception of I/O) always FIFO interfaces, and support standard queue operations Push, Pop, and optionally Empty/Full, and Size operations. For higher depth requirements, channels can occupy the same resources as regular FIFO buffers (see Section 2.5).
Mapping from code to PEs differs slightly between tools, but is manifested when functions are connected using channels. In the following, we will use the syntax from Xilinx Vivado HLS to instantiate PEs, where each non-inlined function correspond to a PE, and these are connected by channels that are passed as arguments to the functions from a top-level entry function. In Intel OpenCL, this is instead expressed as having multiple __kernels functions, which are connected by global channel objects prefixed with the channel keyword. To see how streaming can be an important tool to express scalable hardware, we apply it in conjunction with replication (Section 3.2) to implement an iterative version of the stencil example from Listing 5. Unlike the GEMM code, the stencil code has no scalable source of parallelism in the spatial dimension. Instead, we can fold the outer time-loop to treat B T timesteps in parallel, each computed by distinct PEs connected via channels [21, 62] , as illustrated in Figure 3 . We replace the memory interfaces to the PE with channels, such that the memory accesses on lines 8 and 11 become Pop and Push operations, respectively. The resulting code is shown in Listing 10a. We then use unrolling to make B T replications of the PE, effectively increasing the throughput of the kernel by a factor of B T , and consequently the runtime by folding the outermost loop by a factor of B T , shown in Listing 5a. Such architectures 'ares sometimes referred to as a systolic arrays [37, 44] .
For platforms/HLS tools where large fanout is an issue, the principle of streaming between replicated PEs can also be applied to the GEMM example from Listing 9. We can move the K-fold unroll out of the PE code and replicate the entire PE instead, again replacing reads and writes with channel accesses. B is then streamed into the first PE, and passed downstream every cycle. A and C should no longer be accessed by every PE, but rather be handed downstream similar to B, requiring a careful implementation of the drain and saturation phases, where the behavior of each PE will vary with its depth in the sequence.
Tiling
Loop tiling in HLS is commonly used to fold arbitrarily large problem sizes into chunks that fit into fast on-chip memory, in an already pipelined program. This contrasts loop tiling on CPU and GPU, where tiling is used to make a working program faster, rather than making a fast program work for large domains. Common for both paradigms is that they ultimately aim to meet fast memory constraints. As with vectorization and replication, tiling relies on strip-mining loops to gain useful properties by altering the iteration space.
As an example, consider the GEMM code from Listing 9. The buffer on Line 8 is required to pipeline the inner loop, but increases in size with P (columns of B). Because of this, the code cannot support arbitrarily large matrices. Similar to the loop on Line 1, we can strip-mine the P-loop on Line 6 by a factor B P and move it outside the M-loop, reducing the buffer size to K · B P , which is independent of the matrix dimensions. B P can be as small as the latency L + of the addition used to accumulate without re-introducing a loop-carried dependency.
OTHER TRANSFORMATIONS
Once a design has been pipelined and scaled up to the desired degree of parallelism and hardware resource consumption, we can perform a number of additional optimizations to tune the design further. The transformations covered in this section are more situational and/or more amenable to compiler automation than the previous two classes, but are important to consider for maximizing pipeline, bandwidth and clock frequency results.
Condition flattening
Flattening the depth of combinational logic due to conditional statements can improve timing results for pipelined sections. Conditional statements in a pipelined section that depend on a loop variable must be evaluated in a single cycle (i.e., they cannot be pipelined), and are thus sensitive to the latency of these operations.
Listing 11a shows an example of computing nested indices in a two dimensional iteration space, similar to how a loop is executed in software: the iterator of the inner loop is incremented until it exceeds the loop bounds, at which point the loop is terminated, and the iterator is incremented for the outer loop. This requires two integer additions and two comparisons to be executed before the final value of j is propagated to a register, where it will be read the following clock cycle to compute the next index. Because we know that i and j will always exceed their loop bounds in the final iteration, we can remove the additions from the critical path by bounds-checking the iterators before incrementing them, shown in Listing 11b. Note that these semantics differ from software loop at termination, as the iterator is not incremented to the out-of-bounds value before terminating. 
Memory access extraction
By extracting accesses to external memory from the computational logic, we enable the two aspects to be pipelined and optimized separately. Accessing the same interface multiple times within the same pipelined section is a common cause for increased initiation interval due to interface contention, since the interface can only service a single request per cycle. In many cases, such as for independent reads, this is not an intrinsic memory bandwidth or latency constraint, but arises from the tool scheduling iterations according to program order. This can be relaxed when allowed by inter-iteration dependencies (this can in many cases be determined automatically, e.g., using polyhedral analysis [25] ).
In Listing 12a, the same memory is accessed twice in the inner loop, preventing pipelining due to interface contention on A. By inserting buffered streams A 0 and A 1 of depth M, we can alternate between reading each section of A, allowing the HLS tool to infer bursts accesses to A of length M, shown in Listing 12c. Since the schedules of memory and computational modules are independent, ReadA can run ahead of PE by up to 2M iterations, ensuring that memory is always read at the maximum bandwidth of the interface. From the point of view of the computational PE, both A 0 and A 1 are read in parallel, as shown on Line 6 in Listing 12b, hiding initialization time and inconsistent memory producers in the synchronization implied by the data streams.
A second use case for memory access extraction is to perform in-fast memory data layout transformations, such as transposing column-wise burst reads to a row-wise stream. Such a transformation could be applied after tiling the GEMM code in Listing 9, reading in a full tile of A and streaming it to the kernel in column-major order.
Memory oversubscription
When dealing with nondeterministic memory interfaces such as DRAM, it can be beneficial to request accesses at a more aggressive pace than what is consumed or produced by the computational elements. This can be done by reading ahead into a deep buffer instantiated between memory and computations, by either 1) accessing wider vectors from memory than required by the kernel, narrowing or widening data paths when piping to and from computational elements, respectively, or 2) increasing the clock rate of modules accessing memory with respect to the computational elements.
The memory access function Listing 12c allows long bursts to the interface of A, but receives the data on a narrow bus at W · S int = (1 · 4) Byte/cycle. In general, this limits the bandwidth consumption to f · W S at frequency f , which is likely to be less than what the external memory can provide. To better exploit the bandwidth, we can either read wider vectors (increase W ) or clock the circuit at a higher rate (increase f ). The former consumes more resources, as additional logic is required to widen and narrow the data path, but the latter is more likely to be constrained by timing closure on the device.
Memory striping
When multiple memory banks with dedicated channels (e.g., multiple DRAM modules) are available, the bandwidth at which a single array is accessed can be increased by a factor corresponding the the number of available interfaces by striping it across the banks. This optimization is commonly known from RAID configurations.
We can perform striping explicitly in HLS by inserting modules that join or split data streams from two or more memory interfaces. Reading can be implemented with two asynchronous memory modules requesting memory from a mapped interface, then pushing to FIFO buffers that are read in parallel and combined by a third module, or vice versa for writing, exposing a single data stream to the computational kernel.
Type demotion
We can reduce resource and energy consumption, bandwidth requirements and operation latency by demoting data types to less expensive alternatives that still meet precision requirements. In particular, this can lead to significant improvements on architectures that are specialized for certain data types, such as FPGAs, which have traditionally been optimized for integer and fixed point computations. Because integer/fixed point and floating point computations on these architectures compete for the same reconfigurable logic, using a data type with lower resource requirements increases the total number of arithmetic operations that can be instantiated on the device. While reduced energy consumption from using lower precision operations or integer operations over floating point operations is a benefit in general, other benefits of type demotion, namely area usage, bandwidth requirement and operational latency, vary greatly in effectiveness depending on the target architecture and the application bottleneck. The largest benefits are seen in the following three scenarios:
• In a compute bound scenario, the data type can be changed to a type that occupies less of the same resources. This in particular applies to FPGAs, that traditionally implement floating point operations using general purpose resources such as LUTs, FFs and DSPs.
• In a compute bound scenario, the data type can be moved to a type that is natively supported by the target architecture, such as 16 bit integers on Xilinx' 7 series DSP blocks [31] , or single-precision floating point on Intel's Arria 10 and Stratix 10 devices [64] .
• In a bandwidth bound scenario, performance can be improved by up to the same factor that the size of the data type can be reduced by.
• In a latency bound application, the data type can be reduced to a lower latency operation, such as from floating point, which requires multiple pipeline stages, to an integer type, which can typically be evaluated in a single cycle. In the most extreme case, it has been shown that collapsing the data type of weights and activations in deep neural networks to binary [7, 14, 74] can provide sufficient speedup for inference that the loss of precision can me made up for with the increase in number of weights.
SOFTWARE TRANSFORMATIONS IN HLS
In addition to the transformations described in the sections above, we include a comprehensive overview of well-known software transformations and how they apply to HLS. We base this on the compiler transformations compiled by Bacon et al. [5] . The transformations are split into the following tables:
• Table 1 describes transformations that are essential components of the transformations presented in this paper, and notes how they relate.
• Table 2 lists transformations that apply to HLS in the same way that they apply to software.
• Additional transformations that we deemed to have little or no relevance to HLS, due to fundamental difference in software and hardware paradigms, are included in Appendix A. It is interesting to note that the majority of well-known transformations from software apply to HLS. This implies that we can leverage much of decades of research into high-performance computing transformations to also optimize hardware programs, including many that can be applied
CPU transformation
In HLS Loop interchange [2, 36] Used to resolve loop carried dependencies throughout Section 2. Strip-mining [77] Central component of many HLS transformations, including accumulation interleaving (Section 2.2), vectorization (Section 3.1), replication (Section 3.2), and tiling (Section 3.4).
Loop tiling [36, 40] Cycle shrinking [56] Loop distribution/fission [35, 36] Useful for separating differently scheduled computations to allow pipelining (see Section 3.3). Loop fusion [36, 79, 83] Used for merging pipelines (see Section 2.7). Loop unrolling [18] Essential tool for scaling up performance by generating more computational hardware (Section 3.1 and 3.2). Software pipelining [39] Used by the HLS tool to schedule loop bodies according to the interdependencies of operations. Loop coalescing/flattening [55] Used to save pipeline drains in nested loops (Section 2.6).
Loop collapsing Reduction recognition
Prevent loop-carried dependencies in accumulation codes (Section 2.1 and 2.3).
Loop idiom recognition
Relevant for HLS backends, for example to recognize slidingwindow buffers (Section 2.5) in Intel OpenCL [72] .
Procedure inlining
Required to pipeline code sections with function calls (Section 2.4).
Procedure cloning
Every occurrence of a function is always specialized to all variables that can be statically inferred. Loop unswitching [17] Often the opposite is beneficial (see Section 2.6 and 2.7).
Loop peeling
Often the opposite is beneficial to allow coalescing (Section 2.6).
Graph partitioning
Streaming is central to hardware algorithms (Section 3.3).
SIMD transformations
Covered in Section 3.1. Table 1 . Software transformations that relate directly to the proposed HLS transformations.
directly (i.e., without further adaptation to HLS) to the imperative source code or intermediate representation before synthesizing for hardware (in particular transformations loop-based strength reduction through scalar replacement in Table 2 ). Despite not receiving much attention in this paper, we stress the importance of support for these pre-hardware generation transformations in HLS compilers, as they lay the foundation for the hardware-specific transformations proposed here.
RELATED WORK
Much work has been done in optimizing C/C++/OpenCL HLS codes for FPGA, such as stencils [33, 75, 76, 78, 87] , deep neural networks [69, 74, 84] , matrix multiplication [16, 75] , and Smith Waterman protein sequencing [60, 63] . These works optimize the respective applications using cyclic buffering, vectorization, replication, and streaming, which we describe as general transformations here. Zohouri et al. [86] use the Rodinia benchmark to evaluate the performance on OpenCL codes on FPGA, employing optimizations such as SIMD vectorization, sliding-window buffering, accumulation interleaving, and compute unit replication across multiple kernels. We present a general description of a superset of these transformations, along with concrete code examples that show they are applied in practice. Kastner et al. [34] go through the implementation of many HLS codes in Vivado HLS, focusing on algorithmic optimizations for FPGA, and apply some of the transformations found here. Lloyd et al. [43] describe optimizations specific to Intel OpenCL, and include a variant of memory access extraction, as well as the single-loop accumulation variant of accumulation interleaving.
Software transformation
Notes Loop-based strength reduction [8, 12, 68] Benefits from eliminating code are larger, as this results in less generated hardware.
Induction variable elimination [1] Unreachable code elimination [1] Useless-code elimination [1] Dead-variable elimination [1] Common-subexpression elimination [1] Constant propagation, constant folding [1] Copy propagation, forwarding substitution [1] Reassociation Algebraic simplification, strength reduction Bounds reduction Redundant guard elimination Loop-invariant code motion (hoisting) [1] Hoisting code from loops does not save hardware in itself, but can save memory operations.
Loop normalization
Used as an intermediate transformations. Loop reversal [1] Same arguments apply to HLS. Array padding, array contraction Scalar expansion, scalar replacement Loop skewing [1] Used in multi-dimensional wavefront codes.
Function memoization
Requires explicitly instantiating fast memory.
Tail recursion elimination
Eliminating dynamic recursion can enable a code to be implemented in hardware.
Regular array decomposition
Applies to partitioning of fast memory in addition to partitioning of external memory.
Message vectorization
We do not consider implications of distributed settings and message passing in this paper, but these optimizations should be implemented in dedicated message passing hardware when relevant. Table 2 . Software transformations that have equivalent or similar meaning in HLS.
Message coalescing Message aggregation Collective communication Message pipelining Guard introduction Redundant communication
High-level, directive-based frameworks such as OpenMP and OpenACC have been proposed as alternative abstractions for generating FPGA kernels. Leow et al. [42] implement an FPGA code generator from OpenMP pragmas, primarily focusing on correctness in implementing a range of OpenMP pragmas. Lee et al. [41] present an OpenACC to OpenCL compiler, using Intel OpenCL as a backend. The authors implement vectorization, replication, pipelining and streaming by introducing new OpenACC clauses. As an alternative to OpenCL, Papakonstantinou et al. [53] generate HLS code for FPGA from directive-annotated CUDA code.
Mainstream HLS compilers automatically apply many of the transformations in Table 2 [3, 26, 27] , but can also employ more advanced FPGA transformations. Intel OpenCL [72] performs memory access extraction into load store units (LSUs), does memory striping between DRAM banks, and detects and auto-resolves some cyclic buffering and accumulation patterns.
Polyhedral compilation is a popular framework for optimizing CPU and GPU programs [25] , and has also been applied to HLS for FPGA for optimizing data reuse [57] . Such techniques may prove valuable in automating, e.g., the tiling transformation.
Implementing programs in domain specific languages (DSLs) can make it easier to detect and exploit opportunities for advanced transformations. Darkroom [29] generates optimized HDL for image processing codes, and the popular image processing framework Halide [59] has been extended to support FPGAs [58] . Additionally, Luzhou et al. [44] propose a framework for generating stencil codes for FPGAs. These frameworks rely on optimizations such as cyclic buffering, streaming and replication, which we cover here. Using DSLs to compile to structured HLS code can be a viable approach to automating a wide range of transformations, as proposed in the FROST [67] DSL framework.
EXPERIMENTS
To demonstrate the effects of the set of optimizing transformations proposed here, we apply them to a set of HLS kernels and report the resulting performance when targeting an FPGA platform. These kernels are written in C++ for the Xilinx Vivado HLS [81, 85] tool. We target the TUL KU115 [73] board, which hosts a Xilinx Kintex UltraScale XCKU115-2FLVB2104E FPGA and four 2400 MT/s DDR4 banks, although we only use two banks for these experiments. The chip hosts two smaller dies with limited interconnect between them, where each die is connected to two of the DDR4 pinouts. This multi-die design is used in all of Xilinx' larger UltraScale and UltraScale+ devices, and while it allows multiplying the amount of available logic resources (2 × 331,680 LUTs and 2 × 2760 DSPs for the TUL KU115) with the number of connected dies, crossing between them is challenging for the routing process, which impedes the achievable clock rate and resource utilization for a monolithic kernel attempting to span the full chip. To interface with the host computer we use version 4.0 of the board firmware provided with the SDx 2017.2 [82] Development Environment, which provides memory and PCIe controllers on the device, and allows access to device memory and execution of the kernel through an OpenCL interface on the host side (this interface is compatible with kernels written in C++). For each example, we will describe the sequence of transformations applied, and give the resulting performance at each major stage.
Stencil code
As one of the most popular target applications for FPGAs in HPC, we will optimize a stencil code using the proposed transformations to optimize and scale up the design within the hardware constraints set by the FPGA platform. We implement the Jacobi 2D 4-point stencil from Listing 5. The experiments use single precision floating point types, and iterate over a 8192×8192 domain, and avoid memory conflicts by using a double-buffering scheme. We begin from a naive implementation with all explicit memory accesses, which has heavy interface contention on the input array, then perform the following optimization steps: The effect of each stage above is quantified in Table 3 . Enabling pipelining with cyclic buffers allows the kernel to throughput ∼1 cell per cycle. Improving the memory performance to add vectorization (using W = 8 operands/cycle for the kernel) exploits spatial locality through additional bandwidth usage. The replication and streaming step scales the design to only be limited by placement and routing due to high resource utilization. Table 3 . Performance progression of applying transformations to stencil kernel.
Perf

GEMM code
We implement a scalable GEMM kernel based on Listing 7. For experiments, we build for single precision floating point types, and benchmark for 8192 × 8192 matrices. The optimization stages performed are given below, starting from the naive code in Listing 1a: Table 4 . Allowing pipelining and regularizing the memory access pattern brings a dramatic improvement of 40×, throughputting ∼1 cell per cycle. Vectorizing multiplies the performance by W , set to 8 in the benchmarked kernel. The replicated and streaming kernel is only limited by placement and routing due to high resource usage on the chip. Compute utilization is lower than for the stencil code, due to 1) different distribution of floating point multiplications to additions, and 2) more control logic overhead from the multiple data streams between the processing elements with respect to the computational logic. Table 4 . Performance progression of applying transformations to a matrix multiplication kernel.
Perf
N-body code
Finally, we show the optimization process of an N-body based on the implementation in Listing 2. We use single precision floating point types and iterate over 16,128 bodies. Since Vivado HLS does not allow memory accesses of a width that is not a power of two, it was necessary to include memory extraction in the first stage. The steps taken were as follows:
( Table 5 . The second stage gains a factor of 7× corresponding to the latency of the interleaved accumulation, then by a factor of 39× from replicated units across the chip. The memory bandwidth requirement is regulated by L. In fact, we can further reduce the bandwidth requirements by storing more resident particles on the chip, scaling up to the full fast memory usage of the FPGA. In this case, the accumulation interleaving transformation thus enables not just pipelining the compute, but also minimization of bandwidth consumption, and thus energy consumption due to I/O.
With these examples, we have demonstrated the effect of our transformations on a reconfigurable hardware platform, showing that we can scale up kernels until constrained by high resource utilization on the device. In particular, enabling pipelining, regularizing memory accesses and replicating were shown to be central components of scalable hardware architectures. Using these principles, we can continue to exploit new platforms as the hardware landscape evolves, adapting the transformation parameters to accommodate available resources. Table 5 . Performance progression of applying transformations to an N-body simulation kernel.
Perf
CONCLUSION
Programming specialized hardware architectures has been brought to a much wider audience with the adoption of high-level synthesis (HLS) tools. To facilitate the development of HPC kernels using HLS, we have proposed a set of optimizing transformations that enable efficient and scalable hardware architectures, which can be applied directly to the source code, or automatically by an optimizing compiler. We hope that software and hardware programmers, performance engineers, and compiler developers, will be able to benefit from this set, with the goal of serving as a common toolbox for developing high performance hardware using HLS.
