FPGAs excel in low power and high throughput computations, but they are challenging to program. Traditionally, developers rely on hardware description languages like Verilog or VHDL to specify the hardware behavior at the register-transfer level. High-Level Synthesis (HLS) raises the level of abstraction, but still requires FPGA design knowledge. Programmers usually write pragma-annotated C/C++ programs to define the hardware architecture of an application. However, each hardware vendor extends its own C dialect using its own vendor-specific set of pragmas. This prevents portability across different vendors. Furthermore, pragmas are not first-class citizens in the language. This makes it hard to use them in a modular way or design proper abstractions.
INTRODUCTION
Field Programmable Gate Arrays (FPGAs) consist of a network of reconfigurable digital logic cells that can be configured to implement any combinatorial logic or sequential circuits. This allows the design of custom application-tailored hardware. In particular memory-intensive applications benefit from FPGA implementations by exploiting fast on-chip memory for high throughput. These features make FPGA implementations orders of magnitude faster/more energy-efficient than CPU implementations in these areas. However, FPGA programming poses challenges to programmers unacquainted with hardware design.
FPGAs are traditionally programmed at Register-Transfer Level (RTL). This requires to model digital signals, their timing, flow between registers, as well as the operations performed on them. Hardware Description Languages (HDLs) such as Verilog or VHDL allow for the explicit description of arbitrary circuits but require significant coding effort and verification time. This makes design iterations time-consuming and error-prone, even for experts: The code needs to be rewritten for different performance or area objectives.
High-Level Synthesis (HLS) [7] increases the abstraction level to an untimed high-level specification similar to imperative programming languages and automatically solves low-level design issues such as clock-level timing, register allocation, and structural pipelining. HLS languages such as Vivado HLS or Catapult-C are usually based on various C dialects [28] . However, an HLS code that is optimized for synthesis of high-performance circuits is fundamentally different from a software program delivering high performance on a CPU. This is due to the significant gap between the programming paradigms. An HLS compiler must optimize the memory hierarchy of a Fig. 1 . We decouple the algorithm description sobel_x from its realization in hardware make_local_op. The hardware realization is a function that specifies important transformations for exploitation of parallelism and memory architecture. The function generate(vhls) selects the backend for code generation, which is Vivado HLS in this case. Ultimately, an optimized input code for HLS is generated by partially evaluating the algorithm and realization functions.
hardware implementation and parallelize its data paths [8] . In order to achieve good performance, HLS languages demand programmers to also specify the hardware architecture of an application instead of just its algorithm. For this reason, HLS languages offer hardware-specific pragmas. This ad-hoc mix of software and hardware features makes it difficult for programmers to optimize an application.
Pragmas allow to exploit different design choices, but they cannot be used in a modular way because the preprocessor already resolves them. In addition, most HLS tools rely on their own C dialect, which prevents code portability. For example, Xilinx Vivado HLS [45] uses C++ as base language while Intel SDK [19] (formerly Altera) uses OpenCL C. These severe restrictions make it hard to use existing HLS languages in a portable and modular way.
In this paper, we advocate to describe FPGA designs using functional abstractions and partial evaluation to generate optimized HLS code. Consider Figure 1 for an example from image processing: With a functional language, we separate the description of the sobel_x operator from its realization in hardware. The hardware realization make_local_op is a function that specifies the data path, the parallelization, and memory architecture. Thus, the algorithm and hardware architecture descriptions are described by a set of higher-order functions. A partial evaluator, ultimately, combines these functions to generate an HLS code that delivers high-performance circuit designs when compiled with HLS tools. Since the initial descriptions are high-level, compact, and functional, they are reusable and distributable as a library. We leverage the AnyDSL compiler framework [24] to perform partial evaluation and extend it to generate input HLS code for Intel and Xilinx FPGA devices. We claim that this approach leads to cleaner and better composable code than existing HLS approaches, and is able to produce highly efficient hardware implementations. Contributions: In summary, this paper makes the following contributions:
• We extend AnyDSL framework by hardware-specific memory types and parallelization constructs to target FPGAs (see Section 2). • We present AnyHLS 1 , a set of major abstractions that significantly eases the design of libraries for HLS-based FPGA implementations. These functional abstractions include important loop transformations for performance optimizations as well as control structures such as Finite State Machines (FSMs), multiplexers (MUXs), and reduction operators (see Section 2). AnyHLS allows defining all abstractions for a domain in a language called Impala and relies on partial evaluation for code specialization. This ensures maintainability and extensibility of the provided domain-specific library-for image processing in this example.
• As a case study, we present a library for image processing that is based on AnyHLS's abstractions (see Section 3) . We demonstrate that the performance of the circuits synthesized from this library is on par with or even exceeds those using existing state-of-the-art domain-specific compilers (see Section 4).
THE AnyHLS LIBRARY
Efficient and resource-friendly FPGA designs require application-specific optimizations. These optimizations and transformations are well known in the community. For example, de Fine Licht et al. [11] discuss the key transformations of HLS codes such as loop unrolling and pipelining. In our setting, the programmer defines and provides these abstractions using AnyDSL for a given domain in the form of a library (see Figure 2 ). They describe the whole hardware design from the low-level memory layout to the operator implementations with support for low-level loop transformations throughout the design. We rely on partial evaluation to combine those abstractions and to remove overhead associated with them. This is in contrast to other domain-specific approaches like Rigel [18] or Hipacc [37] , which rely on domain-specific compilers to instantiate predefined templates or macros.
In the following, we briefly present the relevant key concepts of AnyDSL (Section 2.1) before presenting our extensions to AnyDSL, key abstractions for FPGA designs, and HLS code generation (Section 2.2). Then, Section 3 introduces an image processing library based on these key abstractions. Finally, Section 4 evaluates the overall approach.
AnyDSL Compiler Framework
AnyDSL 2 [23, 24] is a compiler framework for designing high-performance, domain-specific libraries. It provides the imperative and functional language Impala. Its syntax is inspired by Rust. We will now briefly discuss Impala's most important features we rely on in AnyHLS.
Partial
Evaluation. Impala enables programmers to partially evaluate [16] their program at compile time. Programmers control the partial evaluator via filters [9] . These are Boolean expressions of the form @(expr) that annotate function signatures. Each call site instantiates the callee's filter with the corresponding argument list. If the expression evaluates to true, the call will be specialized. Additionally, the expression ?expr yields true, if expr is known at compile time; the expression $expr is never considered constant by the evaluator. For example, the following @(?n) filter will only specialize calls to pow if n is statically known at compile time:
Thus, the calls let z = pow (x , 5); let z = pow (3 , 5) ;
will result in the following equivalent sequences of instructions after specialization: As syntactic sugar, @ is available as shorthand for @(true). This causes the partial evaluator to always specialize the annotated function. that is passed to iter as the last argument. We call functions that are invokable like this generators. Domain-specific libraries implemented in Impala make busy use of these features as they allow programmers to write custom generators that take advantage of both domain knowledge and certain hardware features, as we will see in the next section.
Generators are particularly powerful in combination with partial evaluation: Both generators iterate from a (inclusive) to b (exclusive) while invoking body each time. The filter unroll tells the partial evaluator to completely unroll the recursion if both loop bounds are statically known at a particular call site. The function range wraps a call to unroll but prevents the partial evaluator from unrolling because it always considers $a as dynamic.
Building Abstractions for FPGA Designs
In the following, we present abstractions for the key transformations and design patterns that are common in FPGA design. These include (a) important loop transformations, (b) control flow and data flow descriptions such as FSMs and reductions, (c) MUXs, and (d) the explicit utilization of different memory types. Approaches like Spatial [21] expose these patterns within the languagenew patterns require dedicated support from the compiler. Hence, these languages and compilers are restricted to a specialized application domain they have been designed for. In AnyHLS, Impala's functional language and partial evaluation allow us to design the abstractions needed for FPGA synthesis in the form of a library. New patterns can be added to the library without dedicated support from the compiler. Passing W for the tiling size, unroll for the inner loop, and range for the outer loop yields a generator that is identical to the loop nest at the beginning of this paragraph. With this design, we can reuse or explore iteration techniques without touching the actual body of a for loop. For example, consider the processing options for a two-dimensional loop nest as shown in Figure 3 : When just passing range as inner and outer loop, the partial evaluator will keep the loop nest and, hence, not unroll body and instantiate it only once. Unrolling the inner loop replicates body and increases the bandwidth requirements accordingly. Unrolling the outer loop also replicates body, but in a way that benefits data reuse from temporal locality of an iterative algorithm. Unrolling both loops replicate body for increased bandwidth and data reuse for temporal locality.
C/C ++ -based HLS solutions often use a pragma to mark a loop amenable for pipelining. This means parallel execution of the loop iterations in hardware. For example, the following code on the left uses an initiation interval (II) of 3:
Instead of a pragma (on the left), AnyHLS uses the intrinsic generator pipeline (on the right). This enables the programmer to invoke and pass around pipeline-just like any other generator. AnyHLS models computations that do not only depend on the inputs but also on an internal state with an FSM. In order to define an FSM, programmers need to specify states and a transition function that determines when to change the current state based on the machine's input. This is especially beneficial for modeling control flow. To describe an FSM in Impala, we start by introducing types to represent the states and the machine itself:
An object of type FSM provides two operations: adding one state with add or running the computation. The add method takes the name of the state, an action to be performed for this state, and a transition function associated with this state. Once all states are added, the programmer runs the machine by passing the initial state. The following example adds 1 to every element of an array: Just like the other abstractions introduced in this section, the constructor for an FSM is not a built-in function of the compiler but a regular Impala function (see Appendix A). In some cases, we want to execute the FSM in a pipelined way. For this scenario, we add a second method run_pipelined (see Section 3 and Appendix A for details).
Reductions are useful in many contexts. The following function takes an array of values, a range within, and an operator: 2.2.3 Reductions.
Multiplexers.
MUXs are devices that take n inputs and a selection index i ∈ [0, n − 1]. They yield the i-th input. AnyHLS models this with the following function that takes an array of inputs as the argument:
If the number of inputs n is known, the partial evaluator will generate a sequence of if statements for each possible case because the constant n will be propagated to the call to unroll. This will in turn trigger the evaluator to unroll the loop. For example, the partial evaluator will specialize the call let sel = value & 3; mux ( sel , 4, [ multiple_four , odd , even , odd ]) to the following sequence of if statements:
Memory Types and Memory Abstractions.
FPGAs have different memory types of varying sizes and access properties. Impala supports four memory types specific to hardware design (see Figure 5 ): global memory, on-chip memory, registers, and streams. Global memory (typically DRAM) is allocated on the host using our runtime and accessed through regular pointers. On-chip memory (e.g., BRAM or M10K/M20K) for the FPGA is allocated using the reserve_onchip compiler intrinsic. Memory accesses using the pointer returned by this intrinsic will map to on-chip memory. Standard variables are mapped to registers, and a specific stream type is available to allow for the communication between FPGA kernels. Memory-wise, a stream is mapped to registers or on-chip memory by the HLS tools. Vivado HLS uses a pragma to annotate register arrays: Since normal variables are stored in registers, a recursive structure that holds one register per recursion level constitutes a register array:
In the base case, this function generates an empty register array that always reads as 0 and cannot be written to. When the size is not zero, the function allocates a register by declaring a variable named reg, and creates a smaller register array with one element less named others. The read and write functions test if the index i is equal to the index of the current register. In the case of a match, the current register is used. Otherwise, the search continues in the smaller array.
Note that we have annotated make_regs1d, read, and write for partial evaluation. Thus, any call to these functions will be inlined recursively. This means that the search to find the register to read to or write from will be performed at compile-time, and no test will remain in the specialized code. By doing this, we created a generator for register arrays that we instantiate with the following code:
The generated code will not contain any compiler directives, and thus targets Vivado HLS or Intel FPGA SDK for OpenCL (AOCL) alike. Moreover, these registers will be optimized by the AnyDSL compiler, just like any other variables: unnecessary assignments are avoided, and a cleaner HLS code is generated. Based on this one-dimensional register array, AnyHLS provides generators for two-dimensional register arrays. Generators for arrays of on-chip memory and streams work the same way (see Figure 6 ). With opencl we use a grid and block size of (1, 1, 1) to generate a single work-item kernel, as the official AOCL documentation recommends. To provide an abstraction over both HLS backends, we create a wrapper generate that expects a code generation function:
Switching backends is now just a matter of passing an appropriate function to generate:
DESIGN OF A LIBRARY FOR IMAGE PROCESSING ON FPGA
In this section, we present a library for image processing based on the fundamental abstractions introduced in Section 2.2. Our low-level implementation is similar to existing domain-specific languages targeting FPGAs [18, 37] . For this reason, we focus on the interface of our abstractions as seen by the programmer.
Vectorization
Image processing applications consist of loops that possess a very high degree of spatial parallelism. This should be exploited to reach the bandwidth speed of memory technologies. A resourceefficient approach, so-called vectorization or loop coarsening, is to aggregate the input pixels to vectors and process multiple input data at the same time to calculate multiple output pixels in parallel [40, 33, 43] . This replicates only the arithmetic operations applied to data (so-called datapath) instead of the whole accelerator, similar to Single Instruction Multiple Data (SIMD) architectures. Vectorization requires a control structure specialized to a considered hardware design. We support the automatic vectorization of an application by a given factor v when using our image processing library. For example, the make_local_op function has an additional parameter to specify the desired vectorization and will propagate this information to the functions it uses internally: make_local_op(op, v). For brevity, we omit the parameter for the vectorization factor for the remaining abstractions in this section.
Memory Abstractions for Image Processing

Memory
Accessor. In order to optimize memory access and encapsulate the contained memory type (on-chip memory, etc.) into a data structure, we decouple the data transfer from the data use via the following memory abstractions: Similar to hardware design practices, these memory abstractions require the memory address to be updated before the read/write operations. The update function transfers data from/to the encapsulated memory to/from staging registers using vector data types. Then, the read/write functions access an element of the vector. This increases data reuse and DRAM-to-on-chip memory bandwidth [5] .
Stream
Processing. Inter-kernel dependencies of an algorithm should be accessed on-the-fly in combination with fine-granular communication in order to pipeline the full implementation with a fixed throughput. That is, as soon as a block produces one data, the next block consumes it. In the best case, this requires only a single register of a small buffer instead of reading/writing to temporary images:
We define a stream between two kernels as follows:
fn make_mem_from_stream ( size : int , data : stream ) -> Mem1D ;
Line Buffers.
Storing an entire image to on-chip memory before execution is not feasible since on-chip memory blocks are limited in FPGAs. On the other hand, feeding the data on demand from main memory is extremely slow. Still, it is possible to leverage fast on-chip memory by using it as FIFO buffers containing only the necessary lines of the input images (W pixels per line).
line buffer line buffer
This enables parallel reads at the output for every pixel read at the input. We model a line buffer as follows:
type LineBuf1D = fn ( Mem1D ) -> Mem1D ; fn make_linebuf1d ( width : int ) -> LineBuf1D ; // similar for LineBuf2D
Akin to Regs1D (see Section 2.2.5), a recursive call builds an array of line buffers.
Sliding Window.
Registers are the most amenable resources to hold data for highly parallelized access. A sliding window of size w × h updates the constituting shift registers by a new column of h pixels and enables parallel access to w · h pixels.
Mem2D (w, h, 1)
sliding window
Mem2D (1, h, v) This provides high data reuse for temporal locality and avoids waste of on-chip memory blocks that might be utilized for a similar data bandwidth. Our implementation uses make_regs2d for an explicit declaration of registers and supports pixel-based indexing at the output. 
The following figure illustrates the AnyHLS point operator:
The total latency is
where W and H are the width and height of the input image, and L arith is the latency of the data path.
Local
Operators. Algorithms such as Gaussian blur and Sobel edge detection calculate an output pixel by considering the corresponding input pixel and a certain neighborhood of it in a local window. Thus, a local operator with a w × h window requires w · h pixel reads for every output. The same (w − 1) · h pixels are used to calculate results at the image coordinates (x, y) and (x + 1, y). This spatial locality is transformed into temporal locality when input images are read in raster order for burst mode, and subsequent pixels are sequentially processed with a streaming pipeline implementation. The local operator implementation in AnyHLS (shown below) consists of line buffers and a sliding window to hold dependency pixels in on-chip memory and calculates a new result for every new pixel read. 3 Appendix A depicts an alternative point operator implementation that is based upon the FSM abstraction in combination with run_pipelined. Both variants generate similar HLS code. 
This provides a throughput of v pixels per clock cycle at the cost of an initial latency (v is the vectorization factor)
that is spent for caching neighboring pixels of the first calculation. The final latency is thus:
Compared to the local operator in Figure 1 , we also support boundary handling. We specify the extent of the local operator (filter size / 2) as well as functions specifying the boundary handling for the lower and upper bounds. Then, row and column selection functions apply border handling correspondingly in xand y−directions by using one-dimensional multiplexer arrays similar to Özkan et al. [33] . 
Algorithm Specification
We design applications by decoupling their algorithmic description from their schedule and memory operations. For instance, typical image operators, such as the following Sobel filter, just resort to the make_local_op generator. Similarly, we implement a point operator for RGB-to-gray color conversion as follows: The image data structure is opaque. The target platform mapping determines its layout. AnyHLS provides common border handling functions as well as point and global operators such as reductions (see Section 2.2.3). These operators are composable to allow for more sophisticated ones.
EVALUATION AND RESULTS
In the following, we present experimental results for a Cyclone V GT 5CGTD9 FPGA and a Zynq XC7Z020 FPGA using Intel FPGA SDK for OpenCL 18.1 and Xilinx Vivado HLS 2017.2, respectively. We compare the results of AnyHLS to other state-of-the-art domain-specific approaches including Halide-HLS [35] and Hipacc [37] .
Applications
We evaluate the following applications:
• Gaussian (Gauss) blurring an image with a 5 × 5 integer kernel • Harris corner detector (Harris) consisting of 9 kernels that resort to integer arithmetic and horizontal/vertical derivatives • Jacobi smoothing an image with a 3 × 3 integer kernel • filter chain (FChain) consisting of 3 convolution kernels as a pre-processing algorithm • bilateral filter (Bilateral), a 5 × 5 floating-point kernel as an edge-preserving and noise-smoothing function based on the exponential function • mean filter (MF), a 5 × 5 filter that finds the average within a local window via 8-bit arithmetic • SobelLuma, an edge detection algorithm provided as a design example by Intel. The algorithm consists of RGB to Luma color conversion, Sobel filters, and thresholding
Evaluation of the Implementation Results
This section evaluates the generated hardware designs based on their throughput, latency, and resource utilization. FPGAs possess two types of resources:
(1) computational: LUTs and DSP blocks;
(2) memory: Flipflops (FFs) and on-chip memory (BRAM/M20K). A SLICE/ALM is comprised of look-up tables (LUTs) and flip flops, thus indicate the resource usage when considered with the DSP block and on-chip memory blocks. The implementation results presented for Vivado HLS feature only the kernel logic, while those by Intel OpenCL include PCIe interfaces. The execution time of an FPGA circuit (Vivado HLS implementation) equals to T clk · latency, where T clk is the clock period of the maximum achievable clock frequency (lower is better). We measured the timing results for Intel OpenCL by executing the applications on a Cyclone V GT 5CGTD9 FPGA. This is the case for all analyzed applications. We have no intention nor license rights [10, §4] [20, §2] to benchmark and compare the considered FPGA technologies or HLS tools.
Library Optimizations
AnyHLS exploits stream processing and performs implicit parallelization. The following subsections show the impact of those optimizations. The throughput of both streaming pipeline implementations is indeed determined by their slowest individual kernel, which is a local operator. Consider Table 1 , which displays the Vivado HLS reports. The latency results correspond to Equation (3). 4.3.2 Vectorization. Many FPGA implementations benefit from parallel processing in order to increase memory bandwidth. AnyHLS implicitly parallelizes a given image pipeline by a vectorization factor v. As an example, Figure 8 shows the Post Place and Route (PPnR) results, along with the achieved memory throughput for different vectorization factors for the mean filter on a Cyclone V. The memory-bound of the Cyclone V is reported by Intel's diagnosis tool. The speedup is almost linear, whereas resource utilization is sub-linear to the vectorization factor, as Figure 8 depicts. AnyHLS exploits the data reuse between consecutive iterations of the local operators. Data is read and written with the vectorized data types. The line buffers and the sliding window are extended to hold dependency pixels for vectorized processing. Thus, only the datapath is replicated instead of the whole accelerator implementation (see Section 3.1). All the considered applications except Figure 10 reach the memory bound. Bilateral is compute-bound due to its large number of floating-point operations.
Bilateral in
Hardware Design Evaluation
We evaluate the generated hardware designs based on their throughput, latency, and resource utilization. As a reference, we use the designs generated by Halide-HLS [35] and Hipacc [37] , two state-of-the-art image processing DSLs that generate better results than previous approaches (e.g., Xilinx OpenCV). In contrast to these, which implement dedicated HLS code generators, AnyHLS is essentially implemented as a library within the AnyDSL framework, as illustrated in Figure 2 .
Our focus is to show that higher-order abstractions, together with partial evaluation, are powerful enough to design a library targeting different HLS compilers.
Experiments using Xilinx Vivado HLS.
We evaluate the results of circuits generated using AnyHLS in comparison with the domain-specific language approaches Hipacc and Halide-HLS. We consider two representative applications from the Halide-HLS repository with different configurations (border handling mode and vectorization factor): Gauss and Harris. The applications are rewritten for Hipacc and AnyHLS by respecting their original descriptions. This ensures that Halide-HLS applications have been implemented with adequate scheduling primitives. Hipacc and AnyHLS implementations require only the algorithm descriptions as input.
As Tables 2 and 3 report, AnyHLS provides the implementations with the lowest latency (number of clock cycles) amongst the considered approaches. The execution time of an implementation equals to T clk · latency, where T clk is the clock period of the maximum achievable clock frequency (lower is better). Overall, AnyHLS processes a given image faster than the other DSL implementations.
Let us consider the number of BRAMs utilized for the Gaussian blur: The line buffers need to hold 4 image lines for the 5 × 5 kernel, as explained in Section 3.3.2. The image width is 1020 and the pixel size is 32 bits. Therefore, eight 18K BRAMs are required, as shown in Table 2 . The number of BRAMs is sub-linear to the vectorization factors (v).
Unlike Hipacc and AnyHLS, Halide-HLS pads input images according to the border handling mode. This requires more on-chip memory and increases latency. In the worst case, the number of BRAMs doubles as shown in Table 3 (1024 integer pixels require 16 18K BRAMs to buffer four image lines).
For almost all applications in Tables 2 and 3 , AnyHLS implementations demand fewer resources and deliver higher performance. The DSLs have been developed by FPGA experts and perform better than many other existing libraries. The performance difference between AnyHLS and Hipacc is not significant since the latency of both implementations is close to optimal. Equation (3) shows the minimum number of clock cycles required for a local operator iteration. For instance, the theoretical latency of Gauss is 1042442 cycles for v = 1. Latency of the Gauss generated by AnyHLS and Hipacc is only 14 and 2058 clock cycles, respectively. In conclusion, we claim that AnyHLS is on par with or exceeds existing approaches.
Experiments using
Intel FPGA SDK for OpenCL (AOCL). Table 4 presents the implementation results for an edge detection algorithm provided as a design example by Intel. The algorithms consist of RGB to Luma color conversion, Sobel filters, and thresholding. Intel's implementations consist of a single-work item kernel that utilizes shift registers according to the FPGA design paradigm. These types of techniques are recommended by Intel's optimization guide [19] despite that the same OpenCL code performs drastically bad on other computing platforms. We implemented this algorithm with Hipacc and AnyHLS using algorithmic abstractions (as in Section 3.4). Both Hipacc and AnyHLS provide a higher throughput even without vectorization. In order to reach memory bound, we would have to rewrite Intel's hand-tuned design example to exploit further parallelism. AnyHLS uses slightly less resource, whereas Hipacc provides slightly higher throughput for all the vectorization factors. Similar to Figure 8 , both frameworks yield throughputs very close to the memory bound of the Intel Cyclone V. The OpenCL NDRange kernel paradigm conveys multiple concurrent threads for data-level parallelism. OpenCL-based HLS tools exploit this paradigm to synthesize hardware. AOCL provides attributes for NDRange kernels to transform its iteration space. The num_compute_units attribute Fig. 9 . Design space for a 5 × 5 mean filter using an NDRange kernel (using the num_compute_units / num_simd_work_items attributes) and AnyHLS (using the vectorization factor v) for an Intel Cyclone V.
replicates the kernel logic, whereas num_simd_work_items vectorizes the kernel implementation 4 . Combinations of those provide a vast design space for the same NDRange kernel. However, as Figure 9 demonstrates, AnyHLS achieves implementations that are orders of magnitude faster than using attributes in AOCL. Finally, Figures 10 and 11 present a comparison between AnyHLS and the AOCL backend of Hipacc. As shown in Figure 2 , Hipacc has an individual backend and template library written with preprocessor directives to generate high-performance OpenCL code for FPGAs. In contrast, the application and library code in AnyHLS stays the same. The generated AOCL code consists of a loop that iterates over the input image. Compared to Hipacc, AnyHLS achieves similar performance but outperforms Hipacc for multi-kernel applications such as the Harris corner detector. This shows that AnyHLS optimizes the inter-kernel dependencies better than Hipacc (see Section 3.2.2).
MF Gauss
Jacobi Bilateral FChain Harris 
RELATED WORK
Languages at Register-Transfer Level (RTL). HDL-based design is time-consuming and errorprone. Often, the modification of an existing design is tedious: Programmers must rewrite the code several times to meet performance and/or area constraints. Recent languages such as Chisel [1] , VeriScala [25] , and MyHDL [12] let programmers create a functional description of their design but remain at the RTL.
High-Level Synthesis. HLS increases the abstraction level to an untimed high-level specification such as C ++ or Java from a fully-timed RTL. This eases the hardware design problem by eliminating low-level issues such as clock-level timing, register allocation, and gate-level pipelining [34, 7] . Moreover, it enables software-like development for library design and verification. Modern HLS tools such as AOCL and Xilinx SDX also offer system synthesis from the same code, utilizing hardware/software co-design to map program parts to either software or hardware.
After some early attempts, with the second wave [26] , HLS tools are now able to generate high-quality results for DSP and datapath-oriented applications. Several authors [e.g., 7, 26, 2] have argued the following points as key to this success: (i) advancements in RTL design tools, (ii) device-specific code generation, (iii) domain-specific focus on the target applications, and (iv) generating both software and hardware from the same code. However, the lack of standardization of HLS languages and compilers hinders progress in this area [2] . In contrast, our approach allows developing HLS libraries in a portable and modular way.
C-based HLS. There is an ongoing discussion about whether C-based languages are good candidates for HLS [14, 39, 7, 2, 21] . The problem is that a high-performance C-based code written for HLS is entirely different from a software program. Thereby, the developer should express the FPGA implementation of an application using the language abstractions of software (i.e., arrays, loops to specify the memory hierarchy and hardware pipelining). Language extensions like pragmas fill the gap for the lacking FPGA-centric features. Since HLS compilers optimize the Intermediate Representation (IR) and not the source program [7] , different programming styles may lead to different results [32, 38] . This ad-hoc mix of software and hardware is hard to optimize [21, 32] . The modularity and readability of C/C++ descriptions often conflict with best coding practices of HLS compilers [15, 41] . FPGA implementations must be defined statically for high-performance: types, loops, functions, and interfaces must be resolved at compile time [38, 15, 31, 41] . One option to achieve this in a generic way is C++ template metaprogramming. However, creating high-level abstractions in C++ using template metaprogramming is challenging [44, 38] . As a solution, we use partial evaluation to synthesize optimized HLS code from a given functional description of an algorithm. The generated code goes to the selected HLS tool. Consequently, we achieve competitive performance for different HLS tools from the same algorithm description (as shown for Vivado HLS and AOCL in Section 4).
DSLs for FPGAs. Developing efficient designs for FPGAs is difficult. DSLs use domain-specific knowledge to parallelize algorithms and generate low-level, optimized code [29] . Programming accelerators using DSLs is thus easier, in particular for FPGAs because the compiler performs scheduling. A prominent example of that is the FPGA version of Spiral [27] . It generates HDL for digital signal processing applications. In the domain of image processing, recent projects include Darkroom [17] , Rigel [18] , and the work of Pu et al. [35] based on Halide [36] . Hipacc [37] , PolyMage [6] , SODA [4] , and RIPL [42] create image processing pipelines from a DSL. Rigel/Halide, PolyMage, and RIPL are declarative DSLs, whereas Hipacc is embedded into C ++ . All of these compilers, except Rigel, generate HLS code in order to simplify their backends. Other examples include Lift that targets FPGAs via algorithmic patterns [22] and Tiramisu [3] for data-parallel algorithms on dense arrays. It takes as input a set of scheduling commands from the user and feeds it to the polyhedral analysis of the compiler. However, a considerable portion of these scheduling primitives remains platform-specific [13] . Spatial [21] is a language for programming Coarse-Grained Reconfigurable Architectures (CGRAs) and FPGAs. Koeplinger et al. [21] argue against the usual mix of hardware and software abstractions in existing HLS languages. Instead, they provide language constructs to express control, memory, and interfaces of a hardware implementation.
Unlike these approaches, AnyHLS allows programmers to build the basic blocks and abstractions necessary for their application domain by themselves. This means that there is no need to change the compiler when adding support for a new application domain, since programmers can design custom control structures. Previously, we have shown how to abstract image border handling implementations for Intel FPGAs using AnyDSL [30] . In this paper, we present AnyHLS and an image processing library to synthesize FPGA designs in a modular and abstract way for both Intel and Xilinx FPGAs. Partial evaluation removes the overhead of custom functional abstractions: As we have discussed in the paper, the final generated designs are competitive with those generated by DSL compilers such as Halide or Hipacc.
CONCLUSIONS
In this paper, we advocate the use of modern compiler technologies for high-level synthesis. We combine functional abstractions with the power of partial evaluation to decouple a high-level algorithm description from its hardware design that implements the algorithm. This process is entirely driven by code refinement, generating input code to HLS tools, such as Vivado HLS and AOCL, from the same code base. To specify important abstractions for hardware design, we have introduced a set of basic primitives. Library developers can rely on these primitives to create domain-specific libraries. As an example, we have implemented an image processing library for synthesis to FPGAs. Finally, we have shown that our results are on par or even better in performance compared to state-of-the-art approaches. 
