High-level FPGA synthesis tools aim to increase the productivity of FPGAs and to adopt them among software developers and domain experts. OpenCL is a specification introduced for parallel programming across heterogeneous platforms. In this paper, an automated compilation flow to generate customized application-specific hardware descriptions from OpenCL computation kernels is reported. The flow uses Xilinx AutoESL tool to obtain the design specification for OpenCL kernel cores. The provided architecture integrates generated cores with memory and OpenCL host application interfaces. The host program in the OpenCL application is compiled and executed to demonstrate a proof-of-concept implementation towards achieving an end-to-end flow that provides abstraction of hardware at the front-end.
INTRODUCTION
Applications are embracing heterogeneous High Performance Computing (HPC) solutions. These range from Graphic Processing Units (GPU), Field Programmable Gate Arrays (FPGA), heterogeneous multicore processors like Cell to hybrid architectures like Convey HC-1. Different classes of applications employ different targets best suited for the problem at hand. Each of the accelerator technologies possesses advantages that lead to possibilities of a heterogeneous mix of architectures [1] [2] [3] . Reconfigurable systems like FPGAs provide opportunities for acceleration in many fields due to their inherent flexibility and massive parallel computation capabilities.
The multitude of hardware devices require comprehensive approaches to solution which include the knowledge of underlying architecture along with methods of designing the algorithm. This translates to an increase in implementation effort, high learning curves and architecture aware programming. DARPA proposed a "time to solution" as a metric that includes the time to develop a solution as well the time taken to execute it [4] . CPUs and GPUs are programmed in high-level languages like C/C++ and CUDA. This level of abstraction enables faster deployment of solutions. In the case of FPGAs, implementations require tedious hardware design and debug which greatly impact the development time.
FPGAs can deliver significant acceleration due to the flexibility and parallelism provided by the fine-grained architecture. FPGA applications implementation involves cumbersome RTL programming and manual optimizations. Besides domain expertise and software design skills, developers are required to understand intricate details of hardware design including timing closure, state machine control and cycle-accurate implementations. As a result the effort in drastically reducing the execution time translates to increased development time. Chase et al. implemented a tensor-based real time optical flow algorithm on FPGA and GPU platforms [31] . Performance results are comparable, while the FPGA power consumption is significantly lower. However, the FPGA implementation took more than 12x longer to develop.
There has been significant research and industry related work in providing high level programming solutions for FPGA to reduce these design efforts [5] [6] [7] [8] [9] [10] [11] 14, [16] [17] [18] . None of it yet succeed in abstracting all implementation details as the knowledge of hardware concepts is required, to an extent, to appropriately design the software solution for maximum performance.
OpenCL is a cross platform standard for parallel applications. The programming model provides a standard that enhances portability across devices, rendering it platform independent. The developer can focus on the algorithm while abstracting the implementation details. Being able to develop applications in OpenCL for FPGA platforms would result in faster implementation times and time-to-market.
This work reports a proof of concept system that enables OpenCL application development on FPGAs. An efficient method for compiling OpenCL kernel tasks to hardware is demonstrated, without having to build a compiler from scratch. Also, existing open-software tools are reused. A system architecture is presented for the target Convey device with an automated generation of interfaces for the kernel. Also, a library that supports the OpenCL host API is provided. The detailed description of reported work can be found in [32] .
The rest of the paper is organized as follows. Section 2 presents a background on high-level synthesis for FPGA platforms. Section 3 introduces the source-to-source translation, compilation processes and discusses the execution model. Results are discussed in Section 4. Section 5 concludes the paper and proposes future work.
II. EASE OF USE
In this section, the existing high-level synthesis technologies for addressing FPGA productivity concerns are discussed.
A. High-Level Synthesis
FPGAs are being used in embedded systems and high performance computing solutions to deploy complete systems or application-specific coprocessor. FPGAs provide significant performance and power advantages. However, application development on FPGA requires expertise in hardware design.
Several commercial tools and academic research projects have risen to reduce the design efforts. Table 1 shows few such tools categorized according to their source inputs.
In the text-based approach, the most common source input is the C/C++ derivative with syntax restrictions on recursions and pointer manipulations. The tools facilitate faster prototype, implementation and debugging through the familiar C environment. Handel-C [5] targets synchronous hardware design. Parallel constructs are used to express parallelism in an otherwise sequential C code. Impulse-C [6] is yet another Cbased language, derived from Streams-C [7] and targets dataflow-oriented applications. Impulse-C includes a C-to-FPGA compiler and platform support packages for various HPC systems. Other tools like C2H [8] , Catapult-C [9] , Mitrion-C [10] , C-to-Verilog [11] explore different techniques and mechanisms for efficient compilation of traditional C into optimized hardware. While the input specifications into these tools are close to the C language, the code is often annotated with additional constructs that control the specifics of the circuit implementation. The developer has to follow a hardware aware programming approach in order to generate optimized implementations, thus the knowledge of digital design is required. OpenCL is a platform independent framework for parallel programming [12] that is functionally portable across platforms and various accelerator technologies. Commercial compilers from AMD, NVIDIA, Intel and Apple enable development of OpenCL applications on CPUs and GPUs. For years, CUDA has been used to develop many GPU-accelerated implementations. CUDA to OpenCL translators like CU2CL [13] were created in an effort to utilize these designs on non-Nvidia GPUs. Providing support for OpenCL on other hard-ware architectures including multi-core models, FPGAs and heterogeneous environments with different combinations of CPU, GPU and FPGA is another path being tread. OpenCL computation kernels are described at the finest granularity making it an inherently parallel language. In case of C-to-HDL compilers, the input language is sequential. Thus extraction parallelism at the instruction level is cumbersome. Optimizations at a fine-grain level is achieved through the use of appropriate directives. Due to this, a learning curve is associated with every tool for writing C code in a manner that is efficient for compilation to hardware. The OpenCL kernel mapping onto an accelerator is abstracted from the front end, thus ensuring that the programmer is agnostic of the underlying hardware.
B. Related Work
In Altera's OpenCL for FPGAs [14] kernel functions are implemented as dedicated pipelined hardware and replicated. The OpenCL host program is compiled using the ACL compiler. goHDR reported to have achieved a substantial reduction in the development time and a significant increase in performance using Altera's OpenCL for FPGA [15] .
FCUDA explores FPGA programming using CUDA [16] . The design flow includes source-to-source compilation from the annotated CUDA code to C program for Autopilot. Autopilot is a high-level synthesis tool that converts input specifications in C, C++ or SystemC into RTL [17] . The translator coarsens the granularity and extracts parallelism at the level of thread blocks. [18] extends this work to provide flexible parallelism at different levels of granularity.
Lin et al. designed the OpenRCL system for low-power high-performance reconfigurable computing [19] . An LLVMbased compiler converts kernel functions into low-level descriptions. LLVM is an open source compiler infrastructure [20] . Experimental results showed the performance of FPGA is comparable to GPU while the power metric is considerably in favor of FPGAs. This work is generalized in [21] .
Silicon-OpenCL (SOpenCL) [22] is an architecture synthesis tool and follows a template-based hardware generation for the FPGA. Source-to-source code trans-formations coarsen the granularity of the kernel functions from a work-item level to that of work-groups. SOpenCL extends the LLVM compiler and generates HDL code for these functions.
Falcao et al. [3] proposed a framework for accelerating simulations. The OpenCL programming model is used to target a CPU, GPU and an FPGA without any modifications to the input code and using SOpenCL for mapping OpenCL kernels onto FPGA reconfigurable fabric. Results have shown the GPU+CPU and FPGA+CPU computations outperform the pure CPU in terms of throughput, while the performance of the FPGA-based as compared to that of the GPU-base systems, but depends on the size of the design and number of iterations.
Most of the OpenCL to FPGA projects include development of a dedicated tool flow that converts the C or OpenCL C specifications to RTL. This project aims to build an end-to-end flow leveraging the existing tools and to use current technologies to the best advantage in converting high-level algorithms to RTL descriptions so that other aspects like architecture aware optimizations can be concentrated on. Most approaches coarsen the granularity to the work-group level, and follow a sequential execution for all the work-items within a workgroup. This work increases the concurrency by maintaining the fine-grained parallelism of the language.
III. APPROACH AND IMPLEMENTATION
This section introduces the approach and discusses the implementation details involved in enabling development of OpenCL application on FPGA platforms.
A. Approach and introduction to AutoESL
The OpenCL application exists in two parts -an OpenCL C kernel that define the algorithm for a single instance in the index space on the device and a C/C++ host program that uses OpenCL API for configuring and managing the kernel execution.
In this work, the conversion of the kernel code into hardware utilizes Xilinx AutoESL C-to-HDL tool [23] . A source-to-source translator is built to convert the OpenCL kernel to AutoESL C code, thereby offloading the directive based programming on the code translator. The granularity is maintained at the level of a work-item. The HDL core generated from AutoESL represents a single kernel instance. Multiple instannces are integrated with memory and dispatch interfaces on the FPGA devices. A subset of the OpenCL API for the host has been supported to enable testing of applications on the accelerator hardware. The target platform in this implementation is the Convey HC-1 hybrid core computer [24] . The AutoESL tool integrates with the LLVM compiler [25] and optimizes the input to reduce code complexity, maximize data locality and extract parallelism. It also provides a method for functional verification of the generated hardware description using the RTL-cosimulation.
B. Implementation Specifics
The steps involved in generating the host executable and a hardware implementation for the FPGA accelerator device is shown in Figure 1 . The parts of the flow enclosed in dotted lines indicate the source-to-source translation from OpenCL to AutoESL C. AutoESL synthesis translates the C code to HDL. In the interface generation and integration step, interfaces between the kernel HDL and Convey PDK framework are generated. The entire design is then implemented using Xilinx ISE tools.
1) OpenCL to AutoESL Source-to-Source Translation
Various mechanisms are in use for source-to-source translations between languages at the same abstraction level. One of the methods adopted is to convert the input source to an intermediate representation, perform required transformations, and generate code in the output language. Numerous compilation frameworks [26] [27] [28] are already available. In this work the feasibility of managing all transformations using simple graphs is demonstrated. Clang framework [29] is used to obtain the Abstract Syntax Tree (AST) of the input code and Graphtool [30] is used for AST graph manipulations.
Clang is an open source compiler front end API, with libraries for parsing, lexing and code analysis. Clang is easier to embed into applications as compared to gcc, which is a monolithic static compiler binary. Clang emits AST files for the source inputs and is used to generate a directed graph for the kernel functions in a dot format. The custom AST traversal method visits all required statements and includes the variables as well as the operators into the dot file producing input for the Graphtool. Figure 2 shows the OpenCL C kernel program for a vector addition application. The instance identifies the index of this element in the vector using its global ID. The dot file visualization for the abstract syntax tree of vector addition kernel is shown in Figure 3 . Graphtool is a Python module for graph analysis and manipulations, with functions for many graph algorithms. The subgraph_isomorphism function performs structural pattern matching and is used in the translator to identify function calls. In the vector addition example get_global_id(0) is used to obtain the global ID of the instance. In the hardware implementation these values become the input parameter to the core modules. In order to enforce additional handshake signals in RTL the pointers are annotated with AutoESL ap_bus interface directive.. The resulting graph after performing all required transformations is reparsed to generate an AutoESL C code shown in Figure 4 . 
2) AutoESL Synthesis, Integration and Mapping
AutoESL synthesizes the translated C program and generates customized FSM with data path RTL implementation that corresponds to a single processing element (PE) onto which work-items are mapped (see Figure 5 ).
The target platform in the implementation is the Convey HC-1 hybrid core. The coprocessor hosts four Xilinx Virtex-5 XC5VLX330 FPGAs (Application Engines), eight memory controllers (MC) etc. Each AE is connected to the system through a dispatch interface, memory controller interfaces, AE-AE interface and a management interface.
The Convey coprocessor is an OpenCL compute device, with AE's being compute units as shown in Figure 5 . A single work-group is mapped onto an AE. Scheduling of the workgroup tasks among compute units is done by the host CPU. Each AE contains multiple kernel instances onto which workitems are mapped. The dispatch unit sends the appropriate IDs and control signals to the cores. Round-robin arbiters control the load/store requests from the cores to the memory controller interfaces. Two arbiters for every MC facilitate up to sixteen parallel memory accesses. The generation of the interface modules and their integration are automated and transparent to the user. The global memory, which can be read/written to by all work-items, is mapped onto the external DDR2 modules on the coprocessor. Local memory, (smaller and faster) is implemented using on-chip BRAMs on the AE. Registers within the kernel core modules are used for implementing private memory.
3) Host Library
The host library in this work contains definitions for a subset of the API required to test the execution of the kernel tasks on the hardware. The definitions are targeted specifically for the Convey platform. One of the aspects of OpenCL is online compilation where the OpenCL C programs are built at run-time. Since FPGA implementation times on Convey run into hours, a pre-compiled binary kernel (bitstream) is loaded onto the hardware. The disadvantage of offline compilation model is that the number of dimensions and the size of a workgroup in each dimension has to be fixed at compile-time as the hardware implementation varies depending on these numbers.
IV. RESULTS
The objective of this work is to develop a proof of concept system that enables the development of OpenCL applications on FPGA platforms leveraging existing tools, e.g. Clang and AutoESL compilers. In this section, a vector addition application is studied. The performance and resource utilization numbers for different input parameters are presented and evaluated for different sizes of the work-group. Full results can be found in [32] . Table 2 compares the performance results between the vector addition example from Convey and the OpenCL implementation for the same application. The bitstreams for the OpenCL accelerator devices are generated for three different values of the work-group size (per AE). The total number of cores on the coprocessor device is 64, 128 and 192 respectively. Convey's example design consists of 16 adder modules per AE. Modules access memory in a continuous manner over the entire range as opposed to the OpenCL implementations where batches of tasks are scheduled by the host. The scheduling at the work-group level calls for additional overhead which is prominent in smaller designs as can be seen from the execution times in the table. With the vector size constant, as the work-group size is increased, the number of work-group tasks to be scheduled decreases thus reducing the total execution time. The performance numbers for different vector sizes is shown in Figure 6 .
1) Performance and Resource Utilization
The total device utilization for each of the implementations is shown in Figure 7 . The resource utilization for the OpenCL implementations are observed to be much lesser than the Convey example. With enhancements to the memory access patterns and differentiation between the physical and logical number of cores in a work-group, performance improvements over the current implementation can be achieved. Table 2 shows a comparison of the compilation flow and architecture presented in this work with other implementations.
B. Comparison of Methodologies
The compilation flow presented in this paper avoids the reinvention of a C-to-HDL compiler by using an existing tools. Both OpenRCL and SOpenCL build their compilers using the LLVM framework. A simluation environment is available in the current implementation using which the verilog files can be simulated along with the host program to ensure functional correctness or explore alternatives. Also, the fine-grained parallelism of the application is maintained and support for multiple FPGA devices is demonstrated. There has not been any explicit mention of this feature in the previous work implementations.
A common factor in comparisons against the Altera's tool for supporting OpenCL applications on FPGAs is that both implementations are platform dependant. Altera's tool implements the kernel logic as deeply pipelined hardware circuits, which are then replicated to increase parallelism. The implementation presented in this paper is specific to Xilinx devices due to the use of AutoESL tool. 
V. CONCLUSIONS
The main aim of this work was to successfully demonstrate the compilation and execution an OpenCL application on FPGA platform. High-performance applications require high design efforts for FPGA implementations. This project improve the design productivity of FPGAs using OpenCL as the high-level programming language for development. A method of conversion of OpenCL C kernels into hardware accelerators has been discussed and demonstrated using existing open-source tools and the Convey HC-1 hybrid computer.
OpenCL abstracts the low level details of hardware implementation through its virtual architecture. This provides functional portability across different platforms. Various architecture aware optimizations are needed for every hardware device, in order to obtain maximum performance. The biggest advantage of FPGAs is the ability to use different bit widths for the data. This advantage is nullified OpenCL coding, as the developer is limited to the numerical data types provided in the language. Future work will focus on domain-specific performance improvements and support for special data types.
