Abstract-FPGA vendors now include hardened IPs to form a system-on-chip (SoC) making it easier to build embedded systems. However programming and integrating hardware accelerators (devices) into these systems present a challenge.
I. INTRODUCTION
OpenCL has garnered much interest recently because it is a language based on an open standard that can run on many different heterogeneous platforms. In the FPGA community, Altera introduced an OpenCL compiler in 2013 [1] that truly raised the FPGA design abstraction to a much higher level than hardware description languages with the significant advantage that the code can be easily migrated between different kinds of computing platforms. Xilinx has recently announced its OpenCL compiler for FPGAs [20] , which raises the commercial interest in OpenCL for FPGAs even more because of the potential for portability between FPGA platforms.
Currently, there is no open-source solution for exploring an OpenCL-compatible framework and investigating platform designs for an embedded system implementable on an FPGA. This paper describes UT-OCL, an OpenCL framework for embedded systems using Xilinx FPGAs. The goal is to make this an open-source framework that can be used to experiment with all aspects of OpenCL, primarily targeting FPGAs, including testing possible modifications to the standard as well as exploring the underlying computing architecture. And from a systems perspective, the framework can be used to explore architectures that can ultimately be captured in an ASIC. Although the current trend is to use OpenCL in high-level synthesis targeting FPGAs, it is not the focus of this paper, but it can be explored using the proposed framework.
A brief OpenCL overview is covered in Section III. The UT-OCL framework is described in Section IV, and related work is presented in Section V. Finally, to demonstrate the practicality of the framework, architectural changes applied to the platform and to a CRC application are evaluated in Section VI.
II. MOTIVATION
Compared to common practices of programming for embedded systems for FPGAs, the higher level of abstraction provided by OpenCL should make application development for FPGAs much easier for software engineers. For example, during the testing/verification phase of the custom accelerator (devices in the OpenCL platform model), minor changes to the source code are required for the user to compare the output of the device under test with that of other functioning devices.
During testing, the developer can perform more extensive tests with greater ease since test cases are software. And when evaluating multiple devices using an open-source framework like UT-OCL, the environment and the testbenches are constant, leaving the devices as the only variable in the system. Therefore, the evaluation and the comparison between multiple devices are fair and easy to setup.
Furthermore, by using an OpenCL-compatible framework as a basis for building complex heterogeneous embedded systems, an application programmer can start the implementation using an OpenCL framework in a workstation environment, where the tools for debugging and development/prototyping are easier to use and more readily available. Once the programmer is satisfied that the application is functionally correct on the workstation, the application can be migrated to the embedded platform with more confidence that the application will work. With good high-level synthesis support, some of the code could be turned into hardware devices. The goal is to do most of the development in the friendlier workstation environment and make the embedded design more of a porting To support such a development process, the embedded system needs to provide the architectural support needed to model the workstation environment so that any changes required to do the migration are minimized.
Targeting FPGAs from OpenCL has its unique challenges because FPGAs are not simply processors using a typical software design flow. The FPGA architecture is much different than the common platforms (e.g. CPUs, GPUs) targeted by OpenCL implementations. For example, FPGA vendors have recently introduced programmable system-on-chip (SoCs), where an SoC is coupled with FPGA fabric creating a ready-touse platform for an embedded environment, such as the Zynq platform [21] . Furthermore, the flexibility of FPGAs, the long compile times and the potential for partial reconfiguration [18] are some of the challenges that FPGAs introduce, leaving much opportunity for OpenCL to adapt to this type of platform.
While OpenCL defines a particular programming model, there is still lots of opportunity to explore the implementation of the hardware and software that supports the programming model as well as the programming model itself. This is especially important given that the standard continues to evolve and will. For example, in the OpenCL 2.0 specification [7] , streaming capability amongst kernels is now present. It will be important to study the best ways to support streaming, also known as Pipes. To do this research, it is necessary to have a completely accessible framework that allows experimentation on both the hardware architecture and the software. Furthermore, when using an open-source framework, comparison between related studies using the framework would be easier and more fair.
III. OPENCL OVERVIEW
This section will provide a brief overview of an OpenCL framework. For additional information, technical details of the OpenCL framework can be found in the OpenCL specification [7] . Figure 1 shows a brief overview of an OpenCL framework. The bottom layer of Figure 1 shows the Platform Model of an OpenCL framework. In this model, the Host is connected to one or more Devices. In the example of Figure 1 , the Host is connected to two devices, Device A and Device B. For the remainder of this paper, a device is analogous to a hardware accelerator in an embedded system. Traditionally, when developing with OpenCL, the Host application is executed on a CPU running an Operating System (OS), whereas the Kernels executed on the devices do not have OS support. In an OpenCL framework, the Host application uses a compiler (JIT, interpreter, etc..) and an OpenCL implementation to execute the source on the Host, and the Kernel is compiled and executes on the Device using a driver.
In the remainder of this paper, the term OpenCL framework refers to the OpenCL API, compiler (JIT, interpreter, etc..) and driver allowing the OpenCL application to execute on the platform. Figure 2 shows a block diagram of the hardware system of the UT-OCL framework. Much like the standard OpenCL platform model, the hardware system is composed of two subsystems: a subsystem executing the host application (Host subsystem), and another subsystem executing the kernels (Device subsystem). The hardware system is implemented on the ML605 development platform [17] . The remainder of this section will describe details of these subsystems.
IV. THE UT-OCL FRAMEWORK

A. Host subsystem
The vast majority of OpenCL implementations run on CPUs running an Operating System (OS), thus providing the host application with OS support. To allow for a host application to be easily ported to our framework, which is targeted for embedded systems on FPGAs, our host subsystem should be capable of running an OS.
The main processing unit of the host subsystem is the Host implemented using a MicroBlaze microprocessor [16] . It is augmented to run Linux. It connects to the Device SubSystem Manager using an AXI Stream interconnect. System peripherals that are only used during the OS boot process (e.g. the GPIO, FLASH controller, debug module and ethernet controller) are not shown in Figure 2 .
When the MicroBlaze is configured to run Linux, the instructions to access the stream ports are privileged, meaning these instructions can only be executed in the kernel space of the operating system. A device driver, in the form of a Loadable Kernel Module (LKM) was implemented to allow a user process running on the Linux OS to execute these instructions. From hereafter, this LKM will be referred to as the stream driver.
A good OpenCL implementation requires the host subsystem to send to and receive from the device subsystem concurrently. Hence, the driver is implemented with two virtual buffers, one for each direction of the stream port, and two kernel threads managing these buffers. By having two threads, the communication between the host subsystem and the device subsystem can occur concurrently, however OS overhead for managing these threads is introduced.
In the Linux Operating System, memory accesses use virtual memory and paging. Unfortunately, the device subsystem does not have access to the memory management mechanism in Linux. However, in most OpenCL applications, the host application and the kernels share data. Therefore, it is necessary for the addressing scheme between the host and device subsystems to be compatible, so they can reference the same data.
To solve this issue, we partitioned the physical memory into two equal sized partitions. The first partition (Linux partition) is used and managed by the Linux OS. The second partition (shared partition) is accessed by the host application and the kernels. Both parts access the same data in the shared partition by using physical addresses. By using this approach, both subsystems, i.e. the host application and the kernels, will use the same addressing scheme to access data, thus solving the issue.
For the host application to access the physical addresses of the shared partition, another LKM (device driver) was implemented. The driver maps the address of the shared partition to a virtual memory so the shared partition can be accessed by the OS. From hereafter, this LKM will be referred to as the iomem driver.
The Host MicroBlaze has its data and instruction caches enabled to increase the performance of the OS running on the processor. Since there is no coherency scheme between the two subsystems, the host MicroBlaze was modified to only cache the address range from the Linux partition. As a consequence of this modification, only the memory requests for the Linux partition are sent to the memory bus, so we added an AXIto-AXI connector [14] between the peripheral bus and the global memory bus, permitting the host MicroBlaze to access the contents in the shared partition. The port connected to the peripheral bus does not have a mechanism for memory burst as does the port connected to the memory bus. Thus, with the new changes, accessing continuous addresses through the peripheral bus requires more time to complete compared to accessing continuous addresses through the memory bus. In our experiments, we compare this current setup to a setup with a core that performs burst accesses to the shared partition using the stream driver.
B. Device subsystem
The device subsystem is composed of three major components: the Device SubSystem Manager (DevSubSys), the Kernel Database Manager (KernelDB) and a Device Manager (DevManager). These components communicate with each other using a message passing paradigm.
1) The Device SubSystem Manager (DevSubSys):
The DevSubSys is implemented using a MicroBlaze microprocessor [16] . It is the communication portal for control and requests between the Host and the devices in the system. In addition, it is responsible for routing messages from the Host to the appropriate DevManager and sending requests to the KernelDB.
2) The Kernel Database (KernelDB): The KernelDB is also implemented using a MicroBlaze microprocessor [16] , and is responsible for accessing kernel information. The kernel information for individual kernels is stored in a file that the KernelDB accesses.
3) The Device Manager (DevManager): The DevManager is also implemented using a MicroBlaze microprocessor [16] . One DevManager is required per device in the system. The DevManager receives the work dimensions specified in the clEnqueueNDRangeKernel, the kernel body from the KernelDB and the kernel arguments set by clSetKernelArg in this order. It also contains a table with the device attributes used during queries and is responsible for notifying the DevSubsys when a kernel has completed its execution on the device.
In addition to the hardware system, the framework also includes a hardware debugging system that allows for the user to have more insight into the internals of the hardware for testing purposes. The framework also implements the profiling interface in the OpenCL specification used to provide fair and accurate measurements when executing kernels.
V. RELATED WORK
While there have been several efforts exploring the use of OpenCL as a description language for high-level synthesis, that is not the goal of this work. Here, we describe prior work that has the goal of developing platforms that can execute OpenCL applications.
Tomiyama [13] presents SMYLE OpenCL: A Programming Framework for Embedded Many-core SoCs. In their framework, they analyze the host source program to identify the type and the size of the various OpenCL objects. Then, they statically reserve memory space for the objects in shared memory. As a consequence, they must statically map the kernels onto the devices in the system. In our framework, the kernels can be mapped to the devices dynamically. Furthermore, memory for the objects is also allocated dynamically. Such features make our framework more versatile during runtime.
Similar to SMYLE OpenCL, Ma [9] presents a design flow that analyzes the source code of an OpenCL application and generates the hardware platform, as well as the corresponding executable running on the platform. In their design flow, OpenCL constructs are mapped to components from a predefined system model [3] . For example, the kernel instances are transformed into "HybridThreads" (Hthreads) [11] , so they can execute in their SoC designed for FPGAs.
Contrary to Ma [9] , the design of the UT-OCL's hardware system is independent of the source code of an OpenCL application. Moreover, the host application is executed on an OS. As mentioned, enabling the host application to execute on an OS provides the user with abstractions available when developing with OpenCL in a workstation environment. This makes the portability of an OpenCL application to our framework much easier. Then, the user can profile the application and harden tasks that are computationally intensive.
As shown with Ma [9] , an OpenCL application exhibits properties of a threaded paradigm. For example, a kernel instance can be modelled by a thread. Some works merge a software threaded paradigm onto reconfigurable platforms [8] [2] [6] . Applications running on these systems have the context of a thread, which is managed and scheduled by the OS.
In an OpenCL framework, a kernel instance is not an entity managed by the part of the OpenCL application running on the host, but an entity managed by the device. Therefore, unlike the threaded systems for reconfigurable platforms [8] [2] [6] , the UT-OCL framework manages and schedules the kernel instances outside the scope of the OS, more specifically in the device subsystem.
In the work presented by Ahmed [12] , the FPGA platform was incorporated into an OpenCL framework to enable the execution of kernels on an FPGA. The host application is executed on a CPU running the Linux OS, and the kernels are executed on the FPGA. In our framework, the host application is executed on a processor that resides in the FPGA, bringing the OpenCL framework into embedded systems.
In contrast to the work presented thus far, some work [4] [10] [20] extend the OpenCL framework to a high-level-synthesis tool. Such an application of the OpenCL framework is not the intent of this work, but the addition of high-level synthesis into the framework for generating custom accelerators would significantly improve the means for creating the custom devices and, as mentioned, a prospective enhancement to the framework.
VI. EXPERIMENTS
UT-OCL is an ideal framework for creating a benchmark suite for custom devices. The benchmarks are written as kernels with their host application counterparts, where the benchmark uses the profiling interface to retrieve its runtime. By using a benchmark suite built with the UT-OCL framework, the evaluation methodology is consistent and allows for fairer comparison amongst the custom devices.
From a research perspective, the benefit of an open-source OpenCL framework like UT-OCL allows for exploration in a broad range of research topics. Potential research directions using UT-OCL include: hardware system architecture exploration and custom device exploration, as well as the conformance of OpenCL on custom devices and on FPGAs as a platform.
The purpose of this paper is to describe the many features of the UT-OCL framework as well as research potential when using this open-source framework 1 . Hence, we demonstrate the practicality of the framework by evaluating architectural changes applied to the hardware system and to an application performing cyclic redundancy check (CRC). All experiments have been executed using the ML605 development board with designs targeting a 100 MHz system clock. The runtimes were recorded using the profile interface in UT-OCL. To account for OS overhead and noise such as context switching and scheduling, ten runs of the same experiment were executed and the average of these runs was taken.
A. Architectural changes applied to the hardware system
As mentioned in Section IV, the host subsystem is designed with an iomem driver to access the physical address of the shared partition. Such a design enables the addressing scheme between the host and device subsystems to be compatible, so they can reference the same data. A consequence of using this design is that the host processor is unable to perform burst accesses to the shared partition.
The Datamover core from Xilinx [15] is a peripheral that can be configured to perform burst memory accesses. To communicate with the Datamover core, the host MicroBlaze uses the stream driver. In this Section, we compare the tradeoffs of accessing the shared partition using the iomem driver to the Datamover core from Xilinx [15] accessed by the stream driver.
For the experiment, we implemented a host application using the read (clEnqueueReadBuffer) and write (clEnqueueWriteBuffer) functions from the OpenCL implementation. These functions are designed to copy data between the Linux partition and the shared partition. We executed the read and write functions with five different data sizes (64KB, 256KB, 1MB, 4MB and 16MB) and three different maximum burst lengths (16, 64 and 256) . The runtimes using the stream From Figure 3 , we observe that, for both the read and write operations, the runtime using the datamover core with the stream driver is larger than the runtime using the iomem driver. Across all experiments performing the read operation (Figure 3a) , the runtime with the Datamover core and the stream driver averages 1.4 times the runtime with the iomem driver, and for the write operation (Figure 3b ), the runtime with the Datamover core and the stream driver averages 3.0 times the runtime with the iomem driver. The additional runtime is a result of the overhead of the virtual buffers and threads in the stream driver. Although the virtual buffers and threads in the stream driver are needed for the host and device subsystem to communicate concurrently, such a requirement is not needed for transferring data between the shared partition and the Linux partition.
As a result, we modified the stream driver to bypass the virtual buffers and access the stream ports directly for the ports connected to the Datamover core. Figure 4 shows the runtime using the modified stream driver with the Datamover core normalized to the runtimes using the iomem driver for the read and write operations. In Figure 4 , labels Stream-Direct-16, Stream-Direct-64 and Stream-Direct-256 refer to experiments with the modified stream driver that accesses the stream ports directly with a burst length of 16, 64 and 256 respectively.
From Figure 4 , we observe that the modified stream driver with the Datamover core performs faster than the iomem driver. The average runtime for the read operation with modified stream driver and the Datamover core is 0.8 times the runtime with the iomem driver, shown in Figure 4a . Similarly, the average runtime for the write operation with modified stream driver and the Datamover core is 0.8 times the runtime with the iomem driver, shown in Figure 4b . When using the data from Figure 3 and 4, we can conclude that the virtual buffer and the threads in the stream driver represent 43% of the runtime for the read operation and 73% of the runtime for the write operation.
In addition, for all experiments in Figures 3 and 4 , we observe that the burst length does not have an impact on the runtime. The reason being that the host processor is not sending data to the Datamover core quick enough to leverage the bandwidth of the memory controller. Table I shows the absolute resource utilization of the Datamover core with the different burst lengths. The differences between the systems with the Datamover Core relative to UT-OCL's hardware system is shown in parenthesis. From the data in Table I , we observe that as the burst length increases the Datamover core uses more resources. The additional resources are used to implement the mechanism to support the additional burst lengths.
There is not much variance in the additional amount of FF and LUT used by the Datamover core systems. However, there is a significant increase in BRAM usage for the Datamover core with burst length of 256, which should be considered. Much like other embedded systems, dedicated memory primitive types on FPGAs are scarce, thus their allocation should be carefully examined. The designer is responsible for evaluating the trade-off. And, as demonstrated in our experiments, the Datamover core with a burst length of 16 performs equally well compared to the other burst lengths. Hence, although the resource utilization of the Datamover core with a burst length of 16 is similar to the Datamover core with a burst length of 64, there is no performance benefit when using a burst length of 64, therefore we will use the Datamover core with a burst length of 16 in our hardware system.
B. Architectural changes applied to a CRC application
In this Section, we explore the effects of architectural changes on a CRC application. We have selected a CRC application, since it is commonly used in telecommunication in an embedded environment. For this experiment, we used the CRC implementation from MI Bench [5] , a commercially representative embedded benchmark suite. We extracted the core computation from the benchmark to create an OpenCL kernel.
The kernel was implemented in software (crc-sw) and in hardware (crc-hw). The software version of the kernel is [19] . By comparing these two implementations, we can evaluate the performance of the kernel using a general purpose processor and using custom hardware. The kernels were executed with five different input sizes: 64KB, 256KB, 1MB, 4MB and 16MB. The runtime of crc-sw normalized to crc-hw is shown in Figure 5 .
For all input sizes, we observe that crc-hw performs faster than crc-sw. The performance benefit is a result of parallelism extracted from the kernel by Vivado HLS. In addition, the HLS Tool was able to collapse many instructions from the software implementation that execute in many cycles on a processor into a custom operation by leveraging the FPGA architecture to execute in a single cycle. Hence, we have extracted parallelism from the kernel to create custom hardware that increases the performance of an application in an OpenCL environment.
VII. CONCLUSION
FPGA vendors have recently developed FPGA platforms with an integrated SoC to easily build embedded systems using FPGAs. However, support for integrating and managing custom accelerators, devices in the OpenCL model, is significantly lacking. In this paper, we presented UT-OCL an opensource OpenCL framework for embedded systems on FPGAs. UT-OCL is composed of a hardware system and its software counterparts that can execute OpenCL applications compliant with the 2.0 specifications.
By making the framework open-source, continuing effort on adapting and improving OpenCL for FPGAs can be performed, including testing possible modifications to the standard. This is very important as the standard continues to evolve, which is the primary motivation for developing this framework.
To demonstrate the practicality of the framework, architectural changes applied to the hardware system and to a CRC application have been evaluated. In the hardware system, we quantified the overhead of virtual buffers and threads in the stream driver, and developed a burst mechanism that increases the performance of data transfer between the Linux partition and the shared partition. For a CRC application, we have shown that a commercial HLS tool can be applied to a kernel to create custom hardware to increase the performance of an application. By using UT-OCL as an evaluation environment, future CRC implementations can be compared fairly with our implementations. In the end, the release of UT-OCL into the research community would permit the exploration of a broad range of research topics, as well as fuel other prospective research topics in the context of OpenCL in an embedded environment using FPGAs.
