Programming Heterogeneous Systems from an Image Processing DSL by Pu, Jing et al.
Programming Heterogeneous Systems from an Image Processing DSL
Jing Pu, Steven Bell, Xuan Yang, Jeff Setter, Stephen Richardson
Jonathan Ragan-Kelley†, Mark Horowitz
Stanford University †UC Berkeley
{jingpu,sebell,xuany,setter,steveri,horowitz}@stanford.edu, †jrk@berkeley.edu
Abstract
Specialized image processing accelerators are necessary
to deliver the performance and energy efficiency required
by important applications in computer vision, computational
photography, and augmented reality. But creating, “program-
ming,”and integrating this hardware into a hardware/software
system is difficult. We address this problem by extending the
image processing language Halide so users can specify which
portions of their applications should become hardware accel-
erators, and then we provide a compiler that uses this code
to automatically create the accelerator along with the “glue"
code needed for the user’s application to access this hard-
ware. Starting with Halide not only provides a very high-level
functional description of the hardware, but also allows our
compiler to generate the complete software program including
the sequential part of the workload, which accesses the hard-
ware for acceleration. Our system also provides high-level
semantics to explore different mappings of applications to a
heterogeneous system, with the added flexibility of being able
to map at various throughput rates.
We demonstrate our approach by mapping applications to
a Xilinx Zynq system. Using its FPGA with two low-power
ARM cores, our design achieves up to 6× higher performance
and 38× lower energy compared to the quad-core ARM CPU
on an NVIDIA Tegra K1, and 3.5× higher performance with
12× lower energy compared to the K1’s 192-core GPU.
1. Introduction
The performance and energy efficiency of image processing
tasks are becoming increasingly important as cameras be-
come ubiquitous, and as our ability to extract information
from images improves. These tasks are extremely computa-
tionally intensive, requiring, for example, 120 gigaops/sec to
process 1080p/60fps raw video [13]. To efficiently process
so many pixels, designers historically built custom hardware
engines specialized to the task. For example, a typical im-
age signal processor (ISP) in a mobile SoC operates at 1.2
gigapixels/sec [26] and hardware video codecs perform an
equally immense amount of processing, both with power bud-
gets low enough for a smartphone to run for hours on a small
battery. This efficiency is possible because the applications
have extreme data locality, and matching the hardware to their
computation patterns can yield enormous energy savings.
The problem with these platforms is the brittle nature of
the functions provided by the accelerators. These functions
are relatively fixed, in order to keep their performance and
Small kernels mapped
Large kernel mapped
at lower throughput
Fixed throughput
Variable throughput
Large kernel mapped
En�re design fits
on FPGA
FPGA size [rela�ve to XC7Z020]
En
er
gy
 [v
s.
 C
PU
 a
lo
ne
]
Figure 1: Energy savings from accelerating a large application
composed of one large kernel computing depth from stereo
plus 19 small convolution kernels on a CPU/FPGA system at
different FPGA area limits. When accelerators run at unit rate
or not at all, it takes significant FPGA resources to get energy
savings (blue). By contrast, our variable-rate system provides
benefits even from very small FPGA fabrics (orange).
efficiency high, so their utility to an application programmer
is limited to predefined library calls. However, application
demands are rapidly evolving, and a more flexible approach
is needed. Configurable hardware, using either coarse grain
reconfigurable arrays (CGRA) [15, 9] or FPGAs [12] are one
approach to providing a flexible machine that can be config-
ured to match the data flow of different algorithms. Although
these architectures promise much better energy efficiency com-
pared to CPUs or GPUs, programming and integrating them
into complete real world systems remains a formidable task
for application developers.
To help address the programming part of this challenge,
C-based high-level synthesis (HLS) has been widely studied
in past decades [14, 6, 37]. C HLS tools raise the design
level by decoupling clock timing and automatically scheduling
pipelines and other resources. However, designers still need
to create a good microarchitecture in their C-code in order to
develop high performance implementations, which requires
hardware expertise.
So, to further reduce the hardware knowledge a designer
needs, researchers created domain specific languages (DSLs),
which can embed microarchitecture knowledge for a specific
application domain in the compiler. These DSL systems, in-
cluding Darkroom [13] and HIPAcc [29] for imaging, can then
to generate efficient FPGA and ASIC designs from high-level
image codes [3, 18, 8, 25].
This paper builds on current DSL-to-FPGA systems like
Darkroom and HIPAcc and extends them in three impor-
tant ways. First, rather than creating a specialized language
ar
X
iv
:1
61
0.
09
40
5v
1 
 [c
s.S
E]
  2
8 O
ct 
20
16
for our DSL, we use a widely popular open-source DSL,
Halide [27, 28], to describe our applications. In addition
to giving us a large collection of existing image-related algo-
rithms, it forces us to extend the microarchitecture template
and compiler techniques used in previous systems to support
these real-world applications. For example, we handle the
affine indices and data dependent reduction often found in
kernels like downsampling or histogram.
Second, in addition to the one pixel/cycle pipelines of prior
systems, we can specify and generate kernels with variable
throughput rates, exploiting space-time tradeoffs often needed
to map real applications on FPGAs with finite hardware. Fig-
ure 1 dramatically illustrates the advantage of this flexibil-
ity for accelerating large applications. At the previous one-
pixel/cycle fixed throughput (blue), no significant energy sav-
ings is seen until the FPGA is large enough to accelerate the
application’s largest kernel, stereo. Accelerating stereo at a
much lower throughput rate requires significantly less area,
so the savings are seen even with very small FPGAs (orange
“variable” line). As resources increase, our system gracefully
tunes the throughput and includes more stages for acceleration.
Finally, our system not only generates the FPGA kernels,
it also creates all the software needed to connect that hard-
ware to the user’s application. Thus in addition to the FPGA
configuration file, our system creates the CPU portion of the
algorithm, the Linux kernel drivers for the accelerator, and the
software glue that maps user Halide calls onto kernel driver
calls that access the hardware. The automatic CPU/FPGA
integration greatly helps to explore the workload partitioning
between CPU and FPGA for system-level optimization, as we
saw in Figure 1.
This paper makes the following contributions:
• We demonstrate that the popular image DSL Halide is suf-
ficiently restrictive that much of its computation can be
“compiled” into efficient FPGA implementations. In fact
Halide’s scheduling language is powerful enough that we
needed to add only two new commands to help define what
and how the hardware is generated. This opens a large class
of applications to acceleration.
• We extend the line buffer pipeline template of prior image-
DSL-to-hardware systems to suit the variety of computation
possible in Halide. This includes creating pipelines of differ-
ent throughputs and dealing with higher dimension input and
output stencils. In addition, our compiler implements a new
loop transformation optimization, called loop perfection, to
support these features.
• We create the first end-to-end system that takes Halide user
code and creates an FPGA bitstream along with a multi-
threaded software program that controls the new hardware.
This end-to-end system, coupled with Halide’s schedule
language, allows a user to seamlessly explore the effects
of moving function execution between the CPU and the
FPGA. We have used this system to implement a range of
applications on a Xilinx Zynq platform.
The following section describes tools and techniques of im-
age processing and domain-specific languages that undergird
this work. Sections 3 and 4 describe our extensions to the
Halide language, and our compiler system that implements
these extensions to produce blended CPU/FPGA designs. Fi-
nally, we present our test platform in Section 5, followed
by an experimental comparison with other methods, and a
quantitative evaluation of potential optimizations.
2. Background and Prior Work
Hardware designers have continually moved toward higher
level language descriptions to help deal with the growing com-
plexity of their target machines. Since the success of early
hardware description languages (HDL) such as Verilog in the
late 80’s, considerable work has focused on hardware synthe-
sis from high-level languages [14, 6]. Recent commercially-
available tools include Vivado (previously AutoPilot [37]) and
Catapult [10]. These tools generate hardware designs from
high-level specifications in C/C++/SystemC, decoupling the
input description from issues of hardware resource allocation,
clock-level timing, and pipelining. As a result, they are see-
ing increasing use, especially for FPGAs. Yet users of these
tools must still have a fairly good understanding of the mi-
croarchitecture of the hardware they want to build, since these
systems can’t (yet) do global restructuring of the input code to
improve energy or performance. This same limitation occurs
in research efforts like OpenCL-to-FPGA [7], SOpenCL [21]
and FCUDA [23], which explore the feasibility of using GPU
languages like CUDA and OpenCL as hardware description
languages.
To address the limitation, we can restrict the hardware target
to a single domain, like image processing, and, by examin-
ing the characteristics of applications in that domain, we can
create reasonable microarchitectural templates for automatic
hardware generation.
2.1. Image Processing
Most image processing algorithms consist of kernels operating
on a small window of the image. For example, sharpening and
blurring operations can be expressed as convolutions, which
use a fixed window of pixels to compute each result pixel.
Likewise, corner detection, edge enhancement, image sensor
demosaicking, and color transformation all calculate their
output pixels using small nearby regions of data. In Halide,
these kernels are defined as functions. A separable 3×3 box
blurring filter can be expressed as a chain of two functions in
x,y as follows.
Func blury(x, y) = (input(x, y-1) + input(x, y) + input(x, y+1)) / 3;
Func blurx(x, y) = (blury(x-1, y) + blury(x, y) + blury(x+1, y)) / 3;
The composition of these kernels can be expressed as a di-
rected acyclic graph (DAG), where the result of one computa-
tion is fed forward into the next.
This data locality, combined with the fact that image pro-
cessing algorithms work on millions of pixels, means that
2
Stencil Functions and Line Buffers
o Stencil functions consume sliding windows of data.
o If the intermediate is buffered, the producer only need to compute 
the ‘new’ data each iteration as opposed to a window of  data.
o Line buffer is the hardware buffer block.
3/21/2016 18HALIDE HC
Figure 2: A line buffer captures the intermediate working set
between convolution functions using minimum storage.
image processing is a fertile area for code optimization and
hardware acceleration. All of the data necessary to compute a
result pixel can fit within a small (and therefore near and low-
power) memory block. Moreover, because pixels are typically
computed in sequence, shared stencil data can be re-used from
one pixel to the next with minimal data fetching.
However, taking full advantage of this locality while main-
taining sufficient parallelism on a CPU/GPU is often difficult.
The algorithm could be computed in many ways: for example,
the first kernel could process the whole image, producing an
entire output image to be processed by the second kernel, and
so forth. This minimizes recomputation, but has poor data
locality. Alternatively, all the kernels could be fused into a
single giant computation, which is then applied pixel-by-pixel
to produce the outputs directly from the inputs. This has much
better locality, but may perform many redundant computations,
since every pixel is recomputed from the source, including
shared values where stencils overlap. In practice, the best
performance is usually achieved with some combination of
tiling (slicing the image into tiles for cache locality) and kernel
fusion (computing multiple kernels before moving to the next
tile), but the exact parameters are difficult to determine.
Custom hardware for these imaging pipelines tends to max-
imize data reuse and minimize memory traffic by carefully
designing pipeline stages with balanced throughput and plac-
ing specialized buffers between kernels. This organization
keeps all the intermediate data in local buffers, and ensures
that the image data is only fetched once at the beginning of
the pipeline, and written back once at the end of the pipeline.
In Figure 2, if the rate of new pixels produced by kernel
StencilA equals the rate of 3×3 windows consumed by Sten-
cilB, the intermediate working set can be reduced to its mini-
mum size of approximately two rows of pixels. A specialized
buffer that captures this working set using the minimum re-
quired storage is called a line buffer. This line-buffered col-
lection of deeply pipelined kernels is the microarchitectural
template used by most prior image processors and image DSL-
to-hardware systems, and will, with some extensions, form the
basis of our design as well.
2.2. Image DSLs
In the past decade, domain specific languages have become
a popular approach to help reduce the amount of detailed
knowledge required for application creation. These systems
use knowledge of a specific domain to create more efficient
applications. Spiral [18] synthesizes signal processing applica-
tions from mathematical descriptions, initially for generating
high-performance x86 code, and later for creating efficient
hardware. George [8] and Prabhakar [25] used OptiML [31],
a machine learning DSL, to compile applications into FPGA
instances.
The two prior projects most similar to our effort, Dark-
room [13, 4] and HIPAcc [29], both created image process-
ing DSLs and provided compiler tools using a line buffered
pipeline microarchitecture to guide their hardware generation.
They demonstrated it was possible to take a function coded in
a DSL and implement it as an FPGA or ASIC. Recent HIPAcc
followup work [22] supports vectorizing the whole pipeline.
Interestingly, HIPAcc’s compiler emits HLS C and feeds that
output to Vivado HLS to generate the hardware. This flow lets
them leverage the resource and datapath optimizations in the
HLS tools.
Our systems builds on this earlier work. Like HIPAcc, our
compiler generates synthesizable C and leverages existing
HLS tools to create the final Verilog. To support the broad
range of computation in Halide, we needed to extend the pre-
vious work in a few ways. First, to efficiently support large
programs, we needed to be able both to set the throughput
rate of the hardware, and to block the image data before pro-
cessing it. To support the wider application class available
to Halide, we needed to generalize the line buffered pipeline
microarchitecture to include: higher dimension input stencils
(> 2D), affine index into these arrays, and data dependent
reductions. These changes required some additional analysis
and optimization that is described in Section 4.
2.3. End-to-end Systems
Other groups have worked on creating complete systems to
utilize hardware accelerators in a seamless way. For example,
CUDA [19] and OpenCL [30] provide a C-based program-
ming environment which is similar for host and device code,
along with a runtime API to manage memory transfers and
synchronization.
Lime [2] goes a step further by providing a unified language
for CPU, GPU, and FPGA, with semantics to delineate bound-
aries between computation blocks. The blocks are compiled to
one or more target architectures, and then the runtime system
selects a set of implementations and automatically handles
data transfers and synchronization at those block boundaries.
We adopt one of the key strengths of these systems: when
the compiler knows the boundary between CPU and acceler-
ator code, it can automatically fill in the gap between them.
In our case, this is a stack of Linux executables and kernel
drivers, discussed in Section 5.
Our new scheduling primitives extend Lime’s concept of
“relocation brackets,” enabling the developer to easily specify
whether code should run on the CPU or accelerator, separate
3
1 Func unsharp(Func in) {
2 Func gray, blurx, blury, sharpen, ratio, unsharp;
3 Var x, y, c, xi, yi;
4
5 // The algorithm
6 gray(x, y) = 0.3*in(0, x, y) + 0.6*in(1, x, y) + 0.1*in(2, x, y);
7 blury(x, y) = (gray(x, y-1) + gray(x, y) + gray(x, y+1)) / 3;
8 blurx(x, y) = (blury(x-1, y) + blury(x, y) + blury(x+1, y)) / 3;
9 sharpen(x, y) = 2 * gray(x, y) - blurx(x, y);
10 ratio(x, y) = sharpen(x, y) / gray(x, y);
11 unsharp(c, x, y) = ratio(x, y) * input(c, x, y);
12
13 // The schedule
14 unsharp.tile(x, y, xi, yi, 256, 256).unroll(c)
15 .accelerate({in}, xi, x)
16 .parallel(y).parallel(x);
17 in.fifo_depth(unsharp, 512);
18 gray.linebuffer().fifo_depth(ratio, 8);
19 blury.linebuffer();
20 ratio.linebuffer();
21
22 return unsharp;
23 }
Accelerator Interface
unsharp
gray
blury blurx sharpen
ratio
in
Figure 3: Algorithm and schedule code for the unsharp func-
tion, and its corresponding DAG. accelerate primitive defines
the accelerator scope from in to unsharp.
from the algorithm itself. However, the scheduling primitives
also provide more control over the generated hardware, as
described in the following sections.
3. Language
The Halide language tackles the problem of finding the most
efficient implementation for an application by separating the
computation to be performed (the algorithm) from the order
in which it is done (the schedule). The language provides
scheduling primitives which control high-level scheduling de-
cisions like tiling, loop reordering, and parallel execution,
making it easy to experiment with various tradeoffs between
locality, parallelism and redundant re-computation.
Our task is to extend these semantics to cover heteroge-
neous systems, mostly involved with mapping Halide func-
tions onto a specialized hardware engine. Specifically, the
schedule should include:
• The scope and interface of the hardware accelerator pipeline.
• The granularity of the accelerator launch task, i.e. the size
of output image block the hardware produces per launch.
• The amount of parallelism implemented in the hardware da-
tapath, which affects the throughput of each pipeline stage.
• The allocation of buffers, specifically line buffers, that opti-
mally trades storage resources for less re-computation.
• The number of delay register slices needed to match varying
computation latencies.
Many hardware scheduling choices have analogues in CPU
scheduling, and Halide already has primitives to describe them.
For example, both CPU and hardware schedules must describe
computation order and memory allocation. In such cases, we
reuse as many of the existing primitives as possible. Ulti-
mately, we were able to achieve efficient hardware mapping
and hybrid CPU/accelerator execution using only two new
primitives and a bit of syntactic sugar.
The language of scheduling is best explained in the context
of an example. Figure 3 shows a simple unsharp mask filter
implemented in Halide. Unsharp masking is an image sharp-
ening technique often used in digital image processing. We
will use this as a running example throughout the paper, as it
demonstrates many important features of our system. The code
first computes a blurred gray-scale version of the input image
using a chain of three functions (gray, blury, and blurx), and
then amplifies the input based on the difference between the
original image and the blurred image.
The hardware schedule begins on line 14. unsharp.tile is
a standard Halide operation, which breaks an ordinary row-
major traversal (defined by the Vars x and y) into a blocked
computation over tiles (here, 256×256 pixels). The variables
xi and yi represent the inner loops of the blocked computation
which work pixel by pixel, while x and y then become the
outer loops for iterating over blocks.
With the image now broken into constant-sized pieces, we
can apply hardware acceleration. Our first new primitive is
f.accelerate(inputs, innerVar, blockVar), which defines
both the scope and the interface of the accelerator and the
granularity of the accelerator task. The first argument, inputs,
specifies a list of Funcs for which data will be streamed in. The
accelerator will use these inputs to compute all intermediate
Funcs to produce the result f. In this example, this is the
sequence of computation through gray, blury, blurx, sharpen,
and ratio that produces unsharp from in (Figure 3, bottom).
The block loop variable blockVar defines the granularity
of the computation: the hardware will compute an entire tile
of the size that blockVar counts; in this case, 256×256 pix-
els. The inner loop variable innerVar controls the throughput:
innerVar will increment each cycle, in this case producing one
pixel each time. To create higher-throughput hardware, we
could use Halide’s split primitive to split the innerVar loop
into two, and accelerate with the outer one as the hardware
stride size.
Our second new primitive is src.fifo_depth(dest, n). It
specifies a FIFO buffer with a depth of n, instantiated between
function src and function dest. In the unsharp example, both
ratio and unsharp consume multiple data streams (sharpen is
fused into ratio), so the latency needs to be balanced across
the inputs. The optimal FIFO depths in the DAG can be
solved automatically as an integer linear programming (ILP)
problem [13], so we can eventually automate this decision, but
for now we specify and tune it by hand.
4
Algorithm & Schedule
Initial Lowering HW Pipe. Extract.
IR Transformation
HLS Codegen LLVM Codegen
Common Optimizations
Linux ApplicationFPGA Design
Halide IR
Dataflow IR
Architecture
parameters
Loop Perfection
Figure 4: Compilation flow. Blue blocks are new, green blocks
are unchanged/existing Halide compilation passes.
f.linebuffer(), our syntactic sugar for a combination of ex-
isting Halide primitives, is designed to instantiate a line buffer
for function f.1 Without these primitives, functions would
be fused directly into other downstream functions, potentially
causing re-computation when their values were reused.
The existing Halide primitive f.unroll(var, factor) is
useful for optimizing hardware. In a CPU schedule, unroll is
used to eliminate short loops and to enable optimizations on
cross-iteration sharing of data. However, in terms of hardware,
since the HLS tool schedules resource for one loop iteration,
having a larger loop body through unrolling also increases the
parallelism of the datapath. It effectively duplicates the com-
pute units in the pipeline, potentially scaling up the throughput.
In the example, unroll on unsharp causes three multipliers to
be instantiated for computing three color channels simultane-
ously, which scales the throughput of the pipeline from 1/3
pixel/cycle to one pixel/cycle.
All other existing Halide primitives (e.g., tile, vectorize,
parallel) remain unchanged for the portion of the program
mapped to software, where Halide already provides state-of-
the-art performance on ARM and x86 CPUs. In our example,
the parallel primitives on line 16 schedule multiple tile pro-
cessing tasks concurrently onto multiple CPU cores. Further
details about parallel execution are discussed in Section 5.
4. Compiler Implementation
Figure 4 describes our compiler design. The inputs to our
system are an application’s algorithms and schedules written
in Halide. An analysis pass extracts parameters for the archi-
tecture template. A transformation pass re-writes the hardware
1 Halide’s native analog of the line buffer, the sliding window pattern, is
achieved by specifying different compute and storage levels with compute_at
and store_at primitives, and letting the compiler apply a storage folding
optimization. We overload the semantics of this same pattern if they are
used in the accelerated portions. As accelerate already defines the compute
and storage levels, we add the linebuffer sugar which needs no additional
arguments.
Stencil in of kernel gray
Window: [3, 1, 1]
Stride: [3, 1, 1]
Domain: [0:2, -1:256, -1:256]
Stencil gray of kernel gray
Window: [1, 3]
Stride: [1, 1]
Domain: [-1:256, -1:256]
Stencil gray of kernel blury
Window: [1, 3]
Stride: [1, 1]
Domain: [-1:256, -1:256]
Stencil gray of kernel ratio
Window: [1, 3]
Stride: [1, 1]
Domain: [0:255, -1:256]
gray ratio
Input stencils: Output stencil:
Computation: ... Schedules: ...
Kernel ratio
blury
in gray
Input stencils: Output stencil:
Computation:
Schedules:
gray(x, y) = 0.3*in(0, x, y) + 
0.6*in(1, x, y) + 0.1*in(2, x, y);
fifo_depth(ratio , 8)
Kernel gray
in
unsharp
gray
blury
ratio
gray blury
Input stencils: Output stencil:
Computation: ... Schedules: ...
Kernel blury
Figure 5: Architecture template parameters for unsharp. The
blue circles are stencil kernels corresponding to line buffered
functions (blurx and sharpen were fused into ratio). Middle
and right columns show parameters of selected kernels and
stencil streams. The domain of the output stencil stream gray
is the union of the domains of the stencil gray in kernels blury
and ratio.
parts of the Halide IR in a dataflow style. After the loop per-
fection optimization and some common scalar optimization
passes, like constant propagation, common sub-expression
elimination etc., the final IR is passed to an HLS code genera-
tor and an LLVM code generator, which produce the hardware
designs and the software programs, respectively.
4.1. Architecture Parameter Extraction
Our system generates specialized hardware accelerators by
instantiating architectural templates from a scheduled program.
The architectural template approximates the line-buffered
pipeline of Darkroom [13], with extensions to support a wider
range of algorithm and performance targets.
In this template, an accelerator is a DAG whose edges are
streams of windows of pixels, or stencil streams, and whose
nodes are stencil kernels. Each kernel is a Halide function
scheduled for a line buffer, into which one or more non-line-
buffered functions can be fused. A stencil stream is parame-
terized by the window size, the sliding stride, and the range
of the image domain. Note that because an output stencil
may serve as input to more than one downstream kernel, the
producer kernel must compute a union of the stencils required
by all consumers. Figure 5 shows some of the architectural
parameters for the unsharp application of Figure 3.
Since the scope of the pipeline and the line buffered func-
tions are defined in a high-level language, composing the DAG
is straightforward. To extract the parameters of each stencil
stream in the template, we apply bounds inference analysis re-
cursively back from the output. This is similar to Halide [28],
except that we now have line buffers between stencil kernels
that capture data reuse, so the output stencil size of upstream
5
1 alloc gray [258, 258]
2 for (unsharp.y, 0, 256)
3 for (unsharp.x, 0, 256)
4 // compute function "in" here...
5 for (y, gray.y.loop_min, gray.y.loop_extent)
6 for (x, gray.x.loop_min, gray.x.loop_extent)
7 gray(x,y) = 0.3*in(0,x,y) + 0.6*in(1,x,y) + 0.1*in(2,x,y)
8 // compute function "blury" here...
(a) Halide IR of function gray
1 // kernel "in" here...
2 def_stream gray_stream [1, 3]
3 def_stream gray_step_stream [1, 1]
4 for (scan_y, 0, 258)
5 for (scan_x, 0, 258)
6 alloc in [3, 1, 1]
7 alloc gray [1, 1]
8 pop (in, in_stream)
9 for (y, 0, 1)
10 for (x, 0, 1)
11 gray(x,y) = 0.3*in(0,x,y) + 0.6*in(1,x,y) + 0.1*in(2,x,y)
12 push (gray, gray_step_stream)
13 linebuffer(gray_step_stream, gray_stream, [258, 258])
14 dispatch(gray_stream, [-1:256, -1:256],
15 "blury", 1, [-1:256, -1:256],
16 "ratio", 8, [0:255, -1:256])
17 // kernel "blury" here...
(b) Dataflow IR of kernel gray
Figure 6: IR transformation allocates additional storage for lo-
cal stencils, and separates pipeline stages with data streams.
Tap Registers 
(Runtime configurable)
in
(AXI4-Stream)
Config Bus
(AXI4-Lite)
DP LB
PE 
unsharp
unsharp
(AXI4-Stream)
PE
gray
PE
blury
PE
ratio
LB DP
Figure 7: Pipeline implementation for unsharp. Processing
elements PE implement kernel DAGS. Line buffers LB capture
the data reuse in the stencil pattern. Stream dispatchers DP
fork a stencil stream to multiple consumers.
kernels doesn’t cumulatively increase. We also apply aggres-
sive constant propagation at this stage in order to derive any
constant bounds as early as possible.
4.2. IR Transformation
Given a schedule, the Halide compiler generates an arch-
itecture-independent imperative representation of the algo-
rithm, with loop nests and storage allocations injected for each
function. Figure 6a shows part of the Halide loop IR for func-
tion gray. Because the schedule specified a sliding window
order for computing gray, the storage for gray is allocated
outside the loop nest that iterates over the 256×256 block
of unsharp, while the computation of gray, along with other
functions, is interleaved inside the loop nest.
To map this scheduled loop IR into the line buffer pipeline
shown in Figure 5, we further lower it into our own dataflow
IR, using the architecture template parameters previously ex-
tracted. First, additional storage for input and output stencils
are allocated locally to a kernel computation, and references
1 for (scan_x, 0, 256)
2 pop (in, in_stream)
3 out(0) = 0
4 for (d, 0, 5)
5 out(0) += mask(d)*in(d)
6 push (out, out_stream)
(a) Original IR.
1 for (scan_x, 0, 256)
2 for (d, 0, 5)
3 if (d == 0)
4 pop (in, in_stream)
5 out(0) = 0
6 out(0) += mask(d)*in(d)
7 if (d == 4)
8 push (out, out_stream)
(b) IR after the optimization.
Figure 8: Loop perfection optimization creates a larger perfect
loop nest by pushing operations from the outer loop body into
the innermost loop.
to the original storage are replaced with references to stencils.
Then, the original storage allocation is replaced with declara-
tions of stencil streams, and stream operations (pop and push)
are inserted before and after the kernel computation. Finally,
loops iterating over the image domain are inserted for each
compute stage, along with calls to linebuffer and dispatch
IR primitives if a line buffer or a stream dispatcher is needed.
Figure 6b shows the IR of kernel gray after the transforma-
tion. For each scanning step (here, each scan_x loop iteration),
the inner loop nest computes a 1×1 gray stencil and pushes
it to gray_step_stream, which is later line-buffered and dis-
patched to the kernels blury and ratio. The unit length loops
x and y will be eliminated in a later simplification pass.
After the IR transformation, the dataflow IR presents a bit-
accurate representation of the pipeline, with different stages
explicitly separated by data streams. For example, the unsharp
pipeline is composed of four kernel stages, two line buffers,
and two stream dispatchers, as shown in Figure 7. Many
HLS tools can infer coarse grain pipelined designs from such
code structure, including Vivado HLS with its DATAFLOW direc-
tive. However, the throughput of stages producing less than
1 pixel/cycle is not optimal, due to limits on the automatic loop
pipelining of the current HLS tool. We address this problem
next.
4.3. Loop Perfection Optimization
Figure 8a shows the IR of a 5-point convolution kernel sched-
uled at the rate of 1/5 pixel/cycle. Loop pipelining in Vivado
HLS only applies to a perfect loop nest such that only the
innermost loop has operations. Here, loops scan_x and d do
not form a perfect loop nest due to the instructions on line 2, 3
and 6, and thus loop d and line 2, 3, and 6 will run sequentially
in the resulting design. In order to generate a fully pipelined
convolution stage with 1/5 pixel/cycle throughput, the content
in loop scan_x must be pushed into the innermost loop, as
shown in Figure 8b.
We apply restructuring automatically in the accelerated
region of IR through a recursive descent algorithm. The impact
of the optimization is analyzed in Section 6.3. Note that the
loop perfection is an inverse operation of loop peeling [32],
which moves predicates out of the innermost loop to improve
the performance on traditional processors.
6
Shift Reg
RAM
(3 rows)
1-D LB
1-D LB
1-D LB
RAM
(3 planes)
1-D LB
1-D LB
2-D LB
1-D LB
2-D LB
3-D LB
Figure 9: Multi-dimensional line buffer. 1-D LB uses shift reg-
ister. 2-D LB uses RAM to buffer rows of pixels, pushes a col-
umn to 1-D LBs and outputs 2-D stencils. 3-D LB instantiates
RAMs and 2-D LBs similarly.
4.4. Code Generation
After some common optimizations, the final IR of the pipeline
is passed to two different code generator back-ends, HLS and
LLVM.
The HLS code generator translates the hardware accelerator
portions of IR into HLS-synthesizable C code, and the rest
of the IR is translated into a C++ testbench wrapper. The
code generator inserts HLS directives (pragmas) automatically
to assist the HLS compiler in applying loop pipelining and
array partitioning. To simplify the code generator, we built
an HLS-synthesizable C++ template library implementing an
abstract line buffer interface:
template<int IMG_SIZE_0, int IN_SIZE_0, int OUT_SIZE_0, ...>
void linebuffer(stream<IN_SIZE_0, ...> &in,
stream<OUT_SIZE_0, ...> &out);
We design the multi-dimensional (up to 4D) line buffer tem-
plates hierarchically, as shown in Figure 9.
Enabling more design optimizations, the compiler can also
statically evaluate constant functions (e.g. lookup tables), and
generate the code that later synthesizes to ROMs.
Generating machine code for the CPU portion of the soft-
ware program is left to the LLVM compiler infrastructure. We
largely use the existing Halide ARM back-end, which includes
a highly optimized ARM NEON SIMD vectorizer, thread pool-
based parallel runtime, etc. The final IR describes a complete
pipeline with both CPU and accelerator, from which the code
generator emits platform-specific device driver calls to ac-
cess the hardware when it visits the boundary of the hardware
pipeline during IR traversal. Moreover, the tool recognizes all
the data buffers accessed by any hardware pipeline, and thus
can emit special allocation routines for these buffers, inserting
data transfers where required by the platform setup.
5. Platform Development
We implemented our system on a Xilinx Zynq-7000 SoC plat-
form [36], which contains a pair of ARM Cortex A9 cores
and FPGA fabric. We run Linux on the ARMs, giving us the
many conveniences of an operating system (e.g. a file system,
networking, and core utilities), and realistically modeling the
challenges of integrating an accelerator into a larger hetero-
geneous system. The generated Halide program appears to
the user as a single C ABI function, callable from ARM CPU
userland; all interaction with the FPGA fabric is automatically
managed within that function.
We use AXI DMA from Xilinx [33] to connect the stream-
ing interface of the accelerators to the CPU cache hierarchy
through an accelerator coherency port (ACP). The DMA en-
gines use 2D transfer mode to access data row-by-row with
a fixed offset between the start of each row. By adjusting the
DMA configuration, we can easily send a sub-image to the
accelerator without having to copy or move any data.
We built a parametrized Linux kernel module to drive the
DMA engines and the accelerators. The driver provides a
simple interface used by the generated Halide CPU code, in-
cluding an asynchronous launch for the hardware pipeline and
a synchronization barrier to wait for the hardware to complete.
We overlap the execution of CPU cores and accelerators
using Linux threads. After the output image has been tiled into
smaller blocks, the processing pipeline of the small block is
wrapped by a loop iterating over the tiles. A user can schedule
the loop to run in parallel using Halide’s parallel primitive,
as shown in line 16 of Figure 3. After this, a typical software
program looks like:
parallel foreach tile_index:
retrieve pinned_buffers
compute values in pinned_buffers...
task_id = launch_accelerator(pinned_buffers)
wait_sync(task_id)
consume values in pinned_buffers...
release pinned_buffer
During execution, a thread pool is created for launching the
workers that run the loop body. Some threads use CPU cores
to compute values in the buffer, launch the accelerator task,
and then quickly get blocked in the wait_sync calls. While
these threads sleep and the accelerator is running, other active
threads can use the CPU cores. To avoid cache thrashing by
too many contexts running concurrently, the number of active
threads is limited to three.
6. Evaluation
We now describe our experimental setup, followed by evalu-
ation on four separate fronts: 1) hardware generation, com-
paring our generated hardware versus optimized kernels from
an HLS library and generated hardware from the HIPAcc
compiler; 2) impact of the loop perfection optimization; 3)
heterogeneous system performance, where we describe our
efforts to optimize the hardware and software running on the
target platform; and 4) efficient programmability, where we
show that we can generate competitive hardware and software
for a variety of Halide applications.
7
Application Description
gaussian 9-point 2-D Gaussian filter
harris Harris corner detector
unsharp unsharp masking filter
stereo stereo depths using block matching
bilateral grid fast bilateral filtering algorithm
camera camera pipeline from Frankencamera
Table 1: Applications used in this paper.
6.1. Experimental Setup
Compiler: Our compiler is based on open source Halide [11]
built using GCC 4.8 and LLVM 3.6. The evaluated imple-
mentations for the Zynq platform, ARM CPU and CUDA
GPU are all generated by our compiler, although the CPU
and GPU code is the same as produced by the original Halide
compiler. We also compared the open source HIPAcc-Vivado
compiler [16].
Applications: Table 1 lists the applications we use in this
paper, including Gaussian, a basic stencil kernel; Harris cor-
ner detector, a pipeline of stencil kernels; unsharp mask,
an example of a DAG of kernels; stereo, a compute inten-
sive algorithm calculating stereo correspondence using block
matching; bilateral grid, a fast bilateral filtering algorithm (an
edge-preserving smoothing filter) containing rich patterns of
histogram, downsampling and data gathering [5, 24]; and a
simple camera pipeline from Frankencamera [1]. The last two
applications are adapted from Halide’s open source repository.
Platforms: We use a Xilinx ZC702 evaluation board hold-
ing a Zynq XC7Z020 SoC as the target heterogeneous plat-
form, running a Linux 4.0 kernel built from Xilinx Open
Source Linux 2015.4 [35] along with our custom built device
driver module. We use Xilinx Vivado Design Suite 2015.4 for
synthesizing HLS, RTL simulation and generating an FPGA
bitstream.
For comparison to a CPU+GPU platform, we use an
NVIDIA Jetson TK1 board with JetPack 2.0 [20]. The Tegra
K1 SoC is fabricated in the same 28nm technology as the
Zynq SoC. We evaluate both ARM and CUDA implemen-
tations on TK1 for each application using the same Halide
algorithm but different schedules, optimized by Halide experts.
For stereo, where Halide-generated CUDA performs poorly,
we compared to the Tegra-optimized OpenCV CUDA kernel
shipped by NVIDIA with JetPack.
Measuring power and performance: We pull statistics
from Texas Instruments UCD9248 power controllers on the
ZC702, which reports the power of each subsystem including
FPGA, DRAM, and CPU. For the TK1 board, we measure the
current on the 12V DC supply. To derive the energy efficiency
for TK1, we subtract the board idle power (about 2 Watts) to
exclude the board’s uncore components, which should give the
TK1 an advantage as it also excludes SoC and DRAM static
power. We use gettimeofday to measure software program
execution time and report the average over 20 runs. For the
Extra FIFO
HLS Library
HIPAcc
Ours (1x)
Ours (2x)
LUT Flip‐flop BRAM DSP
0.5 1.51 0.5 1.51 2
Arithme�cLine Buffer Other
HLS Library
HIPAcc
Ours (1x)
Ours (1/4x)
Ours (2x)
ha
rr
is
0.5 1.51
0.5 1 0.5 1.51
0.5 1 0.5 1.510.5 1.51 2
co
nv
ol
u�
on
Figure 10: Resource utilization comparison of conv2d (top)
and harris (bottom) in our system at various pixel rates vs.
HLS library and HIPAcc implementations. Our system’s de-
signs can target multiple different throughput, while HIPAcc
and library kernels are only optimized at a 1 pixel/cycle rate.
CUDA target, we exclude the data transfer time between CPU
and GPU (again giving the TK1 an advantage).
6.2. Hardware Generation
From very compact high-level code, using a 5×5 2-D con-
volution (conv2d) and Harris as test cases, we find that our
system can generate hardware quality similar to that of the
Xilinx HLS video library [34] and HIPAcc-Vivado system.
Figure 10 shows the FPGA resource utilization of our gen-
erated designs vs. the library designs for conv2d and harris on
a 1080p image. We generated designs at different pixel rates
using the same Halide algorithm but different unrolling sched-
ules, while the library and HIPAcc2 only provide designs at a
single pixel per cycle rate. All the designs achieve similar peak
frequencies, around 180MHz for conv2d and 150MHz for har-
ris. Harris designs run slower because they are bounded by a
floating-point-to-integer conversion.
Our line buffer components use less BRAM for two reasons.
First, the line buffer instance from the library is not optimal in
terms of storage usage: in 5×5 convolution, the library design
buffers five rows whereas the minimum required is four, so
both our and HIPAcc’s designs start with 20% less BRAM.
Second, the library and HIPAcc both instantiate per-kernel line
buffers for each input stencil, while we place a line buffer for
each output stencil stream, which can be shared among kernels
consuming the same stream. In harris, our design instantiates
fewer line buffer instances thanks to this buffer sharing, for an
extra 6% BRAM savings as compared to HIPAcc.
In harris, our design and HIPAcc use fewer DSPs and LUTs
because the code generation (meta-programming) approach
is more flexible and does better constant propagation and
simplification as compared to the C++ template solution used
by the library. Our unit-rate conv2d design has fewer LUTs
2 The recent HIPAcc Altera-OpenCL backend [22] can generate >1 pixel/-
cycle pipelines through kernel vectorization, but cannot generate <1 pixel/cy-
cle designs. We did not evaluate this system as it is not open-source.
8
HLS Lib HIPAcc Halide Generated HLS
Conv2D 209 5+5 2+2 885
Harris 520 47+26 23+11 619
Table 2: Lines of code (LoC) for conv2d and harris in HLS
library, HIPAcc, Halide and generated HLS code from Halide.
HLS Lib excludes basic data structure code; HIPAcc counts
are DSL parameter+algorithm [17]; Halide counts are algo-
rithm+schedule; Generated HLS counts exclude 900 LoC in
the line buffer template library.
Sched. rate Meas. rate Resource
Application (pix/cyc) (pix/cyc) (LUT+FF)
histogram 0.016 0.015 1.1%+0.6%
histogram+opt 0.016 0.016 2.0%+1.0%
stereo 0.25 0.062 55%+25%
stereo+opt 0.25 0.25 55%+32%
Table 3: Scheduled and measured throughput and resource
utilization of histogram and stereo with and without the loop
perfection optimization.
because the library solution creates four separate streams for
the RGBA channels, whereas we package RGBA as a single
structure and pass it through a wider stream along the pipeline,
which simplifies the control logic for managing data streams.
Multi-rate designs need higher compute throughput and
buffer bandwidth. Fortunately, the BRAM banks of the line
buffers provide more than enough bandwidth for 1080p im-
ages. Therefore, the multi-rate designs use more LUTs and
DSPs for the arithmetic datapath, while the line buffer re-
sources do not change significantly.
One source of overhead in our design comes from un-
necessary FIFOs inserted between pipeline stages. We use
hls::stream objects to connect different stages in the gener-
ated HLS code, and the current HLS compiler creates a hard-
ware FIFO for each stream object. However, in our design,
stages can be directly connected through a handshake interface
(e.g. AXI4-stream) because the latencies in the pipeline are
already balanced. The extra FIFO adds 30% FF and 20% LUT
overheads. Future optimizations in the HLS compilation could
eliminate these unnecessary FIFOs.
Table 2 summarizes the code length of conv2d and harris
using different systems. Moving from less to more domain
specific, Halide and HIPAcc DSLs are orders of magnitude
more compact than both HLS C library and our generated
HLS code. Because Halide uses functional representation, its
application code is 2x shorter than HIPAcc’s.
6.3. Impact of Loop Perfection
As discussed in Section 4.3, the loop perfection optimiza-
tion helps fully pipeline the stages that have sequential loops
(running at <1 pixel/cycle rate). Such pipeline stages are im-
portant in applications that require data-dependent reductions,
y = 12.79x + 0.02
y = 7.96x + 5.29
0
50
100
150
200
250
300
350
400
450
0 5 10 15 20 25 30 35
R
u
n
ti
m
e 
(m
s)
Image Size (# 600×400 tiles)
1 thread
3 threads
Figure 11: Runtime of stereo with different sizes of input im-
age. All runs use a hardware accelerator that processes a
600×400 image tile in 7.93ms. Our multithreading technique
helps overlap CPU and accelerator execution time, so that per-
tile processing cost plus CPU overhead becomes just 7.96ms
instead of 12.76ms.
e.g. the histogram stage in bilateral grid, or are intentionally
scheduled at slower rate due FPGA resource constraints, e.g.
stereo.
Table 3 lists the scheduled and measured throughput and
resource utilization of histogram and stereo with and without
the loop perfection optimization. The optimization improves
the performance by 7% and 300% with low resource overhead
for histogram and stereo, respectively. stereo gains more
because the innermost loop is short (4 iterations) and the
overhead of entering and exiting the short loop is relatively
large if not pipelined with the outer loops.
6.4. Heterogeneous System Performance
System performance can be greatly improved if the CPU and
the accelerator run concurrently while processing a hetero-
geneous pipeline. Figure 11 illustrates the effectiveness of
our multi-threading technique from Section 5, overlapping
CPU and accelerator workloads for stereo. The figure plots
the program runtime using the same accelerator to process
different size images with and without multiple threads. For
each launch, the accelerator processes a 600×400 image tile
in 7.63ms. The software program breaks the image into such
tiles, prepares an input image tile for the accelerator includ-
ing padding the boundary of the original image, launches and
waits for the accelerator, and then repeats for the next tile.
When running the program sequentially with one thread,
each tile takes 12.8ms, indicating a CPU workload of about
5ms/tile. However, using 3 threads to process 3 tiles concur-
rently, the per-tile processing cost reduces to 7.96ms, as most
of the CPU workload now overlaps with accelerator execution.
With this overlapping, the system is bounded by its slowest
part which in this case is the accelerator portion of the work-
load. When the CPU workload dominates, we may not see
such nice behavior, as we find out below.
On our target platform, the accelerators are cache coherent
and share the L2 cache with the CPU cores. If intermediate
9
Read Block —HW time— DRAM
Rate BW size sim. measured Edyn
(px/cy) (MB/s) (KB) (µs) (µs) (nJ/px)
1 300
48 193 199 0.0
192 710 721 2.5
768 2730 2764 3.3
2 600
48 99 111 0.0
192 360 510 2.3
768 1378 2444 3.3
Table 4: Performance and energy cost measurements of the
gaussian accelerators for different block sizes and pixel rates
running at 100MHz. Larger blocks cause L2 cache misses to
DRAM, increasing DRAM dynamic energy.
buffers passing data between CPU kernels and accelerators fit
in the L2, the number of DRAM accesses will decrease, and
the memory access latency will improve as well. To this end,
blocking the image via the tile primitive effectively reduces
the size of the intermediate buffers.
Table 4 summarizes the performance and energy cost of
gaussian compiled for different block sizes and pixel rates.
We show the accelerator execution time for both Verilog simu-
lation and real time as measured when running the accelerator
repeatedly on the Zynq. Memory accesses start to miss in the
L2 when the block size exceeds 48KB, causing the DRAM
dynamic energy to increase from zero to around 3.3nJ/pixel
(i.e. 34pJ/bit-access).3
When we compare simulated to measured accelerator per-
formance (sim vs. measured in Table 4), we see that, for low
rate configurations, the measured runtime does not degrade
when large block sizes cause L2 misses, since DMA buffers
can hide the latency. However, for high rate configurations,
although the accelerators are twice as fast than the unit-rate
designs in simulation, peak speed is only achieved for the
48KB block configuration (i.e., when memory accesses all hit
in the L2). Because of cache misses, the actual runtime is 77%
longer than the simulated accelerator runtime for the much
larger 768KB block configuration.
However, breaking images into many small blocks intro-
duces two problems of its own: First, more overlapping block
boundaries causes more re-computation; and second, the host
CPU has to schedule more accelerator launches through the
device driver interface. On the current platform, we ob-
serve scheduling overhead around 100∼200 microseconds per
launch in the device driver, mostly caused by context switches
and synchronizations between the user thread and kernel back-
ground threads responsible for managing accelerator launch
and completion queues. As a result of these overheads, for
most of the evaluated applications, fine-grained blocking to fit
in the L2 is not an efficient strategy overall. Instead, we chose
3 Computing an RGB pixel in gaussian requires 6 bytes of memory access.
If the memory access misses in the L2, we assume it causes two DRAM
accesses.
Rate Read BW Resource
Application (pix/cyc) (MB/s) (%)
gaussian 2 200 9/5/3/33
harris 2 284 22/19/7/30
unsharp 1 375 7/5/7/31
stereo 0.25 62 55/32/5/0
bilateral 1 182 23/20/14/13
camera 2 568 9/6/5/5
camera+unsharp 1 250 14/10/11/34
Table 5: Specifications of generated accelerators for the eval-
uated applications. The resource utilization is reported as per-
centage of LUT, FF, BRAM, and DSP used by the given appli-
cation on the Zynq XC7Z020.
to block the image in larger sizes (e.g. 480×640) for our full
system evaluation.
With further engineering of our existing system, it would
be possible to increase memory bandwidth for streaming large
blocks. However, if we were to design a new platform, we
would choose a faster CPU core (the Zynq’s 667MHz A9
was slow even at its introduction in 2012), a larger shared
L2 cache, and a hardware launch engine that pulls accelerator
tasks directly from a memory buffer without CPU intervention,
as opposed to having a background thread pushing tasks to the
accelerator.
6.5. Programmability and Efficiency
We implemented six individual applications and an application
that combined the camera pipeline with an unsharp mask, all
in Halide, and generated the hardware and software for Zynq
using our compiler.
Table 5 lists the specifications of generated accelerators for
these applications. For most of the applications, the CPU just
does tiling, i.e. calculating the coordinates of each image tile
and scheduling the accelerator for processing a tile. In stereo
the CPU computes padding using a repeat edge condition, and
in bilateral grid it shuffles data into 8×8 grid order. We run
8-megapixel images through each application, but the BRAM
usage for internal buffering in each accelerator is kept low
thanks to the image tiling.
Figure 12 shows the throughput and energy efficiency of the
Zynq implementation versus TK1’s ARM and GPU cores. On
average, applications on Zynq achieve 2.6× and 1.9× higher
throughput, and 14.2× and 6.1× higher energy efficiency
compared to the CPUs and GPU on the TK1, respectively.
Harris achieves the most energy reduction of 38× and 12×,
as well as the highest throughput speedups of 6× and 3.5×,
compared to the TK1 CPUs and GPU. The energy efficiency is
achieved by high locality and data reuse exploited in the line-
buffered accelerator pipeline, and the greatly reduced memory
requests as compared to programmable cores. Low-precision
fixed-point arithmetic is also very efficient in LUTs and DSPs
on FPGA fabric.
10
050
100
150
200
250
300
gaussian harris unsharp stereo bilateral
grid
camera camera +
unsharp
Th
ro
u
gh
p
u
t 
(M
P/
s)
ARM CUDA Zynq accelerator (sim)
19
4.5 8.7 0.5 5.7
19
6.6
22 14 13
2.6
23 18
7.7
88
169
69
14
42
91
73
0
50
100
150
200
gaussian harris unsharp stereo bilateral
grid
camera camera +
unsharpEn
ge
ry
 E
ff
ic
ie
n
cy
 (
M
P/
J)
 
ARM CUDA Zynq
Figure 12: Throughput and energy efficiency comparison of
Zynq platform versus the four ARM cores and CUDA GPU on
TK1. The accelerator RTL simulation throughput represents
the theoretical ideal for the synthesized accelerator in isola-
tion, if it were never bottlenecked on other parts of the system
(namely input and output bandwidth and latency). Additional
system optimization, and improved CPU performance in future
SoCs, could push realized Zynq performance closer to this
level.
All Zynq-based applications except camera achieve the
peak throughput of the accelerator in Verilog simulation. The
camera accelerator requires much higher read bandwidth, and
thus suffers the most from bandwidth problems caused by
reading data that misses in L2 cache (see Section 6.4).
Once the application fits in the FPGA fabric, the throughput
of Zynq implementations is generally bound by memory band-
width and clock frequency. Therefore, speedups compared
to the ARM CPUs or GPU on TK1 are proportional to the
number of operations accelerated on the FPGA (approximately
proportional to the LUTs and the DSPs used). For this reason,
harris, stereo, and camera+unsharp get the most speedup. In
bilateral grid, the accelerator also uses a lot of LUTs, but most
of them implement control logic and multiplexers for building
histograms and data-gathering for interpolation, which do rel-
atively little real computation. Moreover, parallelism in the
kernels of the bilateral grid is limited by data dependencies,
making it more favorable to execute on high frequency and
high memory bandwidth processors (i.e., the Tegra CPUs and
GPU).
An important takeaway is that the acceleration using the
FPGA becomes more effective as the image processing
pipeline grows deeper, which matches the trend of new appli-
cations from computational photography and computer vision.
For example, on CPU or GPU, the execution time of the cam-
era+unsharp combination is the sum of the execution times of
the two individual applications. However, pipelining two appli-
cations onto the FPGA fabric simultaneously doesn’t increase
the required memory bandwidth or slow the clock frequency.
Therefore, the throughput for any composition of pipelines
which fit on the FPGA is bounded by the slowest one (in this
case, unsharp, which is around 110 megapixels/second).
6.6. Extensibility
Our dataflow IR provides an untimed, bit accurate specifica-
tion of image processing pipelines with explicit coarse grain
pipeline parallelism information, so the system can be easily
extended to target other HLS compilation tools simply by pro-
viding new code generators. To demonstrate this extensibility,
we ported the system to Catapult HLS [10], a popular com-
mercial tool for ASIC technology, by changing just around
1000 lines of code in the existing code generator module. We
evaluated each Table 1 application in a 14nm ASIC technol-
ogy using the new backend. The hardened pipelines in ASIC
get approximately 12× higher throughput and 8∼20× better
energy efficiency than those programmed on the Zynq FPGA
(fabricated in 28nm).
7. Conclusion
The Halide image processing DSL provides an ability to
quickly create and optimize new image processing applica-
tions. The availability of complex SoC chips with large FPGA
fabrics provides a potential platform for exploring new hetero-
geneous architectures for these applications, but implementing
the hardware accelerator and the interface software is a huge
barrier for many designers. We extended Halide to remove
this barrier, allowing it to generate both the design of the accel-
erator and the software that communicates with the hardware.
Our results demonstrate significant gains in both performance
and energy efficiency, which increase as the imaging computa-
tion becomes more complex. We will open source our system
to share with the community for others to use and build on.4
We also plan to extend our system in several ways. First,
we will incorporate more automatic optimization, to help
the designer find optimal low-level buffering and blocking.
Next, given the generality of the underlying computational
model, we want to extend this system to generate “code"
for different underlying image processing engines including
custom hardware and the specialized SIMD and coarse-grain-
reconfigurable-array architectures optimized for image pro-
cessing appearing in new SoCs. We hope this system will help
the community develop and use efficient programmable ISPs.
4The source is available at https://github.com/jingpu/Halide-HLS.
11
References
[1] Andrew Adams, Eino-Ville Talvala, Sung Hee Park, David E. Jacobs,
Boris Ajdin, Natasha Gelfand, Jennifer Dolson, Daniel Vaquero, Jong-
min Baek, Marius Tico, Hendrik P. A. Lensch, Wojciech Matusik, Kari
Pulli, Mark Horowitz, and Marc Levoy. The Frankencamera: An exper-
imental platform for computational photography. ACM Transactions
on Graphics, 29(4):29:1–29:12, July 2010.
[2] Joshua Auerbach, David F. Bacon, Ioana Burcea, Perry Cheng,
Stephen J. Fink, Rodric Rabbah, and Sunil Shukla. A compiler and
runtime for heterogeneous computing. In Proceedings of the 49th An-
nual Design Automation Conference, DAC ’12, pages 271–276, New
York, NY, USA, 2012. ACM.
[3] Joshua Auerbach, David F Bacon, Perry Cheng, and Rodric Rabbah.
Lime: a Java-compatible and synthesizable language for heterogeneous
architectures. In ACM Sigplan Notices, volume 45, pages 89–108.
ACM, 2010.
[4] John S. Brunhaver. Design and Optimization of a Stencil Engine. PhD
thesis, Stanford University, 2015.
[5] Jiawen Chen, Sylvain Paris, and Frédo Durand. Real-time edge-aware
image processing with the bilateral grid. In ACM Transactions on
Graphics (TOG), volume 26, page 103. ACM, 2007.
[6] J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Z. Zhang.
High-level synthesis for FPGAs: From prototyping to deployment.
IEEE Transactions on Computer-Aided Design of Integrated Circuits
and Systems, 30(4):473–491, April 2011.
[7] Tomasz S Czajkowski, Utku Aydonat, Dmitry Denisenko, John Free-
man, Michael Kinsner, David Neto, Jason Wong, Peter Yiannacouras,
and Deshanand P Singh. From OpenCL to high-performance hardware
on FPGAs. In Field Programmable Logic and Applications (FPL),
2012 22nd International Conference on, pages 531–534. IEEE, 2012.
[8] Nivia George, HyoukJoong Lee, David Novo, Tiark Rompf, Kevin J
Brown, Arvind K Sujeeth, Martin Odersky, Kunle Olukotun, and Paolo
Ienne. Hardware system synthesis from domain-specific languages.
In Field Programmable Logic and Applications (FPL), 2014 24th
International Conference on, pages 1–8. IEEE, 2014.
[9] Venkatraman Govindaraju, Chen-Han Ho, and Karthikeyan Sankar-
alingam. Dynamically specialized datapaths for energy efficient com-
puting. In High Performance Computer Architecture (HPCA), 2011
IEEE 17th International Symposium on, pages 503–514. IEEE, 2011.
[10] Mentor Graphics. Catapult high-level synthesis. https://www.
mentor.com/hls-lp/catapult-high-level-synthesis/.
[11] Halide. Halide, a language for image processing and computational
photography. http://halide-lang.org/.
[12] Scott Hauck and Andre DeHon. Reconfigurable computing: the theory
and practice of FPGA-based computation. Morgan Kaufmann, 2010.
[13] James Hegarty, John Brunhaver, Zachary DeVito, Jonathan Ragan-
Kelley, Noy Cohen, Steven Bell, Artem Vasilyev, Mark Horowitz, and
Pat Hanrahan. Darkroom: Compiling high-level image processing
code into hardware pipelines. ACM Trans. Graph., 33(4):144:1–11,
July 2014.
[14] Grant Martin and Gary Smith. High-level synthesis: Past, present, and
future. IEEE Design & Test of Computers, 26(4):18–25, 2009.
[15] Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, and
Rudy Lauwereins. Adres: An architecture with tightly coupled VLIW
processor and coarse-grained reconfigurable matrix. In Field Pro-
grammable Logic and Application, pages 61–70. Springer, 2003.
[16] Richard Membarth and Oliver Reiche. Fork of HIPAcc generating code
for Vivado HLS. https://github.com/hipacc/hipacc-vivado.
[17] Richard Membarth, Oliver Reiche, Frank Hannig, Jürgen Teich, Mario
Körner, and Wieland Eckert. HIPAcc: A domain-specific language and
compiler for image processing. IEEE Transactions on Parallel and
Distributed Systems, 27(1):210–224, 2016.
[18] Peter Milder, Franz Franchetti, James C Hoe, and Markus Püschel.
Computer generation of hardware for linear digital signal processing
transforms. ACM Transactions on Design Automation of Electronic
Systems (TODAES), 17(2):15, 2012.
[19] John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. Scal-
able parallel programming with CUDA. Queue, 6(2):40–53, 2008.
[20] NVIDIA. Jetpack for L4T. https://developer.nvidia.com/
embedded/jetpack.
[21] Muhsen Owaida, Nikolaos Bellas, Konstantis Daloukas, and Christos D
Antonopoulos. Synthesis of platform architectures from OpenCL pro-
grams. In Field-Programmable Custom Computing Machines (FCCM),
2011 IEEE 19th Annual International Symposium on, pages 186–193.
IEEE, 2011.
[22] M Akif Özkan, Oliver Reiche, Frank Hannig, and Jürgen Teich. Fpga-
based accelerator design from a domain-specific language. In Inter-
national Conference on Field Programmable Logic and Applications,
2016.
[23] Alexandros Papakonstantinou, Karthik Gururaj, John A Stratton, Dem-
ing Chen, Jason Cong, and Wen-Mei W Hwu. FCUDA: Enabling
efficient compilation of CUDA kernels onto FPGAs. In Application
Specific Processors, 2009. SASP’09. IEEE 7th Symposium on, pages
35–42. IEEE, 2009.
[24] Sylvain Paris and Frédo Durand. A fast approximation of the bilateral
filter using a signal processing approach. International journal of
computer vision, 81(1):24–52, 2009.
[25] Raghu Prabhakar, David Koeplinger, Kevin Brown, HyoukJoong Lee,
Christopher De Sa, Christos Kozyrakis, and Kunle Olukotun. Gen-
erating configurable hardware from parallel patterns. arXiv preprint
arXiv:1511.06968, 2015.
[26] Qualcomm Inc. Snapdragon 800 series mobile processors. https:
//www.qualcomm.com/products/snapdragon/processors/
800-series.
[27] Jonathan Ragan-Kelley, Andrew Adams, Sylvain Paris, Marc Levoy,
Saman Amarasinghe, and Frédo Durand. Decoupling algorithms from
schedules for easy optimization of image processing pipelines. ACM
Transactions on Graphics (TOG), 31(4):32, 2012.
[28] Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain
Paris, Frédo Durand, and Saman Amarasinghe. Halide: A language
and compiler for optimizing parallelism, locality, and recomputation in
image processing pipelines. In Proceedings of the 34th ACM SIGPLAN
Conference on Programming Language Design and Implementation,
PLDI ’13, pages 519–530, New York, NY, USA, 2013. ACM.
[29] Oliver Reiche, Moritz Schmid, Frank Hannig, Richard Membarth, and
Jürgen Teich. Code generation from a domain-specific language for C-
based HLS of hardware accelerators. In Hardware/Software Codesign
and System Synthesis (CODES+ ISSS), 2014 International Conference
on, pages 1–10. IEEE, 2014.
[30] John E Stone, David Gohara, and Guochun Shi. OpenCL: A parallel
programming standard for heterogeneous computing systems. Com-
puting in science & engineering, 12(1-3):66–73, 2010.
[31] Arvind Sujeeth, HyoukJoong Lee, Kevin Brown, Tiark Rompf, Has-
san Chafi, Michael Wu, Anand Atreya, Martin Odersky, and Kunle
Olukotun. OptiML: an implicitly parallel domain-specific language for
machine learning. In Proceedings of the 28th International Conference
on Machine Learning (ICML-11), pages 609–616, 2011.
[32] Michael Wolfe. Beyond induction variables. In Proceedings of the
ACM SIGPLAN 1992 Conference on Programming Language Design
and Implementation, PLDI ’92, pages 162–174, New York, NY, USA,
1992. ACM.
[33] Xilinx. AXI DMA v7.1 LogiCORE IP product guide. http://www.
xilinx.com/support/documentation/ip_documentation/axi
_dma/v7_1/pg021_axi_dma.pdf.
[34] Xilinx. Vivado high-level synthesis. http://www.xilinx.com/
products/design-tools/vivado/integration/esl-design.
html.
[35] Xilinx. Xilinx wiki - open source Linux. http://www.wiki.xilinx.
com/Open+Source+Linux.
[36] Xilinx. Zynq-7000 all programmable SoC overview. http://
www.xilinx.com/support/documentation/data_sheets/ds190-
Zynq-7000-Overview.pdf.
[37] Zhiru Zhang, Yiping Fan, Wei Jiang, Guoling Han, Changqi Yang, and
Jason Cong. AutoPilot: A platform-based ESL synthesis system. In
High-Level Synthesis, pages 99–112. Springer, 2008.
12
