HipaccVX: Wedding of OpenVX and DSL-based Code Generation by Özkan, M. Akif et al.
HipaccVX: Wedding of OpenVX and DSL-based Code
Generation
M. Akif Özkan · Burak Ok · Bo Qiao · Jürgen Teich · Frank Hannig
This is the author’s version of the work. Personal use of this material is permitted. Permission from Springer must be obtained for all other
uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new
collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Abstract Writing programs for heterogeneous plat-
forms optimized for high performance is hard since
this requires the code to be tuned at a low level with
architecture-specific optimizations that are most times
based on fundamentally differing programming paradigms
and languages. OpenVX promises to solve this issue for
computer vision applications with a royalty-free indus-
try standard that is based on a graph-execution model.
Yet, the OpenVX’ algorithm space is constrained to a
small set of vision functions. This hinders accelerating
computations that are not included in the standard.
In this paper, we analyze OpenVX vision functions
to find an orthogonal set of computational abstrac-
tions. Based on these abstractions, we couple an ex-
isting Domain-Specific Language (DSL) back end to the
OpenVX environment and provide language constructs
to the programmer for the definition of user-defined
nodes. In this way, we enable optimizations that are
not possible to detect with OpenVX graph implemen-
tations using the standard computer vision functions.
These optimizations can double the throughput on an
Nvidia GTX GPU and decrease the resource usage of a
Xilinx Zynq FPGA by 50% for our benchmarks. Finally,
we show that our proposed compiler framework, called
HipaccVX, can achieve better results than the state-of-
the-art approaches Nvidia VisionWorks and Halide-HLS.
Keywords OpenVX · Domain-specific language ·
Image processing · GPU · FPGA
M. Akif Özkan · Burak Ok · Bo Qiao · Jürgen Teich · Frank
Hannig
Hardware/Software Co-Design, Department of Computer
Science, Friedrich-Alexander University Erlangen-Nürnberg
(FAU), Germany
E-mail: {akif.oezkan, burak.ok, bo.qiao, teich, hannig}@fau.de
1 Introduction
The emergence of cheap, low-power cameras and em-
bedded platforms have boosted the use of smart sys-
tems with Computer Vision (CV) capabilities in a broad
spectrum of markets, ranging from consumer electronics,
such as mobile, to real-time automotive applications and
industrial automation, e.g., semiconductors, pharmaceu-
ticals, packaging. The global machine vision market size
was valued at $16.0 billion already in 2018, and yet, is
expected to reach a value of $24.8 billion by 2023 [2].
A CV application might be implemented on a great
variety of hardware architectures ranging from Graphics
Processing Units (GPUs) to Field Programmable Gate
Arrays (FPGAs) depending on the domain and the as-
sociated constraints (e.g., performance, power, energy,
and cost). Yet, for sophisticated real-life applications,
the best trade-off is often achieved by heterogeneous
systems incorporating different computing components
that are specialized for particular tasks.
Optimizing CV programs to achieve high perfor-
mance on such heterogeneous systems usually goes along
with sacrificing readability, portability, and modularity.
The programs need to be tuned at a low level with
architecture-specific optimizations that are typically
based on drastically different programming paradigms
and languages (e.g., parallel programming of multicore
processors using C++ combined with OpenMP; vector
data types, libraries, or intrinsics to utilize the SIMD1
units of CPU; CUDA or OpenCL for programming
GPU accelerators; hardware description languages such
as Verilog or VHDL for targeting FPGAs). Partitioning
a program across different computing units, and accord-
ingly, synchronizing the execution is difficult. In order to
1 Single Instruction, Multiple Data (SIMD) units are CPU
components for vector processing, i.e., they execute the same
operation on multiple data elements in parallel.
ar
X
iv
:2
00
8.
11
47
6v
1 
 [c
s.C
V]
  2
6 A
ug
 20
20
2 M. Akif Özkan et al.
Table 1: Available features in OpenVX (VX), DSL compiler Hipacc (H), and our joint approach HipaccVX (HVX).
Features VX H HVX
Industrial standard (open, royalty-free) 3 7 3
Community driven open-source implementations 7 3 3
Well-known CV functions (e.g., optical flow) 3 7 3
High-level abstractions that adhere to distinct memory access patterns (e.g., local) 7 3 3
Custom node execution on accelerator devices (i.e., OpenCL) 3 7 3
Acceleration of the custom nodes that are based on high-level abstractions 7 3 3
achieve these ambitious goals, high development effort
and architecture expert knowledge are required.
In 2014, the Khronos Group released OpenVX as a
C-based API to facilitate cross-platform portability not
only of the code but also of the performance for CV appli-
cations [29]. This is momentous since OpenVX is the first
(royalty-free) standard for a graph-based specification
of CV algorithms. Yet, the OpenVX’ algorithm space is
constrained to a relatively small set of vision functions.
Users are allowed to instantiate additional code in the
form of custom nodes, but these cannot be analyzed
at the system-level by the graph-based optimizations
applied from an OpenVX back end. Additionally, this
requires users to optimize their implementations, who
supposedly should not consider the optimizations of the
performance. Standard programming languages such as
OpenCL do not offer performance portability across dif-
ferent computing platforms [24,4]. Therefore, the user
code, even optimized for one specific device, might not
provide the expected high-performance when compiled
for another target device. These deficiencies are listed
in Table 1.
A solution to the problems mentioned above is of-
fered by the community working on Domain-Specific
Languages (DSLs) for image processing. Recent works
show that excellent results can be achieved when high-
level image processing abstractions are specialized to a
target device via modern metaprogramming, compiler,
or code generation approaches [18,8,10]. These DSLs
are able to generate code from a set of algorithmic ab-
stractions that lead to high-performance execution for
diverse types of computing platforms. However, existing
DSLs lack formal verification, hence they do not ensure
the safe execution of a user application whereas OpenVX
is an industrial standard.
In this paper, we couple the advantages of DSL-
based code generation with OpenVX (summarized in
Table 1). We present a set of abstractions that are used
as basic building blocks for expressing OpenVX’ stan-
dard CV functions. These building blocks are suitable
for generating optimized, device-specific code from the
same functional description, and are systematically uti-
lized for graph-based optimizations. In this way, we
achieve performance portability not only for OpenVX’
CV functions but also for user-defined kernels2 that are
expressed with these computational abstractions. The
contributions of this paper are summarized as follows:
– We systematically categorize and specify OpenVX’
CV functions by high-level abstractions that adhere
to distinct memory access patterns (see Section 4).
– We propose a framework called HipaccVX, which
is an OpenVX implementation that achieves high
performance for a wide variety of target platforms,
namely, GPUs, CPUs, and FPGAs (see Section 5).
– HipaccVX supports the definition of custom nodes
(i.e., user-defined kernels) based on the proposed
abstractions (see Section 5.1).
– To the best of our knowledge, our approach is the first
one that allows for graph-based optimizations that
incorporate not only standard OpenVX CV nodes
but also user-defined custom nodes (see Section 5.2),
i.e., optimizations across standard and custom nodes.
2 Related Work
The OpenVX specification is not constrained to a cer-
tain memory model as OpenCL and OpenMP, therefore
enables better performance portability than traditional
libraries such as OpenCV [19]. It has been implemented
by a few major vendors, including Nvidia, Intel, AMD,
and Synopsys [30]. The authors of [5,33,9,34,27] fo-
cus on graph scheduling and design space exploration
for heterogeneous systems consisting of GPUs, CPUs,
and custom instruction-set architectures. Unlike the
prior work, [26] suggests static OpenVX compilation for
low-power embedded systems instead of runtime-library
implementations. Our work is similar to this since we
statically analyze a given OpenVX application and com-
bine the benefits of domain-specific code generation
approaches [18,8,10,21,15,3].
2 A kernel in OpenVX is the abstract representation of a
computer vision function [32].
HipaccVX: Wedding of OpenVX and DSL-based Code Generation 3
C++
embedded DSL
OpenVX code
Source-to-Source
Compiler
Clang/LLVM
Domain
Knowledge
Architecture
Knowledge
CUDA
(GPU)
OpenCL
(x86/GPU)
C/C++
(x86)
Renderscript
(x86/ARM/GPU)
OpenCL
(Altera FPGA)
Vivado C++
(Xilinx FPGA)
CUDA/OpenCL/Renderscript Runtime Library AOCL Vivado HLS
Fig. 1: HipaccVX overview.
Halide [18], Hipacc [8], and PolyMage [10] are image
processing DSLs that provide language constructs and
scheduling primitives to generate code that is optimized
for the target device, i.e., CPUs, GPUs. Halide [18] de-
couples the algorithm description from scheduling prim-
itives, i.e., vectorization, tiling, while Hipacc [8] and
PolyMage [10] implicitly apply these optimizations on a
graph-based description similar to OpenVX. CAPH [22],
RIPL [25], and Rigel [6] are image processing DSLs that
generate optimized code for FPGAs. Hipacc-FPGA [21]
supports HLS tools of both Xilinx and Intel, while
Halide-HLS [15], PolyMage-HLS [3], and RIPL only tar-
get Xilinx devices. CAPH relies upon the actor/dataflow
model of computation to generate VHDL or SystemC
code. Our approach could also be used to implement
OpenVX by these image processing DSLs.
There is no publicly available OpenVX implementa-
tion for Xilinx FPGAs to the best of our knowledge. Intel
OpenVino [7] provides a few example applications that
are specific to Arria-10 FPGAs. Taheri et al. [28] provide
some initial results for FPGAs, where the main attention
is the scheduling of statistical kernels (i.e., histogram).
The image processing DSLs in [21,3] use similar tech-
niques to implement user applications as a streaming
pipeline. Section 5.2.1 shows how to instrument these
techniques for the OpenVX API. Omidian et al. [12]
present a heuristic algorithm for the design space explo-
ration of OpenVX graphs for FPGAs. This algorithm
could be simplified by using HipaccVX’ abstractions
(see Section 4) instead of OpenVX’ CV functions. Then
it could be used in conjunction with HipaccVX to ex-
plore the design space of hardware/software platforms.
Moreover, Omidian et al. [11] suggest an overlay archi-
tecture for FPGA implementations of OpenVX. The
proposed overlay implementation requires the optimized
implementation of OpenVX’ CV functions, which could
be generated by HipaccVX. Furthermore, an overlay
architecture based on HipaccVX’s abstractions, which
is a smaller set of functions compared to OpenVX CV
functions, could reduce resource usage in [11].
Intel’s OpenVX implementation [1] is the first work
extending the OpenVX standard with an interoperability
API for OpenCL. This is supported in OpenVX v1.3 [32].
Yet, performance portability still cannot be assured for
the custom nodes. An OpenCL code tuned for a specific
CPU might perform very poorly on FPGAs and GPU
architectures [24,4]. Contrarily to our approach, the
performance of this approach relies on the user code.
3 OpenVX and Image Processing DSLs
In the following Sections 3.1 and 3.2, we briefly ex-
plain the programming models of OpenVX and image
processing DSLs, respectively. Then, we discuss the com-
plementary features of these approaches in Section 3.3,
which are the motivation of this work.
3.1 OpenVX programming model
OpenVX is an open, royalty-free C-based standard for
the cross-platform acceleration of computer vision ap-
plications. The specification does not mandate any op-
timizations or requirements on device execution; in-
stead, it concentrates on software abstractions that are
freed from low-level platform-specific declarations. The
OpenVX API is totally opaque; that is, the memory
hierarchy and device synchronization are hidden from
the user. Typically, platform experts of the individual
hardware vendors provide optimized implementations
of the OpenVX API [30].
Listing 1 shows an example OpenVX code for a sim-
ple edge detection algorithm, for which the application
graph is shown in Figure 2. An application is described
4 M. Akif Özkan et al.
OpenVX Code (Application Graph)
virt1 img1SobelGaussian
virt2
virt3
Magnitude Thresholdvirt4img0 virt0Channel Extract
Fig. 2: Graph representation for the OpenVX code given in Listing 1. The output image (img1 ) contains
solely the horizontal edges extracted from the input image (img0 ). The virt2 image is defined only
because OpenVX’ Sobel function returns both horizontal and vertical edges. This redundant computation
is eliminated during the optimization passes of our HipaccVX compiler framework (see Section 5.2.2).
as a Directed Acyclic Graph (DAG), where nodes repre-
sent CV functions (see Lines 14 to 18) and data objects,
i.e., images, scalars (see Lines 4 to 12), while edges show
the dependencies between nodes. All OpenVX objects
(i.e., graph, node, image) exist within a context (Line 1).
A context keeps track of the allocated memory resources
and promotes implicit freeing mechanisms at release
calls (Line 24). A graph (Line 2) solely operates on the
data objects attached to the same context.
The data objects that are used only for the interme-
diate steps of a calculation, which can be inaccessible for
the rest of the application, should be specified as virtual
by the users. Virtual data objects (i.e., virtual images de-
fined in Lines 9 to 12) cannot be accessed via read/write
operations. This paves the way for system-level opti-
mizations applied in a platform-specific back end, i.e.,
host-device data transfers or memory allocations [19].
The execution is not eager; an OpenVX graph must
be verified (Line 20) before it is executed (Line 22).
The verification ensures the safe execution of the graph
and resolves the implementation types of virtual data
objects. The OpenVX standard mandates that a verifi-
cation procedure must (i) validate the node parameters
(i.e., presence, directions, data types, range checks), and
(ii) assure the graph connectivity (detection of cycles),
at the minimum [31]. Optimizations of an OpenVX back
end should be performed during the verification phase.
The verification is considered to be an initialization
procedure and might restructure the application graph
before the execution. A verified graph can be executed
repeatedly for different input parameters (i.e., a new
frame in video processing).
3.1.1 Deficiencies of OpenVX
As mentioned above, the OpenVX standard relieves an
application programmer from low-level, implementation-
specific descriptions, and thus enables portability across
a variety of computing platforms. In OpenVX, the small-
est component to express a computation is a graph node
(e.g., vxGaussian3x3Node) from the set of base CV func-
tions. However, these CV functions are restricted to
1 vx_context context = vxCreateContext ();
2 vx_graph graph = vxCreateGraph(context);
3
4 vx_image img[] = {
5 vxCreateImage(context , width , height , VX_DF_IMAGE_UYVY),
6 vxCreateImage(context , width , height , VX_DF_IMAGE_U8)};
7
8 vx_image virt[] = {
9 vxCreateVirtualImage(graph , 0, 0, VX_DF_IMAGE_VIRT),
10 vxCreateVirtualImage(graph , 0, 0, VX_DF_IMAGE_VIRT),
11 vxCreateVirtualImage(graph , 0, 0, VX_DF_IMAGE_VIRT),
12 vxCreateVirtualImage(graph , 0, 0, VX_DF_IMAGE_VIRT)};
13
14 vxChannelExtractNode(graph , img[0], VX_CHANNEL_Y , virt [0]);
15 vxGaussian3x3Node(graph , virt[0], virt [1]);
16 vxSobel3x3Node(graph , virt[1], virt[2], virt [3]);
17 vxMagnitudeNode(graph , virt[3], virt[3], virt [4]);
18 vxThresholdNode(graph , virt[4], thresh , img [1]);
19
20 status = vxVerifyGraph(graph);
21 if (status == VX_SUCCESS)
22 status = vxProcessGraph(graph);
23
24 vxReleaseContext (& context);
Listing 1: OpenVX code for an edge detection
algorithm. The application graph derived for this
OpenVX program is shown in Figure 2.
a small set since OpenVX has a tight focus on cross-
platform acceleration [32]. Custom nodes can be added
to extend this functionality3, but, they leave the follow-
ing issues unresolved: (i) Users are responsible for the
performance of a custom node, who supposedly should
not consider performance optimizations. (ii) Portability
of performance cannot be enabled for the cross-platform
acceleration of user code. (iii) The graph optimization
routines cannot analyze custom nodes.
For instance, consider Figure 4 that depicts an OpenVX
application graph with three CV function nodes (red)
and a user-defined kernel node (blue). A GPU back end
would offer optimized implementations of the vxNodes
3 The support for the execution of a user code (custom
node) as part of an application graph on an accelerator device
was introduced only recently (August 2019) with the release of
OpenVX v1.3 [32]. Previous versions [31] constraint the usage
of the user-defined kernels to the host platform and required
them to be implemented as C++ kernels.
HipaccVX: Wedding of OpenVX and DSL-based Code Generation 5
(e.g., Gauss), but the user code (custom node) is a black
box for the graph optimizations.
Programming models such as OpenCL can be used to
implement custom nodes. This enables functional porta-
bility across a great variety of computing platforms.
However, the user should have expertise in the target
architecture in order to optimize an implementation for
high performance. Furthermore, OpenCL cannot assure
the portability of the performance since the code needs
to be tuned according to the target device, i.e., usage of
device-specific synchronization primitives, exploitation
of texture memory if available, usage of vector opera-
tions, or different numbers of hardware threads [24,4].
In fact, an OpenCL code optimized for an Instruction
Set Architecture (ISA) has to be ultimately rewritten
for an FPGA implementation in order to deliver high-
performance [13].
3.2 Image Processing DSLs
Recently proposed DSL compilers for image processing,
such as Halide [18], Hipacc [8], and PolyMage [10], en-
able the portability of high-performance across varying
computing platforms. All of them take as input a high-
level, functional description of the algorithm and gener-
ate platform-specific code tuned for the target device.
In this work, we use Hipacc to present our approach.
Hipacc provides language constructs that are em-
bedded into C++ for the concise description of compu-
tations. Applications are defined in a Single Program,
Multiple Data (SPMD) context, similar to kernels in
CUDA and OpenCL. For instance, Listing 2 shows the
description of a discrete Gaussian blur filter application.
First, a Mask is defined in Line 7 from a constant ar-
ray. Then, input and output Images are defined as C++
objects in Lines 12 and 13, respectively. Clamping is
selected as the image boundary handling mode for the
input image in Line 16. The whole input and output
images are defined as Region of Interest (ROI) by the
Accessor and IterationSpace objects that are specified
in Lines 17 and 20, respectively. Finally, the Gaussian
kernel is instantiated in Line 23 and executed in Line 24.
Listing 3 describes the actual operator kernel for the
Gaussian shown in Listing 2. The LinearFilter is a
user-defined class that is derived from Hipacc’s Kernel
class, where the kernel method is overridden. There, a
user describes a convolution as a lambda function using
the convolve() construct, which computes an output
pixel (output()) from an input window (input(mask)).
Hipacc’s compiler utilizes Clang’s Abstract Syntax Tree
(AST) to specialize the lambda function according to
the selected platform and generates device-specific code
that provides high-performance implementations when
1 // filter mask for Gaussian blur filter
2 const float filter_mask [3][3] = {
3 { 0.057118f, 0.124758f, 0.057118f },
4 { 0.124758f, 0.272496f, 0.124758f },
5 { 0.057118f, 0.124758f, 0.057118f }
6 };
7 Mask <float > mask(filter_mask);
8
9 // input and output images
10 size_t width , height;
11 uchar *image = read_image (&width , &height , "input.pgm");
12 Image <uchar > in(width , height , image);
13 Image <uchar > out(width , height);
14
15 // reading from in with clamping as boundary condition
16 BoundaryCondition <uchar > cond(in, mask , Boundary ::CLAMP);
17 Accessor <uchar > acc(cond);
18
19 // output image (region of interest is the whole image)
20 IterationSpace <uchar > iter(out);
21
22 // instantiate and launch the Gaussian blur filter
23 LinearFilter Gaussian(iter , acc , mask , 3);
24 Gaussian.execute ();
Listing 2: Hipacc application code for a Gaussian
filter. It instantiates the LinearFilter Kernel given
in Listing 3.
1 class LinearFilter: public Kernel <uchar > {
2 // ...
3 public:
4 LinearFilter(Accessor <uchar > &input , // input image
5 IterationSpace <uchar > &out , // output image
6 Mask <float > &mask) // mask
7 : {/* ... */}
8
9 void kernel () { // convolve -> local operator
10 output () = convolve(mask , Reduce ::SUM , [&] () -> uchar {
11 return mask() * input(mask);
12 });
13 }
14 };
Listing 3: Hipacc kernel code for an FIR filter.
compiled with the target architecture compiler. We re-
fer to [8,21] for more detailed explanations, further
programming language constructs of Hipacc as well as
corresponding code generation techniques.
3.3 Combining OpenVX with Image Processing DSLs
Our solution to the posed challenges in Section 3.1.1 is
introducing an orthogonal set of so-called computational
abstractions that enables high-performance implemen-
tations for a variety of computing platforms (such as
CPUs, GPUs, FPGAs), similar to the DSLs discussed
in Section 3.2. These abstractions should be used to
implement OpenVX’ CV functions and, at the same
time, be served to the user for the definition of custom
nodes.
Assume that the geometric shapes in Figure 4 repre-
sent the abstractions above. By implementing both the
OpenVX CV functions and the custom node using the
6 M. Akif Özkan et al.
HipaccVX (Implementation Graph - Optimized)
HipaccVX (Implementation Graph) node aggregation
node aggregation
U8 U8
local
local
s16
s16
point points16UYVY U8point
local
U8 U8point +local local + point + pointUYVY
Cuda OpenCL OpenCL HLS
C/C++ 
HLSC/C++
dead comp. elimination
Fig. 3: The application graph in Figure 2 is implemented by using high-level abstractions called point
and local (explained Section 4) instead of OpenVX vision function. This enables high-performance code
generation for various targets when coupled with a DSL compiler and additional optimizations such as
dead computation elimination and node aggregation (see Sections 5.1.1 and 5.2).
VxNode
VxNode
VxNode
Custom Node
Computational 
Abstractions
Fig. 4: HipaccVX enables performance portability for
user-defined code by representing OpenVX’ CV func-
tions as well as custom nodes by a small set of compu-
tational abstractions.
basic building block (different geometric shapes in the
figure), a consistent graph is constructed for the imple-
mentation. Consequently, the problem of instantiating
the user code as a black box is eliminated. Likewise,
assume that all the CV functions of the OpenVX code
in Listing 1 are implemented by using the computa-
tional abstractions called point and local(explained in
Section 4). Then, its application graph (Figure 2) trans-
forms into the implementation graph shown in Figure 3.
This implementation graph could be used for target-
specific optimizations and code generation similar to
the DSL compiler approaches for image processing.
In this paper, we implement the OpenVX standard
by the computational abstractions explained in Section 4.
We accomplish this task by developing a back end for
OpenVX using Hipacc (as an existing image processing
DSL) instead of standard programming languages. In
this way, we get the best of both worlds (OpenVX and
DSL works). Our approach relies on OpenVX’ industry-
standard graph specification and enables DSL-based
code generation. The user is offered well-known CV
functions as well as DSL elements (i.e., programming
constructs, abstractions) for the description of custom
nodes. As a result of this, programmers can write func-
tional descriptions for custom nodes without having
concerns about the performance; and, as a consequence,
allows writing performance-portable OpenVX programs
for a larger algorithm space.
4 Computational Abstractions
We have analyzed OpenVX’ CV functions and catego-
rized them into the computational abstractions summa-
rized in Table 2. The categorization is mainly based on
three groups of operators: (i) point operators that com-
pute an output from one input pixel, (ii) local operators
depend on neighbor pixels over a certain region, and
(iii) global operators where the output might depend on
the whole input image, (presented in Figure 5). We have
identified the following patterns for the global opera-
tors: (a) reduction: traverses an input image to compute
one output (e.g., max, mean), (b) histogram: categorizes
(maps) input pixels to bins according to a binning (re-
duce) function, (c) scaling : downsizes or expands input
images by interpolation, (d) scan: each output pixel
depends on the previous output pixel. Warp, transpose,
and matrix multiplication are denoted as global operator
blocks.
Through the introduction of the node-internal com-
putational abstractions, our approach enables additional
optimizations that manipulate the computation (see
Sections 5.1.1 and 5.2). This is also illustrated in Fig-
HipaccVX: Wedding of OpenVX and DSL-based Code Generation 7
input image output image
(a) Point Operator
input image output image
(b) Local Operator
input image output image
(c) Global Operator
Fig. 5: The considered computational abstractions (listed in Table 2) are based on three groups of operators.
Table 2: Categorization of the OpenVX Kernels according to data access patterns
OpenVX Kernels HipaccVX Abstractions Hipacc Abstractions
AbsDiff, Copy, Add, Subtract, And, Xor, Or, Not, ChannelCombine, ChannelEx-
tract, ColorConvert, ConvertDepth, Magnitude, Phase, Multiply, ScaleImage,
Threshold, TensorAdd, TensorSubtract, TensorConvertDepth, TensorMultiply,
ScalarOperation, Select, Remap
point Kernel
NonMaxSuppression, Dilate3x3, Erode3x3, NonLinearFilter, Median3x3, Bilat-
eralFilter, Sobel3x3, Box3x3, Convolve, Gaussian3x3, LBP, FastCorners
local Kernel
MinMaxLoc, MeanStdDev, Min, Max reduce (global) Reduction
Histogram histogram (global) Histogram
scale-image scale (global) Interpolation
GaussianPyramid, LaplacianPyramid, LaplacianReconstruct pyramid (global) Pyramid
IntegralImage scan (global) Software
WarpAffine, WarpPerspective warp (global) Software
TensorTranspose, TensorMatrixMultiply (global) transpose, matrixMult Software
HarrisCorners point + local + custom Kernel, Software
EqualizeHist histogram + point Kernel, Histogram
OpticalFlowPyrLK point + local + pyramid + custom Kernel, Pyramid, Software
HOGCells custom + local + histogram Kernel, Software
CannyEdge point + local + custom Kernel, Software
ure 3, where redundant computations are eliminated,
and nodes are aggregated for better exploitation of
locality. Memory access patterns of our abstractions
entail system-level optimization strategies motivated
by the OpenVX standard, such as image tiling [27]
and hardware-software partitioning [28]. An abstraction-
based implementation allows expressing aggregated com-
putations as part of the reconstructed graph. In this
way, an implementation graph, as well as an application
graph can be expressed using the same graph structure.
Furthermore, using the proposed set of abstractions re-
duces code duplication compared to typical approaches,
where the libraries are implemented using hand-written
CV functions. For instance, 36 of OpenVX’ CV func-
tions can be implemented solely with the description of
point and local operators as shown in Table 2; that is, a
few highly optimized building blocks for a single target
platform (e.g., GPU) can be reused.
5 The HipaccVX Framework
In this paper, we developed a framework, called Hipac-
cVX, which is a DSL-based implementation of OpenVX.
We extended OpenVX specification by Hipacc code in-
teroperability (see Section 5.1) such that programmers
are allowed to register Hipacc kernels as custom nodes to
OpenVX programs. The HipaccVX framework consists
of an OpenVX graph implementation and optimization
routines that verify and optimize input OpenVX applica-
tions (see Section 5.2). Ultimately, it generates a device-
specific code for the target platform using Hipacc’s code
generation. The tool flow is presented in Figure 1.
5.1 DSL Back End and User-Defined Kernels
OpenVX mandates the verification of parameters and
the relationship between input and output and parame-
ters as presented in Listing 4. There, first, a user kernel
8 M. Akif Özkan et al.
1 vx_node vxGaussian3x3Node(vx_graph graph ,
2 vx_image arr ,
3 vx_image out) {
4
5 // Extension: An OpenVX kernel from a Hipacc kernel
6 vx_kernel cstmk = vxHipaccKernel("gaussian3x3.cpp");
7
8 /*** The code below is the standard OpenVX API ***/
9 // Create vx_matrix for mask
10 const float coeffs [3][3] = /* ... */;
11 vx_matrix mask = vxCreateMatrix(context ,
12 VX_TYPE_FLOAT32 ,
13 3, 3);
14 vxCopyMatrix(mask , (void*)coeffs , VX_WRITE_ONLY ,
15 VX_MEMORY_TYPE_HOST);
16
17 // Set input/output parameters for a kernel
18 vxAddParameterToKernel(cstmk , 0, VX_OUTPUT ,
19 VX_TYPE_IMAGE ,
20 VX_PARAMETER_STATE_REQUIRED);
21 vxAddParameterToKernel(cstmk , 1, VX_INPUT ,
22 VX_TYPE_IMAGE ,
23 VX_PARAMETER_STATE_REQUIRED);
24 vxAddParameterToKernel(cstmk , 2, VX_INPUT ,
25 VX_TYPE_MATRIX ,
26 VX_PARAMETER_STATE_REQUIRED);
27 vxFinalizeKernel(cstmk);
28
29 // Create generic node
30 vx_node node = vxCreateGenericNode(graph , cstm_k);
31 vxSetParameterByIndex(node , 0, (vx_reference) out);
32 vxSetParameterByIndex(node , 1, (vx_reference) arr);
33 vxSetParameterByIndex(node , 2, (vx_reference) mask);
34
35 return node;
36 }
Listing 4: DSL code interoperability extension (only
Line 6).
and all of its parameters should be defined (lines 6
to 26). Then, a custom node should be created by
vxCreateGenericNode (Line 30) after the user kernel
is finalized by a vxFinalizeKernel call (Line 27). The
kernel parameter types are defined, and the node pa-
rameters are set by vxAddParameterToKernel (lines 20
to 26) and vxSetParameterByIndex (lines 31 to 33), re-
spectively.
We extended OpenVX by vxHipaccKernel function
(Line 6) to instantiate a Hipacc kernel as an OpenVX
kernel. The Hipacc kernels should be written in a sepa-
rate file and added as a generic node according to the
OpenVX standard [32]. Programmers do not have to
describe the dependency between Hipacc kernels as in
Listing 2, instead, they write a regular OpenVX pro-
gram to describe an application graph. This sustains
the custom node definition procedure of OpenVX. Ulti-
mately, the HipaccVX framework verifies and optimizes
a given OpenVX application, generates the correspond-
ing Hipacc code, and employs Hipacc for device-specific
code generation.
OpenVX’ CV functions are implemented as a li-
brary by using our extension for Hipacc code instantia-
tion. For instance, the HipaccVX implementation of the
vxGaussian3x3Node API is shown in Listing 4. Users can
simply use these CV functions as in Listing 1. A minor-
ity of OpenVX functions are implemented as OpenCV
kernels since they cannot be fully described in Hipacc.
These are listed in Table 2 with a Software label instead
of a Hipacc abstraction type. As future work, we can
extend Hipacc to support these functions.
5.1.1 Optimizations Based on Code Generation
We inherited many device-specific optimization tech-
niques by implementing a Hipacc back end for OpenVX.
Hipacc internally applies several optimizations for the
code generation from its DSL abstractions. These in-
clude memory padding, constant propagation, utiliza-
tion of textures, loop unrolling, kernel fusion, thread-
coarsening, implicit use of unified CPU/GPU memory,
and the integration with CUDA Graph [8,20,16,17]. At
the same time, Hipacc targets Intel and Xilinx FPGAs
using their High-Level Synthesis (HLS) tools. There, an
input application is implemented through application
circuits derived from the DSL abstractions and opti-
mized by hardware techniques such as pipelining and
loop coarsening [21,13,14].
5.2 OpenVX Graph and System-Level Optimizations
As mentioned before, an OpenVX application is repre-
sented by a DAG Gapp = (V,E), where V is a set of
vertices, and E is a set of edges E ⊆ V × V denoting
data dependencies between nodes. The set of vertices
V can further be divided into two disjoint sets D and
N (V = D ∪N , D ∩N = ∅) denoting data objects and
CV functions, respectively.
Both data (i.e., Image, Scalar, Array) and node (i.e.,
CV functions) objects are implemented as C++ classes
that inherit the OpenVX Object class. Vertices v ∈ V of
our OpenVX graph implementation consist of OpenVX
Object pointers. The verification phase first checks if
an application graph Gapp (derived from the user code,
see, e.g., Listing 1) does not contain any cycles. Then,
it verifies that the description is a bipartite graph, i.e.,
∀(v, w) ∈ E : v ∈ D ∧w ∈ N ∨ v ∈ N ∧w ∈ D. Finally,
the verification phase applies the following optimizations:
5.2.1 Reduction of Data Transfers
Data nodes of an application graph that are not virtual
must be accessible to the host, while the intermediate
HipaccVX: Wedding of OpenVX and DSL-based Code Generation 9
(virtual) points of a computation should be stored in
the device memory. We distinguish these two data node
types by the set of non-virtual data nodes Dnv and the
set of virtual data nodes Dv, where D = Dnv ∪ Dv,
Dnv ∩Dv = ∅. HipaccVX keeps this information in its
graph implementation and determines the subgraphs
between non-virtual data nodes, which can be kept in
the device memory. In this way, data transfers between
host and device are avoided.
5.2.2 Elimination of Dead Computations
An application graph may consist of nodes that do not
affect the results. Inefficient user code or other com-
piler transformations might cause such dead code. A
less apparent reason could be the usage of OpenVX
compound CV functions for smaller tasks. Consider
Sobel3x3 as an example, which computes two images,
one for the horizontal and one for the vertical deriva-
tive of a given image. As the OpenVX API does not
offer these algorithms separately, programmers have to
call Sobel3x3, even when they are only interested in
one of the two resulting images. Our implementation is
based on abstractions and allows a better analysis of
the computation compared to OpenVX’ CV functions,
i.e., the Sobel API is implemented by two parallel local
operators as shown in Figure 3. HipaccVX optimizes a
given application graph using the procedure described
in Algorithm 1. Conventional compilers do not analyze
this redundancy if utilizing the host/device execution
paradigm (e.g., OpenCL, CUDA); that means, when
OpenVX kernels are offloaded to an accelerator device,
and device kernels are executed by the host according
to the application dependency (see Section 6.2).
Algorithm 1 assumes that the non-virtual data nodes
whose input and output degrees are zero must be the
inputs (Din) and the results (Dout) of an application,
respectively. Other non-virtual data nodes could be
input, output, or intermediate points of an application
depending on the number of connected virtual data
nodes. These are initialized in Line 2. Then, all of the
nodes in the same component between the node vstart
and the set Vin are traversed via the depth-first visit
function (Line 18) and marked as alive (Lines 2 to 20).
Finally, in Line 21, a filtered view of an application
graph is created from the set of alive nodes.
The complexity of the functions transpose (Line 15)
and depth-first visit (Line 18) are O(|V | + |E|) and
O(|E|), correspondingly. The filter graph function (Line 21)
is only an adaptor that requires no change in the applica-
tion graph [23]. In the worst case, the graph has |V | − 2
output data nodes. That is, the complexity of Algo-
Algorithm 1: Graph Analysis for Dead Computation
Elimination
input :Gapp – application graph
Dnv – set of are non-virtual data nodes
output :Gfilt – optimized application graph
1 function eliminate_death_nodes(Gapp, Dnv)
/* Find candidate non-virtual roots and leaves */
2 Din ← ∅, Dout ← ∅
3 forall v ∈ Dnv do
4 if deg− (v) = 0 then
5 Din ← Din ∪ v // input non-virtual data nodes
6 end
7 else if deg+ (v) = 0 then
8 Dout ← Dout ∪ v // out non-virtual data nodes
9 end
10 else
11 Din ← Din ∪ v
12 Dout ← Dout ∪ v
13 end
14 end
/* Mark the nodes between roots and leaves as alive */
15 Gtrans ← transpose_graph (Gapp)
16 Valive ← ∅
17 forall vstart ∈ Dout do
18 Vv ← depth-first_visit (vstart, Din, Gtrans)
19 Valive ← Valive ∪ Vv
20 end
/* Filter, keep only the alive nodes and their edges */
21 Gfilt ← filter_graph (Gapp, KEEP_EDGES, Valive)
22 return Gfilt
23 end
rithm 1 becomes O(|V |2+ |E|) in time and O(|V |+ |E|)
in space.
6 Evaluation and Results
We present results for a Xilinx Zynq ZYNQ-zc706 FPGA
using Xilinx Vivado HLS 2019.1 and an Nvidia GeForce
GTX 680 with CUDA driver 10.0. We evaluate the fol-
lowing applications: As image smoothers, we consider
a Gaussian blur (Gauss) and a Laplacian filter with
a 5 × 5 and 3 × 3 local node, respectively. The filter
chain (FChain) is an image pre-processing algorithm
consisting of three convolution (local) nodes. The SobelX
determines the horizontal derivative of an input image
using the OpenVX vxSobel function. The edge detector
in Figure 2 (EdgFig2) finds horizontal edges in an in-
put image, while Sobel computes both horizontal and
vertical edges using three CV nodes. The Unsharp filter
sharpens the edges of an input image using one Gauss
node and three point operator nodes. Both Harris and
Tomasi detect corners of a given image using 13 (4 local
+ 9 point) and 14 (4 local + 10 point) CV nodes, respec-
tively. These applications are representative to show the
optimization techniques discussed in this paper. The
performance of a simple CV application (e.g., Gauss)
solely depends on the quality of code generation, while
graph-based optimizations can further optimize the per-
formance of more complex applications (e.g., Tomasi).
10 M. Akif Özkan et al.
0 1 3 5 6 9
102
103
Number of user-defined nodes
T
hr
ou
gh
pu
t
[M
B
/s
] Support is disabled HipaccVX
Fig. 6: Throughput for different versions of the same
corner detection application (consisting of 9 kernels) on
the Nvidia GTX680 (higher is better). The blue bars
denote an increasing number of CV functions imple-
mented as user-defined nodes using C++. In OpenVX,
these user-defined functions have to be executed on the
host CPU, which leads to a performance degradation;
whereas, HipaccVX accelerates all user-defined nodes
on the GPU.
Laplacian uses the OpenVX’ custom convolution API
and EdgFig2 consists of redundant kernels.
6.1 Acceleration of User-Defined Nodes
User-defined nodes can be accelerated on a target plat-
form (e.g., GPU accelerator) when they are expressed
with HipaccVX’ abstractions (see Section 5.1). A C++
implementation of these custom nodes results in exe-
cuting them on the host device. This is illustrated in
Figure 6 for a corner detection algorithm that consists of
nine kernels. The CPU codes for these custom nodes are
also acquired using Hipacc. As can be seen in Figure 6,
HipaccVX provides the same performance invariant to
the number of user-defined nodes, whereas using the
OpenVX API decreases the throughput severely since
each user-defined node has to be executed on the host
CPU.
6.2 System-Level Optimizations based on OpenVX
Graph
Reduction of Data Transfers HipaccVX eliminates the
data transfers between the execution of subsequent func-
tions on a target accelerator device, as explained in
Section 5.2.1. This is disabled for naive implementa-
tions. The improvements for the two applications are
shown in Figure 7. HipaccVX’ throughput optimizations
reach a speedup of 13.5.
0 1 2 43 5 6 7
FChain
FChain
Harris
Harris
Execution time (normalized)
naive
optimized
(a) Xilinx Zynq FPGA
0 1 2 4 6 8 10 12 14
FChain
FChain
Harris
Harris
Execution time (normalized)
naive
optimized
(b) Nvidia GTX 680 GPU
Fig. 7: Normalized execution time (lower is better) for
1024 × 1024 images. HipaccVX eliminates redundant
transfers by analyzing OpenVX’ graph-based application
code.
0 1 2
SobelX
SobelX
EdgFig2
EdgFig2
Execution time (normalized)
naive
optim.
(a) Xilinx Zynq FPGA
0 1 2 3
SobelX
SobelX
EdgFig2
EdgFig2
Execution time (normalized)
naive
optim.
(b) Nvidia GTX 680 GPU
Fig. 8: Normalized execution time (lower is better) for
1024× 1024 images.
SobelX EdgFig2 SobelX EdgFig2
0
10
20
R
es
ou
rc
e
U
sa
ge
(%
)
BRAM No. of slices
naïve optimized
Fig. 9: Post Place and Route (PPnR) results for the
Xilinx Zynq FPGA. Elimination of dead computation
reduces the area, significantly.
Elimination of Dead Computation HipaccVX eliminates
the computations that do not affect the results of an
application (see Section 5.2.2). This is illustrated in Fig-
ure 8. HipaccVX improves the throughput by a factor
of 2.1 on the GTX 680. The throughput improvement
for the Zynq FPGA is only slightly better since the ap-
plications fit into the target device; thus, run in parallel.
Yet, HipaccVX’ FPGA implementation for the same
application reduces the number of FPGA resources (ele-
mentary programmable logic blocks called slices and on-
chip block RAMs, short BRAMs) significantly (around
50% for SobelX) on the Zynq (see Fig. 9).
HipaccVX: Wedding of OpenVX and DSL-based Code Generation 11
Harris Tomasi GaussLaplacian Sobel Unsharp
103
104
T
hr
ou
gh
pu
t
[M
P
ix
el
/s
] VisionWorks
HipaccVX
Fig. 10: Comparison of Nvidia VisionWorks v1.6 and
HipaccVX on the Nvidia GTX 680. Image sizes are
2048× 2048.
6.3 Evaluation of the Performance
In Figure 10, we compare HipaccVX with the Vision-
Works (v1.6) provided by Nvidia, which provides an op-
timized commercial implementation of OpenVX. Hipac-
cVX, as well as typical library implementations, exploit
the graph-based OpenVX API to apply system-level
optimizations [19], such as reduction of data trans-
fers (see Section 5.2). Additionally, HipaccVX gener-
ates code that is specific to target GPU architectures
and applies optimizations such as constant propagation,
thread coarsening, Multiple Program, Multiple Data
(MPMD) [8]. As shown in Figure 10, HipaccVX can gen-
erate implementations that provide higher throughput
than VisionWorks. Here, the speedups for applications
that are composed of multiple kernels (Harris, Tomasi,
Sobel, Unsharp) are higher than the ones solely consist-
ing of one OpenVX CV function (Gauss and Laplacian).
This performance boost is, to a large extent, due to
the locality optimization achieved by fusing consecutive
kernels at the compiler level [16]. This requires code
rewriting and the resource analysis of the target GPU
architectures.
There was no publicly available FPGA implemen-
tation of OpenVX at the time this paper was writ-
ten. Therefore, in Table 3, we compare HipaccVX with
Halide-HLS [15], which is a state-of-the-art DSL tar-
geting Xilinx FPGAs. As can be seen, HipaccVX uses
fewer resources and achieves a higher throughput for the
benchmark applications. HipaccVX transforms a given
OpenVX application into a streaming pipeline by replac-
ing virtual images with FIFO semantics. Thereby, it uses
an internal representation in Static Single Assignment
(SSA) form. Furthermore, it replicates the innermost
kernel to achieve higher parallelism for a given factor
v. For practical purposes, we present results only for
Table 3: PPnR results for the Xilinx Zynq for images of
1020× 1020 and Ttarget = 5 ns (corresponds to ftarget
= 200 MHz).
App v BRAM SLICE DSP Latency [cyc.]
Gauss
1 HipaccVX 8 473 16 1044500Halide-HLS 8 1823 50 1052673
4 HipaccVX 16 1519 64 261649Halide-HLS 16 4112 180 266241
Harris
1 HipaccVX 20 1457 34 1042466Halide-HLS 16 2688 35 1052673
2 HipaccVX 20 2326 68 521756Halide-HLS 16 4011 70 528385
100 101 102 103 104
Gauss
Laplacian
Sobel
Unsharp
Tomasi
Harris
Throughput [MPixel/s]
Intel i7 Nvidia GTX680 Xilinx Zynq
Fig. 11: Comparison of throughput for the Nvidia
GTX680, Xilinx Zynq, and Intel i7-4790 CPU. The same
OpenVX application code is used to generate different
accelerator implementations. The HipaccVX framework
allows for both code and performance portability by gen-
erating optimized implementations for a diverse range
of accelerators.
Xilinx technology. Prior work [13,21] shows that Hipacc
can achieve a performance similar to handwritten exam-
ples provided by Intel for image processing. This also
indicates that the memory abstractions given in Table 2
are suitable to generate optimized code for HLS tools.
Figure 11 compares the throughputs that were achieved
from the same OpenVX application code for different
accelerators. Here, we generated OpenCL, CUDA, and
Vivado HLS (C++) code to implement a given applica-
tion on an Intel i7-4790 CPU, an Nvidia GTX680 GPU,
and a Xilinx Zynq FPGA, respectively. GPUs and FP-
GAs can exploit data-level parallelism by processing
a significantly higher number of operations in parallel
compared to CPUs. This makes them very suitable for
12 M. Akif Özkan et al.
computer vision applications. Modern GPUs operate
on a higher clock frequency compared to existing FP-
GAs, therefore they could provide higher throughput
for the abundantly parallel applications. This is the
case for Gauss and Unsharp. Whereas, FPGAs can ex-
ploit temporal locality by using pipelining and eliminate
unnecessary data transfers to global memory between
consecutive kernels. Therefore, all the FPGA implemen-
tations in Figure 11 achieve a similar throughput.
7 Conclusion
In this paper, we presented a set of computational ab-
stractions that are used for expressing OpenVX’ CV
functions as well as user-defined kernels. This enables
the execution of user nodes on a target accelerator sim-
ilar to the CV functions and additional optimizations
that improve the performance. We presented HipaccVX,
an implementation for OpenVX using the proposed ab-
stractions to generate code for GPUs, CPUs, and FP-
GAs.
References
1. Ashbaugh, B., et al.: OpenCL interoperability with
OpenVX graphs. In: Proc. of the 5th Intern. Workshop
on OpenCL, p. 26. ACM (2017)
2. BCC Research: Global markets for machine vision tech-
nologies. Tech. rep. (2018)
3. Chugh, N., et al.: A DSL compiler for accelerating image
processing pipelines on FPGAs. In: Proc. of the Intern.
Conf.on Parallel Architecture and Compilation Techniques
(PACT), pp. 327–338. IEEE (2016)
4. Du, P., et al.: From CUDA to OpenCL: Towards a
performance-portable solution for multi-platform GPU
programming. Parallel Computing 38(8), 391–407 (2012)
5. Elliott, G.A., et al.: Supporting real-time computer vision
workloads using OpenVX on multicore+GPU platforms.
In: Proc. of the Real-Time Systems Symp. (RTSS), pp.
273–284. IEEE (2015)
6. Hegarty, J., et al.: Rigel: Flexible multi-rate image pro-
cessing hardware. ACM Trans. on Graphics (TOG) 35(4),
85:1–85:11 (2016)
7. Intel: Intel’s OpenVX developer guide
8. Membarth, R., et al.: Hipacc: A domain-specific language
and compiler for image processing. Trans. on Parallel and
Distributed Systems (TPDS) 27(1), 210–224 (2016)
9. Mori, J.Y., et al.: A design methodology for the next gen-
eration real-time vision processors. In: Proc. of the Intern.
Symp. on Applied Reconfigurable Computing (ARC), pp.
14–25. Springer (2016)
10. Mullapudi, R.T., et al.: Polymage: Automatic optimiza-
tion for image processing pipelines. In: Proc. of the Intern.
Conf.on Architectural Support for Programming Lan-
guages and Operating Systems (ASPLOS), pp. 429–443.
ACM (2015)
11. Omidian, H., et al.: An accelerated OpenVX overlay for
pure software programmers. In: Proc. of the Intern. Conf.
on Field Programmable Technology (FPT) (2018)
12. Omidian, H., et al.: JANUS: A compilation system for bal-
ancing parallelism and performance in OpenVX. Journal
of Physics:Conf.Series 1004(1), 012011 (2018)
13. Özkan, M.A., et al.: FPGA-based accelerator design from
a domain-specific language. In: Proc. of the 26th Intern.
Conf. on Field-Programmable Logic and Applications
(FPL). IEEE
14. Özkan, M.A., et al.: Hardware design and analysis of
efficient loop coarsening and border handling for image
processing. In: Proc. of the Intern.Conf.on Application-
specific Systems, Architectures and Processors (ASAP).
IEEE (2017)
15. Pu, J., et al.: Programming heterogeneous systems from
an image processing DSL. ACM Trans. on Architecture
and Code Optimization (TACO) 14(3), 26:1–26:25 (2017)
16. Qiao, B., et al.: From loop fusion to kernel fusion: A
domain-specific approach to locality optimization. In:
Proc. of the Intern. Symp. on Code Generation and Opti-
mization (CGO) (2019)
17. Qiao, B., et al.: The best of both worlds: Combining
CUDA graph with an image processing DSL. In: Proc. of
the 57th Annual Design AutomationConf.(DAC) (2020)
18. Ragan-Kelley, J., et al.: Halide: A language and compiler
for optimizing parallelism, locality, and recomputation in
image processing pipelines. In: Proc. of the Conf.on Pro-
gramming Language Design and Implementation (PLDI),
pp. 519–530. ACM (2013)
19. Rainey, E., et al.: Addressing system-level optimization
with OpenVX graphs. In: Proc. of the Conf. on Computer
Vision and Pattern Recognition Workshops, pp. 644–649.
IEEE (2014)
20. Reiche, O., et al.: Auto-vectorization for image processing
DSLs. In: ACM SIGPLAN Notices, vol. 52, pp. 21–30.
ACM (2017)
21. Reiche, O., et al.: Generating FPGA-based image process-
ing accelerators with Hipacc. In: Proc. of the Intern. Conf.
on Computer Aided Design (ICCAD), pp. 1026–1033.
IEEE (2017)
22. Sérot, J., et al.: CAPH: a language for implementing
stream-processing applications on FPGAs. In: Embed-
ded Systems Design with FPGAs, pp. 201–224. Springer
(2013)
23. Siek, J., et al.: The boost graph library: User guide and
reference manual. Addison-Wesley (2002)
24. Steuwer, M., et al.: Generating performance portable code
using rewrite rules: From high-level functional expressions
to high-performance OpenCL code. ACM SIGPLAN
Notices 50(9), 205–217 (2015)
25. Stewart, R., et al.: A dataflow IR for memory efficient
RIPL compilation to FPGAs. In: Proc. of the Intern. Conf.
on Algorithms and Architectures for Parallel Processing
(ICA3PP), pp. 174–188. Springer
26. Tagliavini, G., et al.: Enabling OpenVX support in mW-
scale parallel accelerators. In: Proc. of the Intern. Conf.
on Compilers, Architectures and Synthesis for Embedded
Systems (CASES), pp. 1–10. IEEE (2016)
27. Tagliavini, G., et al.: Optimizing memory bandwidth ex-
ploitation for OpenVX applications on embedded many-
core accelerators. Journal of Real-Time Image Processing
15(1), 73–92 (2018)
28. Taheri, S., et al.: Acceleration framework for FPGA im-
plementation of OpenVX graph pipelines. In: Proc. of the
Intern. Symp. on Field-Programmable Custom Comput-
ing Machines (FCCM), pp. 227–227. IEEE (2018)
29. The Khronos Group: Khronos finalizes and releases
OpenVX 1.0 specification for computer vision acceleration.
Press Release (2014)
HipaccVX: Wedding of OpenVX and DSL-based Code Generation 13
30. The Khronos Group: OpenVX resources (2018)
31. The Khronos Vision Working Group and others: The
OpenVX specification v1.2.1 (2018)
32. The Khronos Vision Working Group and others: The
OpenVX specification v1.3 (2019)
33. Yang, M., et al.: Making OpenVX really “real-time”. In:
Proc. of the Real-Time Systems Symp. (RTSS) (2018)
34. Zhang, J., et al.: DS-DSE: Domain-specific design space
exploration for streaming applications. In: Proc. of the
Conf. on Design, Automation and Test in Europe (DATE),
pp. 165–170. IEEE (2018)
