Module-per-Object: a Human-Driven Methodology for C++-based High-Level
  Synthesis Design by da Silva, Jeferson Santiago et al.
ar
X
iv
:1
90
3.
06
69
3v
2 
 [c
s.D
C]
  9
 A
pr
 20
19
Module-per-Object: a Human-Driven Methodology
for C++-based High-Level Synthesis Design
Jeferson Santiago da Silva, Franc¸ois-Raymond Boyer and J.M. Pierre Langlois
Polytechnique Montre´al, Canada
{jeferson.silva, francois-r.boyer, pierre.langlois}@polymtl.ca
Abstract—High-Level Synthesis (HLS) brings FPGAs to au-
diences previously unfamiliar to hardware design. However,
achieving the highest Quality-of-Results (QoR) with HLS is
still unattainable for most programmers. This requires detailed
knowledge of FPGA architecture and hardware design in order
to produce FPGA-friendly codes. Moreover, these codes are
normally in conflict with best coding practices, which favor code
reuse, modularity, and conciseness.
To overcome these limitations, we propose Module-per-Object
(MpO), a human-driven HLS design methodology intended
for both hardware designers and software developers with
limited FPGA expertise. MpO exploits modern C++ to raise
the abstraction level while improving QoR, code readability
and modularity. To guide HLS designers, we present the five
characteristics of MpO classes. Each characteristic exploits the
power of HLS-supported modern C++ features to build C++-
based hardware modules. These characteristics lead to high-
quality software descriptions and efficient hardware generation.
We also present a use case of MpO, where we use C++ as
the intermediate language for FPGA-targeted code generation
from P4, a packet processing domain specific language. The
MpO methodology is evaluated using three design experiments:
a packet parser, a flow-based traffic manager, and a digital up-
converter. Based on experiments, we show that MpO can be
comparable to hand-written VHDL code while keeping a high
abstraction level, human-readable coding style and modularity.
Compared to traditional C-based HLS design, MpO leads to more
efficient circuit generation, both in terms of performance and
resource utilization. Also, the MpO approach notably improves
software quality, augmenting parameterization while eliminating
the incidence of code duplication.
I. INTRODUCTION
High-level synthesis (HLS) has opened doors to an audience
unfamiliar with FPGA hardware design methodology. Indeed,
HLS tools can convert high-level and untimed C-based code
into a synthesizable register-transfer level (RTL) description,
a task that once had to be manually done by hardware (HW)
designers. The RTL design flow is known to be much slower
than its counterparts in software (SW) [1], since it requires
a detailed description of the desired micro-architecture, in-
cluding synchronization schemes, pipelining, and parallelism.
HLS tools, on the other hand, abstract away these micro-
architecture aspects allowing a faster design space exploration
(DSE) through a SW development flow.
However, achieving good Quality-of-Results (QoR) in HLS
environments is sometimes unintuitive and, in some cases, not
This work was supported by the Brazilian National Council for Scientific
and Technological Development - CNPq.
High QoR
Low code modularity
Low code readability
H
W
q
u
al
it
y
S
W
q
u
al
it
y
Low QoR
High code modularity
High code readability
(a) Traditional unidimensional HLS design view
Proposed approach - MpO
Low
High
QoR
Low
High
Code modularity
Code readability
Traditional
approaches
H
W
q
u
al
it
y
S
W
q
u
al
it
y
(b) Proposed bi-dimensional HLS design view
Fig. 1. HLS design approaches
straightforward at all. In the HW design context, the ratio
between performance and design cost normally defines the
QoR standard for a given circuit. In FPGA design, high per-
formance is normally associated with throughput and latency,
while design cost refers to circuit area, energy consumption,
and development time.
Efforts have been made to improve QoR with HLS with
source-to-source transformations and code restructuring [1],
[2]. While improving QoR, such approaches lower abstraction
and make code maintenance and reuse more difficult. The
latter two aspects are well-known problems in HLS design
and they have been subject of research as well [3], [4].
Satisfactory HW QoR with HLS-based design and good SW
engineering practices are often seen as incompatible [5], [6].
Indeed, the majority of HLS users are HW developers who
translate RTL codes into sometimes awkward HW-oriented
C-based descriptions. They attempt to reproduce RTL-level
microarchitectural expressiveness while still accelerating the
FPGA design cycle through HLS design flow. Such HW-
oriented C descriptions lead to incomprehensible codes dif-
ficult to reuse by other designers.
Although existing HLS approaches can sometimes deliver
good code readability and modularity, and still produce good
results, this is most often not the case. Normally, HLS devel-
opment trades-off HW QoR and SW quality, following a sort
of unidimensional view, as illustrated in Fig. 1a. However,
a bi-dimensional HLS approach is required. Indeed, a bi-
dimensional perspective highlights independence between HW
QoR and SW quality. Fig. 1b shows the design space of
this novel bi-dimensional HLS view. In fact, in the course of
this work, we show that using our approach, it is possible
to increase HW QoR and SW quality simultaneously by
employing modern and high-quality C++ constructs, which
leads to cleaner codes and reduces duplication.
In this context, we present design guidelines for C++-
based HLS design targeting both HW and SW designers.
We present several C++ high-level constructs and, whenever
possible, we show their correspondence in the generated HW.
The HLS methodology we propose is called Module-per-
Object (MpO). It is meant to be human-driven and used by
ordinary programmers with limited HW expertise, not only
by FPGA experts. We aim to close the gap between QoR and
code modularity and readability. We use the results obtained
by traditional HLS design as HW QoR metric. We focus
on code modularity and readability as SW quality metrics.
Code modularity is evaluated by the capability of reuse of
a given module while code readability is related to the code
expressiveness and conciseness.
As a final goal, we intend to widen FPGA usage by SW
programmers by raising the FPGA development abstraction.
Indeed, higher design abstractions allow programmers to use
a single version of their code to run on an x86 CPU or be
synthesized for an FPGA device [7]. To do so, we propose
to exploit high-level modern constructs and the Standard
Template Library (STL). Such constructs are well known by
SW developers to improve code readability [6]. We target QoR
and code readability and modularity by extensively employing
templated classes and structures that can tune the C++ objects
according to design needs. In addition, we discuss the possi-
bility of adopting templated C++ classes as an intermediate
language to be used alongside a Domain Specific Language
(DSL). The main contributions of this work are as follows:
• A methodology called Module-per-Object, a design pat-
tern for HLS design that simultaneously achieves high
modularity, readability, and QoR (§ III);
• The extensive use of synthesizable templated C++ data
structures and constructs to improve QoR and modularity
with HLS (§ III);
• A case-study on using C++ as an intermediate language
for automatic code generation of a packet parser written
in the P4 language targeting FPGAs (§ IV-A);
• Based on three specific use-cases, we have identified
HLS tools deficiencies that prevent exploiting the full
capabilities of high-level constructs, and we propose
guidance for HLS designers and hints for future HLS
tool releases (§ IV-B); and
• An evaluation of the benefits brought by the MpO ap-
proach on three design examples: a packet parser, a flow-
based traffic manager, and a digital-up converter (§ V).
II. RELATED WORK
A. QoR Improvements in HLS-based Design
Liang et al. [8] conducted a study on how to restructure C
codes in order to improve QoR with HLS for several different
benchmarks. Their results showed up to 126× performance
improvement over a pure software implementation, which
were obtained after various rounds of code refactoring and
#pragma insertions, which requires extensive HW expertise.
In addition, when comparing to hand-crafted RTL design, their
results are up to 20× worse. Also, the authors affirm that, in
some cases, improving QoR conflicts with good SW engineer-
ing practices. Matai et al. [1] presented a methodology for
code restructuring with HLS targeting FPGA devices. How-
ever, the transformed codes are unintuitive and not portable.
Similar research was conducted by Homsirikamol and Gaj [9]
and Liu et al. [10]. Zhou et al. have presented Rosetta [11].
Rosetta is a benchmark suite for HLS-driven FPGA design.
The benchmarks have been meticulously coded and tuned
for state-of-the-art HLS tools. While such practices improve
performance and reduce FPGA area, in most cases, the source
code is unreadable for a non-FPGA expert.
Source-to-source transformations have been explored by
Winterstein et al. [2], [12]. The authors have proposed a frame-
work that performs source-to-source transformations on the
original C code in order to ensure synthesizability. The authors
claim that the produced code is human-readable. Automated
source-to-source transformations can result in descriptions that
might not exactly match the original code. Gao et al. [13] and
Cong et al. [14] have done similar research.
B. Raising the Abstraction Level in HLS
Cong et al. [7] have conducted a thorough study on HLS
methods and tools. They have as well evaluated the perfor-
mance of the former AutoESL’s HLS tool. The authors have
presented a design methodology for HLS-driven FPGA design,
which includes code reusing practices through C++ templates.
Muck and Frohlich [3], [4] have exploited advanced and
metaprogrammed C++ constructs to create compatible codes
for both CPUs and FPGA devices. The authors present guide-
lines for FPGA-friendly pointer handling and static polymor-
phism implementation [15]. According to the authors, the
resulting overhead in having reusable and modular unified C++
codes is worthwhile. The area and performance overhead are
up to 30% and 50%, respectively, compared to HW-oriented
C++ design. Our work leverages their ideas by employing sev-
eral other C++11 constructs and by comparing the achievable
results with RTL implementations.
Thomas [16] has presented a DSL library targeting recursion
with C++ HLS tools described using C++11 constructs. The
author has shown how compile-time metaprogramming and
lambda expressions can leverage HLS-driven HW design. In-
deed, in our work, we have confirmed that such constructs can
be used by HLS designers, eventually leading to higher QoR,
while raising the abstraction. Similar research was conducted
by Richmond et al. [17]. Recently, Eran et al. [18] have
proposed HLS-friendly design patterns for packet processing
exploiting the capabilities of modern post C+11.
Zhao and Hoe [19] have assessed HLS-based flow in struc-
tural design. Their results for a network-on-chip implementa-
tion are comparable with a self-generated RTL approach. The
TABLE I
SUMMARY OF C++ FEATURES USED IN THIS WORK
Constructs Benefits Version
Fixed-point types Fixed-point arithmetic C++98, vendor dependent
(Variadic) Templates Parameterizable design (C++11), C++98
Classes OO paradigm, encapsulation, inheritance, polymorphism C++98
Template metaprogramming Compile-time calculation, performance improvement C++98
STL Modularity, code reuse, standardization > C++98, in constant evolution
Data containers Data storage and encapsulation > C++98, in constant evolution
Algorithms Standardization, code reuse > C++98, in constant evolution
Iterators, range-based for loops Syntax sugaring, easier container iteration C++11
Lambda expressions Function pointer properties C++11
constexpr variables and functions Compile-time calculation, performance improvement C++11
auto, decltype Automatic type inference C++11
area and performance results vary according to the network
topology, ranging from +1%∼+23% in lookup tables (LUTs),
−71%∼−54% in flip-flops (FFs), and −14%∼+24% in clock
frequency. Their approach does not explore in depth the
capabilities of C++ constructs supported by the HLS tool,
which improves code modularity and readability.
Oezkan et al. [20] have also exploited templated C++
classes to build an image processing library targeting FPGA
devices. The authors make extensive use of templates to
generate highly parameterizable C++ classes. One of their final
remarks is that the more the code is written in a “hardware
design manner”, the better its synthesis is. This “hardware
manner” coding style lowers the abstraction, which could be
alleviated by exploiting the potential of the available high-
level constructs of the STL, augmenting thus code readability,
avoiding code duplication, and improving code maintenance.
C. HLLs as Intermediate Representation in FPGA Design
Other researchers have pointed to the use of DSLs for FPGA
design [21]. Although increasing the development abstraction,
such languages need to be converted into synthesizable RTL
code, a process similar to what is done by HLS tools. Ex-
amples of such DSLs can be found in most varied domains,
ranging from signal/image processing to network applications.
In the network domain, several works have used HLLs,
such as P4 [22], for FPGA implementation. P4FPGA [23]
is a framework for fast prototyping of network functions de-
scribed in P4. P4FPGA uses BlueSpec Verilog as intermediate
representation idiom, which requires a proprietary compiler to
generate synthesizable RTL. The approach proposed by Khan
[24] uses off-the-shelf HLS tools, however, it is difficult to
evaluate the real impact of this work due to the lack of details
provided. While Emu [25] is not used alongside a higher level
network DSL, it could have been, since it comprises a set of
standard network libraries written in C# in an object-oriented
fashion that are compiled to Verilog using Kiwi [26]. These
approach is similar to what Silva et al. [27] have done for a
P4-compatible packet parser.
III. MPO HLS METHODOLOGY
A. Overview of the MpO Methodology
We propose the Module-per-Object (MpO) HLS methodol-
ogy, in which we define the concept of “module” as a C++
object that logically represents a self-contained functional unit.
To do so, this work exploits high-level constructs available
in C++11, and that are supported by Xilinx Vivado HLS, to
improve QoR while keeping a very high level of abstraction.
Inspired by Cong et al. [7], Table I summarizes the synthesiz-
able C++ constructs used in this work.
To increase code modularity and readability, our approach
uses the concept of an MpO base class, which abstracts
common functionalities between different modules. Conse-
quently, this approach allows to reuse the same source code
to describe functional modules with similar behavior. The
five characteristics of an MpO class are: 1) Templates: class
parameterization and code modularity (§ III-C); 2) Systematic
utilization of const and constexpr variables for static
objects (§ III-D); 3) STL constructs: zero-overhead abstrac-
tion, code reuse and modularization (§ III-E); 4) Inheritance
and static polymorphism (when appropriate): code reuse and
modularization (§ III-F); and, 5) Smart constructors: constant
class member initialization (§ III-G).
The main idea is to write generic code that is specialized at
compile time. Generics codes, exploiting templates (1), STL
constructs (3), and inheritance and static polymorphism (5),
allow writing more compact and reusable code, reducing code
duplication. Specialized objects also help reducing resource
usage by allowing specific pieces of hardware to be precisely
inferred. Indeed, const and constexpr variables (2) give
hints to the compiler to perform constants propagation that can
be used in conjunction with smart constructors (5) for class
member initialization.
B. Illustrative Use Case: a Packet Parser
We demonstrate the viability of the proposed methodology
with the design of a packet parser as a use case. A packet
parser determines the set of valid protocols supported by a
network device and extracts the required header fields that are
to be matched in the packet processing pipeline.
A packet parser can be modeled at a high-level with a
directed acyclic graph (DAG), where nodes represent protocols
and edges are protocol transitions [28]. A parser is imple-
mented as an abstract state machine (ASM), performing state
transition evaluations at each parser state. States belonging to
the path connecting the first state to the last state in the ASM
ETH
IPv4
IPv6
UDP
TCP
END
(a) Example of a parser graph
ETH
P
ip
e
R
eg
is
te
r
IPv4
IPv6
P
ip
e
R
eg
is
te
r
UDP
TCP
P
ip
e
R
eg
is
te
r
D
at
a
In
D
at
a
O
u
t
(b) Pipeline organization for the example packet parser
Fig. 2. Representation of a packet parser
compose the set of supported protocols of a network equip-
ment. A packet-processing language, such as P4 [22], can be
used to describe such an ASM. Details on the implementation
of a packet parser in FPGA can be found in [27]. Fig. 2a
illustrates a parser graph for a layer-4 network device while
Fig. 2b shows its possible hardware realization.
C. Specializing Classes with Templates
Templates are fundamental to correctly parameterize an
MpO base class. Indeed, class templates allow generic code
to be fine-tuned for different design instances, favoring code
reuse, reducing duplication while generating results compara-
ble to hand-tuned codes.
Referring to Fig. 2a, the nodes of the parser graph share
common properties and may share the same code, being a
great starting point for an MpO base class. Listing 1 presents
an example of an MpO base class that describes a node of the
parser. For simplicity, only relevant code fragments are shown
and cannot be compiled as is.
The class presented in Listing 1 is parameterized with four
template parameters (line 1). The two first parameters, omitted
in the listing, are integers and they are used to configure the
arbitrary-sized integers. T_HeaderLayout is a struct type
derived from a template. This type is used to declare the
class member HeaderLayout on line 5, which represents
the expected header layout to be processed. The last template
parameter, T_DHeader, is also a type. However, this type is
used to allow static polymorphism of methods of the Header
class; therefore, it represents a type that is derived from the
Header class itself [15].
Consequently, with the extensive use of templates, an MpO
base class provides a high-degree of configurability to MpO
class objects. Thus, MpO base classes contribute to more
reusable and compact code. The graph described in Fig. 2a
is an example where node is a different C++ object, sharing
the same source code, described in Listing 1, using different
template parameters.
D. Specializing Operands with constexpr
In MpO, we use constexpr functions and variables to
set accurate bus sizes in a generic fashion, which leads
to faster and more compact circuits while configurable yet
synthesizable C++ descriptions are used. Also, constexpr
functions are more comprehensible compared to the their
equivalents using older C++ versions. Indeed, they allow
Listing 1: The Header MpO C++ base class.
1 template<· · · , class T_HeaderLayout, class T_DHeader>
2 class Header {
3 protected:
4 typedef ap_uint<numbits(B2b(N_Size))> RXBitsType;
5 const T_HeaderLayout HeaderLayout;
6 const ShiftType stateTransShiftVal;
7 const array<bool, ARR_SIZE> HeaderBusCompVal;
8 RXBitsType rxBits;
9 public:
10 template<typename T, typename F>
11 const T init_array(const F& func) const {
12 typename remove_cv<T>::type arr {};
13 for (auto i = 0; i < arr.size(); ++i)
14 arr[i] = func(i);
15 return arr;
16 }
17 Header (const headerIDType instance_id, const
T_HeaderLayout& HLayout) :
18 · · ·
19 HeaderLayout(HLayout),
20 stateTransShiftVal{shift_def(B2b(N_Size), N_BusSize
,
21 (HLayout.KeyLocation.first + HLayout.KeyLocation.
second))
22 },
23 HeaderBusCompVal(
24 init_array<decltype(HeaderBusCompVal)>(
25 [HLayout](size_t i) {
26 return (HLayout.ArrLenLookup[i] >> numbits(
N_BusSize)) > 0;
27 }
28 )
29 ),
30 { · · · } // end of constructor
31 void StateTransition(const PktDataType& PktIn);
32 void PipelineAdjust(· · · );
33 void HeaderAnalysis(const PktDataType& PktIn,
PHVDataType& PHV,PktDataType& PktOut);
34 };
template specialization and alleviate a task that before C++11
was only possible through template metaprogramming and
partial template specialization.
The type RXBitsType in Listing 1 line 4 is such an exam-
ple. The functions numbits(), B2b(), and shift_def()
in Listing 1 are examples of constexpr functions. In [29],
we present the implementation of the numbits() function
along its verbose equivalent described in C++03. This function
returns the size in bits to represent an arbitrary-sized integer.
One can benefit of compilers’ ability to propagate con-
stants by using constexpr functions to initialize class
members in constructors. An example is the protected mem-
ber stateTransShiftVal of the Header class in List-
ing 1 line 20, whose value is compile-time resolved when the
Listing 2: The StateTransition() method
1 template<· · · > void Header<· · · >::StateTransition(const
PktDataType& PktIn){
2 typedef decltype(HeaderLayout.Key.front().KeyVal)
KeyType;
3 const KeyType DataInMask = createMask(HeaderLayout.
KeyLocation.second);
4 KeyType packetKeyVal = (PktIn.Data >>
stateTransShiftVal) & DataInMask;
5 if (!NextHeaderValid && (rxBits > HeaderLayout.
KeyLocation.first))
6 for (auto key : HeaderLayout.Key)
7 if (key.KeyVal == (packetKeyVal&key.KeyMask)) {
8 NextHeader = key.NextHeader;
9 NextHeaderValid = true;
10 }
11 }
class constructor is called (line 17), becoming a hardwired
value in the HW implementation.
E. Exploiting STL Constructs
STL constructs raise the development abstraction and ease
code readability and maintenance, characteristics favored by
the MpO methodology.
Listing 2 shows how such constructs can be used to describe
a possible implementation of a state evaluation function in
a parser ASM. Its goal is to search the incoming packet
stream to determine if there is a valid protocol transition
for a given ASM state. To do so, the members Key and
KeyLocation of the HeaderLayout struct are used.
Key is an STL array container, composed of another data
structure that holds information regarding the value to be
matched and which is the next header transition in case of
a match. KeyLocation is an STL pair type, where the
first member is the key location in the incoming data stream
and the second member is the key size in bits.
An array<Type, Size> Array container is a
fixed-sized array similar to the array declaration Type
Array[Size] in ISO C. However, since it belongs to
the STL, it includes some useful built-in methods, such
as size() and front(). These method calls can be
resolved at compile time, and therefore they can be used to
parameterize types and to set fixed loop bounds. One example
of such utilization is shown in Listing 2 in the KeyType
type definition on line 2. To define this new type, we use
the decltype keyword. Again, one constexpr function
is used, createMask(), to allow constant propagation on
variable DataInMask.
Also, STL arrays, such as the HeaderLayout.Key,
allow the use of iterators in a range-based for loop to iterate
over the array. Such constructs lead to safer and more compact
code since it is not required to calculate the iteration indexes
or to specify loop bounds. Such an example is the for
statement shown in Listing 2 line 6. In addition, automatic type
resolution can be used with the auto keyword to determine
the type of the loop iterator, simplifying the code as well.
According to our experiments, using STL constructs did not
introduce overhead in terms of QoR. However, the increased
Listing 3: Example of static polymorphism
1 template<· · · , class T_DHeaderFormat>class HeaderFormat {
2 ap_uint<HSIZE_BITS> getHeaderSize(const ap_uint<
HSIZE_BITS>& expr_val) const
3 { return static_cast<T_DHeaderFormat*>(this)->
getSpecHeaderSize(expr_val); }
4 };
5 template<· · · >
6 class varHeaderFormat : public HeaderFormat< · · · , const
varHeaderFormat< · · · >> {
7 ap_uint<HSIZE_BITS> getSpecHeaderSize(const ap_uint
<4>& ihl) const
8 { return ((0x4*ihl)*0x8); }
9 };
code readability and modularity is noticeable, specially when
dealing with data containers, such as array, by minimizing
the need for raw pointer manipulation as required in C [30].
F. Inheritance and Static Polymorphism
The MpO methodology favors code reuse by employing
inheritance whenever possible. Inheritance greatly improves
code modularity and maintainability by reducing code repli-
cation. MpO exploits C++ to leverage the DRY (don’t repeat
yourself) design guideline. Indeed, C++ offers adequate arte-
facts for improving inheritance, such as polymorphic methods
and virtual classes.
Virtual classes are, to date, not supported by HLS vendors.
However, inheritance and static polymorphism are allowed.
For the packet parser, it is of interest to keep the same
method calls even if variable- and fixed-size headers are
processed in a different manner. To do so, static polymorphism
is a C++ mechanism that can be used with MpO.
To parse fixed-sized headers, all needed information is
known at compilation time. When processing variable-sized
headers, the header length must be retrieved from the header
information itself. To do so, the T_HeaderLayout type in
Listing 1 implements static polymorphism to retrieve both
fixed- or variable-sized header length information using the
same method call. The T_HeaderLayout definition is
shown in Listing 3.
In Listing 3, the HeaderFormat is the base struct. The
struct varHeaderFormat and fixedHeaderFormat
(not shown in the code extract) are derived from
HeaderFormat. Note that to allow static polymorphism,
we use the Curiously Recurring Template Pattern (CRTP)
technique [15] as in [3], [4], where the derived class is passed
as a template parameter to the base class (lines 5-6). By
doing so, the compiler is able to statically resolve pointer
conversions, which results in a synthesizable description. In
this example, the implementation of the getHeaderSize()
method (line 2) is done in the derived struct (line 7).
The base class Header from Listing 1 also supports CRTP
to implement static polymorphism. Two classes are derived
from the Header class: the FixedHeader class and the
VariableHeader class. Similarly to what is done with the
HeaderFormat from Listing 3, the classes derived from the
Header class have their own implementation for the method
PipelineAdjust() (Listing 1 line 32). This method is
responsible for keeping the output data bus aligned for the next
processed header. To process fixed-sized headers, fixed bit-
shift operations suffice for this alignment while barrel-shifters
are required when dealing with variable-sized headers.
A naive barrel-shifter implementation in FPGAs is based on
a chain of multiplexers, which results in O(Nlog(N)) area
complexity and O(log(N)) delay. Contrary to ASIC design,
implementing wide multiplexers can be costly in FPGAs,
having normally the same complexity as an adder [7]. Thus,
avoiding wide multiplexers is desired when designing efficient
FPGA HW. In the parser, the number of bits to be shifted is a
function of the current header size. Once we are dealing with
wide data buses and the size of the processed headers is well-
constrained by a formula (Listing 3 line 8) in which only a
few set of values are valid, then a natural choice is to use a
small lookup table storing only the set of valid shift operands.
G. Smart Constructors
Class constructors can be used to initialize constant class
members, which leads to more efficient circuits, as in the
constant lookup table of shift values in the previous section.
An example of a smart constructor that makes use of a
templated function is shown in Listing 1 line 10. The function
is called in the constructor in line 23 to initialize the const
class member HeaderBusCompVal. Note that the templated
function uses a lambda expression as a callable parameter.
In C++, templated functions and objects allow callable
objects to be passed as parameters to functions. Callable
parameters allow functions to be reused, thus reducing code
duplication. Such callable parameters can be function pointers,
functors (function objects), or lambda expressions, which were
introduced in C++11. Functors are objects with a single
method, which once constructed can be called as a function.
Modern compilers have the ability to optimize the object
construction, inlining the code within the scope it is called.
More interestingly, lambdas are local functions which are
stored as variables, while allowing parameter passage and
context capturing. In fact, lambdas are syntax sugaring for
functors [16]. Indeed, for the same functionality, both the
functor and the lambda implementation generate the same
assembly (and LLVM) code [31].
Function pointers are unsynthesizable constructs by most
HLS tools. Thus, functors and lambdas are alternative yet
synthesizable ways to emulate function pointers. Besides being
convenient and elegant, lambdas can contribute to more effi-
cient HW generation by enabling constant propagation when
initializing constant class members in class constructors.
IV. PACKET-PARSER GENERATION FROM P4
A. Top-Level Pipeline
Until now, we have described how a single HW module
can be described using the proposed MpO approach. Several
instances of the generic Header class from Listing 1 can
be specialized to generate different HW modules. Therefore,
the proposed MpO methodology from § III can be used to
implement a complete packet parser.
Listing 4 shows a possible implementation for the packet
parser illustrated in Fig. 2. The code in this listing is auto-
matically generated from a P4 description [22]. Details on the
internal parser micro-architecture and the optimization steps
for code generation are subject of previous work [27].
The generated HW architecture from Listing 4 is in accor-
dance to the parser pipeline organization shown in Fig. 2b.
This is ensured by the static declaration of the parser node
objects (line 3 and 5), in a similar approach to what Zhao
and Hoe have proposed [19]. The static keyword is used
to declare stateful header objects. The pipeline is therefore
inferred according to the data dependency graph. Conditional
inputs in a given pipeline stage or in the output are resolved
with the ternary (?:) C operator (line 17, 20, and 23), which
generates a multiplexer in the final HW [32].
B. Adapting MpO to Current HLS tools
Vivado HLS supports C, C++, and SystemC for synthesis
and simulation. The most recent C++ version supported by
Vivado HLS dates from 2011. However, Vivado HLS does
not fully support this C++ version, limiting the spectrum of
standard high-level constructs that can be used to raise the
development abstraction.
This work makes extensive use of the C++11 STL. While
some constructs available in the library are expected to fail
during synthesis, such as lists and maps, fixed-bounded con-
structs are well supported. These constructs, such as the stan-
dard array and pair, are described as classes in the STL
and their operators are defined as functors in these classes.
During synthesis, when facing each of these operators, Vivado
HLS performs automatic function inlining for the method
describing one operator, which leads to longer synthesis time
and memory usage. The decision to use these STL constructs
is, therefore, a trade-off between the synthesis time and the
flexibility provided by these constructs.
During this work, we have struggled to correctly implement
dynamic polymorphism with Vivado HLS. Static polymor-
phism through CRTP was the only found solution for poly-
morphism in this work. However, even static polymorphism
is limited. Derived classes can access neither local members
nor base class members. Such data accesses cause an invalid
pointer reinterpretation error under synthesis. The detour for
such errors is to pass the necessary operands as function pa-
rameters to the callee methods in the derived class. Accessing
static members in the base class does not cause any error.
Modern compilers are able to devirtualize virtual methods
of dynamically polymorphic classes at compile time and to
inline the code in derived classes. Clang, for instance, is
capable of devirtualizing with the compiler optimization flag
set to -O2 [33]. However, Vivado HLS does not support the
compiler optimization flags. Since the optimization flag has no
effect on Vivado HLS, and its own synthesis pass is not able
to infer the virtual type, dynamic polymorphism cannot be
synthesized. Thus, as already concluded by other researchers
Listing 4: The Parser pipeline
1 void Parser(const PktDataType& PktIn, EthPHVDataType&
eth_PHV, · · · , PktDataType& PktOut) {
2 array<PktDataType, 5> tmpPIn, tmpPOut;
3 static FixedHeader<· · · > eth (· · · );
4 static EthPHVDataType tmpEthPHV;
5 static VariableHeader<· · · > ipv4(· · · );
6 static Ipv4PHVDataType tmpIpv4PHV;
7 · · ·
8 tmpPIn[0] = PktIn;
9 eth.HeaderAnalysis(tmpPIn[0],tmpEthPHV,tmpPOut[0]);
10 eth_PHV = tmp_eth_PHV;
11 tmpPIn[1] = tmpPOut[0];
12 ipv4.HeaderAnalysis(tmpPIn[1],tmpIpv4PHV,tmpPOut[1]);
13 ipv4_PHV = tmpIpv4PHV;
14 tmpPIn[2] = tmpPOut[0];
15 ipv6.HeaderAnalysis(tmpPIn[2],tmpIpv6PHV,tmpPOut[2]);
16 ipv6_PHV = tmpIpv6PHV;
17 tmpPIn[3] = (tmpIpv4PHV.Valid)?tmpPOut[1]:tmpPOut[2];
18 udp.HeaderAnalysis(tmpPIn[3],tmpUdpPHV,tmpPOut[3]);
19 udp_PHV = tmpUdpPHV;
20 tmpPIn[4] = (tmpIpv4PHV.Valid)?tmpPOut[1]:tmpPOut[2];
21 tcp.HeaderAnalysis(tmpPIn[4], tmpTcpPHV, tmpPOut[4]);
22 tcp_PHV = tmpTcpPHV;
23 PktOut = (tmpUdpPHV.Valid)?tmpPOut[3]:tmpPOut[4];
24 }
[34], borrowing some front-end optimization techniques from
modern compilers may be useful in the HLS world.
V. EXPERIMENTAL RESULTS
A. Experimental Setup
In order to demonstrate the efficacy of the proposed MpO
methodology, we conducted three design experiments: 1) a
configurable packet parser [27]; 2) a flow-based traffic man-
ager (TM) [35]; and, 3) a digital up-converter [36].
The first design experiment is a configurable packet parser
briefly introduced in § III-B. To enable reproducibility, the
code of this experiment is open-source [37].
The second design experiment is a flow-based TM archi-
tecture proposed by Benacer et al. [35] in the context of
SDN. The architecture is made up of a traffic policer, a packet
scheduler, a systolic priority queue, and a traffic shaper. This
source code is proprietary.
The third design experiment is a digital up-converter re-
trieved from an application note from Xilinx [36]. The up-
converter design is composed of multi-stage FIR filters, a
direct digital synthesizer, and a mixer. This implementation
is open-source [38].
All experiments targeted a Xilinx Virtex-7 FPGA, part
number XC7VX690TFFG1761-2. Vivado HLS 2015.4 was
used to generate synthesizable RTL code. While we have
tested more recent versions of Vivado HLS, according to our
experiments, the version 2015.4 is the one that better supports
modern C++ constructs. Xilinx Vivado 2015.4 was used for
the synthesis and place and route (P&R). Code complexity is
presented in terms of equivalent lines of code (eLOC). The
eLOC metric ignores blank and commented lines. LOC, when
presented, represents the actual number of lines of code. We
measure code reuse with CCFinderX [39], an open-source tool
based on the work by Kamiya et al. [40], to detect code clones.
TABLE II
PACKET PARSER RESULTS. ADAPTED FROM [27]
Work
Freq.
[MHz]
Lat.
[ns]
LUTs FFs Slices
VHDL [41] 195.3 27 N/A N/A 8000
[41] 195.3 46.1 10 103 5537 15 640
[41] MpO 312.5 41.6 6450 10 308 16 758
MpO 312.5 25.6 6046 8900 14 946
TABLE III
FLOW-BASED TM RESULTS
Work
Freq.
[MHz]
LUTs FFs Slices eLOC
Systolic Slice Size = 3
[35] 91.5 37 581 13 723 9833 784
[35] MpO 102.4 33 575 13 536 9182 1001
Systolic Slice Size = 4
[35] MpO 74.5 55 625 13 891 14 669 1001
Systolic Slice Size = 8
[35] MpO 44.9 116 666 13 585 31 884 1001
Systolic Slice Size = 16
[35] MpO 31.0 200 450 13 930 57 876 1001
B. Results
1) Configurable Packet Parser: Table II presents the results
of the configurable packet parser experiment. In terms of
throughput (omitted from the table) and latency, this work
performs as well as the hand-crafted VHDL implementation
reported in [41]. This work outperforms automatically gen-
erated VHDL code in all aspects except in the number of
FFs. The LUTs reduction can be explained by the degree of
parameterization that our specialized C++ classes offer. The
operations are therefore fine-tuned for each header instance.
We have conducted a different experiment where we mimic
Benacek’s architecture using our MpO methodology. This
experiment is labelled “ [41] MpO” in Table II. Architectural
aspects aside, this hybrid implementation delivers better re-
sults than the original Benacek implementation, significantly
reducing the latency (-10%) and the number of LUTs (-35%).
One takeaway from this experiment is that VHDL lacks in
abstraction to be used as a direct conversion language from
a high-abstraction DSL, such as P4. On the other hand, C++
offers an adequate dialect to represent network semantics that
can be described using P4.
2) Flow-based traffic manager: Table III presents the re-
sults of the TM implementation. This TM implementation can
process 1024 different packet flows. The queue depth of this
TM is 128. To provide a fair comparison for this experiment,
we did not perform any algorithmic or architectural optimiza-
tion in the original code. Also, we kept the same optimization
directives of the original design.
Besides code modernization using C++11 constructs, we
augmented the degree of parameterization of the TM design.
The core component of this TM is a systolic implementation
of a priority queue. In the original design, each systolic slice
implemented a micro queue of two or three elements. The
MpO implementation fully parameterizes these micro queues,
not limiting to two or three. This can be seen in Table III, in
TABLE IV
DIGITAL UP-CONVERTER HW QOR RESULTS
Work
Freq.
[MHz]
Lat.
[cycles]
LUTs FFs Slices
[36] 371.6 3394∼3395 3472 7388 1641
[36] MpO 404.0 3375∼3376 3010 5723 1568
TABLE V
DIGITAL UP-CONVERTER SW QUALITY RESULTS
Work eLOC Clones Instances
Clone
LOC
Clone
Original [36] 383 6 2.33 83.5
[36] MpO 323 0 ∅ ∅
which we show the TM implementation results, with 4, 8, and
16 elements in each systolic slice.
The MpO version of the TM improved HW QoR. Notice-
ably, the circuit frequency was improved by more than 10%.
The area consumption was improved as well, with a reduction
of more than 10% in LUTs and 7% in occupied slices. No
effects in latency, II, DSPs, and BRAMs were observed.
The MpO implementation augmented eLOC by 27%. In-
deed, this was expected because we generalized a hardwired
implementation of the systolic queue slice to support arbi-
trary systolic slices. Moreover, a significant contributor to the
increased eLOC is a library that can be reused elsewhere.
This library has roughly 10% of the total eLOC, in which we
implemented type trait classes and generic helper functions.
In both original and MpO-based implementation, CCFinderX
did not find code clones.
3) Digital up-converter: Table IV shows the results of
the digital up-converter implementation. We did not perform
optimizations on the original code. We only modified the code
for the FIR filters. While HW QoR results consider the whole
design, the SW quality analysis applies only to the filters.
As shown in Table IV, the MpO approach improves QoR
metrics compared to the original digital up-converter im-
plementation from Xilinx. There were improvements in the
maximum frequency, latency, and area consumption, notably
for FFs. The FFs reduction can be explained by the reduced
latency, which means that a shorter pipeline was required in
the MpO implementation. No effects on BRAM, DSP, and II
were observed, thus, not reported in Table IV.
The MpO approach significantly improves software quality
as presented in Table V. The measure of eLOC in Table V
shows how expressive the MpO is compared to the original
design. The MpO-based code is 16% more concise than
its original counterpart. Also, we evaluate code reuse by
measuring code clone patterns as reported in Table V. We
observe that CCFinderX found 6 patterns of code clones
in the original design while no clones were found in our
implementation. Indeed, the MpO methodology favors code
reuse and STL usage, following a DRY methodology to
avoid code duplication. In addition, in the original design,
CCFinderX found an average of 2.33 replicated instances per
clone pattern, with a maximum of 3. CCFinderX also reported
an average of 83.5 LOC per clone, with a maximum of 115.
C. Analysis and Discussions
Zhao and Hoe [19] present quantitative results for the design
of a network-on-chip. The authors compare their methodology
to an auto-generated RTL implementation, while comparisons
to hand-crafted RTL are not shown. On average, their results
show an overhead of 11% and 8% for the LUTs consumption
and the clock period. FF usage is reduced by 58%. Latency
results are not presented. Their experiment is similar to the
comparison between this work applied to the packet parser
and [41]. Using our methodology, the maximum frequency
is 1.6× higher, the latency is reduced by 45%, and LUTs
by 40%, while increasing the number of FFs by 60%. These
improvements in the LUTs consumption and the maximum
frequency are due to our design’s ability to specialize opera-
tions, leading to faster and more compact circuits.
Oezkan et al. [20] present comparative results between
their HLS-based image processing library and the results of
an image processing DSL that generates C++ code. In that
comparison, their results outperform the auto-generated code,
which is expected, since their library is directly hand-crafted
in C++. Therefore, their results cannot be used as a baseline
for a fair comparison against our proposed methodology.
While similar works have exploited modern C++ with HLS
design [3], [4], [16], [17], no generalized methodology has
been presented to date. Muck and Frohlich [3], [4] have
focused on unified CPU-FPGA C++ code-base. Thomas [16]
and Richmond et al. [17] have exploited the power of modern
C++ to implement features not natively supported by HLS
tools, such as recursion and high-order functions. None of
these works have presented the benefits of using C++ in the
generated HW, as we have shown. Also, no SW quality metrics
have been presented in these works.
VI. CONCLUSION
HLS is a game changer to spread FPGA usage outside the
HW world. However, achieving high QoR with HLS design
still relies on detailed FPGA knowledge to generate FPGA-
friendly low-level code, an uncommon skill for software
developers. Such codes lower the design abstraction level
making their comprehension and maintenance tedious even for
experienced programmers. This HLS design approach follows
a uni-dimensional design perspective, trading-off HW QoR
and SW quality.
In this work, we introduced a bi-dimensional HLS de-
sign view by proposing the MpO methodology. The MpO
methodology targets FPGA development with HLS exploiting
standard C++11 constructs. The proposed MpO methodology
builds on the concept of an MpO base class. The five presented
characteristics of an MpO base class leverage HLS design,
improving HW QoR, code readability and modularity while
raising the abstraction development level. Through three de-
sign examples, we showed that using the MpO methodology,
a C++ code can deliver results comparable to hand-crafted
VHDL design. We as well showed that the code complexity
can be reduced using the zero-overhead characteristic of C++.
REFERENCES
[1] J. Matai et al., “Enabling FPGAs for the Masses,” CoRR, vol.
abs/1408.5870, 2014. [Online]. Available: http://arxiv.org/abs/1408.5870
[2] F. Winterstein et al., “High-level synthesis of dynamic data structures:
A case study using Vivado HLS,” in 2013 International Conference on
Field-Programmable Technology (FPT), Dec 2013, pp. 362–365.
[3] T. R. Muck and A. A. Frohlich, “Toward Unified Design of Hardware
and Software Components Using C++,” IEEE Transactions on Comput-
ers, vol. 63, no. 11, pp. 2880–2893, Nov 2014.
[4] ——, “”A metaprogrammed C++ framework for hardware/software
component integration and communication”,” Journal of Systems
Architecture, vol. 60, no. 10, pp. 816 – 827, 2014. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S1383762114001143
[5] C. Hoare, “The quality of software,” Software: Practice and Experience,
vol. 2, no. 2, pp. 103–105, 1972.
[6] B. Stroustrup and H. Sutter, “C++ Core Guidelines,”
http://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines, 2018.
[7] J. Cong et al., “High-Level Synthesis for FPGAs: From Prototyping
to Deployment,” IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems, vol. 30, no. 4, pp. 473–491, April 2011.
[8] Y. Liang et al., “High-level synthesis: productivity, performance, and
software constraints,” Journal of Electrical and Computer Engineering,
vol. 2012, p. 1, 2012.
[9] E. Homsirikamol and K. Gaj, “Can high-level synthesis compete against
a hand-written code in the cryptographic domain? A case study,” in 2014
International Conference on ReConFigurable Computing and FPGAs
(ReConFig14), Dec 2014, pp. 1–8.
[10] X. Liu et al., “High Level Synthesis of Complex Applications: An
H.264 Video Decoder,” in Proceedings of the 2016 ACM/SIGDA
International Symposium on Field-Programmable Gate Arrays, ser.
FPGA ’16. New York, NY, USA: ACM, 2016, pp. 224–233. [Online].
Available: http://doi.acm.org/10.1145/2847263.2847274
[11] Y. Zhou et al., “Rosetta: A Realistic High-Level Synthesis Benchmark
Suite for Software Programmable FPGAs,” in Proceedings of the 2018
ACM/SIGDA International Symposium on Field-Programmable Gate
Arrays, ser. FPGA ’18. New York, NY, USA: ACM, 2018, pp. 269–278.
[Online]. Available: http://doi.acm.org/10.1145/3174243.3174255
[12] F. Winterstein et al., “Separation Logic-Assisted Code Transforma-
tions for Efficient High-Level Synthesis,” in 2014 IEEE 22nd Annual
International Symposium on Field-Programmable Custom Computing
Machines, May 2014, pp. 1–8.
[13] X. Gao et al., “Automatically Optimizing the Latency, Area,
and Accuracy of C Programs for High-Level Synthesis,” in
Proceedings of the 2016 ACM/SIGDA International Symposium
on Field-Programmable Gate Arrays, ser. FPGA ’16. New
York, NY, USA: ACM, 2016, pp. 234–243. [Online]. Available:
http://doi.acm.org/10.1145/2847263.2847282
[14] J. Cong et al., “Bandwidth Optimization Through On-Chip
Memory Restructuring for HLS,” in Proceedings of the 54th
Annual Design Automation Conference 2017, ser. DAC ’17. New
York, NY, USA: ACM, 2017, pp. 43:1–43:6. [Online]. Available:
http://doi.acm.org/10.1145/3061639.3062208
[15] J. O. Coplien, “Curiously Recurring Template Patterns,” C++
Rep., vol. 7, no. 2, pp. 24–27, Feb. 1995. [Online]. Available:
http://dl.acm.org/citation.cfm?id=229227.229229
[16] D. B. Thomas, “Synthesisable recursion for C++ HLS tools,” in 2016
IEEE 27th International Conference on Application-specific Systems,
Architectures and Processors (ASAP), July 2016, pp. 91–98.
[17] D. Richmond, A. Althoff, and R. Kastner, “Synthesizable Higher-Order
Functions for C++,” IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems, pp. 1–1, 2018.
[18] H. Eran, L. Zeno, Z. Istvanz, and M. Silberstein, “Design Patterns
for Code Reuse in HLS Packet Processing Pipelines,” in To appear
at the 2019 IEEE 27th Annual International Symposium on Field-
Programmable Custom Computing Machines (FCCM), April 2019, pp.
1–10.
[19] Z. Zhao and J. C. Hoe, “”Using Vivado-HLS for Structural Design:
a NoC Case Study”,” Carnegie Mellon University, ECE Department,
Pittsburgh, PA USA, Tech. Rep., 2017.
[20] M. A. Oezkan et al., “A Highly Efficient and Comprehensive Image
Processing Library for C++-based High-Level Synthesis,” in FSP 2017;
Fourth International Workshop on FPGAs for Software Programmers,
Sept 2017, pp. 1–10.
[21] N. Kapre and S. Bayliss, “Survey of domain-specific languages for
FPGA computing,” in 2016 26th International Conference on Field
Programmable Logic and Applications (FPL), Aug 2016, pp. 1–12.
[22] P. Bosshart et al., “P4: Programming Protocol-independent
Packet Processors,” SIGCOMM Comput. Commun. Rev.,
vol. 44, no. 3, pp. 87–95, Jul. 2014. [Online]. Available:
http://doi.acm.org/10.1145/2656877.2656890
[23] H. Wang et al., “P4FPGA: A Rapid Prototyping Framework for P4,”
in Proceedings of the Symposium on SDN Research, ser. SOSR ’17.
New York, NY, USA: ACM, 2017, pp. 122–135. [Online]. Available:
http://doi.acm.org/10.1145/3050220.3050234
[24] J. Khan and P. Athanas, “Creating Custom Network Packet Processing
Pipelines on HMC-Enabled FPGAs.”
[25] N. Sultana et al., “Emu: Rapid Prototyping of Networking
Services,” in 2017 USENIX Annual Technical Conference
(USENIX ATC 17). Santa Clara, CA: USENIX
Association, 2017, pp. 459–471. [Online]. Available:
https://www.usenix.org/conference/atc17/technical-sessions/presentation/sultana
[26] S. Singh and D. J. Greaves, “Kiwi: Synthesis of FPGA Circuits from
Parallel Programs,” in 2008 16th International Symposium on Field-
Programmable Custom Computing Machines, April 2008, pp. 3–12.
[27] J. Santiago da Silva, F.-R. Boyer, and J. M. P. Langlois, “P4-
Compatible High-Level Synthesis of Low Latency 100 Gb/s Streaming
Packet Parsers in FPGAs,” in Proceedings of the 2018 ACM/SIGDA
International Symposium on Field-Programmable Gate Arrays, ser.
FPGA ’18. New York, NY, USA: ACM, 2018, pp. 147–152. [Online].
Available: http://doi.acm.org/10.1145/3174243.3174270
[28] G. Gibb et al., “Design principles for packet parsers,” in Architectures
for Networking and Communications Systems, Oct 2013, pp. 13–24.
[29] Jeferson Santiago da Silva, “constexpr Example,”
https://godbolt.org/z/fpme-0, 2019.
[30] ——, “C++ Zero Abstraction Example,” https://godbolt.org/z/xkKXON,
2019.
[31] ——, “Lambda versus Functor Example,” https://godbolt.org/g/uQqU65,
2019.
[32] J. H. Kim et al., “FPGA-based CNN inference accelerator synthesized
from multi-threaded C software,” in 2017 30th IEEE International
System-on-Chip Conference (SOCC), Sept 2017, pp. 268–273.
[33] Jeferson Santiago da Silva, “Frontend Optimizations Example,”
https://godbolt.org/g/dheE2Q, 2019.
[34] D. H. Noronha et al., “Rapid circuit-specific inlining tuning for FPGA
high-level synthesis,” in 2017 International Conference on ReConFig-
urable Computing and FPGAs (ReConFig), Dec 2017, pp. 1–6.
[35] I. Benacer et al., “Design of a Low Latency 40 Gb/s Flow-Based Traffic
Manager Using High-Level Synthesis,” in 2018 IEEE International
Symposium on Circuits and Systems (ISCAS), May 2018, pp. 1–5.
[36] A. Paek and J. Wu, “Designing a Digital Up-Converter using Modular
C++ Classes in Vivado High Level Synthesis,” 2016.
[37] Jeferson Santiago da Silva, “Packet Parser Code,”
https://github.com/engjefersonsantiago/P4HLS, 2019.
[38] ——, “Digital-up Converter Code,”
https://github.com/engjefersonsantiago/MpO/tree/master/DUC, 2019.
[39] Peter Senna Tschudin, “CCFinderX Code,”
https://github.com/petersenna/ccfinderx-core, 2019.
[40] T. Kamiya et al., “Ccfinder: a multilinguistic token-based code clone
detection system for large scale source code,” IEEE Transactions on
Software Engineering, vol. 28, no. 7, pp. 654–670, Jul 2002.
[41] P. Bena´cek et al., “P4-to-VHDL: Automatic Generation of 100 Gbps
Packet Parsers,” in 2016 IEEE 24th Annual International Symposium on
Field-Programmable Custom Computing Machines (FCCM), May 2016,
pp. 148–155.
