Hardware Acceleration Using Functional Languages by Hodaňová, Andrea
VYSOKE´ UCˇENI´ TECHNICKE´ V BRNEˇ
BRNO UNIVERSITY OF TECHNOLOGY
FAKULTA INFORMACˇNI´CH TECHNOLOGII´
U´STAV POCˇI´TACˇOVY´CH SYSTE´MU˚
FACULTY OF INFORMATION TECHNOLOGY
DEPARTMENT OF COMPUTER SYSTEMS
VYUZˇITI´ FUNKCIONA´LNI´CH JAZYKU˚
PRO HARDWAROVOU AKCELERACI
HARDWARE ACCELERATION USING FUNCTIONAL LANGUAGES
DIPLOMOVA´ PRA´CE
MASTER’S THESIS
AUTOR PRA´CE Bc. ANDREA HODANˇOVA´
AUTHOR
VEDOUCI´ PRA´CE Doc. Dr. Ing. OTTO FUCˇI´K
SUPERVISOR
BRNO 2013
Abstrakt
Cílem této práce je prozkoumat možnosti využití funkcionálního paradigmatu pro hard-
warovou akceleraci, konkrétně pro datově paralelní úlohy. Úroveň abstrakce tradičních
jazyků pro popis hardwaru, jako VHDL a Verilog, přestáví stačit. Pro popis na algorit-
mické či behaviorální úrovni se rozmáhají jazyky původně navržené pro vývoj softwaru a
modelování, jako C/C++, SystemC nebo MATLAB. Funkcionální jazyky se s těmi imper-
ativními nemůžou měřit v rozšířenosti a oblíbenosti mezi programátory, přesto je předčí v
mnoha vlastnostech, např. ve verifikovatelnosti, schopnosti zachytit inherentní paralelis-
mus a v kompaktnosti kódu. Pro akceleraci datově paralelních výpočtů se často používají
jednotky FPGA, grafické karty (GPU) a vícejádrové procesory. Praktická část této práce
rozšiřuje existující knihovnu Accelerate pro počítání na grafických kartách o výstup do
VHDL. Accelerate je možno chápat jako doménově specifický jazyk vestavěný do Haskellu s
backendem pro prostředí NVIDIA CUDA. Rozšíření pro vysokoúrovňovou syntézu obvodů
ve VHDL představené v této práci používá stejný jazyk a frontend.
Abstract
The aim of this thesis is to research how the functional paradigm can be used for hard-
ware acceleration with an emphasis on data-parallel tasks. The level of abstraction of
the traditional hardware description languages, such as VHDL or Verilog, is becoming to
low. High-level languages from the domains of software development and modeling, such
as C/C++, SystemC or MATLAB, are experiencing a boom for hardware description on
the algorithmic or behavioral level. Functional Languages are not so commonly used, but
they outperform imperative languages in verification, the ability to capture inherent par-
alellism and the compactness of code. Data-parallel task are often accelerated on FPGAs,
GPUs and multicore processors. In this thesis, we use a library for general-purpose GPU
programs called Accelerate and extend it to produce VHDL. Accelerate is a domain-specific
language embedded into Haskell with a backend for the NVIDIA CUDA platform. We use
the language and its frontend, and create a new backend for high-level synthesis of circuits
in VHDL.
Klíčová slova
funkcionální jazyky, funkcionální paradigma, jazyky pro popis hardwaru, vysokoúrovňová
syntéza, Haskell, VHDL, Accelerate, GPGPU, CUDA, datově paralelní výpočty
Keywords
functional languages, functional paradigm, hardware description languages, HDLs, high-
level synthesis, Haskell, VHDL, Accelerate, GPGPU, CUDA, data-parallel computing
Citace
Andrea Hodaňová: Hardware Acceleration Using Functional Languages, diplomová práce,
Brno, FIT VUT v Brně, 2013
Hardware Acceleration Using Functional Languages
Prohlášení
Prohlašuji, že jsem tuto diplomovou práci vypracovala samostatně pod vedením pana
Doc. Dr. Ing. Otty Fučíka.
. . . . . . . . . . . . . . . . . . . . . . .
Andrea Hodaňová
22. května 2013
Poděkování
Ráda bych poděkovala svému vedoucímu Doc. Fučíkovi za vedení celého projektu, odbornou
pomoc a rady, zejména z oblasti návrhu a syntézy číslicových systémů. Dále bych chtěla
poděkovat kolegovi Bc. Lukáši Kuklínkovi za konzultace z oblasti funcionálních jazyků a
programování v Haskellu.
c© Andrea Hodaňová, 2013.
Tato práce vznikla jako školní dílo na Vysokém učení technickém v Brně, Fakultě in-
formačních technologií. Práce je chráněna autorským zákonem a její užití bez udělení
oprávnění autorem je nezákonné, s výjimkou zákonem definovaných případů.
Contents
1 Introduction 3
2 Hardware Synthesis 5
2.1 Abstraction Levels of the Design Process . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Synthesizable Subsets of Hardware Description Languages . . . . . . 5
2.1.2 Motivation for Higher Levels of Abstraction . . . . . . . . . . . . . . 6
2.2 Evolution of High-Level Synthesis . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Early Research Tools – the First Generation . . . . . . . . . . . . . . 7
2.2.2 Reasons Behind the Slow Adoption of the First Tools . . . . . . . . 8
2.2.3 Behavioral Synthesis from HDLs – the Second Generation . . . . . . 8
2.2.4 Limitations of Behavioral Synthesis . . . . . . . . . . . . . . . . . . . 8
2.3 Modern High-Level Synthesis Tools . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.1 Synthesis from C/C++ and SystemC . . . . . . . . . . . . . . . . . 10
2.3.2 Functional Verification . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 High-Level Synthesis Process . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.1 Decomposition into Subtasks . . . . . . . . . . . . . . . . . . . . . . 12
2.4.2 Complexity of the Subtasks . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.3 Fundamental Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5 Advantages and Disadvantages of High-Level Synthesis . . . . . . . . . . . . 18
3 Functional Languages 19
3.1 Haskell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Higher-Order Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.1 Map and ZipWith . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.2 Fold and Scan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Lava . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.1 The Original Lava . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.2 Observable Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4 reFLect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5 ForSyDe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4 Data-Parallel Computations 26
4.1 Inspiration from the domain of GPUs . . . . . . . . . . . . . . . . . . . . . 26
4.1.1 NVIDIA CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1.2 Languages for General-Purpose GPU Programming . . . . . . . . . 28
4.2 Functional Languages for Data-Parallel Computation . . . . . . . . . . . . . 29
4.2.1 Obsidian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 The Accelerate Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
1
4.3.1 Accelerated Arrays and Expressions . . . . . . . . . . . . . . . . . . 32
4.3.2 Collective Array Operations . . . . . . . . . . . . . . . . . . . . . . . 33
4.3.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3.4 Frontend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3.5 CUDA Backend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3.6 Other Backends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.4 Suitable EDSL for this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5 VHDL Backend for Accelerate 41
5.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.1.1 Input from the Frontend . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.1.2 VHDL Skeletons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.2.1 Traversal of the AST . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.2.2 User-Defined Constraints for Synthesis . . . . . . . . . . . . . . . . . 45
5.2.3 Text.PrettyPrint . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2.4 Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.3 Testing – dotproduct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6 Conclusion 47
2
Chapter 1
Introduction
The aim of this project is to research the field of hardware description languages (HDLs)
with an emphasis on functional languages that can be used to generate VHDL. Verilog and
VHDL are still the most popular HDLs, although the level of integration and the number of
gates available on chips create demand for higher-level description approaches. There are
three levels of abstraction in HDLs: the behavioral level, the register-transfer level (RTL),
and the structural level. Most commercially successful synthesis tools can synthesize RTL
and structural HDLs, but this level of abstraction is becoming too low and is being replaced
by high-level synthesis of circuits from languages that were created for software rather than
hardware development [15].
C language is becoming popular with hardware designers because it is a widely used gen-
eral purpose language. Not only electronic systems designers, but also many professionals
from other non-hardware backgrounds are familiar with it. Professional tools like Calypto’s
Catapult C or Xilinx Vivado HLS Suite let designers synthesize code written in C, C++,
SystemC or Matlab. This major paradigm shift in electronic systems design speeds up the
development and verification of designs targeted at FPGAs or even at complicated SoCs.
Apart from imperative languages, which are sequential in nature, other paradigms have
been receiving attention from both the academia and the industry. Functional languages
have many interesting characteristics applicable to the hardware description domain: the
main building blocks of functional programs are functions – in the mathematical rather
than the procedural sense. They describe what something is instead of how to do it. A
combinational circuit is a function of inputs that produces outputs, exactly like a function in
functional languages. With sequential circuits, the situation is more difficult, as functional
languages are generally stateless, but a synchronized delay can be modelled as well.
Several functional languages have been developed or modified for hardware description.
One of the most popular languages in research projects is Haskell [34]. It was created in
1990 and has been gaining popularity among researchers ever since. Many domain-specific
languages have been derived from it. A lot of these languages have been embedded in the
language itself without any additional extensions. Such code can then be compiled by any
Haskell compiler in the same way as standard Haskell code. The domain-specific features
rely on specialized libraries.
In this thesis, the topic of high-level synthesis of digital circuits is viewed from the
perspective of data-parallel applications. A domain-specific language embedded into Haskell
called Accelerate with a library for data-parallel computing on GPUs have been developed
at the University of New South Wales, Australia. It was created to perform accelerations
of compute-intensive programs on the NVIDIA CUDA platform, although other backends
3
exist as well. Currently, there are experimental backends for OpenCL and the Repa library.
In this project, we use the Accelerate language and the frontend as a basis for the
development of a Haskell high-level synthesis tool. The tool is implemented as an Accelerate
backend with an output in VHDL for FPGA chips. The design and the development of the
backend is summarized in this document.
4
Chapter 2
Hardware Synthesis
The number of transistors available on a chip increases rapidly. In 1965, Intel’s co-founder
Gordon Moore predicted that the capacity of a chip would double every 2 years. Five
decades later, the so-called Moore’s law is still relevant, although the rate has increased
and the chip capacity has doubled approximately every 18 months recently [15].
With more transistors on chips, designs get more and more complex. More time and
more designers are needed to accomplish a design task in time to catch the market win-
dow. This phenomenon drives the demand for design tools that operate on higher levels of
abstraction.
2.1 Abstraction Levels of the Design Process
There are several levels of abstraction in the electronic systems design: the system level,
the behavioral or algorithmic level, the register-transfer level (RTL), the gate level, the
transistor level, the layout level [33].
Until the late 1960s, designs of integrated circuits were developed, optimized and laid out
by hand. First tools for developers were oriented at simulation – first gate-level simulation
tools became available in the 1970s, followed by cycle-based simulation tools a few years
later. Simulation tools were followed by automated place & route, schematic capture, formal
verification and static timing analysis in the 1980s.
When it became no longer feasible to manually design a system on the gate level, let
alone on the transistor or physical level, manual work was substituted by an automated
process called logic synthesis. Electronic design automation (EDA) tools were used to gen-
erate netlists from RTL descriptions captured in hardware description languages (HDLs),
such as Verilog (1986) and VHDL (1987).
2.1.1 Synthesizable Subsets of Hardware Description Languages
Since their appearance in the late 1980s, VHDL and Verilog have become
”
standard“ in the
industry. They enabled wide adoption of simulation tools and later RTL synthesis tools as
well.
There are three levels of abstraction in HDLs [23]. Each of them is characterized by its
timing concept:
• The structural level is used to capture the structure of a design from gates to com-
ponents and can be directly visualized as a schematic. The basic timing unit is the
5
delay. Gates, devices and interconnecting wires are characterized by their delays. The
minimum clock period is constrained by the longest combinational path between two
storage elements.
• The register-transfer level can be viewed as the clock-cycle level: events like signal
value assignments are usually related to the edges of the clock signal. As the name
suggests, an RTL description is generally formed of a set of registers and operators.
Datapaths and control logic are represented by synthesis tools as boolean equations,
finite state machines (FSMs) and binary decision diagrams (BDDs).
• In contrast, the design at the behavioral level is specified in terms of computational
steps — it is event-driven with timing in seconds rather than clock cycles. A com-
putational step can take several clock cycles. Behavioral subsets of HDLs include
language conctructs for timing information that are not synthesizable by standard
RTL synthesis tools (for example after 1ns or wait 1ns). A typical behavioral de-
scription is a set of protocols for communication with other systems. In
”
standard“
RTL design, the behavioral features are used primarily for simulation and testing (in
testbenches) – to produce inputs for other components described in RTL.
Traditionally, only the structural and the RTL HDLs were synthesizable, however, the
possibility of lower-level code generation from the behavioral HDLs was researched as well.
The process is called behavioral synthesis. Hopeful at the beginning (see for example Behav-
ioral Synthesis and Component Reuse With Vhdl [23]), it failed to meet the expectations,
see section [?].
2.1.2 Motivation for Higher Levels of Abstraction
Although the capacity of chips has been growing in accordance with the Moore’s law, EDA
tools have not kept up with the pace. In the early 2000s, it was becoming evident that the
biggest threat to the growth of the semiconductor industry was the increasing design cost
caused mainly by the so-called productivity gap [37]. The capacity of chips was growing
58 % per year, but the designers productivity was growing only 21 % per year. According
to Intel, a state-of-the-art chip of 1981 contained 10,000 transistors and took 100 designer-
months to develop at the cost of $1M. That is 100 transistors per month with 100 dollars
spent on 1 transistor. A 2002 leading edge chip contained 150,000,000 transistors and took
30,000 designer-months and cost $300M to develop. That is 5,000 transistors per month –
50 times more transistors designed per month than in 1981. However, the cost incresed 300
times over the same period.
If the companies wanted to push their designs to the limits of what was possible to
achieve with the available capacity of chips, they had to boost their productivity with a
different, more cost-effective approach to their design process.
2.2 Evolution of High-Level Synthesis
The general idea of an automatic transformation of a high-level functional specification
into a circuit implementation arose naturally with the increasing design complexity. The
process is called high-level synthesis (HLS) [31, 15, 33, 14].
The inputs of HLS are a behavioral specification of the system and a set of constraints
(speed, space, cost, power-consumption, available resources, testability requirements and so
6
on), and the aim of the synthesis process is to generate an RTL description of the circuit that
implements the specified behaviour and satisfies all the constraints. The input specification
describes the required mappings from sequences of inputs to sequences of outputs, which
define the behavior of the system, i.e. the way the system interacts with the outside world.
The design of the internal structure should be constrained as little as possible to make the
description high-level.
The functional specification is usually untimed: The function receives all its input data
simultaneously, performs all computations without any delay and outputs all its results
simultaneously. A high-level synthesizer then transforms the untimed (or partially timed)
specification into a fully timed implementation. Datatypes used at this level are not re-
lated to the hardware design domain. Typically, integers, floating point types, arrays and
structures are used instead of bits and bit vectors. To obtain a realistic hardware implemen-
tation, a synthesizer has to perform a conversion of these types into bit-accurate datatypes
with a fixed bit-length and acceptable accuracy.
The output of the synthesis process is an RTL structure formed of a datapath (a network
of registers, functional units and interconnecting wires, buses and multiplexers) and control
logic (usually in the form of a FSM) that drives the data through the datapath. The next
step after high-level synthesis is done is usually logic synthesis with place & route.
A high-level synthesizer has to take into account a lot of different constraints. There is
not a single
”
right“ way how to generate the micro-architecture of the design. The lower the
latency of the circuit implementation, the higher the amount of resources needed, and vice
versa. There are many criteria to assess the optimality of a generated circuit. Therefore it
is a multi-objective optimization problem and it is up to the user/designer to decide, which
of the solutions produced by an EDA tool is considered superior.
High-level synthesis was inspired by successful expansion of high-level programming
languages. When the low abstraction level of Assembly languages became insufficient, they
were overtaken by higher-level platform-independent languages like Pascal or C. The same
idea drove the developers of early HLS tools.
2.2.1 Early Research Tools – the First Generation
Before VHDL and Verilog even existed, early efforts were made on HLS. In the late 1970s,
researches at Carnegie Mellon University built a pioneering tool, Carnegie-Mellon Univer-
sity design automation (CMU-DA). It used the ISPS language (instruction set processor
specification) as the input language and the so-called Value Trace for intermediate repre-
sentation of the data flow. The innovative approach provoked a lot of interest from the
research community. In the 1980s and early 1990s, a lot of other teams contributed with
their attempts on HLS. Examples of academic projects include ADAM, HAL, MIMOLA or
Hercules/Hebe, commercialized efforts include Cathedral, Cathedral-II, Yorktown Silicon
Compiler and many other tools. They form what is called the first generation of high-level
synthesis tools [29]. An in-depth overview of their contribution to the HLS knowledge base
can be found in the contemporary paper Tutorial on High-Level Synthesis [31] from 1988.
Although the outcomes of these early research projects were not turned into commer-
cially successful tools at the time, they formed the basic know-how, upon which the later
tools built their future success. The original partition of the synthesis process into subtasks
is still preserved and many pioneering algorithms are still used.
7
2.2.2 Reasons Behind the Slow Adoption of the First Tools
There are several reasons why the first generation tools failed to gain wider adoption: a
lack of demand due to immature RTL synthesis at the time, unfamiliar input languages,
domain-specific nature, and poor quality of results [29].
During the 1980s and early 1990s, hardware design was undergoing significant changes.
Automatic place & route made a revolution in backend design, and the acceptance of Verilog
and VHDL enabled wide adoption of simulation tools. RTL synthesis was just beginning
to receive popular attention. Under these circumstances, it was unprobable that hardware
designers would embrace a new approach when they were just learning to use RTL, all the
less so when the early HLS tools used obscure input languages that the designers were not
familiar with, such as the ISPS language in CMU-DA, HardwareC in Hercules, or Silage in
Cathedral and Cathedral-II [13].
The early HLS tools were targeted at ASIC design, but the quality of their results did
not meet the standards expected by ASIC designers. The tools had simple architectures,
expensive allocation and simplistic schedulers. They took away the controllability, but
failed to produce convincing results.
2.2.3 Behavioral Synthesis from HDLs – the Second Generation
The expansion of RTL-based design flows and hardware description languages in the 1990s
made the situation in hardware design more favorable for new high-level tools. New synthe-
sizers (inspired by the wide adoption of HDLs) focused on descriptions written in behavioral
VHDL and Verilog. Several behavioral synthesis tools have emerged between 1994 and the
early 2000s. Although they received wide attention from the community, they were not as
commercially successful as expected [13, 29].
In 1994, Synopsys introduced their tool called Behavioral Compiler, which produced
RTL implementations from behavioral VHDL and Verilog. It used partially timed pro-
cesses as the abstraction level. It was able to determine the types and quantities of resources
needed to implement an algorithm and minimize the requirements by indentifying candi-
dates for resource-sharing. It also included many novel approaches (like timing constraints
handling and others) [1].
Other behavioral synthesizers included Monet from Menthor Graphics from 1997 and
Visual Architect from Cadence. All three of them were eventually withdrawn and replaced
by the next generation of tools, which abandoned behavioral HDLs in favor of C-like high-
level languages.
2.2.4 Limitations of Behavioral Synthesis
The limitation of this generation of synthesis tools came from the fact that they expected
specifications written in behavioral HDLs as input, although system models and initial im-
plementations of algorithms were often written in C, Matlab or other high-level languages.
Models and prototypes therefore had to be manually re-written to behavioral HDLs. Sec-
ondly, the origin of HDLs is closely connected with simulation – they were primarily designed
as simulation languages and later converted into RTL synthesis languages, but still they
were more of an input language for a discrete event simulator than a high-level language
for algorithm capture [15]. And thirdly, as behavioral HDLs were not common among algo-
rithm and system designers, they required steep learning curves and the level of abstraction
they offered was not that much higher than that of RTL HDLs [13].
8
Not only the languages but also the tools required a considerable level of expertise: many
additional inputs from the designer were needed for a transition to RTL. As a consequence,
designers found it difficult to control the synthesis process and the results produced by
the tools seemed unpredictable and often unsatisfactory. The second generation tools were
mostly dataflow-oriented and produced poor results for control-dominated designs.
Furthemore, validation and verification were not targeted in these tools, which again
contributed to poor applicability to the ASIC design domain.
Although the tools received wide attention at first, the steep learning curves and the
complexity of the EDA tools in combination with only marginally raised level of abstraction
deterred their potential users. The developers instead sticked to their well-known RTL. As
a result, a different approach was taken: instead of a transition from one level of an HDL
to a lower level of the same language, the developers of synthesis tools decided to replace
behavioral HDLs by high-level programming languages such as C/C++ and Matlab, which
had already been popular for modeling and prototyping.
2.3 Modern High-Level Synthesis Tools
The ongoing development in the field have brought to light a wide range of industry and
academia-initiated HLS tools that can generate RTL descriptions from many different high-
level languages: C/C++, SystemC, SystemVerilog, MATLAB, Haskell, and many domain-
specific languages. The biggest boost in the field came in the early 2000s with the onset and
commercial success of C-based synthesis tools. As of 2013, major commercially successful
HLS tools include:
• Catapult C by Calypto1 (formerly by Menthor Graphics) uses pure untimed ANSI
C++ for algorithm capture (the code is restricted to be statically determinable),
arbitrary-length bit-accurate datatypes (ac_int<W,S>, ac_fixed<W,I,S,Q,O>) for
interfaces and directives for user-defined constraints. More recently, a support for
SystemC has been added.
• Vivado Design Suite by Xilinx2 provides tools for HLS of input code in C, C++
and SystemC (Vivado High-Level Synthesis), and for digital signal processing with
models based on MATLAB/Simulink (System Generator for DSP). The latest release
(2013.1) also includes IP Integrator tool for high-level design of IP cores. Vivado HLS
know-how is based on AutoPilot by AutoESL, acquired by Xilinx in 2011, while MAT-
LAB/Simulink synthesis functionality comes from former AccelDSP by AccelChip,
acquired by Xilinx in 2006.
• C-to-Silicon Compiler by Cadence3 synthesizes untimed C/C++/SystemC with con-
structs including templates, classes, user-defined types, and certain types of pointers.
• HDL Coder by MathWorks4 synthesizes Matlab scripts into RTL.
• Cynthesizer by Forte5 specializes in ESL-to-RTL with TLM models in SystemC.
1 http://calypto.com/en/products/catapult/catapult overview
2 http://www.xilinx.com/products/design-tools/vivado/index.htm
3 http://www.cadence.com/products/sd/silicon compiler/pages/default.aspx
4 http://www.mathworks.co.uk/products/hdl-coder/index.html
5 http://www.forteds.com/products/cynthesizer.asp
9
• Synphony C Compiler and Synphony Model Compiler by Synopsys6 compile C/C++
and high-level fixed-point models in Simulink/MATLAB, respectively.
• CyberWorkBench by NEC7 synthesizes ANSI C and SystemC.
• Bluespec HLS toolset8 produces RTL Verilog and SystemC from specifications in BSV
(Bluespec SystemVerilog), an explicitly parallel language based on atomic transactions
and oriented on complex concurrent behavior in modern SoCs.
Although these commercial tools usually protect details about their inner implementa-
tion and even restrict their users from publicly comparing benchmark results, some (possi-
bly slightly outdated) information about the synthesis process in Catapult C, Cynthesizer,
NEC and Bluespec can be found in High-Level Synthesis: from Algorithm to Digital Circuit
(chapters 3, 5, 7, 8) [15]. The book also covers PICO, the predecessor of Synphony C
Compiler, and AutoPilot, the predecessor of Vivado HLS (chapters 4, 6).
2.3.1 Synthesis from C/C++ and SystemC
Modern tools’ focus on C-like languages makes high-level synthesis more accessible to al-
gorithm and system designers. High-level languages like C/C++, Matlab/Simulink or –
especially – SystemC also enable easier hardware/software co-design and system parti-
tioning, because both hardware and software can be built using the same model and the
designer can quickly evaluate, which hardware/software partitioning of functions is more
convenient. Synthesizers of C-based languages can also make use of the latest advancements
in technologies for parallelization and optimization in software compilers.
On the other hand, C/C++ face criticism [13] for not being suitable for algorithms with
task and data parallelism. They also lack built-in support for bit-accurate datatypes, timing
issues, concurrency and synchronization, or hierarchy definition. Unfortunately, they not
only miss some hardware-critical constructs, but also contain some problematic constructs,
which have no reasonable hardware counterparts and complicate synthesis, like pointers,
dynamic memory management, virtual functions, recursion and polymorphism.
Modern C-based high-level synthesis languages try to cope with these shortcomings
with additional extensions and restrictions. Concurrency, timing and other constraints
are addressed by either language extensions to C/C++ (HardwareC, SpecC, Handel-C),
specialized libraries (SystemC), or compiler directives. Many commercial HLS tools accept
a subset of pure ANSI C/C++ along with pragmas and directives to define constraints,
which can be not only synthesized into hardware, but also compiled by standard software
compilers into software. An advantage of this approach is that the code can be used for
hardware/software co-simulation without rewriting. Simulation of ANSI C/C++ is also
much quicker than, for example, that of SystemC.
SystemC is a language embedded into C++. It is implemented as a library of C++
classes and templates for system-level design and verification. It spans multiple levels of
abstraction – it can be used for cycle-accurate modeling as well as for transaction-level
modeling (TLM) – and both hardware and software domain. SystemC is particularly use-
ful for system partitioning, where it helps the designer evaluate which blocks should be
implemented in software and which should be synthesized into hardware. Modeling and
6 http://www.synopsys.com/Systems/BlockDesign/HLS/Pages/default.aspx
7 http://www.nec.com/en/global/prod/cwb/index.html
8 http://www.bluespec.com/high-level-synthesis-tools.html
10
exploration of different architectures is one of the key purposes of TLM. Virtual hard-
ware/software platforms described in SystemC TLMs are also used for early development
of embedded software (before the hardware part is manufactured and physically available)
and for functional verification.
2.3.2 Functional Verification
In RTL synthesis, designers have to produce a testbench for every designed circuit to verify
its functionality. A tested circuit is called DUT (device under test), and its description
is written in RTL HDL, whilst the testbench is written in behavioral HDL. Sequential
statements such as wait are used to set input signals of the DUT to all possible values, and
for each combination, an output is read. Both values are compared. In C-based high-level
synthesis, however, the testbench is written in the same high-level language as the DUT,
i.e. C/C++.
Firstly, the testbench is used for a direct simulation of the C/C++ input code. The
results obtained in the simulation are either used for validation of the C/C++ code and
compared against a golden reference set of outputs, or for generation of such a reference
set from an already validated C/C++ code. An HLS tool then uses the same C/C++
testbench to verify the generated RTL code through equivalence checking.
Figure 2.1: Catapult C: The same C++ testbench can be used both for the C++ code and
for the synthesized RTL implementation [15].
For example in Xilinx Vivado, both the C validation and the RTL verification take
place. The testbench is a C file comprised of a main() function and at least one other
function (possibly more). One of the non-main functions is the
”
top-level function“ for
synthesis, i.e. the function that is being synthesized. Every function below the top-level
function in the function hierarchy is synthesized as well. The rest of the testbench file helps
test the top-level function. In the main() function, the top-level function is called and its
results are saved into an output file. This file is than compared against a golden reference
output file to validate the C implementation.
When the top-level function is synthesized into a corresponding RTL code (in VHDL,
Verilog and cycle-accurate SystemC), the user selects one of the implementations and the
11
rest of the verification process is executed automatically. The RTL code is wrapped by
transactors that enable communication with the testbench. The RTL implementation is
co-simulated with the testbench using an external tool for RTL simulation. The testbench
produces stimuli for the simulation and the DUT responds with result, which are collected
into an output file and compared with the values from the golden set to verify that the
functionality has stayed the same after the synthesis.
2.4 High-Level Synthesis Process
The synthesis process consists of several tasks. This section contains a description of the
key subtasks, their complexity (a lot of them are NP-hard problems) and fundamental
algorithms used to implement these subtasks (with an emphasis on scheduling and binding
tasks).
Figure 2.2: A decomposition of the synthesis process into subtasks, based on An Introduc-
tion to High-Level Synthesis [14] with modifications [33].
2.4.1 Decomposition into Subtasks
Generally, the synthesis task is decomposed into a series of steps:
• code compilation and transformation (optimizations),
• operation scheduling,
• resource allocation (or selection),
• binding (or assignment),
• and controller generation.
This decomposition was introduced by the first researches for problem solving purposes, but
the core subtasks are highly interdependent in nature (especially scheduling and allocation,
and allocation and binding), and should be performed concurrently.
12
Internal Representation
The first step is usually the compilation of the functional specification into an internal
partially-ordered formal description that can expose concurrency in the input code. The
internal representation is usually a parse tree or, more often, a graph. Data flow graphs
(DFG), control flow graphs (CFG) or their combination, control-data flow graphs (CDFG)
are used. A control flow graph is derived directly from the explicit order of statements in the
input code. A data flow graph represents the ordering imposed by the data relations between
operations. Nodes in a DFG represent operations, and edges represent input, output and
intermediate variables (i.e. the data). Control dependencies can be removed at compile
time: For-loops can be unrolled to form a noniterative block of code, and branches can be
resolved by multiplexed values. However, DFGs cannot represent loops with unbounded
iterations (while-loops) or control statements such as goto. Also, data-dependency graphs
tend to be disjoint, which complicates the graph traversal during the synthesis process. For
these reasons, data flow graphs are often combined with nodes and edges representing flow
of control to form control-data flow graphs.
Figure 2.3: The HAL benchmark [33] (a) and its unscheduled DFG (b).
When the internal representation is build, the tool can perform some optimizations on it.
Techniques commonly used in software compilers such as dead-code elimination, constant
propagation, common subexpression extraction, inline expansion of procedures and loop
unrolling can be performed.
Scheduling
Scheduling means assigning operations (nodes in a CDFG) to so-called control steps. A
control step is a basic timing unit in synchronous systems. It usually corresponds to a
clock-cycle in hardware. The aim of scheduling is to find the minimal sequence of control
steps needed to complete the desired algorithm with a limited number of available hardware
resources. When an operation depends on the result of another operation, they cannot be
scheduled to the same control step. On the other hand, when two operations are indepen-
dent, they can share the same control step as long as there are available hardware resources
for both operations to be performed at the same time. As different operations take dif-
ferent amounts of time, an operation can be scheduled to one clock-cycle, over multiple
clock-cycles or it can be chained with other operations within the same clock-cycle.
13
Allocation
Allocation or selection is the process of choosing the types and the numbers of hardware
resources needed to implement operations in the input CDFG according to the design
constraints. Memory units (registers) and interconnects are also allocated at this stage,
although some components may be added during scheduling and binding tasks. The com-
ponents are selected from an RTL library (module selection) and component characteristics
(such as latency, throughput or power consumption) provided in the library are used by
other synthesis subtasks (especially by the scheduler). Communication paths can be imple-
mented as simple wires with multiuplexers, buses or a combination of both.
In fixed resource constrained designs, allocation is performed before scheduling. Re-
sources are allocated first to meet the constraints and then scheduling minimizes the latency
and maximizes the throughput. Resource-constrained approaches are used for applications
where the datapath architecture has to be defined in a certain way, for example to fit on
an FPGA.
In fixed latency constrained designs, allocation is performed during scheduling and
additional units are allocated according to how many units of what type are needed to get a
certain throuhtput/latency, for example in multimedia and telecommunication applications.
Binding
Binding is the process of mapping operations onto instances of functional units, assigning
intermediate variables to memory units and mapping data transfers onto a set of inter-
connection units for data routing (wires, buses, multiplexors). The aim is to minimize the
amount of hardware resources needed. The types and numbers of resources determined dur-
ing allocation are taken as input by the binding algorithm. However, additional resources
(for example registers or multiplexers) may be allocated during the binding process as well.
In the functional units binding, operations scheduled to different control steps can share
the same functional unit. Naturally, some functional units are specialized to perform only
some operations (for example adders, multipliers, dividers), therefore these operations can
only be bound to functional units capable of executing the operations. If more units of that
type are available, the algorithm has to optimize the choice. Mutually exclusive operations
performed by some common functional unit form a group, and the binding algorithm tries
to find the minimum number of groups needed to cover all operations, thus minimizing the
number of functional units.
The aim of storage binding is to minimize the number of memory units used to store
intermediate results between control steps. Each intermediate result (variable) has to be
assigned to a storage element. Variables whose lifespans do not overlap can share the same
memory unit.
Another criterion for functional units assignment and storage binding is the complexity
of the communication paths between the functional units and the intermediate registers.
Output Generation
A schedule can be viewed as temporal partitioning of an input data-flow graph, while
binding can be viewed as spatial partitioning of a DFG. The result of scheduling, allocation
and binding of a DFG is a structure of the datapath. Apart from the datapath and the
components on it, a controller that will drive the data through the datapath needs to
be synthesized as weel. Each control step of the schedule corresponds to a state in the
14
Figure 2.4: First, an input data-flow graph (a) is divided into control steps, which form a so-
called schedule (b). For each operation in the DFG, a functional unit is allocated. Registers
are allocated for variables carrying value across control steps (c). Finally, instances of
functional units and registers are chosen by the binding angorithm during the binding
phase (d) [33].
controlling FSM, and the controller produces signals for data routing and possibly receives
status signals from the components on the datapath.
2.4.2 Complexity of the Subtasks
The whole process of high-level synthesis can be described as a transition between a pro-
cedural high-level code and an RTL description consisting of a datapath and a datapath
controller. The major problem of this process is the enormously large number of design
possibilities, from which the synthesizer tries to pick the solution that is as near as possible
to the optimal design. Unfortunately, many synthesis subtasks are known to be NP-hard,
for example scheduling with limited resources or register binding with limited number of
registers. Yet there are several such tasks, and they are interdependent.
To be able to find the best possible schedule, the scheduler needs to know which opera-
tions will be executed on the same functional unit and what the delay of this unit is. During
allocation, however, one must know which operations will be scheduled to the same control
step (and performed in parallel) to determine, how many functional units are needed. This
circular depencency complicates the whole process.
Generally, the first generation tools introduced several approaches [31, 15] to reduce the
complexity of the cooperation: The easiest approach to implement is to limit the number of
functional units the scheduler can use at one time (fixed resource constrained designs) or to
limit the maximal latency on the longest path (fixed latency constrained designs). Iteration
of the whole scheduling process yields better results. After each iteration, the limits are
changed according to the results of the previous iteration, and the design is rescheduled
until a satisfactory design has been found. The reset of the limit and the assessment of the
results can be done under the supervision of the user (as in MIMOLA) or an expert system
with a feedback loop (as in Chippe).
The schedule and resource requirements can be developed simultaneously from an empty
schedule and an empty list of resources by adding new units when an operation scheduled
to a certain control step cannot be allocated to an existing unit. Again, this process can be
iterated (as in the HAL system). The opposite approach is taken in the Yorktown Silicon
15
Compiler, where each operation is allocated to a separate unit and one control step at first.
Then additional control steps are added to meet all dependency constraints, and resource
sharing is introduced by coupling operations from different control steps together.
2.4.3 Fundamental Algorithms
The main occupation of HLS researchers was (and probably still is) the scheduling process,
therefore scheduling algorithms have got a prominent place in most research papers on HLS
[31, 15]. A nice overview of both early and modern techniques can be found in the chapter
2 (High-Level Syntehsis Fundamentals) of Low-Power High-Level Synthesis for Nanoscale
CMOS Circuits by Mohanty et al. [33] and in the chapter 13 (Operation Scheduling:
Algorithms and Applications) of High-Level Synthesis: from Algorithm to Digital Circuit
[15].
Exhaustive Search
Exhaustive search is one of the oldest methods developed for scheduling. It begins with a
default schedule, usually either maximally parallel or maximally serial, and transforms it by
moving serial operations in the same control step or parallel operations in different control
steps. It tries all possible combinations of these transformations and obtain all possible
schedules, from which the optimal solution is chosen. The time-consuming assessment
process can be somewhat sped up by branch-and-bound techniques, which cut off the search
along any paths that are found to be suboptimal. Although exhaustive search guarantees
to find the optimum, it is not feasible for non-trivial designs.
A different approach to the schedule-space search is to use a heuristic that will choose
promising transformations to reach the best solution more quickly. This method was imple-
mented in the Yorktown Silicon Compiler, where paths were scheduled using backtracking
and then overlapped by a greedy heuristic. Later, an exhaustive version of the path-based
approach was developed and called as-fast-as-possible (AFAP) scheduling [5]. The method
is useful for graphs with conditional branches in control dominated applications. It produces
the fastest possible solution for a resource-constrained schedule.
Since scheduling is in fact an optimization problem, standard optimization methods can
be used as well. Integer linear programming (ILP) can be used to find the exact optimal
solution for both time and resource constrained problems. It uses a branch-and-bound
search algorith with backtracking and to reduce the search space, it uses ASAP and ALAP
algorithms (see below) to calculate the mobility range for each operation. Its worst-case
time complexity is exponentinal, though.
ASAP and ALAP Algorithms
The simplest case of scheduling occurs when the number of resources is not constrained, and
the only goal is to minimize the latency. Operations are scheduled to the earliest possible
control step according to the topological order of the corresponding nodes in a data flow
graph. The method is called as-soon-as-possible (ASAP) scheduling. An operation is
scheduled to the next control step when all of its predecessors have already been scheduled.
The longest path from the root of the DFG to a node determins the earliest moment when
the operation corresponding to the node is allowed to start (the lower bound for its starting
time). Moreover, the longest path through the schedule corresponds to the lower bound of
the overall application latency.
16
The opposite approach is called as-late-as-possible (ALAP) scheduling. With a given
latency limit, each operation is scheduled to the latest possible control step. The latency
limit can be computed by the ASAP algorithm first. The longest path between an operation
node and the end node (often called the sink node) limits the latest opportunity to schedule
the operation (the upper bound for its starting time).
Figure 2.5: The ASAP (a) and the ALAP (b) algorithms find the upper and lower bounds
of each operation.
The ASAP and the ALAP time stamps for each node determine the lower and the upper
bounds on possible control steps to which each operation can be scheduled. The difference
between these bounds is called the operation mobility.
Although these two algorithms are too simple to be used as the main scheduling method,
they form an essential part of many advanced algorithms. A resource-constrained version of
the ASAP algorithm exists as well, and was used in CMU-DA and MIMOLA. The resource-
constrained version, however, does not guarantee to find the shortest schedule. Operations
on the critical path have no priority. As a consequence, less critical operations can occupy
all available resources and force the next-on-line critical operation to wait for the next
control step.
List Scheduling
The lack of priority in the constrained version of the ASAP algorithm is solved in the list
scheduling algorithm. For each control step, a list of operations available to be scheduled
(a
”
ready list“) is kept. An operation becomes available, when all of its predecessors have
already been scheduled, as in ASAP. The list is ordered by a priority function. Operations
with higher priorities are scheduled first. When two operations from the ready list share
the same priority, one is picked randomly. When all resources needed by an operation
have already been occupied, it is deferred to the next step. At the beginning of the next
step, available operations are collected into a new list and ordered. The process continues
until all the operations have been scheduled and an optimal schedule has been found. The
scheduler always generates feasible solutions.
The result of list scheduling strongly depends on its priority function. Different priority
functions have been used, for example the ALAP value (the length of the path between the
operation and the end node) or the mobility.
17
Force-Directed Scheduling
Force-directed scheduling (FDS) was developed for the HAL system. The aim of FDS is to
reduce the amount of resources. It is time-constrained. It tries to achieve a uniform distri-
bution of operations on functional units to maximize unit utilization. FDS is a constructive
method — it starts with an empty schedule and iteratively fills it, one operation at a time.
It does not perform any backtracking. Scheduling decisions are made in a greedy manner.
The method uses the ASAP and the ALAP algorithms to find the range of feasible con-
trol steps for each operation. These control steps form the time frame of the operation and
are considered equally probable. The probabilities are used to form a so-called distribution
graph. The graph shows, for each control step, an estimation on the number of resources
needed in that control step.
The FDS algorithm tries to balance the distribution as much as possible. It computes
the effects of scheduling each unscheduled operation into each control step from its time
frame and picks the operation and the control step with the smallest negative effect on
the distribution. A tentative assignment of an operation to a certain control step affects
the time frames of its predecessor and successor operations. A new time frame has to be
computed for all possibly affected operations.
As the method computes the effects of scheduling each unscheduled operation to each
control step from its time frame and the effects of that scheduling on each predecessor/suc-
cessor, and the whole process is repeated in every iteration, the worst-case time complexity
of the algorithm is cubic. As a result, the algorithm is too time-consuming for larger designs.
Binding Algorithms — Constructive Heuristics vs. Graph Problems
Binding involves three subtasks: binding operations to instances of functional units, binding
values to instances of storage units and binding data transfers to instances of buses, wires
and multiplexors. The common optimization goals of allocation and binding are to minimize
total cost of functional units, registers, bus drivers and multiplexors, total interconnect
length, critical path delays and power dissipation.
Binding techniques can be divided into two groups:
• iterative/constructive heuristics, which select candidates for assignment one by one
according to some global metric or local selection rules
• and global optimization methods, usually borrowed from the graph theory (clique
finding problem, graph coloring problem) or mathematical programming (ILP).
2.5 Advantages and Disadvantages of High-Level Synthesis
High-level synthesis has many advantages over RTL synthesis: a shorter design cycle, fewer
errors and a better readability of code to name a few [31]. The main advantage (which
implies the advantages named above) is probably quicker and more straight-forward verifi-
cation.
18
Chapter 3
Functional Languages
Functional languages use a different paradigm than imperative and object oriented lan-
guages like C, C++ or Java. The theory is based on a formalism called lambda calculus
and the first practical language based on it was Lisp back in 1958. Since than, functional
languages are gaining on popularity. The basic language constructs the programs are com-
posed of are functions. Functions in pure functional languages are different from functions
in procedural languages, as they have no side effects and have a form of equations, where
the left-hand side of the equation is equal to the right-hand side of the equation. As with
mathematical equations, it is not important in which order the equations are going to be
evaluated. In terms of hardware description, this behaviour is similar to VHDL signals,
that are as well evaluated
”
simultaneously“, in other words they are all considered valid at
the same time.
Another resemblance to HDLs comes from the parallel nature of such constructs: In
hardware, all signals have a valid value at the same time and multiple gates are evaluated
in parallel. In functional programming, there are functions that take a list of inputs and
return a list of outputs. The whole list of outputs is
”
ready“ at the same time.
The mathematical nature of functions in functional programming also means that a
proof of correctness of the program can be obtained, which is a key feature in functional
HDLs development and the main motivation for the research in the field. In terms of
hardware synthesis, a proof can be made that standard evaluation of a functional program
describing a circuit yields the same results as symbolic evaluation of the same program,
similar to what Catapult C and other tools do during the verification of RTL code generated
by high-level synthesis (see figure 2.1). The output of hardware synthesis can then be
compared with results obtained through simulation.
Suitability of functional languages as HDLs is discussed in Hardware Design and Func-
tional Programming: A Perfect Match[41].
3.1 Haskell
This section is based on the book
”
Real World Haskell“ from O’Reilly [34].
Haskell is a purely functional language – standard Haskell functions do not have side
effects. However, when side effects are needed (basic IO functions, network communica-
tion. . .), a construct called
”
Monad“ can be used.
It is a strongly typed language and it uses type classes for a hierarchy between types.
User can define both types and type classes. A type class is a group of types with a common
19
characteristic, for example enumerable types belong to Enum type class or types with a
total ordering belong to Ord type class.
Another key feature of the language is called
”
lazy evaluation“. Expressions are eval-
uated only when the result of the evaluation is actually needed. Thus, infinite lists and
other structures can
”
exist“, and as long as we do not try to get all items in them, we can
get just a portion (via standard functions like take, head and similar).
3.2 Higher-Order Functions
Higher-order functions are functions that take a function as a parameter. Some of these
functions are particularly useful for operations over collections, because they apply the input
function to each element of a given input collection, usually a list. The most common higher-
order functions can help the programmer get rid of recursive function definitions, because for
many recursive functions, there is a suitable higher-order function in the standard library.
The four most common functions for data-parallel aplications are decribed below.
3.2.1 Map and ZipWith
The map function takes an unary function as an argument and applies it to all elements of
a list, while zipWith does the same with a binary function and two input lists.
f f f f map f xs
xs
xs'
f zipWith f xs ys
xs
zs
f f f
ys
Figure 3.1: Map and ZipWith
3.2.2 Fold and Scan
Reduction fold{l,r} and prefix sum scan{l,r} reduce a list to a scalar value, or list of
sums for all prefixes of the input list, respectively. There are usually two variants of folds
and scans: the left and the right variants (denoted by r or l). Figure 3.2 shows the right
variant of both functions.
These four functions are frequently used in high-level functional languages for data-
parallel programming and will be present in all following parts of this thesis.
The next three sections are devoted to functional hardware description languages. There
are several languages for descriptions on different levels of abstraction: sobe of them are
aimed at the low-level RTL and structural descriptions (such as Lava), other languages
specialize in functional verification (such as reFLect) and try to take advantage of their
pure functional nature, which makes verification and validation of circuits described in
20
ffoldr f 0 xs
xs
y
f
f
f
0
}
f
scanr f 0 xs
xs
ys
f
f
f
0
}
Figure 3.2: Fold and Scan
functional HDLs more straightforward than with C/C++. There are also a few system-
level functional languages for the electronic system level design (ESL). One of these projects
has already been mentioned: Bluespec is a tool which uses Bluespec System Verilog (BSV),
although it was originally based on Haskell. Another system level tool and language is
ForSyDe, briefly presented in the last section of this chapter.
3.3 Lava
Lava is a pure functional language embedded in Haskell. It is also a library of Haskell
functions and datatypes for hardware description, tools for simulation and generators of
VDHL and netlists. It was developed at the Chalmers University of Technology in Sweden
[2]. It supports two different types of circuit interpretations: standard and symbolic. These
interpretations implement three types of circuit analyses: simulation, verification and syn-
thesis, although the language was designed mainly to become an experimental platform for
formal verification. The output of the tool is either a result of the simulation of the tested
circuit with a given set of inputs, or a set of descriptions in VHDL or a format readable by
formal verification tools.
There are two versions of Lava by the Chalmers University of Technology team: The
newer version of Lava by the team at Chalmers is sometimes referred to as Chalmers Lava
[40] (the original version being called simply
”
Lava“). The difference between these two
versions is in the way they implement the detection of re-used subcircuits. In the monadic
approach, a circuit is encapsulated in a monad and a composition of circuits is based
on monad compositors. The newer version uses ordinary Haskell functions and a newly
introduced approach called Observable Sharing (ObS).
3.3.1 The Original Lava
The original approach is based on monads. A monadic type contains pipelined statements
that are evaluated as a sequence, not in parallel. A simple half adder circuit (as described
in the most cited paper on the original Lava [2]) looks like this:
halfadd :: Circuit m => (Bit , Bit) -> m (Bit , Bit)
halfadd (a, b) = do
carry <- and2 (a, b)
21
sum <- xor2 (a. b)
return (carry , sum)
Although the computations of carry and sum are written in a sequential form, the
resulting VHDL does not reflect it and the expressions are translated to parallel signal
mapping. VHDL processes are not used.
However, the newer version described in [8] (a
”
hackage“ package exists online: [9] with
a recommended tutorial [11]) uses ordinary Haskell functions rather than monadic types:
halfadd :: (Signal Bool , Signal Bool) -> (Signal Bool , Signal Bool)
halfadd (a, b) = (sum , carry)
where
sum = xor2 (a, b)
carry = and2 (a,b)
The Lava library offers a few combinators that can be used to combine circuits to form
a bigger circuit with a regular structure (a row of blocks, a sequence of blocks, a grid. . .),
for example the half adder circuit can be used as a building block in an N-bit adder:
nbitadder = row halfadder
The function row describes a pattern where circuit blocks are placed in parallel with
a connection between neighbours, as shown on the Figure [?]. In terms of Haskell, it is a
higher-order function that takes a block function f as a parameter and recursively forms a
chain of blocks implementing the block function:
row f (carry , []) = ([], carry)
row f (carryIn , a:as) = (b:bs , carryOut)
where
(b, carry) = f (carryIn , a)
(bs, carryOut) = row f (carry , as)
The function above contains pattern matching, a key feature in Haskell, and recursive
definition, therefore it is a good example of how Haskell features can be used in hardware
desctiption tasks.
Figure 3.3: The regularity of circuits can be captured by higher-order functions in Haskell.
The schematic above shows a row of blocks implementing function f that can be modeled
by the row construct in Lava.
Lava is a pure structural language, it does not include syntax for behavioral descrip-
tion of hardware. Certain high-level features are provided by a library of higher-order
components.
22
The original Lava became inspiration to other projects. Satnam Singh [43] extended
the original monadic version to a version called Xilinx Lava, which is a specialised low-level
dialect for Xilinx FPGAs. It is strongly Xilinx-hardware oriented and takes into account
the layout of components on FPGAs (slices, LUTs. . .) for the generation of EDIF netlists.
Lava is used as a quick alternative to standard logic synthesis tools for the generation of
FPGA programming bitstreams for reconfigurable systems [42].
3.3.2 Observable Sharing
The problem with standard Haskell is that it is a lazy programming language that allows po-
tentially infinite structures. A circuit is a finite graph. When modelled by a lazy datatype,
it is indistinguishable from potentially infinite regular trees. Claessen and Sands [10] said
that:
”
implementations of Haskell do indeed represent cyclic structures by graphs. The
problem is that the sharing of nodes that creates such cycles is not observable by any func-
tion which traverses such a structure.“ They proposed a non-conservative (meaning some
properties of Haskell are potentially lost) library-driven extension to Haskell, the so-called
Observable Sharing that keeps the information about node sharing in immutable
”
reference
cells“. It detects cycles in data structures – feedback loops in circuits. An example of a
circuit with a feedback loop can be a latch that toggles its output (this example comes
from [10] where it is compared to a half adder circuit similar to the circuit presented at the
beginning of this section, which – unlike the sequential toggle circuit – does not encounter
the problems addressed in the paper):
toggle :: Signal
toggle = let output = int (latch output) in output
where Signal is a stream of bits implemented as a potentially infinite list of Booleans
and where the inverter and the latch are functions from a Signal to a Signal:
type Signal = [Bool]
inv :: Signal -> Signal
inv bits = map not bits
latch :: Signal -> Signal
latch bits = False : bits
When simulating the toggle circuit, the result is an infinite list [True, False, True, False,
True. . .]. Thanks to the lazy evaluation, we can view as many values as we want:
ghci > take 10 toggle -- view the first ten
[True ,False ,True ,False ,True ,False ,True ,False ,True ,False]
On the other hand, when we want to get just a symbolic representation (an expression
representing the function rather than concrete values of outputs) to generate VHDL or to
verify the circuit, it becomes a problem. If we redefine Signal to be either a variable name
(a wire) or the result of a component that has received its input values in a list
type Signal = Var String | Comp String [Signal]
And if we redefine the inverter and the latch to become components:
inv b = Comp "inv" [b]
latch b = Comp "latch" [b]
We will get an infinite structure:
ghci > toggle
Comp "inv" [Comp "latch" [Comp "inv" [Comp "latch" [Comp "inv" ...
23
And so on. . . There are, of course, only two components, one inverter and one latch, but
this fact cannot be
”
observed“ from outside. The Observable Sharing extension has been
designed to make the finite structure observable. Each component has its
”
reference“ and
although the circuit still creates a cyclic structure, it is now possible to create a function
that observes the identity of the components and generates the symbolic output accordingly.
The biggest advantage of this method is that standard Haskell functions can be used
instead of monads, which are considered a more advanced feature (and therefore less adopt-
able by wider audience), to describe circuits.
There are other extensions to Lava. A comparison of Lava dialects can be found in
[40]. The Observable Sharing method of the Chalmers Lava is replaced by Type Directed
Observable Sharing in Kansas Lava [18, 16], which inspired the creators of the Accelerate
library [7].
3.4 reFLect
The reFLect language [20] is a strongly typed functional language similar to ML with added
meta-programming constructs to represent expressions as unevaluated terms. Quotation
and antiquotation constructs are used to compose and decompose expressions in the reFLect
language itself, thus providing a form of reflection. These meta-programming operators
together with pattern matching over quoted expressions make the abstract syntax trees
(ASTs) of reFLect expressions available to the programmer for transformations.
reFLectt is aimed at applications for hardware design and verification: Hardware designs
are modeled as reFLect programs and the simulation of the designs is done by execution
of the programs. reFLect’s reflective ability can be used to modify or transform the AST
of the circuit model at runtime before evaluation, and for formal reasoning about the
circuit’s properties. The language is used both for the specification of properties and for
the implementation of a theorem prover for these specifications. reFLect was designed as
a part of the Forte hardware verification tool used by Intel [32], which combined model
checking with a theorem prover similar to the HOL system [19].
Expressions in the reFLect language can be quoted by enclosing them in ’〈〈’ and ’〉〉’
symbols. This quotation contruct provides the programmer with an access to the abstract
syntax tree of the expression in between the quotation marks, for example 〈〈 1 + 2 〉〉
captures the addition of 1 and 2. Although 1 + 2 in semantically equal to 3 (both expressions
denote the same integer value of 3), the expression 〈〈 1 + 2 〉〉 is semantically different from
〈〈 3 〉〉, because 〈〈 1 + 2 〉〉 and 〈〈 3 〉〉 are two different abstract syntax trees.
The antiquotation construct has quite the opposite meaning. By prexifing a term with
the ’ˆ’ symbol, the expression is raised one level outside the quotation marks. For example
in 〈〈 1 + ˆ〈〈2〉〉〉〉 the AST of 〈〈 2 〉〉 is splices into the position where the quotation of 〈〈 2
〉〉 have been, and reduced to 〈〈 1 + 2 〉〉.
Antiquotation is used in combination with pattern matching to write functions for term
manipulation (as needed in a theorem prover and for circuit transformations). For example
〈〈 ˆx + ˆy 〉〉 can be matched with 〈〈 1 + 2 〉〉 in a way that 〈〈 1 〉〉 goes for x and 〈〈 2
〉〉 for y. A decomposition function (with {| and |} for quotation and ‘ for antiquotation)
can look like this:
let decompose {| ‘x + ‘y |} = (x, y);
The previous example with splicing of term 〈〈2〉〉 into another quoted expression can be
used for embedding of terms:
24
let inc x = {| 1 + ‘x |};
reFLect’s quotation and antiquotation features provide a flexible framework for embed-
ding of both logic terms and domain-specific languages [35]. Although the project seems
to be abandoned since 2007, it was used as a base for an embedded DSL called Shade in
2008 [36].
3.5 ForSyDe
ForSyDe (Formal System Design)1 is a system-level design methodology for heterogenous
systems with hardware/software co-design. It has a formal basis and is aimed at modeling
of SoCs and embedded systems with the possibility of verification. There are two versions of
system-level modeling tools, one for Haskell and one for SystemC. Declarative specification
models can be refined into implementation models using transformations that preserve the
correctness of the specification [38]. The resultant model will be correct by construction.
High-level models consist of signals and processes. After the refinement process, they can
be synthesized into VHDL code or compiled to C for a microprocessor. An experimental
tool for CUDA C code generation is also available [4]. ForSyDe is a system-level tool, and
as such can be compared to Bluespec or SpecC.
1https://forsyde.ict.kth.se/trac/wiki/ForSyDe
25
Chapter 4
Data-Parallel Computations
Data-parallel computations are those computations where operations are executed in par-
allel over collections of data. It is a simple concept which first arose in connection with
SIMD (single instruction, multiple data) processors, where the same instruction was ex-
ecuted over several independent operands at once (for example over a part of a vector),
yielding a corresponding number of independent results. Data parallelism is deterministic
[27] – the same output will always yield the same result – as opposed to process or task
parallelism where asynchronous updates of shared data can result in nondeterministic be-
haviour and the output depends on external factors, for example relative timing between
tasks/processes.
Declarative data-parallel languages make use of the deterministic nature of computa-
tions and offer the programmer a pure, high-level view of the computation. In this chapter,
several functional data-parallel languages are presented, two of them (namely Obsidian and
Accelerate) in depth. As they are aimed at general-purpose GPU programming (GPGPU),
this chapter also includes a concise introduction to GPUs and the NVIDIA CUDA platform
and programming model.
4.1 Inspiration from the domain of GPUs
GPUs (Graphics processing units) are massively parallel multicore processors developed for
computer graphics. At the beginning, they were used for acceleration of texture mapping
and polygon rendering, and for quick manipulation with geometric primitives (translation
and rotation of vertices). Recently, programable shaders have been added for shading - the
production of appropriate levels of light and color within an image - and for special effects
and postprocessing. As these operations involve a lot of matrix and vector computations,
engineers and GPU vendors began to explore the possibility of general-purpose program-
ming of GPUs and their usage for scientific and engineering data-parallel compute-intensive
operations. The term
”
GPU computing“1 describes the cooperation of GPUs and CPUs
for acceleration. Sequential parts of an application run on a CPU, which consists of a small
number of cores optimized for sequential execution, while parallel compute-intensive parts
are oﬄoaded into a GPU. A GPU contains thousands of small cores optimized for efficient
execution of many threads in parallel.
1http://www.nvidia.com/object/what-is-gpu-computing.html
26
4.1.1 NVIDIA CUDA
In 2006, NVIDIA introduced the CUDA platform (Compute Unified Device Architecture)
[39], a general-purpose parallel computing platform and programming model for efficient
solving of complex computational tasks on NVIDIA GPUs. The word
”
unified“ means
that all the processing units of the architecture are of the same type, as oposed to the
previous generation of GPUs, which typically had two types of units – fragment and vertex
processors.
All the recent GeForce, Quadro and Tesla cards are CUDA-capable and can be used for
general-purpose programming. The implementation language for the CUDA environment is
C with extensions2. The programming model is based on thread hierarchy, shared memory
hierarchy and barrier synchronization.
CUDA programs consist of a CPU part – the host program – and one or more GPU
parts – kernels. Kernels are C functions defined using the __global__ declaration specifier.
When called, kernels are executed N times in parallel by N different CUDA threads (the
number of CUDA threads that execute a kernel for a given kernel call is specified using the
syntax extension <<<...>>> after the kernel function name). Threads are organised into
one-, two- or three-dimensional blocks, and multiple blocks form a grid. A thread can be
identified by its threadIdx within a block and by its blockIdx within the grid. Threads can
store values into their local memory or communicate with other threads within their own
block through a shared memory. Threads from different blocks can communicate through
the same global memory visible to all threads. Furthemore, all threads can access a read-
only constant memory and a read-only texture memory. The execution of all the threads
in a block can be synchronized with a barrier mechanism specified by the built-in function
__syncthreads().
Example
A kernel code for the familiar dotproduct function can look like this:
__global__ void dotproduct( int *a, int *b, int *c) {
__shared__ int product[N];
product[threadIdx.x] = a[threadIdx.x] * b[threadIdx.x];
__syncthreads ();
if( 0 == threadIdx.x ) {
int sum = 0;
for( int i = 0; i < N; i++ ) sum += product[i];
*c = sum;
}
}
and is invoked in the host program by
dotproduct <<<numBlocks , threadsPerBlock >>>(dev_a , dev_b , dev_c );
where dev_a, dev_b and dev_c are pointers into the device memory. First, each thread
computes a product of two elements from a and b and saves it into a shared memory space.
Then all threads are synchronized at a barrier to ensure that the whole product vector is
ready. And then one thread sums up all intermediate results and stores the final result in
scalar c.
2http://docs.nvidia.com/cuda/cuda-c-programming-guide/
27
CUDA Threading Model
Computing on GPUs can be viewed as streaming or throughput computing and GPU cores
function as stream processors [47]. NVIDIA GPUs have a number of multiprocessors, which
execute in parallel. Each multiprocessor consists of several stream processors (cores) and a
low-latency, high-bandwidth shared memory. Each core can execute one sequential thread
(i.e. a list of sequential instructions), but the cores do not work independently. Rather, they
are grouped into so-called warps, which execute in what NVIDIA calls SIMT fashion (Single
Instruction, Multiple Threads). All cores in the same warp execute the same instruction at
the same time, in a similar way to classical SIMD processors. Warps from a single block are
executed on the same multiprocessor, sharing the same software data cache. As there are
possibly more threads than cores, there is a scheduler in each multiprocessor that switches
between warps.
When the threads in a warp perform a memory operation, it takes time due to memory
latency, and the whole warp stalls. Unlike CPUs, GPUs do not depend on a sofisticated
cache hierarchy for memory performance, because caching is ineffective for stream comput-
ing. Instead, GPUs rely on multithreading: To keep the cores busy when a warp needs to
wait for an operand, the scheduler selects another ready warp and switches to it.
As a consequence, best results are obtained, when there is enough parallelism in the
program to populate all the multiprocessors and to keep the cores busy. If a data-parallel
computation contains many more arithmetic operations than memory operations, memory
latencies can be effectively hiden.
SIMT processing also suffers from a phenomenon called thread divergence. Flow control
statements, such as if, for or while, are turned into branch instructions, which may cause
a situation where different threads follow different paths of execution. The execution of
both branches then might be serialized, which significantly degrades the performance of
the whole program. All the languages for GPU computing mentioned in this thesis (below)
have to deal with this problem. Fortunately, the situation in hardware synthesis is different,
as forks on datapaths just cause the datapaths to be divided. The division has impact on
the number of resources needed, but the overall latency of the circuit might not be affected.
To get the best performance out of a GPU, programs have to be written in a certain way
to leverage the potential of SIMT processing and to conform to the GPU memory model.
The device memory uses a very wide datapath, allowing multiple consecutive words to be
fetched at the same time [47]. As a consequence, strided accesses cause severe bandwidth
degradation, so the memory accesses and data organization should be optimized as much
as possible for transfers of contiguous data.
4.1.2 Languages for General-Purpose GPU Programming
All of the GPU-specific problems mentioned above have a common characteristic: they
require a lot of background knowledge about low-level hardware arrangement, and their
solving involves a lot of tedious manual work. And that is exactly what modern high-level
languages try to eliminate.
There are several libraries and domain specific languages targeted at general-purpose
GPU computing, offering different levels of abstraction: apart from the low-level CUDA,
OpenCL or DirectX, there are several libraries above them that try to provide a more con-
venient interface. GPU computing can be accessed from C++ , Python or Matlab. There
are also a few projects dedicated to the functional paradigm and implemented as embedded
domain-specific languages with Haskell as the host language, namely Vertigo (for DirectX),
28
Nicola (for CUDA) and Obsidian (for CUDA) and Accelerate (for CUDA, OpenCL and
other acceleration technologies)3. In these projects, different approaches have been taken.
Obsidian[44] is a quite low-level language aimed at single kernel generation. High-level
Nikola[28] is in its aim and approach to embedding quite similar to even-higher-level and
more expressive Accelerate[7], but as Obsidian, it can only express array computations that
can be implemented as a single kernel.
4.2 Functional Languages for Data-Parallel Computation
From the functional GPU-oriented languages mentioned above, Obsidian and Accelerate
stand out, and one of the reasons most relevant to the topic of this thesis is that they are
closely related to other data-parallel projects, which makes them potentially interesting for
a functional high-level synthesis tool implementation.
Obsidian is closely connected to the Lava hardware description language for RTL circuit
specification. Accelerate is developed in a close symbiosis with the Repa Haskell library for
high-performance array operations on multicore processors [24] and they share the same
frontend philosophy. Both the low-level Obsidian’s and high-level Accelerate’s approaches
to data-parallel computing are presented in the following sections. They are going to be
compared and contrasted to give the reader a better idea of why Accererate was chosen for
the tool implemented in this thesis.
4.2.1 Obsidian
In the CUDA programming, all threads execute the same program parametrized over thread
identity. Each thread computes which elements from an input array belong to it from its
threadIdx and blockIdx, but the indexing computation depends on the overall architecture
of the kernel. Choices such as how many elements will be processed and by how many
threads affect each other, and once a kernel is designed with a particular number of input
elements per thread, it is not easy to change it as the indexing function depends on it and
has to be rewritten.
Obsidian4 [44, 45, 46] is an embedded domain-specific language intended to simplify the
development of GPU kernels. Currently, it specializes on single kernel development, but
the possibility of multiple kernels cooperation is open for the future as well. A kernel in
Obsidian is a computation over arrays, rather than a description of what a single thread
does with directly accessed elements from the arrays. The degree of parallelism is specified
by dimensions of output arrays: for a flat array, the number of threads corresponds with the
number of elements in the resulting array. In the case of nested arrays, however, only the
outermost dimension is parallelized, and the nested arrays inside are computed sequentially.
Obsidian manages details such as indexing and layout of data in a memory in place
of the GPGPU programmer, thus raising the level of abstraction. The programmer can
quickly produce working prototypes and then experiment with different partitionings and
performance-affecting implementation choices without major rewriting of the code.
Obsidian has been developed by the same team as the Lava HDL language5 In the
first version, GPU functions were represented by sequential monads [46] and combined by
3For an overview of existing libraries and languages, see for example the
”
Related work“ sections in
[7, 45]
4http://hackage.haskell.org/package/Obsidian
5Even its name is derived from lava, as obsidian is a volcanic mineral.
29
monadic combinators. In the current version [44], monads have been replaced by arrows
[22]. GPU functions are now represented as transitions from inputs to outputs. Obsidian
can be viewed as a composition of two sublanguages: a language for expressing arrays and
operations over them, and a language for mapping of the array language programs onto the
GPU.
Arrays Representation and Collective Array Operations
Arrays in Obsidian are represented by an abstract datatype called Arr together with an
indexing function (!), a function returning its length len and a constructor mkArr. Inter-
nally, an array constists of a function from indices to elements and an integer representing
its length:
data Arr a = Arr (indexE -> a) Int
The type variable a makes the array definition polymorphic. An array can contain not
only ints, floats or bools, but also nested arrays and tuples. (The possibility of array nesting
is something that substantially distinguishes Obsidian arrays from Accelerate arrays.) The
language offers a number of functions for collective array operations, which serve as building
blocks to implement GPU kernel functions, for example fmap, foldr, pair, unpair, zipp
and other higher-order functions inspired by Haskell’s Prelude.
The newest version of Obsidian [12] splits arrays into two groups. The original Arr type
is called a Pull Array and renamed to Array, and a new array type Push Array is introduced
under the name ArrayP. However, as it gets a bit more complicated with two different types
of arrays, this section will describe the original version, which is more illustrative.
GPU Program Representation
The second sublanguage is dedicated to GPU program representation. Collective operations
over arrays are lifted to GPU programs with the pure function,
pure :: (a -> b) -> (a :-> b)
which changes the function datatype from standard Haskell’s (a -> b) to an arrow-like
type (a :-> b), that represents the GPU program as a transition from input a to output b.
GPU programs can be composed to form a program sequence with synchronization barriers
in between, as can be seen on Figure 4.1. Synchronization barriers are implemented using
a primitive GPU program called sync.
Figure 4.1: Pure GPU program components and synchronization components Sync are
composed with the composition operator (->-). The function in the flow diagram can be
described as gpu = pure f ->- sync ->- pure g ->- sync ->- pure h
30
The composition operator (->-) takes a function from a to b and a function from b to
c and composes them to a function from inputs a to outputs c.
(->-) :: (a :-> b) -> (b :-> c) -> (a :-> c)
The datatype (a :-> b) is an algebraic data type with two constructors: Pure and
Sync.
data a :-> b
= Pure (a :-> b)
| Sync (a -> Arr FData) (Arr FData :-> b)
It is in fact implemented as a cons list of GPU computations (Sync ... (Sync ...
(Sync ... (Pure ...)))).
For kernel execution on a GPU, there is a function called execute, that can be called
from GHCI. It takes a standard Haskell list and turns it into an input array for the kernel.
Then it executes the kernel on the GPU and presents the result as a standard Haskell list.
Only a single kernel can be executed.
As can be seen from the presented description of Obsidian, the language is aimed at
GPU kernel implementation and as such is very GPU-specific. The programmer develops
GPU function components and chains them with combinators. The resulting code is similar
in nature to CUDA code in sequential C, apart from the fact that Obsidian operates on
arrays and CUDA C programs are related to one thread.
This characteristic limits the employment of Obsidian for other purposes than the de-
velopment of GPU kernels and programs for the SIMT computing model in general. In
contrast, the Accelerate library, which will be described in detail in the next section, is
meant as a high-level tool for acceleration of data-parallel applications without a specific
choice of target architecture.
4.3 The Accelerate Library
Accelerate6 [7, 30, 25] is a domain-specific high-level language for data-parallel compu-
tations developed at the University of New South Wales, Australia. It is aimed at flat
multidimensional regular arays. Accelerate is deeply embedded into Haskell, which means
that an abstract syntax tree (a term tree) is build to represent operations of the embedded
language, instead of a direct evaluation of the terms. Data-parallel nature of computations
is expressed by collective array operations over types Array sh e, where Shape sh is the
shape of the array and Elt e is the type of the elements in it.
Shapes (as well as indices) are represented using an inductive notation of tuples as
heterogenous snoc lists, where the constructor Z represents a unit array and the operator
(:.) appends a new dimension to the right of the shape, for example a three-dimensional
array has type (Z:.Int:.Int:.Int) and a three-dimensional shape of an array of 2x2x2
elements is (Z:.2:.2:.2). There are also synonyms for the most common shapes:
type DIM0 = Z
type DIM1 = DIM0 :. Int
type DIM2 = DIM1 :. Int
type Scalar e = Array DIM0 e
type Vector e = Array DIM1 e
6http://hackage.haskell.org/package/accelerate
31
Array sh e can be created from a standard Haskell list:
let xs = fromList (Z:.3:.5) [1..] :: Array DIM2 Int
Although the dimension of xs is (Z:.3:.5), the array is internally represented as a
simple one-dimensional vector with a shape anotation (from the type class Shape) attached.
Arrays can contain only simple atomic types, for example signed Int and its fixed-width
variants, unsigned Word and variants, Bool, Char, Float, Double, shape types formed
from Z with (:.) and others, and tuples thereof. Tuples are constructed as nested pairs,
for example (Int, Int, Bool) is in fact (((((), Int), Int), Bool). All these types
belong to the type class Elt.
4.3.1 Accelerated Arrays and Expressions
Accelerate distinguished between two types of arrays: vanilla Haskell arrays and accelerated
arrays. When an Accelerate program is executed, operations on vanilla arrays are performed
by the CPU and operations on accelerated arrays are performed by an accelerator (GPU).
In the CUDA backend, accelerated arrays are transfered into a GPU device memory and
operations on them are performed by one or more CUDA kernels.
The Accelerate frontend creates an abstract syntax tree for accelerated collective array
operations over accelerated arrays. The
”
accelerated“ nature is represented by the type
Acc. The type Acc a of an operation means that the result of the operation is of type a.
An accelerated array can be created in two different ways. First, a vanilla Haskell array
can be turned into an accelerated array by the function use7, which promoted the array to
the Acc type:
use :: (Shape sh , Elt e) => Array sh e -> Acc (Array sh e)
While arrays of type Array sh e are stored in the host memory, arrays of type Acc
(Array sh e) are stored in a GPU device memory. In the CUDA backend, use implies an
asynchronous host-to-device data transfer of the accelerated array, which will then serve as
an input array for an accelerated collective array operation.
Second, an accelerated array can be generated using a function that computes a default
value for each index in a given shape (with collective operation generate) or using the
same default values for all indices (with operation fill).
generate :: (Shape ix, Elt a)
=> Exp ix -> (Exp ix -> Exp a)
-> Acc (Array ix a)
fill :: (Shape sh, Elt e)
=> Exp sh -> Exp e
-> Acc (Array sh e)
Similarly, in addition to Acc, which marks accelerated (embedded) array computations,
there is also the type Exp for accelerated scalar computations. For example in the type
signature of generate, function (Exp ix -> Exp a) is an embedded (accelerated) function
that takes a Shape argument of dimension ix for the index and returns a scalar value of
type a that is the new element in the generated array on that index.
The distinction between collective and scalar operations has been introduced to prohibit
nested parallelism. Collective operations can inlude scalar operations, but not vice versa.
7In reality, the type of the function use is a bit more general (it is based on typeclass Arrays) to
accomodate for tuples of arrays, because Acceleate represents all arrays of tuples as tuples of arrays instead
– as an optimization for contiguous memory accesses
32
Thus, nested irregular parallelism is excluded statically by the type system and Accelerate
is limited only to flat data-parallelism involving only regular arrays.
To lift en expression to Exp, there are several lift/unlift functions. A scalar can also be
turned into an accelerated array to take part in a collective array operation as an input to
it:
unit :: Elt e => Exp e -> Acc (Scalar e)
Accelerated expressions support Haskell’s standard arithmetic and bitwise operations
by overloading. As Bool operations cannot be overloaded, equality, comparison and logic
operators are new, formed from the standard operators’ symbols and ’*’ (==*, /=*, <*, <=*,
>*, >=*, &&*, ||*). For conditional expressions, there is the (?) operator, and for array
indexing the (!) operator. Other functions for expression manipulation are also available
(min, max, round, abs, mod and many others).
4.3.2 Collective Array Operations
An array can be modified with functions reshape, which changes the shape without altering
the contents (the size stays the same, only the dimension is altered, for example from 3x4
to 2x6 or 2x2x3), replicate, which enlarges and array by replicating it across one or more
dimensions as specified by the first argument (for example replicate (Z :.2 :.All :.3)
arr yields a three dimensional array, where ’arr’ is replicated twice across the first dimension
and three times across the third dimension), and slice, which extracts a subarray from
the input array and can act like the opposite operation to replicate.
reshape :: (Shape ix , Shape ix ’, Elt e)
=> Exp ix -> Acc (Array ix’ e)
-> Acc (Array ix e)
replicate :: (Slice slix , Elt e)
=> Exp slix -> Acc (Array (SliceShape slix) e)
-> Acc (Array (FullShape slix) e)
slice :: (Slice slix , Elt e)
=> Acc (Array (FullShape slix) e) -> Exp slix
-> Acc (Array (SliceShape slix) e)
To apply a function on all elements in an array, map is used for unary operations and
zipWith for binary operations. There are also other variants of zipWith for operations with
more than two inputs (zipWith3, . . . ).
map :: (Shape ix , Elt a, Elt b)
=> (Exp a -> Exp b) -> Acc (Array ix a)
-> Acc (Array ix b)
zipWith :: (Shape ix , Elt a, Elt b, Elt c)
=> (Exp a -> Exp b -> Exp c)
-> Acc (Array ix a) -> Acc (Array ix b)
-> Acc (Array ix c)
While map and zipWith are have the same semantic as the functions of the same name
in Prelude, reduction fold is different from Prelude.foldl and Prelude.foldr, because
it requires an associative binary operation and the order in which elements are reduced
is unspecified, which allows for a parallel tree reduction instead of a left-to-rigth or a
right-to-left sequential traversal. Standard fold requires a default value, but there is also
fold1, which does not need a default value, only a non-empty input array. There are
several types of reductions, not only general (fold, fold1, foldAll, fold1All), but also
33
segmented (foldSeg, fold1Seg) and specialized (sum, product, minimum, maximum, and,
or, any, all).
Apart from reductions, Accelerate also includes prefix sums (scans), which are similar
to reductions in the sense that they combine one dimension of the input array into a scalar
value8, but they also provide all intermediate results. As a consequence, the left/right
variants need to be specified. scanl and scanr expect a default value, scanl1 and scanr1
expect a non-empty array, and scanl’ and scanr’ return a pair, where the final reduction
result is stored separately from the intermediate results). Prescan, postscan and segmented
variants are also available.
fold :: (Shape ix , Elt a)
=> (Exp a -> Exp a -> Exp a) -> Exp a -> Acc (Array (ix:.Int) a)
-> Acc (Array ix a)
fold1 :: (Shape ix , Elt a)
=> (Exp a -> Exp a -> Exp a) -> Acc (Array (ix:.Int) a)
-> Acc (Array ix a)
scanl :: Elt a
=> (Exp a -> Exp a -> Exp a) -> Exp a -> Acc (Vector a)
-> Acc (Vector a)
scanl ’ :: Elt a
=> (Exp a -> Exp a -> Exp a) -> Exp a -> Acc (Vector a)
-> (Acc (Vector a), Acc (Scalar a))
Forward and backward permutations (permute, backpermute) are specified by an index
mapping. In permute, the result array is initialised with default values and any further
values that are permuted into the result array are added to the current value using a given
associative binary combination function. In backpermute, the index mapping function
specifies for each element of the output array which element from the input array is read.
permute :: (Shape ix , Shape ix ’, Elt a)
=> (Exp a -> Exp a -> Exp a) -- ^combination function
-> Acc (Array ix ’ a) -- ^array of default values
-> (Exp ix -> Exp ix ’) -- ^permutation
-> Acc (Array ix a) -- ^array to be permuted
-> Acc (Array ix ’ a)
backpermute :: (Shape ix, Shape ix’, Elt a)
=> Exp ix’ -- ^shape of the result array
-> (Exp ix ’ -> Exp ix) -- ^permutation
-> Acc (Array ix a) -- ^source array
-> Acc (Array ix ’ a)
These collective array operations are based on the so-called scan-vector model [3, Chap-
ter 4]. Last but not least, support for stencil convolution was added in later versions
of Accelerate. One, two and three-dimensional convolution kernels (stencils) with 3 or 5
elements in each dimension (plus 7 and 9 elements in DIM1) are supported, for example:
type Stencil3 a = (Exp a, Exp a, Exp a)
type Stencil5 a = (Exp a, Exp a, Exp a, Exp a, Exp a)
type Stencil7 a = (Exp a, Exp a, Exp a, Exp a, Exp a, Exp a, Exp a)
type Stencil3x3 a = (Stencil3 a, Stencil3 a, Stencil3 a)
type Stencil5x5x5 a = (Stencil5x5 a, Stencil5x5 a,
8actually they are not shape polymorphic as reduction, they require a one-dimensional vector
34
Stencil5x5 a, Stencil5x5 a, Stencil5x5 a)
...
This functionality was inspired by the Repa library, where stencils were introduced in
version 2 [26].
4.3.3 Example
Consider the dotproduct function that computes the sum of the products of elements in two
input vectors, pairwise. In standard Haskell, the function would use higher-order functions
foldl (or fordr) and zipWith from Prelude. It would accept a list of numbers (for example
of type Int, as in the case of this example) and return a single scalar value of the same
numeric type.
dotproduct :: [Int] -> [Int] -> Int
dotproduct xs ys = foldl (+) 0 (zipWith (*) xs ys)
Figure 4.2: The dotproduct function consists of the pairwise product of corresponding
elements in vectors xs and ys and the sum of the intermediate values (called zs, just for
illustration).
The accelerated version is quite similar, only the type signature is different and a sym-
metric fold instead of a left/right variant is used, as neither foldr nor foldl are supported
for performance reasons (they would have to be sequential to guarantee a left-to-right or a
right-to-left traversal).
dotproduct :: (IsNum a, Elt a)
=> Acc (Vector a) -> Acc (Vector a) -> Acc (Scalar a)
dotproduct xs ys= fold (+) 0 (zipWith (*) xs ys)
35
To lift a Vector a to an accelerated Acc (Vector a), the function use is used. If the
CUDA backend is used, use implies a host-to-device data-transfer. It is an entry point to
the accelerated program for input data.
dotp :: (IsNum a, Elt a)
=> Vector a -> Vector a -> Acc (Scalar a)
dotp xs ys
= let
xs ’ = use xs
ys ’ = use ys
in dotproduct xs ’ ys ’
In Haskell’s interpreter GHCi, dotp can be evaluated in Accelerate’s interpreter9.
> let xs = (fromList (Z :. 4) [1,3,5,7]) :: Vector Int
> let ys = (fromList (Z :. 4) [2,4,6,8]) :: Vector Int
>
> run $ dotp xs ys
Array (Z) [100]
The result is still an array, although it is a unit array with the rank-0 dimension (denoted
by (Z)) and a single element, integer 100.
4.3.4 Frontend
As Accelerate is a deeply embedded language, Accelerate operations are not directly exe-
cuted, but they build a term tree to represent the accelerated computation. These trees
use the so-called higher-order abstract syntax (HOAS) to embed function-valued scalar ex-
pressions (λ-abstractions and let bindings). As the HOAS representation is not convenient
for program transformations, it is turned into a nameless de Bruijn representation. Both
representations rely on GADTs10 (Generalized Algebraic Data Types), which allow for a
restriction of the type variables in the constructor’s final result. Accelerate uses them in
combination with type families and typeclass overloading for arithmetic expressions. As
a result, full type information of the embedded language can be maintained in the term
tree. Since the frontend uses type-preserving transformations only, the type information
is available for the backend as well. The type-preserving conversion from HOAS terms to
a nameless representation using typed de Bruijn indices is described in previous work by
Manuel Chakravarty[6] from the Accelerate’s team.
Sharing Recovery
During the conversion, Accelerate’s frontend recover sharing introduced by let bindings, for
example
let xs ’ = map f xs in zipWith g xs ’ xs ’
includes a let binding for subterm map f xs. Without sharing recovery, the expression
would be translated inefficiently as
zipWith g (map f xs) (map f xs)
9function run in Data.Array.Accelerate.Interpreter
10GADTs are currently implemented in the GHC compiler as a non-standard extension and have to be
explicitly enabled, for example by a LANGUAGE pragma {-# LANGUAGE GADTs #-}.
36
The sharing recovery algorithm11 used in Accelerate is a variant of the observable-sharing
technique presented by Gill [17] for Lava, although Accelerate preserves types and produces
a nested AST with minimal flattening instead of a graph.
Figure 4.3: Producer/producer and consumer/producer fusion. (The picture is sourced
from [30].)
Fusion of Collective Array Operations
Apart from the sharing recovery, the frontend can also perform other backend-independent
program analyses and optimizations. Lately, a fusion transformation has been introduced
[30]. Array fusion eliminates the intermediate values and additional kernels that would
have to be added when succesive collective array operations are performed on an array.
The method was inspired by Delayed arrays in Repa [24]. Delayed arrays are represented
by their index function. A series of succesive collective array operations in Repa results in
a composition of functions with the index function, rather than in repeated modification of
the underlying array. Accelerate distinguishes between two types of collective operations:
• producers, where each element of the resultant array depends on at most one element
from each input array, such as map, zipWith, backpermute, replicate, slice or
generate,
• and consumers, where each output element depends on multiple input elements, for
example fold, scanl, scanr, permute or stencil.
First, producer/producer fusion combines sequences of producers into a single producer.
Second, consumer/producer fusion embeds the producer into the consumer code (this hap-
11More information about sharing recovery in Accelerate can be found in Optimising Purely Functional
GPU Programs [30], the latest paper on Accelerate from March 2013.
37
pens during code generation). Third, the opposite fusion process (producer/consumer),
may be performed as well, for example the map operation in
map g . fold f z
can be fused into the fold function by applying g to each element produced by the reduction.
However, this type of fusion is not supported.
Apart from sharing and fusion, the frontend is also responsible for the transformation
of arrays of tuples to tuples of arrays.
Abstract Syntax Tree
The nameless de Bruijn representation backends operate on is implemented as a combination
of several GADTs for accelerated Acc and Exp terms. It is located in Data.Array.Accelerate.AST
(while the HOAS abstract syntax tree visible for the user is located in Data.Array.Accelerate.Smart).
Nameless abstractions are introduced by the Lam constructor, while the Var constructor
wraps the indices, as can be seen in figure 4.4 – the AST of the dotproduct function. The
top-level node of dotproduct is Fold. The left branch below the Fold node represents the
function used for reduction – in this case an addition of two parameters, denoted by two
Lam nodes. The rightmost branch contains the ZipWith operation with multiplication (the
branch with the two Lam nodes) and two input arrays introduced by Use.
Sharing recovery is performed by the function convertAcc at the same time as the
HOAS-to-de-Bruijn conversion. A backend then performs a traversal of the AST and gen-
erates low-level code for a particular architecture. In addition, it can perform architecture-
specific optimizations.
4.3.5 CUDA Backend
Accelerate’s primary backend targets the NVIDIA CUDA platform12. It is based on algo-
rithmic parametrised skeletons encapsulating specific data-parallel behaviour of Accelerate’s
collective array operations. Skeletons are basically hand-tuned C code templates for indi-
vidual computational patterns parametrised by types and scalar expressions to be injected
at predefined points.
The CUDA backend operates in two passes (two traversals of the de Bruijn AST). First,
it produces CUDA kernels by instantiating skeletons for collective operations encountered
in the tree. When a Use node is encountered, the backend initiates an asynchronous data-
transfer of the accelerated array from the host memory to the shared device memory. The
second pass is the evaluation pass – the backend executes the precompiled kernels (from
pass 1), manages intermediate data storage and returns a result computed by the GPU.
The invocation of kernels and allocation of intermediate storage on the device is performed
via the Haskell CUDA C API binding13. Kernels are cached to avoid repeated compilation.
The CUDA backend operates dynamically: kernels are generated and compiled on the fly.
From user’s point of view, the whole process of oﬄoading kernels to the GPU and their
execution is encapsulated and thus invisible. All the user does is run the Cuda.run function.
Although run internally calls foreign functions and performs IO operations, from the outside
it is a pure function of type
run :: Arrays a => Acc a -> a
12http://hackage.haskell.org/package/accelerate-cuda
13available as a separate package http://hackage.haskell.org/package/cuda
38
Fold
Lam Const 0 ZipWith
Lam
Body
PrimApp
PrimAdd Tuple
SnocTup
SnocTup Var 0
NilTup Var 1
Lam Use Use
Lam
Body
PrimApp
PrimMul Tuple
SnocTup
SnocTup Var 0
NilTup Var 1
ArraysRpair
ArraysRunit ArraysRarray
ArrayZ :. 10
ArraysRpair
ArraysRunit ArraysRarray
ArrayZ :. 10
Figure 4.4: The abstract syntax tree of the dotproduct function.
Even though kernels are cached and data-transfers overlapped by asynchronous execution,
the overhead is still present and therefore it is only worthwhile to oﬄoad the computations
that are compute-intensive.
4.3.6 Other Backends
The Accelerate EDSL is not GPU-specific, so it can be used for other platforms than
NVIDIA CUDA or GPUs in general as well. Apart from the already mentioned CUDA
backend, there is an experimental partially-implemented OpenCL backend (initially devel-
oped as a graduate project at the University of Copenhagen).
Apart from GPU backends, there is also a backend for multicore processors, namely for
the already-mentioned Repa library, developed at the same university as Accelerate. The
Repa library shares many features with the Accelerate language (Repa usually being the
first to introduce them), but Accelerates evolves quicker than the backend, so now it can
be used only with older versions of Accelerate.
39
4.4 Suitable EDSL for this thesis
The aim of this thesis is not only to reseach the field of functional hardware description
languages and the possibility of their employment for high-level synthesis of circuits, but
also to develop a functional high-level synthesis tool for data-parallel applications as a proof
of concept, based on the findings from the field research. There are basically three different
approaches to the tool design and development:
• The tool can be developed from scratch with different design decisions inspired by
different functional HDLs approaches. Although the tool could be better tailored
to specific needs of data-parallel applications than if a general functional HDL was
used, it would require to develop both the backend and the frontend and the result
would be a stand-alone tool without for example any support for GPUs or multicore
processors.
• Another possibility is to use an existing lower-level functional HDL and develop a
high-level frontend that would compile high-level constructs to low-level primitives
of the chosen functional HDL. From the languages presented in Chapter 3, Lava
seems to be the best choice, because of its close connection with Obsidian for GPU
computing. A single high-level specification could be therefore translated to either a
circuit described in RTL Lava, or a CUDA kernel described in Obsidian.
• Finally, instead of a high-level frontend with backend for existing lower-level lan-
guages, an opposite approach could be taken and an existing high-level language
could be extended with a new backend for VHDL/Verilog.
As this thesis is oriented more on hardware synthesis than on computer languages
and compilers, we decided to take the third path and extend an already existing high-level
language – Accelerate – with an ability to produce VHDL. The possibility of a VHDL/FPGA
backend was already mentioned by the Accelerate team in the library’s introductory paper
[7], but so far the only existing backends other than the main CUDA backend have been
for OpenCL and the Repa library.
40
Chapter 5
VHDL Backend for Accelerate
This chapter is focused on the development of a VHDL backend for the Accelerate library.
Requirements for the design are:
• The output should be presented as a VHDL file, unlike the dynamic invocation of
CUDA kernels on GPUs in the CUDA backend.
• The VHDL file should be in a human-readable format, and as high-level, as possibly
synthesizable.
• The backend should be implemented as a library in Haskell, Accelerate’s host lan-
guage.
• The project should be available under the BSD3 licence (the same as the Accelerate
frontend).
The project have been implemented according to the requirements, although with some
restrictions (limitations). For example, it does not support floating point datatypes (float,
double), only signed and unsigned integers of a given byte-width, char and bool. Only
a subset of the Accelerate language is supported. In the next three sections, the design,
implementation and testing phases are described.
5.1 Design
The Accelerate library is quite complicated on the inside. It uses several non-standard and
non-trivial extensions of the Glasgow Haskell Compiler. The type system of the Accelerate
language is very elaborate, with GADTs and type-preserving term tree transformations.
As a consequence, it was necessary to look for inspiration in existing backends. The most
similar approach was taken in the Repa backend, which is aimed at oﬄine generation of
Repa files, rather than runtime evaluation (as in the CUDA backend).
5.1.1 Input from the Frontend
The input for the backend is an abstract syntax tree – a nameless representation with de
Bruijn indices for let-bound variables. An example of the term tree format can be seen in
figure 4.4 on page 39. The VHDL backend performs a similar set of operations on the tree.
The backend is invoked by the function VHDL.compile. To turn an accelerated expression
Acc acc into a de Bruijn AST, the function Sharing.convertAcc True True True acc
41
is called. A the name of its module suggests, it is responsible of sharing recovery. Although
fusion transformation is also available in the latest version of the library, it has not been
incorporated into the backend yet. As Acc terms derive Show, they can be printed out:
fold
(\x0 x1 -> x0 + x1)
0
(zipWith
(\x0 x1 -> x0 * x1)
(use (Array (Z :. 4) [1,3,5,7]))
(use (Array (Z :. 4) [2 ,4 ,6,8])))
The code above clearly shows lambda abstractions for addition (\x0 x1 -> x0 + x1) and
multiplication (\x0 x1 -> x0 * x1), which correspond to the Lam branches in figure 4.4
of the dotproduct.
Sharing Recovery vs. Fusion
The de Bruijn AST matches the precise structure of the source code with operations such
as map or fold. In the fused version, many collective operations are squeezed into one and
the resultant term tree with delayed arrays is visually different from the standard de Bruijn
AST after sharing recovery (and less human-readable). For example the function q may
encounter a sharing problem with the bound variable x, which is effectively used twice,
q = map (+2) $ zipWith (*) x y
where x = use $ (fromList (Z:.3:.5) [1..] :: Array DIM2 Int)
y = fill (shape(x)) 1
the corresponding term trees are shown below: (1) without sharing recovery, (2) with
sharing recovery, without fusion, and (3) with both sharing and fusion
map
(\x0 -> x0 + 2)
(zipWith
(\x0 x1 -> x0 * x1)
(use (Array (Z :. 3 :. 2) [1,2,3,4,5,6]))
(generate (shape (use (Array (Z :. 3 :. 2) [1,2,3,4,5,6])))
(\x0 -> 1)))
map
(\x0 -> x0 + 2)
(let a0 = use (Array (Z :. 3 :. 2) [1,2,3,4,5,6])
in zipWith (\x0 x1 -> x0 * x1) a0 (generate (shape a0)
(\x0 -> 1)))
let a0 = use (Array (Z :. 3 :. 2) [1,2,3,4,5,6])
in generate (shape a0) (\x0 -> 2 + (a0!x0))
Although the fused version may be optimal with regard to the number of resources
needed, it might not be optimal for pipelined architectures, as it builds a single tightly
packed combinational path. Therefore, the variant (2) with sharing recovery, without fusion
was chosen. Abstract Syntax trees for (1) and (2) are included in Appendix A.
42
5.1.2 VHDL Skeletons
The VHDL generation is based on skeletons of VHDL code, an approach similar to that
of the CUDA backend. This section explains the approach by illustrations. The VHDL
skeleton of fold shows a generate construct, which generates the reduction tree, as shown
in figure 5.1. A generated functional unit with corresponding indices is shown in figure 5.2.
The generate construct generates a series of functional units, not actually a tree. The tree
structure is emulated by a tmp vector, which stores all inputs, then all intermediate results
and then the final reduction, in tmp(0).
Figure 5.1: fold tree for the VHDL skeleton.
entity fold1_adder is
generic (N : integer := 10);
port ( x : in int_array (0 to (N -1));
y : out int );
end fold1_adder;
architecture acc of fold1_adder is
component adder is
port ( x0 , x1 : in int;
y : out int );
end component;
constant M : integer := (2*N-1);
constant K : integer := integer (2 ** (floor(log2(real(M))))) -1;
signal tmp : int_array (0 to (M-1)) := (others => 0);
begin
generate_net:
for I in 0 to (N-2) generate
fu_x : adder port map (tmp (2*I+1), tmp (2*I+2), tmp(I));
end generate generate_net;
43
tmp(N-1 to (K-1)) <= x((M-K) to (N -1));
tmp(K to (M-1)) <= x(0 to (M-K -1));
y <= tmp (0);
end acc;
Figure 5.2: Functional unit for fold – a reducer.
Figure 5.3: fold tree is mapped onto tmp signal.
5.2 Implementation
As described in the previous section, the VHDL backend uses skeletons for code generation.
These skeletons are not hardcoded as texts, they are formed by functions and combinators
from short vhdl primitives.
5.2.1 Traversal of the AST
The translation from Accelerate to VHDL is performed in one pass through the Accelerate
term tree. The procedure is inspired by functions evalPreOpenAcc for Acc tree nodes, and
evalOpenExp for Exp tree nodes in Data.Array.Accelerate.Repa.Evaluations. Similar
traversal functions with the same names can be found in Data.Array.Accelerate.Interpreter.
Both sets of functions use pattern-matching over type contructors of Acc or Exp nodes, for
example:
evalPreOpenAcc (Alet acc1 acc2) ... = ...
evalPreOpenAcc (Avar idx) ... = ...
evalPreOpenAcc (Use arr) ... = ...
evalPreOpenAcc (Map f acc) ... = ...
evalPreOpenAcc (Fold f e acc) ... = ...
...
evalOpenExp (Let exp1 exp2) ... = ...
evalOpenExp (Var idx) ... = ...
evalOpenExp (Const c) ... = ...
evalOpenExp (PrimApp p arg) ... = ...
evalOpenExp (Cond c t e) ... = ...
...
44
Each type of node is matched with a different handler function and evaluated in a
different manner. The VHDL backend uses the same naming conventions as the Repa
backend and the Interpreter, although it utilizes a RWS Monad instead of passing the state
environment in arguments and return values from one node to another.
RWS Monad1 combines three different Monads: a Reader (a read-only storage space for
values visible from all nodes, for example for synthesis constraints), a Writer (a write-only
data-logger for debugging purposes) and a State Monad for modifiable state (for example
a counter for generation of fresh names for signals and functional unit instances). With
a Monad in place of argument-passing, the resultant Haskell code is less cluttered, as less
arguments are needed in the process.
5.2.2 User-Defined Constraints for Synthesis
Generated VHDL code contains integer datatypes with values limited to a certain range,
which can be set by a user constraint.
subtype int is integer range 0 to 255;
type int_array is array (integer range <>) of int;
5.2.3 Text.PrettyPrint
The resultant VHDL code is printed using a specialized Haskell library for formatted output,
Text.PrettyPrint2 [21]. The library contains several combinators and operations for new
lines, indentation, pretty-printing of symbols and other useful features. It was used in the
Repa backend as well.
5.2.4 Modules
The library follows the naming conventions of the Accelerate library and backends. All mod-
ules begin with Data.Array.Accelerate.VHDL followed by the name of the module. The
top-level module, which includes the function compile, is called only Data.Array.Accelerate.VHDL.
Other modules are:
• Traverse contains all the traversal functions with actions specific for each node of
the input AST.
• Stategen consists of a Monad Stategen and operation to handle state, new names
generation and such.
• Codegen contains helper functions for VHDL code generation.
5.3 Testing – dotproduct
The dotproduct example is a popular testing case in many data-parallel projects. The
code in Accelerate is very compact, but it produces quite a large amount of VHDL and
quite a big, but structured RTL schematic. As a result, both the code and the schematic
were placed into Appendix B.
1http://hackage.haskell.org/packages/archive/mtl/1.1.0.2/doc/html/Control-Monad-RWS-Strict.html
2http://hackage.haskell.org/packages/archive/pretty/1.0.1.1/doc/html/Text-PrettyPrint-
HughesPJ.html
45
The schematic consists of a block of multipliers (in a single column) and an adjacent
block of adders combined in a parallel reduction tree. For an input array of 10 32-bit
integers, the synthesis tool (Xilinx ISE) used 10 32-bit adders and 10 32-bit multiplier,
which occupied 66 4-input LUTs and 42 Slices of a Xilinx Spartan-3 FPGA.
46
Chapter 6
Conclusion
Functional languages feature several qualities useful for hardware description. Many domain-
specific languages are implemented as embedded functional languages, with either a deep
or a shallow embedding. In this thesis, research in three different domains was carried out:
in the first part of the thesis, high-level synthesis from C-like languages (C/C++, SystemC,
MATLAB), evolution of synthesis techniques and important algorithms for scheduling were
presented. In the second chapter, several functional languages for different levels of abstrac-
tion were described, together with an introduction to functional paradigm and Haskell in
general. In this thesis, we specialize on data-parallel applications. The aim of the thesis was
not only to research the field, but also to develop a tool for hardware acceleration in FPGAs,
that would synthesize VHDL from a description written in a high-level language (specific
for the domain of data-parallel computations). As computations on GPUs are similar in
nature to the data-parallel computations in hardware, the domain of GPU computing was
researched as well.
In this thesis, a new backend for Accelerate, a high-level library for general-purpose
GPU programming, was designed and developed. There are currently three other backends
available: a CUDA backend, an OpenCl backend and a Repa backend for multicore proces-
sors. With a new VHDL backend, Accelerate will have covered all three main subdomains
of data-parallel computing: GPUs, multicore processors and, newly, FPGAs.
This thesis is unique in the sense that it views the topic from hardware designer’s point
of view, rather than from a point of view of a language theory expert. Although the tool
developed for this thesis is more of a proof of concept than a ready product, it shows an
interesting amount of code reduction, as a one-liner in Haskell (for the dotproduct function)
yields several VHDL source files and a compact schematic with a human-readable structure.
47
Bibliography
[1] Brian Bailey, Grant Martin, and Andrew Piziali. ESL design and verification a pre-
scription for electronic system-level methodology, 2007.
[2] Per Bjesse, Koen Claessen, Mary Sheeran, and Satnam Singh. Lava: Hardware Design
in Haskell. SIGPLAN Not., 34:174–184, September 1998.
[3] Guy E. Blelloch. Vector models for data-parallel computing. MIT Press, Cambridge,
MA, USA, 1990.
[4] Gabriel H. Blindell. Synthesizing Software from a ForSyDe Model Targeting GPGPUs.
Master’s thesis, Royal Institute of Technology, Stockholm, January 2012.
[5] Raul Camposano. Path-Based Scheduling for Synthesis. In IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems, 1991.
[6] Manuel M. T. Chakravarty. Converting a HOAS term GADT into a de Bruijn term
GADT. Last modified 3 July 2009, Cited 22 May 2013, 2009.
[7] Manuel M. T. Chakravarty, Gabriele Keller, Sean Lee, Trevor L. McDonell, and Vinod
Grover. Accelerating Haskell array codes with multicore GPUs. In Proceedings of the
sixth workshop on Declarative aspects of multicore programming, DAMP ’11, pages
3–14, New York, NY, USA, 2011. ACM.
[8] Koen Claessen. Embedded Languages for Describing and Verifying Hardware. PhD
thesis, Chalmers University of Technology and Göteborg University, Göteborg, Sweden,
April 2001.
[9] Koen Claessen. The chalmers-lava2000 package. Cited 1.1.2013, September 2012.
[10] Koen Claessen and David Sands. Observable sharing for functional circuit description.
In In Asian Computing Science Conference, pages 62–73, 1999.
[11] Koen Claessen and Mary Sheeran. A Slightly Revised Tutorial on Lava: A Hardware
Description and Verification System, May 2007.
[12] Koen Claessen, Mary Sheeran, and Bo J. Svensson. Expressive array constructs in
an embedded GPU kernel programming language. In Proceedings of the 7th workshop
on Declarative aspects and applications of multicore programming, DAMP ’12, pages
21–30, New York, NY, USA, 2012. ACM.
[13] Jason Cong, Bin Liu, Stephen Neuendorffer, Juanjo Noguera, Kees Vissers, and Zhiru
Zhang. High-Level Synthesis for FPGAs: From Prototyping to Deployment. Computer-
Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 30(4):473–
491, April 2011.
48
[14] Philippe Coussy, Daniel D. Gajski, Michael Meredith, and Andres Takach. An Intro-
duction to High-Level Synthesis. IEEE Design & Test of Computers, 26(4):8–17, July
2009.
[15] Philippe Coussy and Adam Morawiec. High-level synthesis : from algorithm to digital
circuit. Springer, 2008.
[16] Andrew Farmer, Garrin Kimmell, and Andy Gill. What’s the Matter with Kansas
Lava? In Rex Page, Zoltán Horváth, and Viktória Zsók, editors, Trends in Functional
Programming, volume 6546 of Lecture Notes in Computer Science, pages 102–117.
Springer Berlin Heidelberg, 2011.
[17] Andy Gill. Type-safe observable sharing in Haskell. Proceedings of the 2nd ACM
SIGPLAN symposium on Haskell, pages 117–128, 2009.
[18] Andy Gill, Tristan Bull, Garrin Kimmell, Erik Perrins, Ed Komp, and Brett Werling.
Introducing Kansas Lava. In MarcoT Morazán and Sven-Bodo Scholz, editors, Imple-
mentation and Application of Functional Languages, volume 6041 of Lecture Notes in
Computer Science, pages 18–35. Springer Berlin Heidelberg, 2011.
[19] Michael J. Gordon. Introduction to HOL : a theorem proving environment for higher
order logic. Cambridge University Press, 1993.
[20] Jim Grundy, Tom Melham, and John O’leary. A reflective functional language for
hardware design and theorem proving. J. Funct. Program., 16(2):157–196, March
2006.
[21] John Hughes. The Design of a Pretty-printing Library. In Advanced Functional Pro-
gramming, First International Spring School on Advanced Functional Programming
Techniques-Tutorial Text, pages 53–96, London, UK, UK, 1995. Springer-Verlag.
[22] John Hughes. Generalising monads to arrows. Science of Computer Programming,
37(1-3):67–111, May 2000.
[23] Ahmed A. Jerraya, Hong Ding, Polen Kission, and Maher Rahmouni. Behavioral
Synthesis and Component Reuse with VHDL. Springer US, Boston, MA, 1997.
[24] Gabriele Keller, Manuel M. T. Chakravarty, Roman Leshchinskiy, Simon P. Jones, and
Ben Lippmeier. Regular, shape-polymorphic, parallel arrays in Haskell. SIGPLAN
Not., 45(9):261–272, September 2010.
[25] Sean Y. Lee. Accelerating Haskell Array Codes with Algorithmic Skeletons on GPUs.
PhD thesis, The University of New South Wales, 2011.
[26] Ben Lippmeier and Gabriele Keller. Efficient parallel stencil convolution in Haskell.
SIGPLAN Not., 46(12):59–70, September 2011.
[27] Björn Lisper. Data parallelism and functional programming. In Guy-René Perrin and
Alain Darte, editors, The Data Parallel Programming Model, volume 1132 of Lecture
Notes in Computer Science, pages 220–251. Springer Berlin Heidelberg, 1996.
[28] Geoffrey Mainland and Greg Morrisett. Nikola: embedding compiled GPU functions
in Haskell. In Proceedings of the third ACM Haskell symposium on Haskell, Haskell
’10, pages 67–78, New York, NY, USA, 2010. ACM.
49
[29] Grant Martin and Gary Smith. High-Level Synthesis: Past, Present, and Future.
Design & Test of Computers, IEEE, 26(4):18–25, July 2009.
[30] Trevor L. McDonell, Manuel M. T. Chakravarty, Gabriele Keller, and Ben Lippmeier.
Optimising Purely Functional GPU Programs. [preprint draft], 2013.
[31] Michael C. McFarland, Alice C. Parker, and Raul Camposano. Tutorial on high-level
synthesis. In Proceedings of the 25th ACM/IEEE Design Automation Conference, DAC
’88, pages 330–336, Los Alamitos, CA, USA, 1988. IEEE Computer Society Press.
[32] Tom Melham. Integrating Model Checking and Theorem Proving in a Reflective Func-
tional Language. In EerkeA Boiten, John Derrick, and Graeme Smith, editors, In-
tegrated Formal Methods, volume 2999 of Lecture Notes in Computer Science, pages
36–39. Springer Berlin Heidelberg, 2004.
[33] Saraju P. Mohanty, Nagarajan Ranganathan, Elias Kougianos, and Priyadarsan Patra.
Low-Power High-Level Synthesis for Nanoscale CMOS Circuits, 2008.
[34] Bryan O’Sullivan, Don Stewart, and John Goerzen. Real world Haskell. O’Reilly, 2009.
[35] Gordon J. Pace and Christian Tabone. Embedding a Hardware Description Language
in a Functional Meta-Programming Language. 2008.
[36] Gordon J. Pace and Christian Tabone. Meta-functional Languages for Hardware Design
and Verification. In Advances in Circuits, Electronics and Micro-Electronics (CEN-
ICS), 2010 Third International Conference on, pages 45–50. IEEE, July 2010.
[37] Jaan Raik and Raimund Ubar. Deliverable D4.4: Semiconductor Technology, Design
and Test Roadmap. Technical report, Department of Computer Engineering, Tallinn
Technical University, Tallinn, Estonia, October 2003.
[38] Ingo Sander. System Modeling and Design Refinement in ForSyDe. PhD thesis, Royal
Institute of Technology, Stockholm, April 2003.
[39] Jason Sanders. CUDA by example : an introduction to general-purpose GPU program-
ming. Addison-Wesley, 2011.
[40] Stefan Schulze and Sergei Sawitzki. Processor design using a functional hardware
description language. Microprocessors and Microsystems, 36(8):676–694, November
2012.
[41] Mary Sheeran. Hardware Design and Functional Programming: a Perfect Match.
Journal of Universal Computer Science, 11(7):1135–1158, July 2005.
[42] Satnam Singh. Designing reconfigurable systems in Lava. In VLSI Design, 2004.
Proceedings. 17th International Conference on, pages 299–306. IEEE, 2004.
[43] Satnam Singh and Phil James-Roxby. Lava and JBits: From HDL to Bitstream in
Seconds. 2001.
[44] Joel Svensson. Obsidian: GPU Kernel Programming inHaskell. Master’s thesis,
Chalmers University of Technology and Göteborg University, Göteborg, Sweden, 2011.
50
[45] Joel Svensson, Koen Claessen, and Mary Sheeran. GPGPU kernel implementation and
refinement using Obsidian. Procedia Computer Science, 1(1):2065–2074, May 2010.
[46] Joel Svensson, Mary Sheeran, and Koen Claessen. Obsidian: A Domain Specific Em-
bedded Language for General-Purpose Parallel Programming of Graphics Processors.
In In Proc. of Implementation and Applications of Functional Languages (IFL), Lec-
ture Notes in Computer Science, 2008.
[47] Michael Wolfe. Understanding the CUDA Data Parallel Threading Model. online,
February 2010.
51
Appendix A
Map
Lam Alet
Body
PrimApp
PrimAdd Tuple
SnocTup
SnocTup Const 2
NilTup Var 0
Use ZipWith
ArraysRpair
ArraysRunit ArraysRarray
Array(Z :. 3) :. 2
Lam AVar 0 Generate
Lam
Body
PrimApp
PrimMul Tuple
SnocTup
SnocTup Var 0
NilTup Var 1
Shape Lam
AVar 0 Body
Const 1
Figure 1: The abstract syntax tree of the function q from section 5.1.1 with the sharing
recovery option enabled.
52
Map
Lam ZipWith
Body
PrimApp
PrimAdd Tuple
SnocTup
SnocTup Const 2
NilTup Var 0
Lam Use Generate
Lam
Body
PrimApp
PrimMul Tuple
SnocTup
SnocTup Var 0
NilTup Var 1
ArraysRpair
ArraysRunit ArraysRarray
Array(Z :. 3) :. 2
Shape Lam
Use
ArraysRpair
ArraysRunit ArraysRarray
Array(Z :. 3) :. 2
Body
Const 1
Figure 2: The abstract syntax trees of the same function q from section 5.1.1 the sharing
recovery option disabled. The middle branch from the node ZipWith shares the same
subtree with the left branch from the node Generate. In the previous version, the common
subtree was bound with Alet and AVar was used as a placeholder at both branches.
53
Appendix B
library IEEE;
use IEEE.std_logic_1164.all;
use IEEE.std_logic_arith.all;
use IEEE.math_real.all;
library work;
use work.types.all;
entity top is
generic (N : integer := 10);
port (output : out int;
input_23 : in int_array (0 to N-1);
input_24 : in int_array (0 to N-1));
end top;
architecture acc of top is
component fold_0 is
generic (N : integer );
port (y_1 : out int;
x_2 : in int_array (0 to N -1));
end component;
signal w_13 : int;
component zipwith_15 is
generic (N : integer );
port (y_15 : out int_array (0 to N-1);
x_16 : in int_array (0 to N-1);
x_17 : in int_array (0 to N -1));
end component;
signal w_22 : int_array (0 to N-1);
begin
fold_0_generated: fold_0 generic map (N) port map (w_13 , w_22);
zipwith_15_generated: zipwith_15 generic map (N)
port map (w_22 , input_23 , input_24 );
output <= w_13;
end acc;
library IEEE;
use IEEE.std_logic_1164.all;
use IEEE.std_logic_arith.all;
use IEEE.math_real.all;
library work;
54
use work.types.all;
entity zipwith_15 is
generic (N : integer );
port (y_15 : out int_array (0 to N-1);
x_16 : in int_array (0 to N-1);
x_17 : in int_array (0 to N -1));
end zipwith_15;
architecture acc of zipwith_15 is
component func_18 is
port (y_19 : out int;
x_20 : in int;
x_21 : in int);
end component;
begin
tree: for I in (0) to (N - 1) generate
zipper: func_18 port map (y_15(I), x_16(I), x_17(I));
end generate tree;
end acc;
library IEEE;
use IEEE.std_logic_1164.all;
use IEEE.std_logic_arith.all;
use IEEE.math_real.all;
library work;
use work.types.all;
entity fold_0 is
generic (N : integer );
port (y_1 : out int;
x_2 : in int_array (0 to N -1));
end fold_0;
architecture acc of fold_0 is
component fold1_5 is
generic (N : integer );
port (y_6 : out int;
x_7 : in int_array (0 to N -1));
end component;
signal din_4 : int_array (0 to N);
begin
inner: fold1_5 generic map (N+1) port map (y_1 , din_4 );
din_4(N) <= 0;
din_4 (0 to N-1) <= x_2;
end acc;
library IEEE;
use IEEE.std_logic_1164.all;
use IEEE.std_logic_arith.all;
use IEEE.math_real.all;
library work;
55
use work.types.all;
entity fold1_5 is
generic (N : integer );
port (y_6 : out int;
x_7 : in int_array (0 to N -1));
end fold1_5;
architecture acc of fold1_5 is
component func_9 is
port (y_10 : out int;
x_11 : in int;
x_12 : in int);
end component;
constant M : integer := (2 * N - 1);
constant K : integer := integer (2 ** floor(log2(real(M)))) - 1;
signal tmp_8 : int_array (0 to M-1) := (others => 0);
begin
tree: for I in (0) to (N - 2) generate
reducer: func_9 port map (tmp_8(I), tmp_8 (2*I+1), tmp_8 (2*I+2));
end generate tree;
tmp_8(N-1 to K-1) <= x_7(M-K to N-1);
tmp_8(K to M-1) <= x_7(0 to M-K-1);
y_6 <= tmp_8 (0);
end acc;
library IEEE;
use IEEE.std_logic_1164.all;
use IEEE.std_logic_arith.all;
use IEEE.math_real.all;
library work;
use work.types.all;
entity func_9 is
port (y_10 : out int;
x_11 : in int;
x_12 : in int);
end func_9;
architecture acc of func_9 is
begin
y_10 <= (x_11 + x_12);
end acc;
library IEEE;
use IEEE.std_logic_1164.all;
use IEEE.std_logic_arith.all;
use IEEE.math_real.all;
library work;
use work.types.all;
56
entity func_18 is
port (y_19 : out int;
x_20 : in int;
x_21 : in int);
end func_18;
architecture acc of func_18 is
begin
y_19 <= (x_20 * x_21);
end acc;
57
88
8
8
8
8
8
rr
r
r
rr
rr
r
rr
r
r
r
)8
)
r
))
)r
)
)8
)
8
))
)r
)
)
8
8
)
))
)r
)
88
8
8
8
8
8
rr
r
r
r8
rr
r
rr
r
r
r
)8
)
r
))
)r
)
)8
)
8
))
)r
)
)
8
8
)
))
)r
)
88
8
8
8
8
8
rr
r
r
r)
rr
r
rr
r
r
r
)8
)
r
))
)r
)
)8
)
8
))
)r
)
)
8
8
)
))
)r
)
88
8
8
8
8
8
rr
r
r
rr
rr
r
rr
r
r
r
)8
)
r
))
)r
)
)8
)
8
))
)r
)
)
8
8
)
))
)r
)
88
8
8
8
8
8
rr
r
r
rr
rr
r
rr
r
r
r
)8
)
r
))
)r
)
)8
)
8
))
)r
)
)
8
8
)
))
)r
)
88
8
8
8
8
8
rr
r
r
rr
rr
r
rr
r
r
r
)8
)
r
))
)r
)
)8
)
8
))
)r
)
)
8
8
)
))
)r
)
88
8
8
8
8
8
rr
r
r
rr
rr
r
rr
r
r
r
)8
)
r
))
)r
)
)8
)
8
))
)r
)
)
8
8
)
))
)r
)
88
8
8
8
8
8
rr
r
r
r)
rr
r
rr
r
r
r
)8
)
r
))
)r
)
)8
)
8
))
)r
)
)
8
8
)
))
)r
)
88
8
8
8
8
8
rr
r
r
r8
rr
r
rr
r
r
r
)8
)
r
))
)r
)
)8
)
8
))
)r
)
)
8
8
)
))
)r
)
88
8
8
8
8
8
rr
r
r
r)
rr
r
rr
r
r
r
)8
)
r
))
)r
)
)8
)
8
))
)r
)
)
8
8
)
))
)r
)
r
rr
1
rr
1
8
8
r
)8
r
rr
1
rr
1
8
8
r
8
d
r
8
r
rd
rr
d
)8
8
r
)r
))
)
)r
)
)8
8
r
)8
))
)
)r
)
)8
8
r
))
))
)
)r
)
)8
8
r
)r
))
)
)r
)
)8
8
r
)r
))
)
)r
)
)8
8
r
)r
))
)
)r
)
)8
8
r
)r
))
)
)r
)
)8
8
r
))
))
)
)r
)
)8
8
r
)8
))
)
)r
)
)8
8
r
))
))
)
)r
)
)8
8
)
)r
))
)
)r
)
)8
8
)
)8
))
)
)r
)
)8
8
)
))
))
)
)r
)
)8
8
)
)r
))
)
)r
)
)8
8
)
)r
))
)
)r
)
)8
8
)
)r
))
)
)r
)
)8
8
)
)r
))
)
)r
)
)8
8
)
))
))
)
)r
)
)8
8
)
)8
))
)
)r
)
)8
8
)
))
))
)
)r
)
)
8
8
r
)r
))
)
)r
)
)
8
8
r
)8
))
)
)r
)
)
8
8
r
))
))
)
)r
)
)
8
8
r
)r
))
)
)r
)
)
8
8
r
)r
))
)
)r
)
)
8
8
r
)r
))
)
)r
)
)
8
8
r
)r
))
)
)r
)
)
8
8
r
))
))
)
)r
)
)
8
8
r
)8
))
)
)r
)
)
8
8
r
))
))
)
)r
)
88
8
8
8
)
rr
r
r
r)
rr
rr
d
8
8
r
r
)8
8
8
))
)r
)
)8
8
)
))
)r
)
)
8
8
r
))
)r
)
88
8
8
8
)
rr
r
r
r8
rr
rr
d
8
8
r
r
)8
8
8
))
)r
)
)8
8
)
))
)r
)
)
8
8
r
))
)r
)
88
8
8
8
)
rr
r
r
r)
rr
rr
d
8
8
r
r
)8
8
8
))
)r
)
)8
8
)
))
)r
)
)
8
8
r
))
)r
)
d
8
d
D
D
D
8
D
D
D
D
88
8
8
8
)
rr
r
r
rr
rr
rr
d
8
8
r
r
)8
88
))
)r
)
)8
8
)
))
)r
)
)
8
8
r
))
)r
)
88
8
8
8
)
rr
r
r
rr
rr
rr
d
8
8
r
r
)8
88
))
)r
)
)8
8
)
))
)r
)
)
8
8
r
))
)r
)
88
8
8
8
)
rr
r
r
rr
rr
rr
d
8
8
r
r
)8
88
))
)r
)
)8
8
)
))
)r
)
)
8
8
r
))
)r
)
88
8
8
8
)
rr
r
r
rr
rr
rr
d
8
8
r
r
)8
88
))
)r
)
)8
8
)
))
)r
)
)
8
8
r
))
)r
)
88
8
8
8
)
rr
r
r
r8
rr
rr
d
8
8
r
r
)8
8
8
))
)r
)
)8
8
)
))
)r
)
)
8
8
r
))
)r
)
88
8
8
8
)
rr
r
r
r)
rr
rr
d
8
8
r
r
)8
8
8
))
)r
)
)8
8
)
))
)r
)
)
8
8
r
))
)r
)
88
8
8
8
)
rr
r
r
rr
rr
rr
d
8
8
r
r
)8
8
8
))
)r
)
)8
8
)
))
)r
)
)8
8
r
))
)r
)
81
1d
8
8
r
)8
r8
8
r
r
)8
)
)r
))
)
)r
)
)8
)
)8
))
)
)r
)
)8
)
))
))
)
)r
)
)8
)
)r
))
)
)r
)
)8
)
)r
))
)
)r
)
)8
)
)r
))
)
)r
)
)8
)
)r
))
)
)r
)
)8
)
))
))
)
)r
)
)8
)
)8
))
)
)r
)
)8
)
))
))
)
)r
)
)8
)
)8
r
))
)
)r
)
)
8
r
))
)r
)
81
1d
8
r
)8
81
1d
8
r
8
d
r
8
r
rd
rr
d
)8
)
)r
))
)
)r
)
)8
)
)8
))
)
)r
)
)8
)
))
))
)
)r
)
)8
)
)r
))
)
)r
)
)8
)
)r
))
)
)r
)
)8
)
)r
))
)
)r
)
)8
)
)r
))
)
)r
)
)8
)
))
))
)
)r
)
)8
)
)8
))
)
)r
)
)8
)
))
))
)
)r
)
)
8
8
))
)r
)
r1
r
)8
r1
r
r8
r
8
r8
)
r
)r
))
)
)r
)
r8
r
8
r8
)
r
)8
))
)
)r
)
r8
r
8
r8
)
r
))
))
)
)r
)
r8
r
8
r8
)
r
)r
))
)
)r
)
r8
r
8
r8
)
r
)r
))
)
)r
)
r8
r
8
r8
)
r
)r
))
)
)r
)
r8
r
8
r8
)
r
)r
))
)
)r
)
r8
r
8
r8
)
r
))
))
)
)r
)
r8
r
8
r8
)
r
)8
))
)
)r
)
r8
r
8
r8
)
r
))
))
)
)r
)
r8
r
8
r8
)
r
)r
))
)
)r
)
r8
r
8
r8
)
r
)8
))
)
)r
)
r8
r
8
r8
)
r
))
))
)
)r
)
r8
r
8
r8
)
r
)r
))
)
)r
)
r8
r
8
r8
)
r
)r
))
)
)r
)
r8
r
8
r8
)
r
)r
))
)
)r
)
r8
r
8
r8
)
r
)r
))
)
)r
)
r8
r
8
r8
)
r
))
))
)
)r
)
r8
r
8
r8
)
r
)8
))
)
)r
)
r8
r
8
r8
)
r
))
))
)
)r
)
1
8
rr
8
r)
)
)r
)
Figure 3: Synthesized dotproduct.
58
