A framework for FPGA functional units in high performance computing by Koltes, A. & O'Donnell, J.T.
Enlighten – Research publications by members of the University of Glasgow 
http://eprints.gla.ac.uk 
 
 
 
 
 
 
 
Koltes, A., and O'Donnell, J.T. (2010) A framework for FPGA functional 
units in high performance computing. In: IEEE International Symposium 
on Parallel and Distributed Processing, 19-23 April 2010, Atlanta, GA. 
 
http://eprints.gla.ac.uk/43727 
 
Deposited on: 17 August 2012 
 
 
A Framework for FPGA Functional Units
in High Performance Computing
Andreas Koltes
Department of Informatics and Mathematics
University of Passau
Passau, Germany
Email: koltes@ieee.org
John T. O’Donnell
Department of Computing Science
University of Glasgow
Glasgow, United Kingdom
Email: jtod@dcs.gla.ac.uk
Abstract—FPGAs make it practical to speed up a program
by defining hardware functional units that perform calcu-
lations faster than can be achieved in software. Specialised
digital circuits avoid the overhead of executing sequences of
instructions, and they make available the massive parallelism
of the components. The FPGA operates as a coprocessor
controlled by a conventional computer. An application that
combines software with hardware in this way needs an interface
between a communications port to the processor and the signals
connected to the functional units. We present a framework that
supports the design of such systems. The framework consists
of a generic controller circuit defined in VHDL that can be
configured by the user according to the needs of the functional
units and the I/O channel. The controller contains a register file
and a pipelined programmable register transfer machine, and
it supports the design of both stateless and stateful functional
units. Two examples are described: the implementation of
a set of basic stateless arithmetic functional units, and the
implementation of a stateful algorithm that exploits circuit
parallelism.
Keywords-FPGA; interface;
I. INTRODUCTION
Many programs perform large numbers of timeconsuming
operations. One way to run such programs faster is to split
them into several tasks to be executed in parallel on different
processor cores. Another approach is to make the basic
operations themselves faster using hardware accelerators.
One example of this is to provide floating point operations
in hardware, rather then performing them in software.
Many computations can be performed faster by a spe-
cialised digital circuit than by a general purpose circuit (i.e.
processor) running a program.
There are two fundamental reasons that circuits may
be faster. The first is that the actual computation that is
needed can be performed directly, without also requiring the
overheads of fetching instructions, decoding them, and so on.
For highly repetitive calculations, this can make hardware
significantly faster than a corresponding program, and the
hardware is relatively easy to design.
An even more fundamental factor is that digital circuits
contain an extraordinary degree of parallelism. All the com-
ponents operate in parallel, although the useful parallelism
in a synchronous circuit is limited by the critical path depth.
The ratio between the number of components and the critical
path depth may be between 103 to 105. With careful circuit
design, much of this large factor can be converted into useful
parallelism.
Particular programs may require specialised operations,
and it is impossible to support all of these in fixed hardware
coprocessors. FPGAs offer the programmer the ability to
define new hardware implementations of key operations used
in a program. This makes it possible to use efficient hardware
to avoid the overhead of executing sequences of instructions,
and it offers an extremely high degree of parallelism.
Reconfigurable circuits, such as FPGAs, allow specialised
circuit designs to be implemented quickly and cheaply [1]
[2]. They offer the possibility of supporting slow operations
in hardware at speeds much higher than can be achieved
using standard processors. Reconfigurable hardware gives
much of the benefit of fabricating a new circuit design at a
much lower cost.
In order to use an FPGA to speed up a program, it is
necessary first to identify a set of operations to be performed
in hardware. These must be implemented as digital circuits,
called functional units. Finally, an interface needs to be
constructed that allows the processor to communicate with
the new circuit.
Designing the interface is a significant challenge. It has
to communicate with a processor, using an input/output
channel, and it also has to communicate with a set of
circuits via digital signals. The interface must handle the
handshaking protocols required by the processor, as well
as the buffering and timing requirements of the circuit. In
some cases, it is useful for the interface to coordinate the
operation of the functional units, treating them as microop-
erations in order to perform a larger calculation. To meet
all these requirements, it is useful to organise the interface
as a programmable register transfer machine (essentially a
small RISC processor) with a register file. Furthermore, the
interface cannot be a fixed circuit: parts of it need to be
changed as the application circuits change.
This paper presents the design of a generic interface that
addresses these challenges. The interface is a digital circuit,
defined in VHDL, that can be embedded on an FPGA along
with functional units designed by a programmer to accelerate
key operations. The interface circuit is a programmable
register transfer machine, which can collect data from the
processor, buffer it, run the functional units, obtain their
results, and deliver them back to the processor. The work
aims to improve portability, by providing a generic controller
that can be adapted to a wide variety of computer systems.
The paper also discusses two distinct methods for using an
FPGA: implementing stateless functional units, and imple-
menting data parallel operations where the functional units
hold persistent data in a state. We show how both methods
are supported by the controller, and give an example of each.
The architecture of the controller is specified as a set of
generics in VHDL. It contains several subsystems; some can
be used without modification, while others are templates
that will generate the actual circuits, under the control of
parameters supplied by the user.
The controller and the case studies have been imple-
mented and tested on an Altera Cyclone FPGA [3], using
VHDL to specify the circuits. A complete description of the
system, including full documentation of the interface and
protocols, as well as the case studies, appears in Koltes’
dissertation [4]. The dissertation also contains the VHDL
code, and provides the information needed to use the system
for practical applications.
Our results do not make the use of hardware accelerators
as easy as ordinary programming. The user still needs to be
able to design circuits as well as to write software. However,
the work presented here does make the task significantly
easier and more portable.
Several previous systems have used FPGAs to provide
new operations to enhance a system’s instruction set. Eisen-
ring and Platzner present a theoretical model for describ-
ing such systems [5]. The CHIMAERA system [6] uses
a processor tightly coupled to a reconfigurable array that
implements operations used by the instruction set. The main
difference with our work is that CHIMAERA is not a generic
framework aimed at portability.
Wirthlin and Hutchins show how to use partial reconfigu-
ration of an FPGA to allow an instruction set to be modified
dynamically [7]. This is useful when the functional unit
circuits require too much space to fit simultaneously in an
FPGA, although the time required to load a new instruction
(i.e. to read an functional unit circuit into the FPGA) is
substantial. Related approaches are described in [8], [9] and
[10].
One of the strengths of the framework presented here
is its flexibility: it can work with a broad spectrum of
microcontrollers and interconnection systems, and this does
not require any modification to the processor architecture
itself, while allowing custom instructions to be introduced
directly into the microcontroller.
1 Introduction
development of coprocessors. The main goal of the project discussed in this thesis was to combine
these approaches into a single generic framework. One of the strengths of the framework proposed
is the ability to work together with a broad spectrum of microcontrollers and interconnection sys-
tems. This is accomplished without the need to modify the CPU architecture itself as has been
proposed in previous publications aiming at the introduction of custom instructions directly into
the microcontroller [AS93] [DeH94] [RS94] [WH95] [YMHB00].
This thesis discusses the experience with designing a generic coprocessor framework, consisting of
a reconfigurable-hardware-based architecture employing a superscalar pipeline and interface points
for integrating a wide range of domain specific operations. All core components of the platform are
independent from the specific data types handled as well as from the hardware platform used. The
framework provides the ability to quickly create specialised FPGA based coprocessor implemen-
tations. The basic design of the architecture is optimised to be combined with a general purpose
(multi-)processor through various interconnection fabrics as discussed in Section 2.4. Chapter 2
reviews the design of the architecture and explains how it can be coupled with a wide variety
of different main processors. Chapter 3 discusses two case studies executed in a real hardware
environment.
1.2 Design goals
The following design goals form the base for the design of the proposed framework. These design
goals aim at providing a maximum of flexibility for configuration of the framework. The main
purpose of the presented framework is to facilitate the development of FPGA based coprocessors
by providing a common interface to hardware accelerators accessable by one or more host CPUs
running standard software as shown in Figure 1.1. From a developer’s view, the interface is intended
to act as a generic interface reusable across projects without changing components belonging to
the framework itself.
CPU #1
Functional
Unit #1
Functional
Unit #2
Functional
Unit #3
Functional
Unit #n
CPU #m
Interface
Figure 1.1: High-level system model
Interface components connecting the framework to the actual host CPU(s) are intended to be
fully independent and transparent to the actual hardware accelerators interfaced by the framework.
Besides this, the framework shall provide means of easy integration of hardware accelerators from
existing code bases or libraries without changing the components themselves. To avoid jeopardis-
ing the performance enhancement achieved by the use of hardware accelerators the framework is
2
Figure 1. High level organisation. The main program is written in C or
any other programming language, and runs in one or more CPUs which
communicate via the interface with a set of functional units. The interface
and the functional units are programmed using VHDL. The Interface
comprises VHDL modules described in this paper, and the Functional Units
communicate with the Interface according to protocols. The Interface can
be configured by editing its VHDL definition.
Section II gives an overview of the system, discussing how
the FPGA, the interface, and the CPU fit together. Section
III describes the central component of the architecture, a
Register Transfer Machine. This is a RISC processor that
provides a register file and a simple instruction set. Section
IV then discusses how the programmer can develop an
application, for both stateless and stateful functional units.
Section V concludes.
II. OVERVIEW OF THE FRAMEWORK
The programmer identifies a set of operations suitable for
hardware implementation. These should have the following
characteristics: they require a relatively long sequence of
ordinary instructions to perform; they can be performed
much more quickly using circuit techniques (e.g. by exploit-
ing the parallelism inherent in circuits); they are executed
frequently. The programmer then designs a dedicated cir-
cuit, called a functional unit, that implements each of the
operations. Each functional unit is designed to interact with
the central interface using a standard signal protocol, which
is defined by the framework.
The aim is to speed up a program running on one or
more processors by augmenting the processors with a set of
functional units. A functional unit is a circuit that performs
some computation significantly faster than can be done in
software. The entire solution consists of a software compo-
nent running in the processors and a hardware component
comprising the functional units. Figure 1 shows the high
level organisation of the system; the CPUs are in a standard
computer while the functional units and the interface are
embedded in an FPGA. The interface is generic, making it
reusable across projects.
The interface needs to be able to execute instructions that
control the functional units. It also needs to retain informa-
tion, enabling a sequence of functional unit operations to be
performed, and to package operands and results according
to the communications protocols.
These requirements are satisfied by organising the in-
terface as a register transfer machine. This is a simple
programmable datapath that contains a register file, and that
has an instruction set for communications.
The entire system is controlled by the host computer. To
perform an accelerated operation, the host sends one or more
packets of data to the controller on the FPGA. The controller
then coordinates the execution of the operations and returns
the final results to the processor. From the processor’s
point of view, the FPGA acts like a coprocessor comprising
one or more functional units. The host computer can send
instructions to be performed on any of the functional units.
Within the FPGA, the instructions may be executed out of
order, but the stream of results returned to the processor
will be consistent with the stream of instructions that were
issued. This is similar to the effect of out-of-order execution
within a sequential superscalar processor.
The register transfer machine communicates with the
host processor using a transceiver circuit. There are many
different physical interfaces that the FPGA might need
to interact with. In some cases a predefined transceiver
interface module may be available, and this can be combined
with the VHDL definition of the controller. Depending on
the system, it may be necessary to create a new transceiver
circuit.
The controller is a digital circuit which is specified in
VHDL, an industry standard language for designing digital
circuits. The interface is customisable, and contains parame-
ters that can be modified easily. For example, the word size
used for the register file is adjustable, so the interface can
meet the requirements of the functional units while requiring
as small a portion of the FPGA as possible.
The system contains several units: the interface to the
CPUs, the central control of the FPGA (a Register Transfer
Machine), and the functional units (Figure 2).
Figure 3 shows the structure of the interface from the pro-
grammer’s point of view. To use the system, the programmer
needs to
• Partition the algorithm into a software part, to run in
the processor(s);
• Define the specialised operations and implement them
as functional units, using VHDL;
• Configure the interface framework by specifying size
parameters for the register file, and selecting the ap-
propriate transmitter and receiver modules.
III. REGISTER TRANSFER MACHINE
The core of the interface is a register transfer machine
(RTM). This is a microcontroller with a RISC style archi-
tecture, based on register files and instructions that act on
1 Introduction
VHDL top-level module
Functional
Unit #1
Functional
Unit #n
Functional
Unit #2
Fi
xe
d 
co
m
po
ne
nt
s 
of
co
pr
oc
es
so
r f
ra
m
ew
or
k
In
te
rf
ac
e 
co
m
po
ne
nt
s
(r
eu
sa
bl
e) Receiver
(from COTS library)
Transmitter
(from COTS library)
Figure 1.2: Coprocessor framework from the perspective of a developer
1.3 Architecture design
The proposed architecture is designed around the design goals outlined in Section 1.2. The frame-
work is designed in a highly modular way and implemented using generic VHDL modules. This
makes it possible to customise specific parts of the framework without the need to retest the entire
coprocessor. Figure 1.3 shows the coarse layout of the circuit running on the FPGA.
Interface Circuitry
Message
Buffer
Functional
Unit #1
Functional
Unit #2
Functional
Unit #3
Functional
Unit #4
Register
Transfer
Machine
Message
Serializer
Figure 1.3: High-level organisation of components
1.3.1 Architecture
• The kernel of the framework consists of a pipelined register-transfer-machine (RTM) circuit
that processes messages received from the main processor(s). General management primi-
tives, e.g. copying data from one register to another, are provided by the framework and
executed directly in the main pipeline. User instructions are dispatched to functional units
holding the implementations of custom operations.
4
Figure 2. Structure of the system. The subsystems shown in this figure
are specified in VHDL and run on the FPGA chip. The interface circuitry
is a low level transceiver that communicates directly with the port pins on
the chip. Incoming and outgoing messages go via hardware buffers to the
central Register Transfer Machine, which controls the functional units.
1 Introduction
VHDL top-level module
Functional
Unit #1
Functional
Unit #n
Functional
Unit #2
Fi
xe
d 
co
m
po
ne
nt
s 
of
co
pr
oc
es
so
r f
ra
m
ew
or
k
In
te
rf
ac
e 
co
m
po
ne
nt
s
(r
eu
sa
bl
e) Receiver
(from COTS library)
Transmitter
(from COTS library)
Figure 1.2: Coprocessor framework from the perspective of a developer
1.3 Architecture design
The proposed architecture is designed around the design goals outlined in Section 1.2. The frame-
work is designed in a highly modular way and implemented using generic VHDL modules. This
makes it possible to customise specific parts of the framework without the need to retest the entire
coprocessor. Figure 1.3 shows the coarse layout of the circuit running on the FPGA.
Interface Circuitry
Message
Buffer
Functional
Unit #1
Functional
Unit #2
Functional
Unit #3
Functional
Unit #4
Register
Transfer
Machine
Message
Serializer
Figure 1.3: High-level organisation of components
1.3.1 Architecture
• The kernel of the framework consists of a pipelined register-transfer-machine (RTM) circuit
that processes messages received from the main processor(s). General management primi-
tives, e.g. copying data from one register to another, are provided by the framework and
executed directly in the main pipeline. User instructions are dispatched to functional units
holding the implementations of custom operations.
4
Figure 3. Top level VHDL module. The main components of the system are
shown from he programmer’s point of view. The receiver and transmitter
come from libraries. The main Register Transfer Machine is a fixed VHDL
definition, althoug it has a number of size parameters. The functional units
are specific to the application.
the registers. The architecture contains two register files.
The main register file holds data, and its word size is
configurable in multiples of 32 bits. There is a secondary
register file holding vectors of flags, which re often useful
for controlling the functional units. The RTM instructions
may have up to three operands to be fetched from the register
file, and up to two results may be loaded into the register
file.
The RTM interacts with the host computer through a
message buffer for input and a message serialiser for output,
and it interacts directly with the functional units using digital
signals). The message buffer a d serialiser com unicate
with the host using tandard FPGA circuits from a COTS
library.
The register transfer machine executes instructions using a
pipeline (Figure 4), in order to attain concurrency among the
instructions and to reduce the clock period. The pipeline was
designed with most registers at the end of the pipeline stages,
because most FPGAs have their registers after the function
generators. Handshaking is used to control transmission of
data between pipeline stages. This allows local control to
1 Introduction
Interface
Circuitry
Message
Buffer
In
pu
t
Execution
Message
Encoder
Decoder
Dispatcher
Message
Serializer
Functional
Unit #1
Functional
Unit #2
Functional
Unit #3
Functional
Unit #4
All connections are point-to-point connections
Write Arbiter
Register File Flag Register FileLock Manager
Functional
Unit Table
Register
Usage Table
O
ut
pu
t
Lock ReadRead
Off-load
Data signals
Acknowledgement/Idleness signals
Unlock WriteWrite
High Priority Write
Main Pipeline
Lookup tables are 
implicitely synthesised
into Decoder
External table module
definitions alleviate
customisation
Message Buffer and Serializer stages 
might need to be adjusted to match
the module interface configuration of
the Interface Circuitry
Figure 1.4: Outline of the high-level organisation of components
8
Figure 4. Organisation of Register Transfer Machine. The RTM is a
RISC processor with a pipeline (the column on the left side of the figure)
comprising a decoder, dispatcher, and execution stage. The processor keeps
state in a register file, which provides a source and sink for data transmitted
to the functional units.
stall the transmission when necessary; there is no global
control for stalling the pipeline. The pipeline contains the
following stages:
• Message buffer. The first stage receives data from the
FPGA input port connected to the host processor, and
converts it to a form usable by the decoder. This stage
needs to be implemented according to the communica-
tion protocol used by the host processor.
• Decoder. The current instruction is decoded into a
vector of signals that control the execution stage.
• Dispatcher. Reads from the register file take place in
the dispatcher stage, and instructions that initiate a
functional unit operation transmit data to the functional
unit through a register in this stage.
• Execution. Instructions that operate on the state of the
RTM are executed.
• Message encoder. There are several types of message
that can be sent from the RTM to the host, including
data records and flag vectors, and these are multiplexed
into a single standard vector of signals.
• Message seraliser. The signal vector is converted to the
form required by the communication port to the host,
and is transmitted on the port.
The speed of the system is determined by two factors: the
latency of the communication interface to the host computer,
and the clock speed of the FPGA. Our implementation used a
prototyping board which is intended for experimentation and
software development, but not for high speed. In particular,
only a very slow connection from the FPGA board to the
processor was available. However, this is not a limitation
of the approach: there are FPGAs that are tightly integrated
with processors, offering extremely high transfer rates. In
such a system, the main limitation on performance would
be the speed of the circuit on the FPGA.
The generic controller is designed to minimise the clock
period; this is achieved by pipelining, so the critical path
in the controller is short. In general, FPGAs have slower
clocks than processors, and the RTM controller should allow
the fastest clock speed that the FPGA allows. The main
limitation on performance will be the functional unit circuits.
IV. DEVELOPING AN APPLICATION
The main task for the programmer is to design the func-
tional units. They must interact with the controller according
to the framework’s protocol, but apart from that requirement,
the designer has complete freedom in the internal structure
of a functional unit.
Figure 5 shows the architecture of a minimal stateless
functional unit. The purpose of the unit is to perform a
calculation, which is implemented by a black box circuit.
The unit interacts with the controller according to the
protocol, which is documented in detail in [4].
An application program running on a host computer uses
the FPGA, with its functional units, similarly to the way it
would use any conventional coprocessor, such as a dedicated
floating point unit. Naturally there will not be an instruction
in the processor’s instruction set that uses the newly created
operation. Typically the FPGA would be treated as a fast
I/O device. The mechanism for executing an operation in
a functional unit depends on the system, but in general it
would be the same as for any other coprocessor operation.
The interface framework allows several functional units
to be incorporated on the FPGA, and these units may have
different designs. Thus it is possible to provide a set of
operations.
Each functional unit interacts with the register transfer
machine according to a protocol expressed as a finite state
machine (an example is shown in Figure 6). The register
transfer machine has an instruction set that is used by the
programmer to control transmission of data between the
registers to the functional units.
There are two major classes of functional units: stateless
and stateful, discussed in more detail in the following
sections.
2 Architecture framework
2.3.4 Internal structure
The internal structure of a functional unit connected to the coprocessor is completely up to the
designer as long as the interface specification described in Section 2.3.1 and Section 2.3.2 is met.
The designer might even choose to run parts of a functional unit inside another clock domain
or to communicate with off-chip components from within a function unit. However, there are
several frequently recurring patterns when creating functional units. Especially if a functional
component is built as a wrapper for an existing core it is easy to reuse code of other functional
unit configurations. The following sections outline some typical strategies to contruct functional
units.
Minimal configuration
Essentially the minimal configuration of a functional unit which is supposed to execute meaningful
tasks consists of some combinational logic transforming a single input value to a single output
value without processing or producing any flags. This combinational logic is followed by an array
of registers which is able to buffer the resulting value of an operation until the connected write
arbiter acknowledges the write operation.
VCC
clock INPUT
VCC
reset INPUT
VCC
dispatch INPUT
VCC
data_acknowledgeINPUT
VCC
data_output_reg_1[7..0] INPUT
VCC
data_input_1[31..0] INPUT
VCC
variety_code[7..0] INPUT
idle OUTPUT
data_readyOUTPUT
data_output[31..0]OUTPUT
data_output_reg[7..0]OUTPUT
variety_code[7..0]
data_input[31..0]
data_output[31..0]
CombinationalLogic
OR2 NOT
AND2
NOT
DFF
sclr
sset
data
clock
q
RegisterRS
reg_data_ready
DFF
data[31..0]
clock
enable
q[31..0]
Register32E
reg_data_output
DFF
data[7..0]
clock
enable
q[7..0]
Register8E
reg_data_output_reg
Figure 2.16: Minimal configuration of a functional unit
Figure 2.16 shows the skeleton of a simple functional unit. The dispatch signal acts as clock
enable signal to trigger the registers contained in the functional unit to sample their inputs storing
the number of the destination register as well as the resulting value of the operation. Besides this
the dispatch signal sets a registered flag indicating to the write arbiter that data is available.
This flag is the only thing which needs to get cleared by the reset signal since the other output
signals are ignored by the write arbiter unless the data ready signals gets asserted.
As long as the signal data acknowledge is not asserted by the write arbiter the register
reg data ready keeps its value to indicate to the write arbiter that data is still pending to be
written to the register file. The idle signal is asserted if either no output data is pending or
if pending output data is acknowledged in the current clock cycle. This combinational forward
mechanism of the write arbiter acknowledgement signal allows the functional unit to theoretically
accept a new instruction every clock cycle. However, combinational signals running through the
functional units can significantly lengthen the critical path of the entire coprocessor. Therefore
such combinational feedback mechanisms are only recommended for simple coprocessor design not
requiring high performance.
34
Figure 5. Minimal functional unit. This is an example of a functional unit circuit, showing the signals that connect it to the controller. The circuit computes
a pure Boolean logic function, using logic gates. A real functional unit would have a similar interface to the controller, although there would normally be
more signals, and the internal computational circuitry would be much more complex.
2.3 Funct onal units
Idle
Dispatch
Completion
No output
Acknowledge
Idle
Completion
Output
Data pending
Execute
Send
Data 2
S nd
Flags
Send
Data 1/2
Send
Data 2
Flags
Send
Data 1/2
Flags
Completion
Output
Completion
Output
Acknowledge
Acknowledge
Acknowledge
Acknowledge
Acknowledge
Acknowledge
Acknowledge
Acknowledge
If the reset signal is 
asserted the FSM
moves to state Idle
regardless of its
current state
Data pending
Data pending
Data pending
Data pending
Figure 2.18: Finite state machine skeleton for area optimised configurations of functional units
Performance optimised configuration
If maximum instruction throughput through a functional unit is desired the previous design skeleton
is not particularly well suited to construct the functional unit. Instead of this a fully pipelined
design should be considered as outlined in Figure 2.19. The skeleton presented uses a lot of FPGA
resources and especially on-chip SRAM blocks consumed by the FIFO buffers. However, for many
real designs the resource consumption will be less since many practical functional units do not
require the full set of available input and output channels.
For maximum performance and throughput, the functionally effective logic contained in the
functional unit is implemented in a pipeline which is able to receive a new instruction either every
clock cycle or at least every kth clock cycle. In the first case the functional unit becomes only busy
towards the dispatcher if the FIFO buffers contained in the functional unit are full. In this case
the functional unit cannot accept additional instructions until the write arbiter processed enough
output data to make room in the FIFO buffers for new data. However, the pipeline itself does not
need to stall its operation in case of full FIFO buffers as long as all FIFO buffers used provide room
for the same number of elements. The number of elements stored in anyone of the FIFO buffers
will never exceed the number of elements stored in the FIFO buffers buffering register numbers
for data output. This is due to the fact that these FIFO buffers immediately enqueue register
numbers during a dispatch cycle as can be seen by looking at the design of the write enable logic
controlling the input ports of these FIFO buffers. The corresponding data values produced by the
pipeline follow n clock cycles later when the corresponding instruction has travelled through the
pipeline. Therefore it is sufficient to compute the idle signal based on the state of these FIFO
buffers and on the readiness of the pipeline. It is recommended to configure the FIFO buffers to
be able to hold more data elements than there are pipeline stages in the functional unit pipeline.
37
Figure 6. Example of a finite state machine for functional unit. Each
functional unit communicates with the RTM controller according to a fixed
protocol, which is implemented within the functional unit by a finite state
machine. The FSM coordinates the transmission of data, and may also
control the datapath within the functional unit.
A. Stateless functional units
A stateless unit computes a pure function of its operands.
Once it transmits its result to the controller, the unit contains
no memory at will affect future computations. Examples of
stateless functional units ar arithme ic units, trigonometric
function calculators, etc.
As a simple example, consider a set of functional units
to perform a family of arithmetic operations on integers.
(The full details appear in [4].) This example is chosen
for simplicity, and to test and measure the system; in a
real application it would be worthwhile designing functional
units only for operations that are significantly more time
consuming.
The programmer needs to decide on the set of operations,
design the functional units, and specify a set of instructions
for the RTM co troller to perform the operations. For this
example, the hardware design is strai htfo ward; the circuits
are standard, and VHDL can synthesise them from standard
notation, muc as a compiler can generate machine language
f om similar notation. For more complex operations, it
may be challenging to design the functional units, just as
progra ming may be challenging for ard probl ms.
Figure 7 shows the instruction set architecture for a
stateless functional unit. The instructions follow the formats
allowed by the RTM controller, and are similar to arithmetic
instructions on a typical RISC processor. Each instruction
specifies the operation, the operand registers, and the result
registers.
B. Stateful functional units
A stateful unit has a local persistent memory. Operations
performed by the unit may depend on data in the memory,
may modify it, and may return part of it to the controller.
Examples of stateful functional units are histogram calcu-
lators, pseudorandom number generators, and associative
memories.
We have developed an application that uses a stateful
functional unit to implement an algorithm that performs
simple computations in parallel on every element of a data
structure. With conventional data structures, the processor
performs operations on one element at a time, leaving the
remainder of the data structure inert. The approach used
here is to use circuit parallelism to provide a richer set of
primitive operations.
The spplication is an implementation of the χ-sort suite
3.2 Stateless functional units
condition, e.g. a division by zero. If this flag is set, the contents of the destination registers (if
any) are undefined by specification.
3.2.2 Functional units
Both functional units incorporated into the case study perform their respective operations in a
single clock cycle. Due to their simple design they are able to accept an instruction every second
clock cycle. This could be improved to a theoretical maximum throughput of one instruction every
clock cycle by intelligent forwarding of the write arbiter acknowledgement signals. However, since
this case study is only intended to evaluate functionality aspects the functional units are designed
as simple as possible.
Arithmetic unit
The arithmetic unit is able to do binary as well as two’s complement additions, subtractions as
well as comparisions. Multi-word operation is supported through an externally provided carry bit
read from the input carry flag. All operations with the exception of the negation instruction are
applied to the first and second source operand in the case of two input operands and to the first
operand in the case one input operand. The negation instruction is applied to the second operand
only, for reasons of logic compactness. Table 3.1 shows the encoding of the instructions supported
by the arithmetic unit. The VHDL specification of the arithmetic unit can be found in Appendix
B.3.1.
63 031324056575859606162 39 151648 47 24 23 8 755
Destination
Flag Register0 01
Destination
Register #1
Source
Register #1
Source
Register #2ADD
SBB
SUB
ADC
1 0 0 0 0
Destination
Flag Register0 01
Source
Flag Register
Destination
Register #1
Source
Register #1
Source
Register #21 0 0 0 0
Destination
Flag Register0 01
Destination
Register #1
Source
Register #1
Source
Register #21 0 0 0 0
Destination
Flag Register0 01
Source
Flag Register
Destination
Register #1
Source
Register #1
Source
Register #21 0 0 0 0
Destination
Flag Register0 01
Destination
Register #1
Source
Register #11 0 0 0 0
Destination
Flag Register0 01
Destination
Register #1
Source
Register #11 0 0 0 0
Destination
Flag Register0 01
Destination
Register #11 0 0 0 0
Destination
Flag Register0 01
Source
Register #1
Source
Register #21 0 0 0 0
Destination
Flag Register0 01
Source
Flag Register
Source
Register #1
Source
Register #21 0 0 0 0
INC
CMP
NEG
DEC
CMPB
Source
Register #2
U
se
 c
ar
ra
y 
fla
g
Fi
xe
d 
ca
rry
O
ut
pu
t d
at
a
Fi
rs
t i
np
ut
 z
er
o
Se
co
nd
 in
pu
t z
er
o
C
om
pl
em
en
t s
ec
on
d 
in
pu
t
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
Function code: 16
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0 0
0
0
00
0
0
0
0
0
54 53 52 51 49
0
50
0
0
0
0
0
0
0
0
0
Table 3.1: Encoding of arithmetic instructions
Logic unit
The logic unit is able to do a variety of basic bitwise logic operations. All operations are applied
to the first and second source operand in the case of two input operands and to the first operand
in the case one input operand. Table 3.2 shows the encoding of the instructions supported by the
logic unit. The VHDL specification of the logic unit can be found in Appendix B.3.2.
55
Figure 7. Instruction set architecture for stateless functional units.
The instructions shown here follow the instruction format for the RTM
controller. To execute the instructions, the controller obtains the operands
from the register file, dispatches the operations to suitable functio al
units, receives the results, and places the results into the register file. The
operations (addition, subtraction, etc.) correspond to the functions computed
by the functional units.
[11], which performs selection and sorting using using an
array represented with index intervals. With ordinary arrays,
an element is identified by a index. With the index-interval
representation, an approximate index can be specified. An
element with index interval 〈p, q〉 belongs in the array at
some index i such that p ≤ i ≤ q. An initial array represents
the complete lack of knowledge of where the elements
belong by assigning each element an index interval 〈0, n−1〉.
In sequential algorithms the data structures can be mod-
ified only one element at a time as the processor executes
load and store instructions. With circuit parallelism, data
structures can be active. Each element of the array is stored
in a small processor called a cell, which is implemented as
small circuit in the FPGA. Cells contain combinational logic
as well as storage; thus cells are a form of “smart memory”.
This capability enables the χ-sort algorithm to recalculate
the index interval of every data item in parallel, at clock
speeds.
The χ-sort algorithm executes in the Register Transfer
Machine, which issues microinstructions to a stateful func-
tional unit, whose organisation is shown in Figure 8. The
functional unit is a tree network with leaf cells containing
persistent memory, and interior node circuits that provide
communications and support parallel folds and scans on
associative operators.
The cell circuit contains a small amount of storage,
enough to hold one data element and its index interval.
The cell also contains a simple arithmetic circuit that can
perform comparisons and additions. The entire set of cells
form a smart memory that implements a microinstruction set
specifically targeted at the χ-sort algorithm. The RTM im-
plements operations (e.g. performing a selection operation)
by issuing a set of microinstructions to the cells. Figure 9
shows the implementation of the cell circuit.
Circuit parallelism enables χ-sort to execute significantly
faster than can be achieved with software on a conventional
3 Case studies
SIMD
Cell
SIMD
Cell
SIMD
Cell
SIMD
Cell
SIMD
Cell
SIMD
Cell
SIMD
Cell
SIMD
Cell
D
at
a
Lo
w
er
 b
ou
nd
U
pp
er
 b
ou
nd
S
el
ec
tio
n 
fla
g
D
at
a
Lo
w
er
 b
ou
nd
U
pp
er
 b
ou
nd
S
el
ec
tio
n 
fla
g
D
at
a
Lo
w
er
 b
ou
nd
U
pp
er
 b
ou
nd
S
el
ec
tio
n 
fla
g
D
at
a
Lo
w
er
 b
ou
nd
U
pp
er
 b
ou
nd
S
el
ec
tio
n 
fla
g
D
at
a
Lo
w
er
 b
ou
nd
U
pp
er
 b
ou
nd
S
el
ec
tio
n 
fla
g
D
at
a
Lo
w
er
 b
ou
nd
U
pp
er
 b
ou
nd
S
el
ec
tio
n 
fla
g
D
at
a
Lo
w
er
 b
ou
nd
U
pp
er
 b
ou
nd
S
el
ec
tio
n 
fla
g
D
at
a
Lo
w
er
 b
ou
nd
U
pp
er
 b
ou
nd
S
el
ec
tio
n 
fla
g
D
ata inputD
at
a 
ou
tp
ut
B
roadcast
Pivot
Flag count
Node
Pivot
Flag count
Node
Pivot
Flag count
Node
Pivot
Flag count
Node
Pivot
Flag count
Node
Pivot
Flag count
Node
Pivot
Flag count
Node
Figure 3.9: Conceptual diagram of SIMD processor unit
A logarithmic height tree is used to compute the count of SIMD cells whose selection flag register
is set and to select a pivot element having an imprecise interval. Both operations are associative
and can therefore be realised with logarithmic delay in hardware. In the current implementation
selecting a pivot element is simply done by selecting the leftmost element of the sequence whose
interval is imprecise. Besides this the tree is able to retrieve a single data value from the array of
SIMD cells assuming that only a single selection flag is set. A dedicated control circuit not shown
in the conceptual diagram controls the operation of the SIMD cells.
3.3.3 SIMD processor unit
The SIMD processor unit consists of a controller unit, a ROM storing microcode programs con-
trolling the SIMD cells and an array of the actual SIMD cells. The controller is implemented
as a simple finite state machine having only two states as shown in Figure 3.10. Its reference
implementation can be found in Appendix B.4.1.
Idle Run
Dispatch
Run microcode program
Reset
Completion
Idle
Reset
Dispatch
I/O operation
Microcode
program
running
Figure 3.10: Finite state machine implemented in ξ-Sort core
The ξ-Sort controller is able to execute a few basic operations. It is able to load a single value
received from the functional unit adapter connecting it to the coprocessor architecture into the
first SIMD cell, at the same time shifting the data of all SIMD cells to the respective following
58
Figure 8. Organisation of stateful functional unit for the XI algorithm.
The functional unit is organised as a binary tree of interior node circuits
and leaf cell circuits. The persistent state is distributed across the cells,
while computations are performed in both the cells and the nodes. The
leaf “cell” processors provide permanent storage and perform comparisons
on indices; the interior nodes do not have persistent state, but they do
contain simple combinational logic functions that implement parallel scans
and folds required by the algorithm.
proces . Ea h oper ti takes a fixed number f clock
cycles with the FPGA; with a CPU each operation requires
an iteration that takes time proportional to the number of
data elements.
The algorithm has been i plemented on an Altera FPGA,
a small scale syst m intended for prototyping and software
development with a clock speed of approximately 50Mhz.
V. CONCLUSION
Several factors make it challenging to use FPGAs in
ordinary programming. The solution requires circuit design
skills as well as programming skills, an overall structur
has to be found for the FPGA circuit, an infrastructure is
required for holding data on the FPGA and delivering it to
the functional units.
We have presented a framework that addresses the inter-
facing issues in using FPGAs. It provides an efficient register
transfer machine for coordinating the data transfers and con-
trolling the functional units, relieving the programmer from
reinventing a significant amount of circuitry. The framework
is implemented in VHDL, with full documentation. To use it,
the programmer needs to configure the interface (by making
some VHDL definitions) and to define the functional units.
The most complex details of the interfacing are provided
by the framework; the programmer’s task is to design the
core logic of the functional unit (hardware design, using
VHDL) and to program the controller (which is software
design, and considerably simpler than it would be to design
a dedicated interface from the ground up).
3.3 Stateful functional units
O
R
2
O
R
3
O
R
3
VCC
cmd_load INPUT
VCC
cmd_restore INPUT
VCC
cmd_select_all INPUT
VCC
load_selected INPUT
VCC
cmd_match_data_lt INPUT
VCC
cmd_match_data_eq INPUT
VCC
cmd_match_data_gt INPUT
VCC
cmd_match_lower_bound INPUT
VCC
cmd_match_upper_bound INPUT
VCC
cmd_match_lower_bound_i INPUT
VCC
cmd_match_upper_bound_i INPUT
VCC
cmd_select_imprecise INPUT
VCC
cmd_set_lower_bound INPUT
VCC
cmd_set_upper_bound INPUT
VCC
cmd_set_bounds INPUT
VCC
cmd_save INPUT
VCC
clock INPUT
VCC
load_data[data_bits-1..0] INPUT
VCC
load_lower_bound[interval_bits-1..0] INPUT
VCC
load_upper_bound[interval_bits-1..0] INPUT
NN
N
AN
D
N
xN
O
R
4
NN
N
A N
D
N
xN
NN
N
AN
D
N
xN
N
N
ORNxNxN
data1x[N..0]
data0x[N..0]
sel
result[N..0]
MUX1xN
VCC
load_saved_state INPUT
unsigned compare
dataa[N..0]
datab[N..0]
aeb
alb
Comparator
OR3
OR2
NOT
OR2
AND2
OR2
OR2
AND2
AND2
AND2
OR3
A
N
D
2
A
N
D
2
AND2
AND2
NOT
OR2
OR2
EXTEND
EXTEND
O
R
3
AND2
OR2
data1x[N..0]
data0x[N..0]
sel
result[N..0]
MUX1xN
AND2
OR4
data1x[N..0]
data0x[N..0]
sel
result[N..0]
MUX1xN
DFF
data[N..0]
clock
enable
q[N..0]
RegisterNE
reg_data
DFF
data[N..0]
clock
enable
q[N..0]
RegisterNE
reg_lower_bound
DFF
data[N..0]
clock
enable
q[N..0]
RegisterNE
reg_upper_bound
DFF
data
clock
enable
q
RegisterE
reg_selected
DFF
data
clock
enable
q
RegisterE
reg_saved_state
selectedOUTPUT
upper_bound[interval_bits-1..0]OUTPUT
lower_bound[interval_bits-1..0]OUTPUT
data[data_bits-1..0]OUTPUT
saved_stateOUTPUT
N
O
T
O
R
2
A
N
D
2
A
N
D
2
VCC
input_data[max(data_bits,interval_bits)-1..0] INPUT
Figure 3.12: Schematic diagram of single SIMD cell
3.3.4 Functional unit
The functional unit connected to the coprocessor components is realised using a functional unit
adapter component. This adapter module connects the actual ξ-Sort core to the dispatcher and
the write arbiter a shown in Figure 3.13. The idea behind the design is to separate the ξ-Sort
controller logic from the interface logic required by the framework.
A finite state machine (FSM) implemented in the functional unit adapter controls the interaction
with the coprocessor as shown in Figure 3.14. The adapter component forwards data from the
dispatcher to the ξ-Sort core and indicates to the dipatcher whether it is able to process the next
command. Besides this, the adapter module buffers the output of the ξ-Sort core since it may be
required to wait for the write arbiter to acknowledge output data written to the register file.
This modular design allows easier integration and testing of the ξ-Sort core. It is also possible
to use the actual adapter core or an adapted version of it to interface other types of cores to
the coprocessor without changing the internal logic of the core being connected. The reference
implementation of the adapter component is shown in Appendix B.4.3. Currently, the adapter
uses 32-bit data records and transcodes data as needed. This aspect can be easily adapted to other
register file configurations. The current adapter module does not support input or output of flags
61
Figure 9. Cell circuit for XI algorithm. A cell corresponds to a word of memory, but it contains a small amount of computatational hardware as well
as storage. There is an array of cells, providing a memory that can hold an array. The entire set of cells comprises an extremely fine grain data parallel
architecture, which is targeted specifically to the χ-sort algorithm. The programmer begins by defining the behaviour of the high level operations in the
algorithm; these perform the same operation simultaneously in every cell. Next, a circuit is designed that provides both the storage and computation required
for every data element. Finally, this circuit is specified using VHDL. The figure shows the low level layout defined in the VHDL design.
The largest remaining challenge is the expertise required
to define new functional units. Much progress has been made
in high level hardware description languages and hardware
synthesis, but for the foreseeable future it will be harder
and require more knowledge to put part of an algorithm into
an FPGA rather than treating it as pure software. However,
the efficiency and parallelism offered by digital circuits are
very large, so this effort is likely to be justified for many
demanding applications.
REFERENCES
[1] K. Compton and S. Hauck, “Reconfigurable computing: a
survey of systems and software,” ACM Computing Surveys,
vol. 34, no. 2, pp. 171–210, 2002.
[2] T. J. Todman, G. A. Constantinides, S. J. E. Wilton,
O. Mencer, W. Luk, and P. Y. K. Cheung, “Reconfigurable
computing: architectures and design methods,” IEEE Proc.
on Computer and Digital Technology, vol. 152, no. 2, pp.
193–207, March 2005.
[3] Cyclone Device Handbook, Vol. 1, Altera Corporation, 2008.
[4] A. Koltes, “A flexible architecture framework for FPGA based
coprocessors,” University of Passau, Faculty of Computing
Science and Mathematics, Innstr. 44, D-94032 Passau, Ger-
many, Tech. Rep., 2008, Diplom hesis.
[5] M. Eisenring and M. Platzner, “An implementation framework
for runtime reconfigurable systems,” in Proc. 2nd Int. Work-
shop on Engineering of Reconfigurable Hardware/Software
Objects (ENREGL00), June 2000, pp. 151–157.
[6] Z. A. Ye, A. Moshovos, S. Hauck, and P. Banergee, “CHI-
MAERA: a high performance architecture with a tightly cou-
pled reconfigurable functional unit,” ACM SIGARCH Com-
puter Architecture News, vol. 28, no. 2, pp. 225–235, 2000.
[7] M. J. Wirthlin and B. L. Hutchings, “A dynamic instruction
set computer,” in IEEE Symposium on FPGAs for Custom
Computing Machines (FCCM’95), 1995, p. 0099.
[8] P. M. Athanas and H. F. Silverman, “Processor reconfigura-
tion through instruction set metamorphosis,” IEEE Computer,
vol. 26, no. 3, pp. 11–18, March 1993.
[9] A. DeHon, “DPGA-coupled microprocessors: commodity ICs
for the early 21st century,” in Proc. IEEE Workshop on
FPGAs for Custom Computing Machines, April 1994, pp. 31–
39.
[10] R. Razdan and M. D. Smith, “A high performance microar-
chitecture with hardware programmable functional units,” in
Proc. 27th Annual Symposium on Microarchitecture. ACM,
1994, pp. 172–180.
[11] J. O’Donnell, “Functional microprogramming for a data par-
allel architecture,” in Proc. 1988 Glasgow Workshop on Func-
tional Programming. Department of Computing Science,
University of Glasgow, 1988, pp. 124–145.
