Hardware profiling in a FPGA-based Soc by Παρνασσός, Ιωάννης
University of Thessaly







A thesis submitted in fulfilment of the requirements
for the degree of Diploma of Science in Computer and Communication
Engineering
in the
Department of Electrical and Computer Engineering
University of Thessaly
October 14, 2015
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
UNIVERSITY OF THESSALY
Department of Electrical and Computer Engineering
Hardware Profiling in a FPGA-based SoC






Diploma of Science in Computer and Communication Engineering
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Declaration of Authorship
I, Ioannis Parnassos, confirm that this thesis is my own work. All direct or indirect
sources used are acknowledged as references. This thesis was not previously pre-
sented to another examination board and has not been published.
Copyright c© 2015 by Parnassos Ioannis.
“The copyright of this thesis rests with the author. No quotations from it should be
published without the author’s prior written consent and information derived from
it should be acknowledged”.
iii
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Dedicated to my family and friends. . .
iv
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Abstract
Developing a complete FPGA-based system architecture requires a vast vari-
ety of design approaches to be examined and evaluated. Several attempts ostensibly
sufficient will not produce the expected outcome in terms of overall system perfor-
mance. Locating the system’s bottleneck cannot be relied entirely on simulation.
The purpose of this Thesis is to fulfill the need of profiling analysis for FPGA-
based SoC presenting the development of a hardware design with capabilities similar
to software event-based profilers.
RIFFA framework offers a user friendly implementation for communicating
data from a host CPU to a FPGA via PCI Express bus and was used as infrastruc-
ture. The created design extends RIFFA with profiling mechanisms for monitoring
and logging of user created IP cores based on a predefined event set. RIFFA Mon-
itor was tested with already implemented designs and collected data were used for
time analysis and event visualization.
During development several tasks were incorporated into RIFFA Monitor
and even more are left as future extensions, with the ambition to create a practical
and convenient tool for hardware design engineers.
vi
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Περίληψη
Κατά την ανάpiτυξη ενός συστήματος βασισμένου σε FPGA θα εξεταστούν και
θα αξιολογηθούν αρκετές διαφορετικές piροσεγγίσεις σχεδιασμού. Πολλές αpiό αυτές
μpiορεί να δίνουν την εντύpiωση ενός ορθά υλοpiοιημένου συστήματος αλλά δεν θα piα-
ράγουν τα αναμενόμενα αpiοτελέσματα όσον αφορά τη συνολική αpiόδοση. Ο εντοpiισμός
συμφόρησης του συστήματος δεν μpiορεί να βασιστεί αpiοκλειστικά στην piροσομοίωση.
Ο σκοpiός αυτής της διατριβής είναι να καλύψει την ανάγκη για ανάλυση αpiόδο-
σης σε FPGA-based SoC piαρουσιάζοντας τον σχεδιασμό και την ανάpiτυξη υλικού με
δυνατότητες αντίστοιχες λογισμικών ανάλυσης αpiόδοσης βασισμένων σε γεγονότα.
Το RIFFA piροσφέρει ένα φιλικό piρος το χρηστή piλαίσιο για την εpiικοινωνία
δεδομένων ανάμεσα στο λογισμικό και σε μια FPGA μέσω του διαύλου PCI Express,
και χρησιμοpiοιήθηκε ως υpiοδομή. Το piροαναφερθέν piλαίσιο εμpiλουτίστηκε με μηχα-
νισμούς για την piαρακολούθηση και καταγραφή στιγμιότυpiων λειτουργίας των συστη-
μάτων συμφώνα με ένα σύνολο piροκαθορισμένων γεγονότων. Το piαραγόμενο υλικό
ονόματι RIFFA Monitor χρησιμοpiοιήθηκε για την piαρακολούθηση και αξιολόγηση υ-
λοpiοιημένων συστημάτων και τα δεδομένα piου συλλέχτηκαν χρησιμοpiοιήθηκαν για την
ανάλυση λειτουργιάς και οpiτικοpiοιήθηκαν.
΄Οσο ο RIFFA Monitor βρισκόταν υpiό ανάpiτυξη υιοθέτησε μια piληθώρα λειτουρ-
γιών, ενώ αρκετές ακόμα δοκιμάζονται για μελλοντικές εpiεκτάσεις, με τη φιλοδοξία να
δημιουργήσουν ένα piρακτικό και βολικό εργαλείο για τους μηχανικούς σχεδίασης υλι-
κού.
vii
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Acknowledgements
For the fullfilment of this Thesis, I would like to thank my my professor Dr.
Nikolaos Bellas for his advice and guidance and my colleague George Zindros for
his support, collaboration and ideas.
Also i would like to thank my family for their support and patience...
viii
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Contents




List of Figures xi
List of Tables xii
Abbreviations xiii
1 Introduction 1
1.1 Describing the Motives . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Background 3
2.1 Field Programmable Gate Array - FPGA . . . . . . . . . . . . . . . 3
2.1.1 FPGA Architecture . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.2 Virtex 7TMVC707 Evaluation board . . . . . . . . . . . . . . 8
2.2 Reusable Integration Framework for FPGA Accelerators - RIFFA . . 9
2.2.1 RIFFA Architecture . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 RIFFA Hardware Interface . . . . . . . . . . . . . . . . . . . 13
2.2.3 RIFFA Sorfware API . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Vivado Design Suite . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3 RIFFA Monitor Core Design & Implementation 21
3.1 Purpose & Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Event-Based Profiler . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 High Level Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4 Module Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4.1 Monitor Top Module . . . . . . . . . . . . . . . . . . . . . . . 24
3.4.1.1 Control Mechanism . . . . . . . . . . . . . . . . . . 26
ix
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Contents x
3.4.1.2 Tail . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4.1.3 Parameters . . . . . . . . . . . . . . . . . . . . . . . 30
3.4.2 Global Timer . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4.3 Monitor Submodule . . . . . . . . . . . . . . . . . . . . . . . 33
3.4.4 Event Log . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.5 Driver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.6 Architectural exploration . . . . . . . . . . . . . . . . . . . . . . . . 39
4 Conclusion 41
4.1 Project Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 In the Future . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
A Verilog Source Code 43
B Software interface - RIFFA Monitor API 56
Bibliography 62
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
List of Figures
2.1 Overview of Island-Style FPGA architecture . . . . . . . . . . . . . . 4
2.2 Simplified example illustration of a logic cell . . . . . . . . . . . . . . 5
2.3 Switch box Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 FPGA Software Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5 VC707 Evaluation board . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.6 VC707 board block diagram . . . . . . . . . . . . . . . . . . . . . . . 8
2.7 XC7VX485T FPGA Feature Summary . . . . . . . . . . . . . . . . . 9
2.8 RIFFA high level architectural diagram . . . . . . . . . . . . . . . . 10
2.9 Sequence diagram for upstream transfer . . . . . . . . . . . . . . . . 11
2.10 Sequence diagram for downstream transfer . . . . . . . . . . . . . . . 12
2.11 Timing diagram for receiving data . . . . . . . . . . . . . . . . . . . 14
2.12 Timing diagram for transmitting data . . . . . . . . . . . . . . . . . 15
2.13 Vivado High Level Synthesis . . . . . . . . . . . . . . . . . . . . . . . 20
3.1 RIFFA with Monitor high level architecture . . . . . . . . . . . . . . 23
3.2 RIFFA Monitor block diagram . . . . . . . . . . . . . . . . . . . . . 24
3.3 Monitor’s RTL schematic . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4 Monitor’s abstract view and I/O . . . . . . . . . . . . . . . . . . . . 25
3.5 Monitor’s Finite State Machine . . . . . . . . . . . . . . . . . . . . . 27
3.6 Global Timer RTL schematic . . . . . . . . . . . . . . . . . . . . . . 32
3.7 Event Monitor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.8 Monitor Submodule RTL schematic . . . . . . . . . . . . . . . . . . 34
3.9 Monitor module ID generation . . . . . . . . . . . . . . . . . . . . . 34
3.10 Event Log Module RTL schematic . . . . . . . . . . . . . . . . . . . 35
3.11 Output of INFO function call . . . . . . . . . . . . . . . . . . . . . . 37
3.12 Output of LOG function call . . . . . . . . . . . . . . . . . . . . . . 38
3.13 Basic Vilsualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
xi
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
List of Tables
2.1 RX - TX interface Signals . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1 Monitor’s OPCODE bits . . . . . . . . . . . . . . . . . . . . . . . . . 26
xii
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Abbreviations
API Application Programming Inteface
ASIC Application Specific Integrated Circuit
BRAM Block Random Access Memory
CAD Computer Aided Design
CMT Clock Managment Tile
CLB Configurable Logic Block
DMA Direct Memory Access
DFF D Flip Flop
DSP Digital Signal Processing
FPGA Field Programmable Gate Array
HDL Hardware Description Language
HLS High Level Synthesis
LUT Look Up Table
MMCM Mixed Mode Clock Managment
MUX MUultipleXer
PCI Peripheral Component Interconnect
PLL Phase Locked Loop
RIFFA Reusable Integration Framework for FPGA Accelerators
RTL Register Transfer Level
xiii
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Chapter 1
Introduction
1.1 Describing the Motives
In software engineering, profiling is a form of dynamic program analysis. Infor-
mation provided can point out which pieces of a program are slower than expected,
and might be candidates for rewriting. It can also tell which functions are being
called more or less often and can help spotting bugs that had otherwise been unno-
ticed.
On the other hand when hardware engineers design an FPGA-based SoC they
have to rely on simulation for the evaluation and optimization of their work due
to the lack of hardware profiling tools. Collecting useful information during SoC
run time is either partially supported if a soft processor is implemented, or require
manual addition of profiling mechanisms.
The purpose of this Thesis is the development of a hardware design offering
similar capabilities to software profiling tools. Hardware Profiler will assist in mon-
itoring, debugging and evaluating FPGA designs, perform time analysis and locate
system bottlenecks.
1
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Chapter 1. Introduction 2
1.2 Thesis Structure
Thesis is divided in three main Chapters, each one of those includes smaller sections
and possibly subsections.
Chapter 2 provides background information over the hardware and software used in
this project. It begins in section 2.1 with a brief overview over FPGA architecture
and operation, and then focuses on the technical characteristics of Virtex 7 VC707
evalution board on which the design was developed. Following in section 2.2 the
RIFFA framework is presented and described and in section 2.3 we have a short
reference on the development suite.
Chapter 3 analytically describe the RIFFA 2.0 Profiler Core. In the first sections
we come across the purpose of the project, a high level view of the architecture, an
an introduction to event-based profilers, followed on section 3.4 by a complete anal-
ysis of each module. Afterwards the software bindings are provided and expanded.
Finally the last section will focus on the development milestones.
Chapter 4 summarizes the work done, results generated and provides ideas for future
development.
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Chapter 2
Background
2.1 Field Programmable Gate Array - FPGA
A field-programmable gate array (FPGA) is an integrated circuit designed
to be configured by a customer or a designer after manufacturing – hence ”field-
programmable”. The FPGA configuration is generally specified using a hardware
description language (HDL). As opposed to Application Specific Integrated Circuits
(ASICs), where the device is custom built for the particular design, FPGAs can be
programmed to the desired application or functionality requirements.
FPGAs contain an array of programmable logic blocks, and a hierarchy of
reconfigurable interconnects that allow the blocks to be ”wired together”, like many
logic gates that can be inter-wired in different configurations. Logic blocks can be
configured to perform complex combinational functions, or merely simple logic gates
like AND and XOR. In most FPGAs, logic blocks also include memory elements,
which may be simple flip-flops or more complete blocks of memory.
An FPGA can be used to solve any problem which is computable. This is
trivially proven by the fact FPGA can be used to implement a soft microprocessor.
Their advantage lies in that they are sometimes significantly faster for some appli-
cations due to their parallel nature and optimality in terms of the number of gates
used for a certain process.
3
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Chapter 2. Background 4
2.1.1 FPGA Architecture
Logic blocks
The most common FPGA architecture among academic and commercial FP-
GAs consists of an island-style array of logic blocks (called Configurable Logic Block,
CLB, or Logic Array Block, LAB, depending on vendor), I/O pads, and routing
channel.
Figure 2.1: Overview of Island-Style FPGA architecture
CLB is a the fundamental building block a FPGA and can be configured by
the engineer to provide reconfigurable logic gates. A logic block consists of a few
logical cells (called ALM, LE, Slice etc.). A typical cell consists of a 4-input LUT,
a Full adder (FA) and a D-type flip-flop, as shown in figure 2.1. The LUTs are in
this figure split into two 3-input LUTs. In normal mode those are combined into
a 4-input LUT through the left mux. In arithmeticmode, their outputs are fed to
the FA. The selection of mode is programmed into the middle multiplexer. The
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Chapter 2. Background 5
output can be either synchronous or asynchronous, depending on the programming
of the mux to the right, in the figure example. In recent years, manufacturers have
started moving to 6-input LUTs in their high performance parts, claiming increased
performance.
Figure 2.2: Simplified example illustration of a logic cell
Hard blocks
Modern FPGA families expand upon the above capabilities to include higher
level functionality fixed into the silicon. Having these common functions embedded
into the silicon reduces the area required and gives those functions increased speed
compared to building them from primitives. Examples of these include multipli-
ers, generic DSP blocks, embedded processors, high speed I/O logic and embedded
memories. Higher-end FPGAs can contain high speed multi-gigabit transceivers and
hard IP cores such as processor cores, Ethernet MACs, PCI/PCI Express controllers,
and external memory controllers. These cores exist alongside the programmable fab-
ric, but they are built out of transistors instead of LUTs so they have ASIC level
performance and power consumption while not consuming a significant amount of
fabric resources, leaving more of the fabric free for the application-specific logic.
The multi-gigabit transceivers also contain high performance analog input and out-
put circuitry along with high-speed serializers and deserializers, components which
cannot be built out of LUTs. Higher-level PHY layer functionality such as line
coding may or may not be implemented alongside the serializers and deserializers
in hard logic, depending on the FPGA.
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Chapter 2. Background 6
Routing
An application circuit must be mapped into an FPGA with adequate resources.
While the number of CLBs/LABs and I/Os required is easily determined from the
design, the number of routing tracks needed may vary considerably even among de-
signs with the same amount of logic. Generally, the FPGA routing is unsegmented.
That is, each wiring segment spans only one logic block before it terminates in a
switch box. By turning on some of the programmable switches within a switch
box, longer paths can be constructed. For higher speed interconnect, some FPGA
architectures use longer routing lines that span multiple logic blocks.
Figure 2.3: Switch box Topology
Software Flow
FPGA architectures have been intensely investigated over the past two decades.
A major aspect of FPGA architecture research is the development of Computer
Aided Design (CAD) tools for mapping applications to FPGAs. It is well estab-
lished that the quality of an FPGA-based implementation is largely determined by
the effectiveness of accompanying suite of CAD tools. Benefits of an otherwise well
designed, feature rich FPGA architecture might be impaired if the CAD tools can-
not take advantage of the features that the FPGA provides. Thus, CAD algorithm
research is essential to the necessary architectural advancement to narrow the per-
formance gaps between FPGAs and other computational devices like ASICs.
The software flow (CAD flow) takes an application design description in a
Hardware Description Language (HDL) and converts it to a stream of bits that is
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Chapter 2. Background 7
eventually programmed on the FPGA. The process of converting a circuit descrip-
tion into a format that can be loaded into an FPGA can be roughly divided into
five distinct steps, namely: synthesis, technology mapping, mapping, placement and
routing. The final output of FPGA CAD tools is a bitstream that configures the
state of the memory bits in an FPGA. The state of these bits determines the logical
function that the FPGA implements.
Figure 2.4: FPGA Software Flow
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Chapter 2. Background 8
2.1.2 Virtex 7TMVC707 Evaluation board
The project was developed on a Virtex-7 VC707 Evaluation board using the
XC7VX485T-2FFG1761C FPGA. Virtex is the flagship family of FPGA products
developed by Xilinx optimized for highest system performance and capacity. The
VC707 board block diagram is shown in Figure 2.6.
Figure 2.5: VC707 Evaluation board
Figure 2.6: VC707 board block diagram
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Chapter 2. Background 9
Figure 2.7: XC7VX485T FPGA Feature Summary
7 series FPGA slice contains four LUTs and eight flip-flops; only some slices can
use their LUTs as distributed RAM or SRLs. Each DSP slice contains a pre-adder,
a 25 x 18 multiplier, an adder, and an accumulator. Block RAMs are fundamentally
36 Kb in size; each block can also be used as two independent 18 Kb blocks. Each
CMT contains one MMCM and one PLL.
2.2 Reusable Integration Framework for FPGA Accel-
erators - RIFFA
RIFFA (Reusable Integration Framework for FPGA Accelerators) is a simple
framework for communicating data from a host CPU to a FPGA via a PCI Express
bus. The framework requires a PCIe enabled workstation and a FPGA on a board
with a PCIe connector. RIFFA supports Windows and Linux, Altera and Xilinx,
with bindings for C/C++, Python, MATLAB and Java.
On the software side there are two main functions: data send and data receive.
These functions are exposed via user libraries in C/C++, Python, MATLAB, and
Java. The driver supports multiple FPGAs (up to 5) per system. The software
bindings work on Linux and Windows operating systems. Users can communicate
with FPGA IP cores by writing only a few lines of code.
On the hardware side, users access an interface with independent transmit and
receive signals. The signals provide transaction handshaking and a first word fall
through FIFO interface for reading/writing data. No knowledge of bus addresses,
buffer sizes, or PCIe packet formats is required. Simply send data on a FIFO inter-
face and receive data on a FIFO interface. RIFFA does not rely on a PCIe Bridge
and therefore is not subject to the limitations of a bridge implementation. Instead,
RIFFA works directly with the PCIe Endpoint and can run fast enough to saturate
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Chapter 2. Background 10
the PCIe link. It communicates data using direct memory access (DMA) transfers
and interrupt signaling, achieving high bandwidth over the PCIe link.
For the development of this Thesis RIFFA version 2.0.2 was used as infrastruc-
ture. The provided analysis is a replica of the information illustrated in the official
RIFFA 2 site.
2.2.1 RIFFA Architecture
Interface has been simplified to expose data as a first word fall through FIFO
(valid-data-ready interface). The data is transferred by RIFFA’s RX and TX DMA
engines using scatter gather address information from the workstation. These en-
gines issue and service PCIe packets to and from the PCIe Endpoint. RIFFA relies
on a Vendor PCIe Endpoint core to drive the transceivers. These are lowest-level
interface that FPGA vendors provide. The RIFFA interface supports 32-bit, 64-
bit and 128-bit widths, depending on the PCIe link configuration. A high level
architectural diagram of the RIFFA framework is illustrated in figure 2.8.
Figure 2.8: RIFFA high level architectural diagram
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Chapter 2. Background 11
The upstream transfer is initiated by the FPGA. However, they will not begin
until the user application calls the user library function fpga recv. Upon doing so,
the thread enters the kernel driver and begins the pending upstream request. If
the upstream request has not yet been received, the thread waits for it to arrive
(bounded by the timeout parameter). On the diagram, the user library and device
driver are represented by the single node labeled ”RIFFA Library”.
Figure 2.9: Sequence diagram for upstream transfer
Servicing the request involves building a list of scatter gather elements which
identify which pages of physical memory correspond to the receptacle byte array.
The scatter gather elements are written to a shared buffer. This buffer location and
content length are provided to the FPGA. Each page enumerated by the scatter
gather list is pinned to memory to avoid costly paging. The FPGA reads the
scatter gather data then issues write requests to memory for the upstream data. If
more scatter gather elements are needed, the FPGA will request additional elements
via interrupt. Otherwise, the kernel driver waits until all the data is written. The
FPGA provides this notification, again via an interrupt.
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Chapter 2. Background 12
After the upstream transaction is complete, the driver reads the FPGA for a
final count of data words written. This is necessary as the scatter gather elements
only provide an upper bound on the amount of data that is to be written. This
completes the transfer and the function call returns to the application with the final
count.
A similar sequence exists for downstream transfers. In this direction, the ap-
plication initiates the transfer by calling the library function fpga send The thread
enters the kernel driver and writes to the FPGA to initiate the transfer. Again, a
scatter gather list is compiled, pages are pinned, and the FPGA reads the scatter
gather elements. Each of the elements results in one or more read requests by the
FPGA. The read requests are serviced and the kernel driver is notified only when
more scatter gather elements are needed or when the transfer has completed.
Upon completion, the driver reads the final count read by the FPGA. In error
free operation, this value should always be the length of all the scatter gather ele-
ments. The final count is returned to the application.
Figure 2.10: Sequence diagram for downstream transfer
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Chapter 2. Background 13
2.2.2 RIFFA Hardware Interface
A single RIFFA channel has two sets of signals, one for receiving data (RX)
and one for sending data (TX). RIFFA has simplified the interface to use a min-
imal handshake and receive/send data using a FIFO with first word fall through
semantics (valid+read interface). The clocks used for receiving and sending can be
asynchronous from each other and from the PCIe interface (RIFFA clock). The
table below describes the ports. The input/output designations are from your user
core’s perspective (i.e. the core(s) you write and connect to the RIFFA channel).
Table 2.1: RX - TX interface Signals
For better understanding of the RX and TX procedures an example of each
with their timing diagrams are provided bellow.
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Chapter 2. Background 14
Figure 2.11: Timing diagram for receiving data
Figure 2.11 shows the RIFFA channel receiving a data transfer of 16 words
(64 bytes). When CHNL RX is high, CHNL RX LAST, CHNL RX LEN, and
CHNL RX OFF will all be valid. In this example, CHNL RX LAST is high, indi-
cating to the user core that there are no other transactions following this one and
that the user core can start processing the received data as soon as the transaction
completes. CHNL RX LAST may be set low if multiple transactions will be initi-
ated before the user core should start processing received data. Of course, the user
core will always need to read the data as it arrives, even if CHNL RX LAST is low.
In the example CHNL RX OFF is 0. However, if the PC specified a value for
offset when it initiated the send, that value would be present on the CHNL RX OFF
signal. The 31 least significant bits of the 32 bit integer specified by the PC thread
are transmitted (due to packing constraints). The CHNL RX OFF signal is meant
to be used in situations where data is transferred in multiple sends and the user
core needs to know where to write the data (if, for example it is writing to BRAM
or DRAM).
The user core must pulse the CHNL RX ACK signal high for at least one cycle
to acknowledge the receive transaction. The RIFFA channel will not recognize that
the transaction has been received until it receives a CHNL RX ACK pulse. Note
that data on CHNL RX DATA may arrive before CHNL RX ACK is pulsed, but
the FIFO will never overflow.
The combination of CHNL RX DATA VALID high and CHNL RX DATA REN
high consumes the data on CHNL RX DATA. New data will be provided until the
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Chapter 2. Background 15
FIFO is drained. Note that the FIFO may drain completely before all the data has
been received. The CHNL RX signal will remain high until all data for the trans-
action has been received into the FIFO. Note that CHNL RX may go low while
CHNL RX DATA VALID is still high. That means there is still data in the FIFO
to be read by the user core. Attempting to read (asserting CHNL RX DATA REN
high) while CHNL RX DATA VALID is low, will have no affect on the FIFO. The
user core may want to count the number of words received and compare against the
value provided by CHNL RX LEN to keep track of how much data is expected.
In the event of a transmission error, the amount of data received may be less
than the amount expected (advertised on CHNL RX LEN). It is the user core’s
responsibility to detect this discrepancy if important to the user core.
RIFFA channel’s TX interface is nearly symmetric to the receive example.
In figure 2.12 RIFFA channel is sending a data transfer of 16 words (64 bytes).
Figure 2.12: Timing diagram for transmitting data
The user core sets CHNL TX high and asserts values for CHNL TX LAST,
CHNL TX LEN, and CHNL TX OFF for the duration CHNL TX is high. CHNL TX
must remain high until all data has been consumed. RIFFA will expect to read
CHNL TX LEN words from the user core. Any more data provided may be con-
sumed, but will be discarded. The user core can provide less than CHNL TX LEN
words and drop CHNL TX at any point. Dropping CHNL TX indicates the end of
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Chapter 2. Background 16
the transaction. Whatever data was consumed before CHNL TX was dropped will
be sent and reported as received to the software thread.
As with the receive interface, setting CHNL TX LAST high will signal to
the PC thread to not wait for additional transactions (after this one). Setting
CHNL TX OFF will cause the transferred data to be written into the PC thread’s
buffer starting CHNL TX OFF 4 bytes words from the beginning. This can be use-
ful when sending multiple transactions and needing to order them in the PC thread’s
receive buffer. CHNL TX LEN defines the length of the transaction in 4 byte words.
As the CHNL TX DATA bus can be 32 bits, 64 bits, or 128 bits wide, it may
be that the number of 32 bit words the user core wants to transfer is not an even
multiple of the bus width. In this case,CHNL TX DATA VALID must be high
on the last cycle CHNL TX DATA has at least 1 word to send. The channel will
only send as many words as is specified by CHNL TX LEN. So any additional data
consumed, past the last word, will be discarded. Shortly after CHNL TX goes
high, the RIFFA channel will pulse high the CHNL TX ACK and begin to consume
the CHNL TX DATA bus. The combination of CHNL TX DATA VALID high and
CHNL TX DATA REN high will consume the data currently on CHNL TX DATA.
New data can be consumed every cycle. After all the data is consumed, CHNL TX
can be dropped. Keeping CHNL TX DATA VALID high while CHNL TX DATA REN
is low will have no effect.
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Chapter 2. Background 17
2.2.3 RIFFA Sorfware API
The software interface is provided by bindings for C/C++. After installation
all bindings are available in their respective runtime environments. The API is
based on the notion of channels. RIFFA can be configured to support between 1
- 12 independent channels. Each channel connects to an IP core and can be ad-
dressed by specifying the channel number from the user application. The channels
are independent and thread safe. At most one thread should be used to access a
single channel. The C/C++ bindings are used by including the riffa.h header file
and linking with the -lriffa library.
API
• int fpga list(fpga info list * list);
Populates the fpga info list pointer with all FPGAs registered in the system.
See riffa driver.h for the fpga info list definition. Returns 0 on success, a
negative value on error.
list - Pointer to a fpga info list struct to populate.
Returns: 0 on success, a negative value on error.
• fpga t * fpga open(int id);
Initializes the FPGA specified by id. On success, returns a pointer to a fpga t
struct. On error, returns NULL. Each FPGA must be opened before any
channels can be accessed. Once opened, any number of threads can use the
fpga t struct pointer.
id - Identifier for the FPGA (in single FPGA installations, this is always 0).
Returns: A fpga t struct pointer or NULL.
• void fpga close(fpga t * fpga);
Cleans up memory/resources for the FPGA specified by the descriptor.
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Chapter 2. Background 18
fpga - Pointer to fpga t struct.
Returns: Nothing.
• int fpga send(fpga t * fpga, int chnl, void * data, int len, int destoff,
int last, long long timeout);
fpga - Pointer to fpga t structure.
chnl - Channel number over which to communicate.
data - Pointer to array of data to send.
len - Length of data to send, in (32 bit) words. Thus a value of 4 means send
16 bytes.
destoff - Value sent to FPGA core to indicate where to start writing this data.
Only the least significant 31 bits are sent (not all 32).
last - If 1, this transfer is the last in a sequence of transfers. If 0, this transfer
is not the last in a sequence of transfers (more transfers to come).
timeout - Timeout value in ms. If 0, no timeout is specified. Otherwise, the
PC will wait up to timeout ms in between PC/FPGA communications.
Sends len words (4 byte words) from data to FPGA channel chnl using the
fpga t struct. The FPGA channel will be sent len, destoff, and last. The value
of destoff is used to support sending data across multiple send transactions.
Note that only the low 31 bits of this unsigned int are sent. If last is 1, the
channel should interpret the end of this send as the end of a transaction. If
last is 0, the channel should wait for additional sends before the end of the
transaction. If timeout is non-zero, this call will send data and wait up to
timeout ms for the FPGA to respond (between packets) before timing out. If
timeout is zero, this call may block indefinitely. Multiple threads sending on
the same channel may result in corrupt data or error. This function is thread
safe across channels.
Returns: The number of words sent.
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Chapter 2. Background 19
• int fpga recv(fpga t * fpga, int chnl, void * data, int len, long long
timeout);
fpga - Pointer to fpga t structure.
chnl - Channel number over which to communicate.
data - Pointer to buffer array where received data will be written.
len - Length of buffer array, in (32 bit) words. Thus a value of 4 means send
16 bytes.
timeout - Timeout value in ms. If 0, no timeout is specified. Otherwise, the
PC will wait up to timeout ms in between PC/FPGA communications.
Receives data from the FPGA channel chnl to the data pointer, using the
fpga t struct. The FPGA channel can send any amount of data, so the data
array should be large enough to accommodate. The len parameter specifies
the actual size of the data buffer in words (4 byte words). The FPGA will
specify an offset value which will determine where received data will start be-
ing written. If the amount of data plus the offset exceed the size of the data
array, then the additional data will be discarded. If timeout is non-zero, this
call will wait up to timeout ms for the FPGA to respond (between packets)
before timing out. If timeout is zero, this call may block indefinitely. Returns
the number of words received to the data array.
Returns: The number of words received to the data array.
• void fpga reset(fpga t * fpga);
Resets the state of the FPGA and all transfers across all channels. This
is meant to be used as an alternative to rebooting if an error occurs while
sending/receiving. Calling this function while other threads are sending or
receiving will result in unexpected behavior.
fpga - Pointer to fpga t structure.
Returns: Nothing.
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Chapter 2. Background 20
2.3 Vivado Design Suite
This project was desing and implemented on Xilinx’s Vivado Design Suite.
Vivado is a software suit for synthesis and analysis of HDL designs, superseding
Xilinx ISE with additional features for system-on-chip development and high-level
synthesis. It includes an in-built logic simulator and high-level synthesis, with a
toolchain that converts C code into programmable logic. Vivado HLS was used for
the creation of accelarators that were attached to RIFFA channels and monitored.
Figure 2.13: Vivado High Level Synthesis
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Chapter 3
RIFFA Monitor Core Design &
Implementation
This chapter presents the RIFFA 2.0 Monitor Core. The purpose, design and
implementation of every module will be described in depth, followed by a brief anal-
ysis of the C based API. It concludes with a summary of the different architectural
approaches while searching for a viable solution in the transparency and compati-
bility of the project.
3.1 Purpose & Approach
The purpose of this project is to expand the functionality that RIFFA frame-
work provides with metrics and logging, creating a profiling mechanism for FPGA
based SoC. The original concept was that Monitor core would make use of hardware
performance counters just like those built into modern microprocessors. Although
this was a simple task, the variety of IP cores users can attach to RIFFA channels
made it harder to come up with a universal solution. And while the project was
already evolving to a viable state the next dilemma emerged. How to perform the
extra operations but keep their existence transparent at user level, and at the same
time retain compatibility with predated projects.
Every RIFFA project consists of two main parts, the user IP core (accelerator
21
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Chapter 3. Design & Implementation 22
/ SoC) to be instantiated in riffa adapter module, and the corresponding software
using RIFFA API to access the core. Accelerators should be able to instantiate as is
without additional logic or interconnections, besides when users manually demand
a specific event to be monitored. At software level the objective was to leave the
original API intact. Furthermore all existing RIFFA projects should be able to
operate properly on the monitored framework.
Apart from the functionality, compatibility and transparency that is expected
the next major factor determining the success of the implementation is the percent-
age of resource occupation. Monitor had to be designed as a lightweight module,
using only the minimum amount of combinational logic possible. It would be mean-
ingless if there was not enough resources left for the actual SoC - accelerator.
3.2 Event-Based Profiler
Monitor started as a single hardware counter measuring the duration of accel-
erator usage and gradually evolved into a tool capable of monitoring and recording
any activity in the user generated modules. That activity will be mentioned as
events and consists of RIFFA RX /TX engine usage (channel transactions) and
user specified triggers.
Similar to event-based profilers triggering on certain events in the code, RIFFA
2.0 Monitor exhaustively monitors and records every trigger associated with his ap-
pointed set of events. User specified triggers are optional and they have to be
manual wired to the Monitor. With this simple step a maximum of 16 different
events can be monitored and logged simultaneously. For each one Monitor core will
count the number of occurrences, measure their duration and log them. The latest
version can support a maximum of 16 unique events.
Since RIFFA Monitor is a hardware design event’s definition differs from orig-
inal software profilers. In software engineering an event is a unique trigger like an
exception, a function call or a specific mark in the code. On the other hand a
hardware event usually has a solid duration. For example a hardware event could
be the signal of a sole wire, a specific state on an FSM or even an active transaction
of the AXI BUS interface.
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Chapter 3. Design & Implementation 23
3.3 High Level Design
In order to assign a custom IP core to RIFFA channels we have to instantiate
it inside riffa adapter module in user-space provided and we are obliged to use the
proper interconnects of RX and TX interfaces. In Figure 2.3 of section 2.2 we show
a high level diagram of the original RIFFA architecture.
After a series of trials in different architectural approaches we concluded that
the most easy-to-use and compatible with predated projects solution is to instan-
tiate a single Monitor core inside riffa adapter module, and redirect user space for
instantiations one level deeper in module hierarchy. One Monitor core will supervise
all channels and monitor all user IP cores. To achieve this profiler core exists be-
tween user cores and riffa adapter and intercepts all TX and RX signals. In figure
3.1 we display the high level architecture of RIFFA with the interfering Monitor
module acting as a wrapper for all user cores.
Figure 3.1: RIFFA with Monitor high level architecture
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Chapter 3. Design & Implementation 24
3.4 Module Analysis
In this section we will begin to tear down the Monitor core starting from the
top module profiler.v. We will see a detailed analysis of the logic behind the FSM
controlling all the functionality, the combinational logic intercepting the wiring
of TX and RX interface and the level of elasticity we can achieve with various
parameters. Monitoring and Logging are handled at a lower hierarchy level offering
the ability to instantiate only what is essential over each project for economy in
resource occupancy.
3.4.1 Monitor Top Module
Monitor’s top module actually implements most of the required logic. It is
designed for the 128-bit version of RIFFA 2.0 framework. The clock used is RIFFA’s
user clock at 250 Mhz (4ns period). The pre-definition of every register’s value
through initialization after fpga programming makes it unnecessary to insert a global
reset network.
Figure 3.2: RIFFA Monitor block diagram
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Chapter 3. Design & Implementation 25
Figure 3.3: Monitor’s RTL schematic
To the upper RIFFA hierarchy levels Monitor seems just like any other IP core
instantiated and connected to RIFFA channels. RIFFA framework offer a maximum
of 12 channels. Through them accelerators are able to communicate with software
using the PCI express link. Monitor control unit is able to receive orders without
binding any channel. It connects to all available channels and intercepts every
transaction. So when a user core get instantiated inside the redirected user space
neither RIFFA modules nor the IP core should be altered to conform with Monitor’s
existence. Furthermore If we don’t manually communicate with Monitor through
his driver, it will silently let user cores to run and will record events in the LOG.
Figure 3.4: Monitor’s abstract view and I/O
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Chapter 3. Design & Implementation 26
The interception of RX interface gives Monitor the ability to communicate
with software without using extra resources and it will be discussed in section
3.4.1.1. On the other hand interception of TX interface gave us the ability to
keep transmission active for one extra cycle, providing 128 extra bits of information
sent back to the software level. The operation of attaching additional data in
accelerator’s transmission is called TAIL and will be covered in section 3.4.1.2.
3.4.1.1 Control Mechanism
Operation Code
Monitor must be able to communicate with software without binding any RIFFA
channel. To achieve this without altering the original drivers or RIFFA’s higher level
modules the only way is to share a channel with a user IP/accelerator core. now
each time RIFFA has an incoming transaction through this channel the incoming
data are aimed at either the Monitor or the accelerator. By adding a header to the
data frame we now have an operation code to make this distinction clear. Profiler
is sharing channel 0 and will be solely responsible start every RX transaction on
this channel. The two LSB act as Job Select and indicate the proper operation.
• FORWARD - Give channel control to user IP core.
• INFO - Transmit the specified information frame.
• FLUSH - Transmit all valid entries in Log’s BRAM.
• SET - Reset specific counters and enable or disable tail.
Bits 2 to 6 are read only when operation selected is SET ( 2 LSB = = 11) , and
each on is associated with a unique operation.
Table 3.1: Monitor’s OPCODE bits
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Chapter 3. Design & Implementation 27
Users won’t have to worry about sending the correct OPCODE since it is already
handled by Monitor’s API. On every transaction targeting the Soc - accelerator a
header with OPCODE 00 (FORWARD) will be automatically attached.
FSM
Monitor’s top module has to perform several actions including event monitoring
and logging, and also has to share a channel with a user core in order to receive
orders and send back information. Even though profiler receives orders and sends
data only through channel 0 to minimize the combinational logic, there still remain
considerable amount of tasks to be organized.
To coordinate every operation we need a solid and reliable FSM impervious to
poorly designed user cores and their misbehaviors and simultaneously simple, light
and with minimum latency. Latest Profiler’s version uses a simplified FSM with 4
states.
Figure 3.5: Monitor’s Finite State Machine
00 - IDLE
Idle is Monitor’s initial state. It is also the state to return to after riffa reset, giving
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Chapter 3. Design & Implementation 28
Monitor the ability to recover even if user IP cores have undetermined behavior.
Monitor will remain in this state until RX signal of channel 0 is high. This indicates
either the start of transmission from software to the user core connected on channel
0 or a request to the Monitor. Since we are no longer in idle state control is handed
over to state 01.
01 - READ OPCODE
Monitor’s driver is responsible to attach the right header on the incoming data.
As soon they are valid, Monitor reads the first 128 bits and freezes RX procedure
by dropping channel’s RX read enable signal. Now depending on the header that
is basically an operation code (OPCODE) the proper job will be selected. Even
though we don’t need all 128 bits, using a full transfer as a header is chosen so that
we don’t mess with data alignment. A total of 7 bits are used as OPCODE since
it was a fair trade between clarification and resource economy. The next state will
be determined depending on OPCODES’s two LSB and at this point Monitor has
to select from four different operations.
1. Do nothing, go to state 10. Transaction was actually targeting accelerator on
channel 0.
2. Transmit through channel 0 useful information about channel usage, last mea-
surements, timestamp and parameter values, go to state 11.
3. Transmit through channel 0 all valid entries recorded in the Log, go to state
11.
4. Reset specific counters, activate or deactivate tail, return to idle state 00.
10 - FORWARD TO USER CORE
Incoming Data was sent to accelerator attached on channel 0. The 128-bit
header is removed since its already consumed by Monitor at state 01 and the re-
maining data will be delivered to the accelerator. From this moment on accelerator
has complete control over channel 0 when he is ready he will raise the channel’s
RX read enable flag. A specific event must trigger the moment when Monitor can
regain channel management. Using a hard coded trigger would require manual
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Chapter 3. Design & Implementation 29
addition on every IP core connected on channel 0, with extra logic and an extra
interconnection. As a result efforts for transparency and compatibility would go
to waste. Using as consensus the fact that almost every time a RX transaction to
the IP core is followed by the immediate response with a TX transaction, Monitor
will regain channel 0 control after TX is finished and return to IDLE state. It was
achieved with minimal additional logic because all RX and TX signals are already
intercepted. One extra cycle of delay before we jump to IDLE is added if tail is
enabled.
11 - TRANSMIT (LOG OR INFO)
In this state Monitor will use RIFFA’s TX engine to transmit through channel
0. Two distinct operations can be performed, flush the log or return values of
specific registers. The first one will return all entries recorded in Log’s BRAM.
The mechanism that forwards entries from BRAM to TX data buffer is aware of
transmission delays and and uses a secondary register array to avoid data loss. The
flush procedure and the structure of the entries will be presented and explained in
section 3.3.4.
The second operation has as default configuration the return of a data frame
containing parameter values, tail setup, number of user-specified events, size of Log’s
BRAM, number of valid entries, the current timestamp and recorded values of events
usage and duration. The structure of this frame can be efficiently reconfigured to
match user’s needs.
3.4.1.2 Tail
Appending a number of bits with extra information on accelerator’s transac-
tions was in fact the original functionality and source of the whole project. Started
as part of accelerator’s logic was later removed since all operation requiring major
changes to user IP cores were excluded. With the massive intercepting of every sig-
nal and complete knowledge of RIFFA’s TX engine timing diagram is now possible
to attach the extra bits without even interfering with accelerator’s TX FSM. Since
Monitor is designed for the 128-bit version of RIFFA endpoint, just one extra cy-
cle of active transmission offers 128-bit of information. This was not only straight
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Chapter 3. Design & Implementation 30
forward to implement but also sufficient for the amount of data to be attached.
As default return values are chosen the current timestamp and the duration of
the corresponding accelerator usage. Those values can be changed to match user’s
needs.
On the software level when the final user want to receive the extra information
attached to the original data he should manually increase the number of the received
words and adjust the buffer’s size accordingly. If he doesn’t do that the result is
not catastrophic, the tailed data will just be ignored by software. One small detail
that deserves some notice is that if word count of outgoing data are not a multiple
of 4, the tail is actually more than one 128-bit frame. The missing words of last
frame will be filled with zeros so that tail is completely aligned in a single transfer
frame.
Once tail is enabled through the correct OPCODE register TAIL value will be
set to 1. To determine if TX has finished we need one more register named TAIL
FLAG. This flag remains high for all the duration of the TX plus one cycle.
3.4.1.3 Parameters
Monitor core expands the list of parameters that were inherited from riffa adapter
module with seven new that will be presented bellow. RIFFA alone is already re-
source consuming and if we expect to use it with massive designs at least the profiling
part should occupy as less space on FPGA as possible. With enough parameters to
specify required functionality and number of events, Monitor core will be generated
with only the necessary amount of modules and part of the designed logic.
C DATA WIDTH
RX/TX interface data width. This parameter is inherited by RIFFA top mod-
ule and passed directly to user IP cores. The current Monitor version is designed
at 128-bit but with further modification 32-bit and 64-bit support will be achieved.
C NUM CHNL
Number of RIFFA channels (1-12). The second parameter coming directly from
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Chapter 3. Design & Implementation 31
RIFFA. Increasing the number of channels does not significantly raise Monitor’s
resource usage, since the controlling mechanism is only connected to channel 0.
BRAM SIZE
Log’s memory size in words. Entries are 64 bit wide so a total of BRAM SIZE/2
events can fit in the Log. Default value is 2048 which translates in 128KB, 144KB
with parity.(4 x 36KB primitive BRAMs are used)
MONITOR CHANNELS
If high information of each channel usage will be recorded creating the pseudo-
event of a full RX-accelerator usage-TX cycle. That info consists of 64-bit, 16MSB
for occurrences number and 48LSB for duration.
LOG CHANNEL
If high RX and TX of each channel will be recorded in the log , providing a
timing analysis for the event mentioned above.
TRIGGERS NUM
Number of user defined Events (Triggers). Default value is 0 since it is an op-
tional function and users have to manually attach their events them to TRIGGER
wire.
LOG TRIGGER
If high Triggers will be saved as events in the log. Default value is 0 since those
triggers are optional events that users want to monitor and they have to manually
attach them to TRIGGER wire.
SUM
If high durations (triggers or channel usage) will be accumulated. If low only
the last duration will be available.
MERGE PULSES
If high a total of 16 different events can be monitored (instead of 8). The num-
ber of events depends on the size of their ID when saved in the Log. Since we use
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Chapter 3. Design & Implementation 32
64-bit entries with 48-bit timestamps ID consists of the remaining 16 bits. Default
option (0) uses 2 bits per event, on for the begging and 1 for termination since it
simplifies decoding and visualizing the recored data.
If both LOG CHANNELS and LOG TRIGGERS are 0 there is no need to
instantiate event log and its BRAM. Likewise MONITOR CHANNELS and TRIG-
GERS NUM define the number of trigger monitor modules. Properly setting those
parameters can minimize Monitor’s size on FPGA.
3.4.2 Global Timer
This minor module has the sole job of counting every cycle since first fpga
programming and can only be reseted manually with PROFILER SET function. It
uses 48 bit counter which means: 248 * 4 ns = 13.0312489 days of nonstop operation.
Output of global timer module is the 48bit timestamp which can be sent directly to
the Monitor’s driver through TX engine with TAIL and INFO operations or used
in the logging procedure.
Figure 3.6: Global Timer RTL schematic
In order to restart global timer Monitor must be provided with the proper
OPCODE, regardless of riffa reset. An important detail is that timer cannot be
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Chapter 3. Design & Implementation 33
restarted if Log is not reseted also. This way we avoid mixing new entries with
invalid outdated ones.
3.4.3 Monitor Submodule
Real time monitoring of events is achieved with the assistance of module trig-
ger monitor. It is responsible for counting event occurrences, measuring their dura-
tion, and generating pulses at their beginning and ending. Those pulses will notify
Monitor and new entries will be recorded in the Log. For each event monitor module
uses a 16-bit counter to counter the occurrences and a 48-bit to measure the last
duration or accumulate total duration depending on parameter SUM. All counters
combined are propagated with o INFO output to the parent module(Monitor’s top
module) and will be used at INFO and TAIL operations.
Figure 3.7: Event Monitor
Generated pulses START and STOP reach top module through output o ID.
If parameter MERGED is high instead of apointing an exclusive bit to every signal,
START[i] and STOP[i] will be compined. This way we can increase the number of
monitored events from 8 to 16.
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Chapter 3. Design & Implementation 34
RIFFA Monitor will instantiate up to two versions of this module, one for
channels transactions and one for user specified events. Similar to global timer,
reseting the counters is achieved by providing the proper OPCODE.
Figure 3.8: Monitor Submodule RTL schematic
Figure 3.9: Monitor module ID generation
3.4.4 Event Log
Module event log serves as RIFFA Monitor’s memory in which the moments
that each event begin or terminate are recorded. Each entry consists of 64-bits.
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Chapter 3. Design & Implementation 35
16-bits to classify the trigger and 48-bit as a time stamp. The default number of
entries is 2048, chosen to be small related to total resources and on the same time
sufficient.
Event log instantiates and manages a 128KB dual port BRAM primitive mod-
ule with a 64-bit wide write port (A) and a 128-bit wide read port (B). Only one
record can be added per cycle, and it is acceptable since in the 16-bit ID are encoded
all possible combinations of the monitored events. When memory is FLUSHed two
entries are read per cycle to fill the 128 available bits of the DATA frame. A
log(#entries)-bit register is used to keep the current address and in case of memory
overflow the logic presented below prevents the out-of-order fetching of valid entries.
The module’s RTL shcematic is presented in figure 3.6.
Figure 3.10: Event Log Module RTL schematic
BRAM FLUSH
This is the procedure of requesting and downloading all the recorded entries of
the Log. While the user cores - accelerators are operating all predefined and man-
ually specified events are recorded in the Log. To export those entries users have
to call the function PR LOG, built in the software interface. OPCODE 0000010
signaling the FLUSH operation will be provided to Monitor’s control mechanism.
BRAM entries will be streamed through TX engine taking into consideration all
possible delays from RIFFA interface. Additional options for displaying and visu-
alization of the downloaded entries are provided and will be presented in software
API section.
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Chapter 3. Design & Implementation 36
3.5 Driver
The software api extends the already installed RIFFA driver with additional
functions. To use an accelerator attached on the monitored RIFFA framework users
have to include in their source code Monitor’s library. For fast reconfigurability and
following RIFFA’s simplified API structure driver is compressed to a single file.
• int PR fpga send( fpga t * fpga, int chnl, void * data, int len,
int destoff, int last, long long timeout)
fpga - Pointer to fpga t structure.
chnl - Channel number over which to communicate.
data - Pointer to array of data to send.
len - Length of data to send in words.
destoff - Value sent to FPGA core to indicate where to start writing this data.
last - If 1, this transfer is the last in a sequence of transfers.
timeout - Timeout value in ms. If 0, no timeout is specified. Otherwise, the
PC will wait up to timeout ms in between PC/FPGA communications.
All user calls to RIFFA’s fpga send are redirected to PR fpga send. If channel
selected is 0 a header of 128 zeros will be attached to the send data buffer.
Afterwards the original function fpga send is called with the same set of pa-
rameters. Only len’s value and the contents of data will be modified if needed.
• unsigned long PR INFO(fpga t * fpga, int print)
fpga - Pointer to fpga t structure.
print - Optional display of downloaded information in console.
The initial version of this function was supposed to return a single Times-
tamp. While profiler was evolving and enriched with additional metrics and
log the core purpose of PR INFO changed completely several times. Instead
of wasting resources on logic for data selection, all available metrics recorded
will be downloaded with a single function call. Additional delay of a couple
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Chapter 3. Design & Implementation 37
cycles is insignificant in front of the overhead for one transaction. The current
timestamp is returned and all additional info are printed in console. As future
development a struct can be populated with the received data.
Figure 3.11: Output of INFO function call
• void PR SET( fpga t * fpga, int TAIL, int RST TRG CNTS,
int RST CHNL CNTS, int RST LOG,
int RST GLTIMER, int print)
fpga - Pointer to fpga t structure.
TAIL - enable or disable TAIL operation.
RST TRG CNTS - reset duration and count of every user specified event.
RST CHNL CNTS - reset duration and count of every channel transactions
RST LOG - reset the log by setting valid entries number to zero. New entries
will overwrite outdated ones.
RST GLTIMER - restart global timer.
print - Optional display of setting in console.
This function provides the proper OPCODE for reseting specific counters and
enable or disable TAIL operation. Since all metrics are not connected to global
reset network they retain their values after a RIFFA reset call.
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Chapter 3. Design & Implementation 38
• void PR LOG(fpga t * fpga, unsigned short **triggers,
long **timestamps, int print, int file,
int timeline)
fpga - Pointer to fpga t structure.
triggers - Array of downloaded triggers.
timestamps - Array of downloaded timestamps.
print - Optional display of downloaded entries in console.
file - Optional printing of downloaded entries in file.
timeline - If not 0 a basic visualization of downloaded entries will be printed
in file.
The continuation of FLUSH operation on software side is implemented in
PR LOG function. The downloaded entries are displayed, printed in raw and
expanded form and visualized in a minimalistic timeline. A console output
example can be seen in figure 3.10.
Figure 3.12: Output of LOG function call
Each line is a recorded entry in the Log. Trigger column contains the 16-bit
IDS which decodes in every combination of events. Reading the ID from right
to left every 2 bits correspond to the beginning and termination of the defined
events. If timeline parameter is non zero a basic visualization will be print in
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Chapter 3. Design & Implementation 39
timeline.txt. The overall period since FPGA programming or Global Timer
reset will be split into a number of sections equal to timeline value and will
be visualized according to the logged entries.
Figure 3.13: Basic Vilsualization in the form of a Timeline
3.6 Architectural exploration
1. Before Monitor become a stand-alone module he was implemented as addi-
tional lines of code in the accelerator we wanted to monitor. A couple of
hardware counters measured execution time and the result was embedded in
the returned data stream. The attachment was achieved by the proper alter-
ation of accelerator’s TX interface handler. And since user created IP cores
don’t follow a specific design pattern different logic should be implemented
for each one.
2. Next generation was developed as a wrapper module. One accelerator was in-
stantiated in one Monitor wrapper. This was the first step towards a universal
design for all accelerators. Again hardware counters were used to measure ei-
ther total accelerator usage or a manually specified event and results were sent
to software level with a mechanism similar to TAIL but with several flaws.
Only this function was performed so communication with software level was
not required.
3. The next attempt inserted a global timer as extra module, accessed by every
Monitor wrapper and apart from duration a timestamp was attached on the
outgoing results. We tried to separate Monitor from accelerator and instan-
tiate them on the same hierarchy level but several compatibility issues arose.
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Chapter 3. Design & Implementation 40
Once again operation selection was hard coded in the design so RX signals
were propagated as is.
4. The next implementation was a top profiler module containing Global timer,
memory blocks for logging and one profiler wrapper for each channel. Since
users should be able to download the recorded data a basic communication
with profiler and software was necessary. The fact that every channel had an
appointed profiler wrapper made it easier to intercept RX signals, but less
user friendly. It was also uncombable with accelerators that use more than
one channel.
5. In the next to last design Monitor top module was monitoring and logging
every event and was communicating with software through all available chan-
nels. Tail operation was fully functional but the design lacked optimizations.
Limiting controls to a single channel, parameterizing the module generation,
minimizing the FSM and merging similar operation led to the current version
of RIFFA Monitor.
Institutional Repository - Library & Information Centre - University of Thessaly




A hardware design for SoC monitoring and logging was successfully devel-
oped. After a series of testing and debugging RIFFA Monitor has reached a stable
state, user-friendly and resource efficient. It has already been used for evaluation
of mathematical accelerators and for better understating of the RIFFA commu-
nication engine. Compatibility with legacy RIFFA designs added extra value on
deserted projects and will be a useful tool for every future work based on RIFFA
infrastructure.
4.2 In the Future
Reaching a stable version was an important milestone but several tasks and
ideas are still waiting to be implemented. Extra functionality for sampling will
give a more statistical approach though the necessity of such information is still
under discussion. Monitoring of standard IPs and AXI bus still requires manual
work to be done by user and it could be automated just like RIFFA communication
engine. Support for single trigger events without duration could be inserted to avoid
wasted entries (2 for 1). Timestamps are chosen to be absolute time in ns for easier
understanding of the log but a later version could use time difference and change
the balance of ID and TIMESTAMP bit count. RIFFA Monitor at the moment
41
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Chapter 4. Conclusion 42
is implemented at 128-bit but 32-bit and 64-bit support is ongoing. Apart from
hardware improvements and additions, Monitor’s software can be greatly upgraded
by adding a graphical user interface. Data interpretation and visualization combined
with plenty automatizations will promote RIFFA monitor to a handy and convenient
tool for fpga design engineers .
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Appendix A
Verilog Source Code
RIFFA Monitor top module
1 ‘timescale 1ns/1ns
2
3 module profiler (clk,o_RX_CLK, i_RX, o_RX_ACK,
4 i_RX_LAST, i_RX_LEN, i_RX_OFF, i_RX_DATA,
5 i_RX_DATA_VALID, o_RX_DATA_REN, o_TX, i_TX_ACK,
6 o_TX_LAST, o_TX_LEN, o_TX_OFF, o_TX_DATA,
7 o_TX_DATA_VALID, i_TX_DATA_REN,rst);
8
9 parameter C_DATA_WIDTH = 9’d128;
10 parameter C_NUM_CHNL = 4’d1;
11 parameter BRAM_SIZE = 16’d4096;
12 parameter MONITOR_CHANNELS = 1’b1;
13 parameter LOG_CHANNELS = 1’b1;
14 parameter TRIGGERS_NUM = 4’d0;
15 parameter LOG_TRIGGERS = 1’b0;
16 parameter SUM = 1’b0;
17 parameter MERGE_PULSES = 1’b0;
18
19 input clk;
20 output [C_NUM_CHNL-1:0] o_RX_CLK;
21 input [C_NUM_CHNL-1:0] i_RX;
22 output [C_NUM_CHNL-1:0] o_RX_ACK;
23 input [C_NUM_CHNL-1:0] i_RX_LAST;
43
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Appendix A. Verilog Source Code 44
24 input [(C_NUM_CHNL*32)-1:0] i_RX_LEN;
25 input [(C_NUM_CHNL*31)-1:0] i_RX_OFF;
26 input [(C_NUM_CHNL*C_DATA_WIDTH)-1:0] i_RX_DATA;
27 input [C_NUM_CHNL-1:0] i_RX_DATA_VALID;
28 output [C_NUM_CHNL-1:0] o_RX_DATA_REN;
29
30 output [C_NUM_CHNL-1:0] o_TX_CLK;
31 output [C_NUM_CHNL-1:0] o_TX;
32 input [C_NUM_CHNL-1:0] i_TX_ACK;
33 output [C_NUM_CHNL-1:0] o_TX_LAST;
34 output [(C_NUM_CHNL*32)-1:0] o_TX_LEN;
35 output [(C_NUM_CHNL*31)-1:0] o_TX_OFF;
36 output [(C_NUM_CHNL*C_DATA_WIDTH)-1:0] o_TX_DATA;
37 output [C_NUM_CHNL-1:0] o_TX_DATA_VALID;
38 input [C_NUM_CHNL-1:0] i_TX_DATA_REN;
39 input rst;
40
41 reg DONE = 1’b0;
42 reg RX0 = 1’b0;
43 reg TAIL = 1’b0;
44 reg FLUSH = 1’b0;
45 reg chnl_rx_last_buff = 1’b0;
46 reg [31:0] chnl_rx_len_buff = {32{1’b0}};
47 reg [30:0] chnl_rx_off_buff = {31{1’b0}};
48 reg [5:0] OPCODE = {6{1’b0}};
49 reg [31:0] LEN = {32{1’b0}};
50 reg [29:0] WCOUNT = {29{1’b0}};
51 reg [1:0] PR_STATE = {2{1’b0}};
52 reg [9:0] ADDRESS = {10{1’b0}};
53 reg [C_NUM_CHNL-1:0] TAIL_FLAG = {C_NUM_CHNL{1’b0}};
54 reg [C_NUM_CHNL-1:0] TX_DELAY = {C_NUM_CHNL{1’b0}};
55 reg [C_DATA_WIDTH-1:0] TxDATA = {C_DATA_WIDTH{1’b0}};
56 reg [C_DATA_WIDTH-1:0] TxDATA_BUFF= {C_DATA_WIDTH{1’b0}};
57
58 wire [C_DATA_WIDTH-1:0] LOG;
59 wire [TRIGGERS_NUM-1:0] TRIGGER;
60 wire [2*TRIGGERS_NUM-1:0] TR_ID;
61 wire [2*C_NUM_CHNL-1:0] CH_ID;
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Appendix A. Verilog Source Code 45
62 wire [15:0] ID;
63 wire [TRIGGERS_NUM*64-1:0] TR_INFO;
64 wire [C_NUM_CHNL*64-1:0] CH_INFO;
65 wire [47:0] TIMESTAMP;




70 wire [C_NUM_CHNL-1:0] chnl_rx_clk;
71 wire [C_NUM_CHNL-1:0] chnl_rx;
72 wire [C_NUM_CHNL-1:0] chnl_rx_ack;
73 wire [C_NUM_CHNL-1:0] chnl_rx_last;
74 wire [(C_NUM_CHNL*32)-1:0] chnl_rx_len;
75 wire [(C_NUM_CHNL*31)-1:0] chnl_rx_off;
76 wire [(C_NUM_CHNL*C_DATA_WIDTH)-1:0] chnl_rx_data;
77 wire [C_NUM_CHNL-1:0] chnl_rx_data_valid;
78 wire [C_NUM_CHNL-1:0] chnl_rx_data_ren;
79 wire [C_NUM_CHNL-1:0] chnl_tx_clk;
80 wire [C_NUM_CHNL-1:0] chnl_tx;
81 wire [C_NUM_CHNL-1:0] chnl_tx_ack;
82 wire [C_NUM_CHNL-1:0] chnl_tx_last;
83 wire [(C_NUM_CHNL*32)-1:0] chnl_tx_len;
84 wire [(C_NUM_CHNL*31)-1:0] chnl_tx_off;
85 wire [(C_NUM_CHNL*C_DATA_WIDTH)-1:0] chnl_tx_data;
86 wire [C_NUM_CHNL-1:0] chnl_tx_data_valid;
87 wire [C_NUM_CHNL-1:0] chnl_tx_data_ren;
88
89 ///////////////////////////////////////////////////////////
90 //// ASSIGN PROFILER OUTPUTS & ACCELERATOR INPUTS //////
91 ///////////////////////////////////////////////////////////
92 assign o_RX_CLK = chnl_rx_clk;
93 assign o_RX_ACK[0] = (PR_STATE == 2’d1);
94 assign o_RX_DATA_REN[0] = (PR_STATE == 2’d1) | ((PR_STATE == 2’d2) &
chnl_rx_data_ren[0]);
95 assign o_TX_CLK = chnl_tx_clk;
96 assign o_TX[0] = (PR_STATE == 2’d3) | TAIL_FLAG[0]| chnl_tx[0];
97 assign o_TX_LAST[0] = (PR_STATE == 2’d3) | chnl_tx_last[0];
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Appendix A. Verilog Source Code 46
98 assign o_TX_LEN[31:0] = (PR_STATE == 2’d3) ? LEN : (TAIL ? -1 :
chnl_tx_len[31:0]);
99 assign o_TX_OFF[30:0] = (PR_STATE == 2’d3) ? 31’d0 : chnl_tx_off[30:0];
100 assign o_TX_DATA[C_DATA_WIDTH-1:0]= (PR_STATE == 2’d3) ? TxDATA :
(chnl_tx_data_valid[0] ? chnl_tx_data[C_DATA_WIDTH-1:0] :
{TIMESTAMP,16’b0,CH_INFO[47:0]});
101 assign o_TX_DATA_VALID[0]= (PR_STATE == 2’d3) | TAIL_FLAG[0] |
chnl_tx_data_valid[0];
102 assign user_clk = clk;
103 assign riffa_reset = rst;
104 assign chnl_rx[0] = (PR_STATE == 2’d2) & RX0;
105 assign chnl_rx_last[0] = chnl_rx_last_buff;
106 assign chnl_rx_len[31:0]= chnl_rx_len_buff;
107 assign chnl_rx_off[30:0]= chnl_rx_off_buff;
108 assign chnl_rx_data_valid[0]= (PR_STATE == 2’d2) & i_RX_DATA_VALID[0];
109 assign chnl_rx_data = i_RX_DATA;
110 assign chnl_tx_ack = i_TX_ACK;
111 assign chnl_tx_data_ren= i_TX_DATA_REN;
112 genvar q;
113 generate
114 for (q = 1; q < C_NUM_CHNL; q = q + 1) begin : assign_chnls_1_12
115 assign o_RX_ACK[q] = chnl_rx_ack[q];
116 assign o_RX_DATA_REN[q] = chnl_rx_data_ren[q];
117 assign o_TX[q] = TAIL_FLAG[q] | chnl_tx[q];
118 assign o_TX_LAST[q] = chnl_tx_last[q];
119 assign o_TX_LEN[32*(q+1)-1:32*q]= TAIL ? -1 :
chnl_tx_len[32*(q+1)-1:32*q];
120 assign o_TX_OFF[31*(q+1)-1:31*q]= chnl_tx_off[31*(q+1)-1:31*q];
121 assign o_TX_DATA[C_DATA_WIDTH*(q+1)-1 : C_DATA_WIDTH*q] =
chnl_tx_data_valid[q] ? chnl_tx_data[C_DATA_WIDTH*(q+1)-1 :
C_DATA_WIDTH*q] : {TIMESTAMP,16’b0,CH_INFO[64*(q+1)-17:64*q]};
122 assign o_TX_DATA_VALID[q]= TAIL_FLAG[q] | chnl_tx_data_valid[q];
123 assign chnl_rx[q] = i_RX[q];
124 assign chnl_rx_last[q] = i_RX_LAST[q];
125 assign chnl_rx_len[32*(q+1)-1:32*q] = i_RX_LEN[32*(q+1)-1:32*q];
126 assign chnl_rx_off[31*(q+1)-1:31*q] = i_RX_OFF[31*(q+1)-1:31*q];
127 assign chnl_rx_data_valid[q] = i_RX_DATA_VALID[q];
128 end
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Appendix A. Verilog Source Code 47
129
130 case ({LOG_TRIGGERS,LOG_CHANNELS})
131 2’b00: assign ID = 0;
132 2’b01: assign ID = CH_ID;
133 2’b10: assign ID = TR_ID;
134 2’b11: assign ID = {TR_ID,CH_ID};
135 endcase
136
137 for (q = 0; q < C_NUM_CHNL; q = q + 1) begin : fix_tail
138 always @(posedge clk)
139 TAIL_FLAG[q] <= TAIL_FLAG[q] ? chnl_tx[q] | ~i_TX_DATA_REN[q]









148 always @(posedge clk)






155 {FLUSH,DONE,OPCODE,WCOUNT} <= 0;
156 if (i_RX[0] & ~TAIL_FLAG[0])
157 begin
158 chnl_rx_len_buff <= i_RX_LEN[31:0] -(C_DATA_WIDTH/32);
159 chnl_rx_last_buff <= i_RX_LAST[0];
160 chnl_rx_off_buff <= i_RX_OFF[30:0];





Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7




169 OPCODE <= i_RX_DATA[5:0];
170 case (i_RX_DATA[1:0])
171 2’b00: PR_STATE <= 2’b10;//USE ACCELERATOR
172 2’b01://RETURN DURATION OR TIMESTAMP
173 begin









180 LEN <= BRAM_SIZE;
181 TxDATA <= LOG;
182 FLUSH <= 1’b1;
183 ADDRESS <= ADDRESS + 1;
184 end
185
186 2’b11: //RESET TIMER-COUNTERS, SET TAIL (128-BIT EXTRA INFO AFTER EACH
TRANSMITION)
187 begin
188 TAIL <= i_RX_DATA[6];




193 else if (FLUSH) {ADDRESS,PR_STATE} <= {ADDRESS + 1, 2’b11};
194 end
195 //ASSUME ACCELARATOR WORK COMPLETED AFTER TRANSMITION
196 2’b10:
197 if (TX_DELAY[0] & ~chnl_tx[0]) PR_STATE <= 2’b00;
198 //DATA TRANSMISSION
199 2’b11:
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7




203 WCOUNT <= WCOUNT + (C_DATA_WIDTH/32);
204 if (WCOUNT >= LEN-4 ) {PR_STATE,FLUSH,ADDRESS} <= ’b0;
205 else if (OPCODE == 2’b01)
206 begin
207 TxDATA[63:0] <= CH_INFO>>(WCOUNT<<4);




212 {TxDATA,FLUSH} <= FLUSH ? {LOG,1’b1} : {TxDATA_BUFF, 1’b1};
213 ADDRESS <= ADDRESS + 1’b1;
214 end
215 end






















Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7





242 reg [C_NUM_CHNL-1:0] RX_DELAY = {C_NUM_CHNL{1’b0}};
243 reg [C_NUM_CHNL-1:0] CHNL_TRIGGER = {C_NUM_CHNL{1’b0}};
244 always @(posedge clk) RX_DELAY <= chnl_rx;
245 for (q = 0; q < C_NUM_CHNL; q = q + 1) begin : trigg
246 always @(posedge clk)
247 if (chnl_rx[q] & !RX_DELAY[q]) CHNL_TRIGGER[q] <=
1’b1;
248 else if ((!chnl_tx[q] & TX_DELAY[q])|rst) CHNL_TRIGGER[q] <=
1’b0;
249 end




















270 // START USER CODE (do edit)
271 ////////////////////////////////////
272
273 // CHANNEL TESTER EXAMPLE
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Appendix A. Verilog Source Code 51
274 genvar i;
275 generate
276 for (i = 0; i < C_NUM_CHNL; i = i + 1) begin : profile_channels
277 chnl_tester #(C_DATA_WIDTH) module1 (
278 .CLK(user_clk),
279 .RST(riffa_reset), // riffa_reset includes riffa_endpoint resets




























Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Appendix A. Verilog Source Code 52
1 ‘timescale 1ns/1ns
2
3 module global_timer(clk, o_TIMESTAMP, rst);
4
5 input clk;
6 output reg [47:0] o_TIMESTAMP = {48{1’b0}};
7 input rst;
8
9 always @(posedge clk)
10 if (rst) o_TIMESTAMP <= {48{1’b0}};














11 parameter TRIGGERS_NUM = 4’d1;
12 parameter SUM = 1’b0;
13 parameter MERGE_PULSES = 1’b0;
14
15 input clk;
16 input [TRIGGERS_NUM-1:0] i_TRIGGER;
17 output [(2-MERGE_PULSES)*TRIGGERS_NUM-1:0] o_ID;
18 output [64*TRIGGERS_NUM-1:0] o_INFO;
19 input rst;
20
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Appendix A. Verilog Source Code 53
21 reg [16*TRIGGERS_NUM-1:0] TR_COUNT;
22 reg [48*TRIGGERS_NUM-1:0] TR_DURATION;
23 reg [TRIGGERS_NUM-1:0] PRV = {TRIGGERS_NUM{1’b0}};
24 wire [TRIGGERS_NUM-1:0] START;
25 wire [TRIGGERS_NUM-1:0] STOP;
26
27
28 always @(posedge clk) PRV <= i_TRIGGER;
29
30 assign START = i_TRIGGER & ~PRV;




35 for (i = 0; i < TRIGGERS_NUM; i = i + 1) begin : pg
36
37 if (MERGE_PULSES) assign o_ID[i] = STOP[i] | START[i] ;
38 else assign o_ID[2*i+1:2*i] = {STOP[i] , START[i]};
39
40 assign o_INFO[64*(i+1)-1:64*i] =
{TR_COUNT[16*(i+1)-1:16*i],TR_DURATION[48*(i+1)-1:48*i]};
41
42 always @(posedge clk)
43 begin








50 if(START[i] & !SUM) TR_DURATION[48*(i+1)-1:48*i] <= 48’d1;
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Appendix A. Verilog Source Code 54
Event Log













14 input [15:0] i_ID;
15 input [9:0] i_ADDRESS;
16 input [47:0] i_TIMESTAMP;
17 output [127:0] o_LOG;
18 output [10:0] o_ENTRIES;
19 input rst;
20
21 reg [10:0] LOG_ADDRESS = {11{1’b0}};
22 reg OVERFLOW = 1’b0;
23 wire [127:0] LOG;
24
25 assign o_LOG = (i_ADDRESS <= LOG_ADDRESS[10:1]) ? LOG : 0;
26 assign o_ENTRIES = OVERFLOW ? {11{1’b1}} : LOG_ADDRESS;
27
28 always @(posedge clk)
29 if (rst) LOG_ADDRESS <= {11{1’b0}};
30 else if (|i_ID) LOG_ADDRESS <= LOG_ADDRESS + 1’b1;
31
32 always @(posedge clk)
33 if (rst) OVERFLOW <= 1’b0;
34 else if ((LOG_ADDRESS=={11{1’b1}})& (|i_ID)) OVERFLOW <= 1’b1;
35
36 // SIMPLE DUAL PORT RAM
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Appendix A. Verilog Source Code 55



































Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Appendix B
Software interface - RIFFA
Monitor API
1 #define TIMEOUT 8000
2 #define BRAM_SIZE 2048
3
4
5 int PR_fpga_send(fpga_t * fpga, int chnl, void * data,
6 int len, int destoff, int last,
7 long long timeout){
8 // If data are sent to channel 0 profiler will use the first 128 bit.
9 // To avoid that extra 128’b0 is added as header.
10 // Profiler will read OPCODE == 0 and forward data without header to
accelarator.
11
12 if (chnl) return fpga_send(fpga, chnl, data, len, destoff, last,
timeout);
13 int i, buffer[len+4];
14 for(i=0; i<len; i++) buffer[i+4] = ((int *)data)[i];
15 buffer[0]= 0;
16 return fpga_send(fpga, chnl, buffer, len+4, destoff, last, timeout);
17 }
18
19 int *PR_PRINT_BINARY(size_t const size, void const * const ptr, int print)
20 // Helper function. Used for proper display of TRIGGERS
56
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Appendix B. Software interface - RIFFA Monitor API 57
21 {
22 unsigned char *b = (unsigned char*) ptr;
23 unsigned char byte;
24 int i, j;
25





31 byte = b[i] & (1<<j);
32 byte >>= j;
33 if (print) printf("%u", byte);






40 unsigned long PR_INFO(fpga_t * fpga, int print){
41 // Returns current Timestamp
42 // Prints parameter and counter values. (Populate struct incoming)
43 // Timestamp * clk_period = time since last profiler_reset (or bitstream
assignment)
44
45 int i, *bits;
46 unsigned long buff[34];
47 unsigned short entries, bram_s, n_chnl, n_trigg,
48 tail, log_chnl, log_trigg, mon_chnl, merge, sum;
49 unsigned short * sbuff = (unsigned short *)buff;
50 buff[0]=1;
51
52 fpga_send(fpga, 0, &buff, 1, 0, 1, TIMEOUT);
53 fpga_recv(fpga, 0, &buff, 68, TIMEOUT);
54
55 bits = PR_PRINT_BINARY(sizeof(short),sbuff+7 ,0);
56 entries = sbuff[3];
57 bram_s = sbuff[4];
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Appendix B. Software interface - RIFFA Monitor API 58
58 n_chnl = sbuff[5];
59 n_trigg = sbuff[6];
60 sbuff[3] = 0;
61
62 if(print){
63 printf("SUM_DURATION : %d\n", bits[5]);
64 printf("MERGE_PULSES : %d\n", bits[4]);
65 printf("MONITOR_CHNL : %d\n", bits[3]);
66 printf("LOG_TRIGGERS : %d\n", bits[2]);
67 printf("LOG_CHANNELS : %d\n", bits[1]);
68 printf("TAIL : %d\n", bits[0]);
69 printf("TRIGGER_NUM : %d\n", n_trigg);
70 printf("CHANNEL_NUM : %d\n", n_chnl);
71 printf("BRAM_SIZE : %d words\n", bram_s);
72 printf("ENTRIES : %d\n", entries);
73 printf("TIMESTAMP : %ld\n", buff[0]);
74 for (i=0; i<n_chnl; i++){
75 printf("chnl %d calls : %d\n", i, sbuff[11+8*i]);
76 sbuff[11+8*i]=0;
77 printf("chnl %d dur : %ld\n", i, buff[2+2*i]);
78 }
79 for (i=0; i<n_trigg; i++){
80 printf("chnl%d calls : %d\n", i, sbuff[15+8*i]);
81 sbuff[15+8*i]=0;






88 void PR_LOG(fpga_t * fpga, unsigned short **triggers,
89 long **timestamps, int print, int file,
90 int timeline){
91 // Returns an array with all valid entries in BRAM log.
92 // Each entry is 64bit (16bit ID , 48bit Timestamp)
93
94 int i, w, finished, active ,start, stop;
95 int buffer = 2;
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Appendix B. Software interface - RIFFA Monitor API 59
96 char ch;
97 long max, step;
98 unsigned short *ts_buff;
99
100 *triggers = (short *)calloc(BRAM_SIZE,2);
101 *timestamps = (long *)calloc(BRAM_SIZE,8);
102 ts_buff = (short *)(*timestamps);
103
104 fpga_send(fpga, 0, &buffer, 1, 0, 1, TIMEOUT);
105 printf("rcvd %d\n",fpga_recv(fpga, 0, ts_buff, BRAM_SIZE*2, TIMEOUT));
106
107 if (file){
108 FILE *raw_log, *log;
109 raw_log = fopen("RAW_LOG.txt", "w");
110 log = fopen("LOG.txt", "w");
111
112 for (i=0; i<BRAM_SIZE; i++){
113 fprintf(raw_log,"%ld\n",(*timestamps)[i]);
114 (*triggers)[i] = ts_buff[4*i+3];









124 printf("\n TRIGGER\t\t| TIMESTAMP
\t\t|\n---------------------------------------------------------\n");
125 i=0;
126 while((*triggers)[i] != 0) {
127 PR_PRINT_BINARY(sizeof((*triggers)[i]), (*triggers)+i, 1);





Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Appendix B. Software interface - RIFFA Monitor API 60
133
134 if (timeline){
135 max = PR_INFO(fpga,0);
136 step = max / timeline;
137 FILE *out = fopen("TIMELINE.txt", "w");
138 start = 0;
139 stop = 1;
140 int *bits = NULL;
141 for (w=0; w<8; w++){
142 finished = 0;
143 active = 0;
144 ch = ’.’;
145 fprintf(out,"%d: ",w);
146 for (i=0; i<=timeline; i++){
147 while ((*timestamps)[active] < i*step) {
148 if ((*timestamps)[active] == 0) break;
149 bits = PR_PRINT_BINARY(sizeof(short), (*triggers)+ active ,0);
150 if (bits[start] == 1) {
151 ch = ’|’;
152 finished = 0;
153 }




158 if (finished) ch = ’.’;
159 }
160 fprintf(out,"\n");
161 start = start + 2;






168 void PR_SET(fpga_t * fpga, int TAIL, int RST_TRG_CNTS,
169 int RST_CHNL_CNTS, int RST_LOG, int RST_GLTIMER,
170 int print){
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Appendix B. Software interface - RIFFA Monitor API 61
171 // RESET TIMER-COUNTERS, SET TAIL (128-BIT EXTRA INFO AFTER EACH
TRANSMITION)
172
173 int buffer = 3 + (TAIL!=0)*64 + (RST_TRG_CNTS!=0)*32 +
(RST_CHNL_CNTS!=0)*16 + (RST_LOG||RST_GLTIMER)*8 + (RST_GLTIMER!=0)*4;
174 if (fpga_send(fpga, 0, &buffer, 1, 0, 1, TIMEOUT)>0) {
175 if(print){
176 if (RST_GLTIMER ) printf("Global Timer Reseted.\n");
177 if (RST_LOG||RST_GLTIMER ) printf("Log Reseted\n");
178 if (RST_CHNL_CNTS) printf("\nChannel Counters Reseted\nTrigger
Counters Reseted\n");
179 if (RST_TRG_CNTS) printf("\nChannel Counters Reseted\nTrigger
Counters Reseted\n");
180 if (TAIL) printf("\nTAIL ON\n\n");
181 else printf("\nTAIL OFF\n\n");
182 }
183 }
184 else printf("FPGA_SENT ERROR\n");
185 }
Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
Bibliography 62
BIBLIOGRAPHY
(1) Field-programmable gate array - Wikipedia, the free encyclopedia
http://en.wikipedia.org/wiki/Field-programmable_gate_array
(2) FPGA Architectures: An Overview
http://www.springer.com/cda/content/document/cda_downloaddocument/
9781461435938-c2.pdf?SGWID=0-0-45-1333135-p174308376
(3) Xilinx Virtex-7 FPGA VC707 Evaluation Kit
http://www.xilinx.com/support/documentation/boards_and_kits/vc707/
ug848-VC707-getting-started-guide.pdf
(4) Virtex 7 Series FPGAs Overview
http://www.xilinx.com/support/documentation/data_sheets/ds180_7Series_
Overview.pdf
(5) VC707 Evaluation Board User Guide
http://www.xilinx.com/support/documentation/boards_and_kits/vc707/
ug885_VC707_Eval_Bd.pdf
(6) RIFFA: A Reusable Integration Framework For FPGA Accelerators
http://riffa.ucsd.edu/
(7) Jacobsen, M. and Kastner, R. “RIFFA 2.0: A reusable integration
framework for FPGA accelerators.”
https://sites.google.com/a/eng.ucsd.edu/matt-jacobsen/fccm_final.
pdf?attredirects=0&d=1
(8) Vivado Design Suite
http://www.xilinx.com/products/design-tools/vivado.html




Institutional Repository - Library & Information Centre - University of Thessaly
09/12/2017 06:37:28 EET - 137.108.70.7
