Simulation-Based Aplication Profiling and Multi-Path Trace Detection by João Manuel Alves Pereira
FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO
Simulation-based application profiling
and multi-path trace detection
João Manuel Alves Pereira
Mestrado Integrado em Engenharia Eletrotécnica e de Computadores
Supervisor: João Canas Ferreira
Co-Supervisor: Nuno Paulino
November 16, 2015
c© João Pereira, 2015

ii
Resumo
Aplicações que exigem intenso processamento de dados estão a ultrapassar os recursos ofereci-
dos por processadores convencionais, impulsionando o uso de coprocessadores para acelerarar a
execução de aplicações específicas. Estes sistemas obrigam a um novo desafio: o de dividir uma
aplicação entre hardware e software de forma a que o sistema possa tomar partido do hardware da
melhor forma possível. Este processo chama-se partição hardware/software. Aqui se decide o que
executa em hardware e o que se executa em software e a vantagem está em migrar as partes de
computação intensiva, chamadas de kernels ou secções críticas, para hardware.
Este trabalho consiste no desenvolvimento de uma aplicação para detetar padrões formados
com caminhos múltiplos num trace de execução de um programa. O tema incide na partição
Hardware/Software ao nivel do binário para tirar partido da portabilidade do mesmo, também
chamada de Partição Dinâmica. O trabalho desenvolvido dá continuidade a um sistema que im-
plementa uma partição dinâmica transparente dividida em quatro fases: Deteção, Identificação,
Translação e Recolocação. Sendo o principal objetivo melhorar a fase de deteção, propondo uma
nova estrutura para representar as secções críticas de um programa, chamada MultiPath Execution
Block.
É explorado neste documento o sistema anteriormente implementado que extrai loops de cam-
inho único de um trace de execução um programa, chamados Megablocos. É mostrado o pro-
cesso de deteção de Megablocos e as nossas alterações para poder estender de uma representação
baseada em caminhos singulares para uma de caminhos múltiplos.
A aplicação desenvolvida tira partido de um simulador de um conjunto de instruções, chamado
OVPsim, que modela um processador Xilinx Microblaze. É discutida a implementação em C++
da aplicação e comparados os resultados com os da aplicação de deteção de Megablocos.
iii
iv
Abstract
Application demands for intensive data processing are surpassing the conventional processor’s re-
sources, propelling the use of coprocessors for accelerating the execution of specific applications.
These systems bring a new challenge: Hardware/Software partition, which divides the application
execution in two, in order to take profit of the hardware potential. The biggest advantage comes
from the possibility of migrating the computational intensive parts, called kernels or critical sec-
tions to hardware.
This work consists on the developing of a application to detect multi-path based patterns in a
program execution. The theme focuses on Hardware/Software partition at the binary level to take
advantage of its portability, also called Dynamic Partition. The work developed continues a system
that implements a transparent dynamic partition divided into four phases: Detection, Identification,
Translation and Replacement. The main goal of this project is to improve the detection phase of
the previous work, proposing a new structure to represent the critical sections of a program, called
Multipath Execution Block.
This work exploits the system previously implemented which extracts single-path loops from
a program trace, called the Megablocks. It is shown the Megablock profiling process and the
changes to extend the representation from a single-path approach to one that can recognize multi-
paths.
The application developed takes advantage of a Instruction Set Simulator, called OVPsim,
which models a Xilinx Microblaze processor. The implementation in C++ of the application is
discussed and compared the results obtained with the Megablock detection application.
v
vi
Agradecimentos
Em primeiro lugar quero agradecer a duas pessoas muito especiais para mim e que são a grande
razão de eu ter conseguido chegar até aqui. Aos meus pais, um obrigado por tudo, vocês são
fantásticos.
Aos meus avós Amadeu e Angelina que me ensinaram muitos dos valores pelos quais eu vivo
diariamente e que sempre me incitaram a percorrer os meus objectivos, um outro muito obrigado.
E claro, embora já não estejam cá, eu sei que estariam muito orgulhosos de mim, porque sempre
me mostraram isso, deixo uma dedicatória especial aos meus avós Álvaro e Alice.
Quero agradecer à pessoa que durante estes meses mais me motivou, que nunca me deixou
parar e que sempre me obrigou a dar o melhor de mim. Marta, sem ti nada disto seria possível.
Não posso deixar de agradecer aos meus amigos Telmo e Bruno, por estarem comigo em todas
as alturas da minha vida.
Durante este percurso na FEUP conheci pessoas fantásticas e quero deixar um abraço especial
ao João Iria que tornou tudo mais divertido. Aos 1040: Bernardo, Carlos e Francisco, um abraço
muito forte, foram tempos fantásticos. Ao Pedro Iria, Zulu, ao Empresário, ao Saraiva, Guilherme
Pereira e Rodrigues, mais um grande abraço, grandes momentos passamos e concerteza mais ire-
mos passar. Finalmente, ao meus meninos vindos do outro lado da linha do metro, Oliveira, Lima,
Medeiros e Duarte um enorme obrigado. Vocês são pessoas incríveis e aprendi muito convosco.
Quero agradecer ao professor João Canas Ferreira pela sua orientação ao longo destes meses e
ao Nuno Paulino e João Bispo um obrigado pelos seus contributos indispensáveis e oportunidade
de continuar o trabalho desenvolvido anteriormente.
João Pereira
vii
viii
“To understand is to perceive patterns.”
Isaiah Berlin
ix
x
Contents
1 Introduction 1
1.1 Contextualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Current Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Thesis Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Background 5
2.1 Conventional processors and execution flow . . . . . . . . . . . . . . . . . . . . 5
2.2 Why the need for a coprocessor . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Reconfigurable Processing Unit . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Hardware/Software partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.1 High Level Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.2 Static and Dynamic Compilation . . . . . . . . . . . . . . . . . . . . . . 8
2.3.3 Dynamic Partition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Binary Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4.1 Instruction Set Simulators . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5 Previous work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5.2 Tool flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5.3 The Megablock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5.4 Results Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.6 Other related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.6.1 The Warp Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.6.2 The ADEXOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3 The MultiPath Execution Block 19
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.1 A motivating example . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 What is the MEB? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4 The MEB Extractor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4.1 Top level view . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4.2 The Pattern Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4.3 The MEB normalizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4.4 The process of saving a MEB . . . . . . . . . . . . . . . . . . . . . . . 29
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
xi
xii CONTENTS
4 Software Implementation and System Architecture 31
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 A general instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2.1 Smart pointers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3 Instruction Set Simulator, the OVPsim . . . . . . . . . . . . . . . . . . . . . . . 34
4.4 The Microblaze Instruction Decoder . . . . . . . . . . . . . . . . . . . . . . . . 35
4.5 The MEB classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.5.1 Graph Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.5.2 Class Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.5.3 The Edge class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.5.4 Node class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.5.5 MultiPath class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.6 The PatternDetector class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.6.1 Class Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.6.2 Saved Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.6.3 An iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5 Verification and Validation of Results 51
5.1 Test Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2 Result Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2.1 Simple Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.2.2 Medium Complexity Programs . . . . . . . . . . . . . . . . . . . . . . . 54
5.2.3 High Complexity Programs . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2.4 Running Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.3 Discussion of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6 Conclusions and Future Work 65
6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
References 67
List of Figures
1.1 Block diagram illustrating the typical system connecting a general purpose pro-
cessor (GPP) and a reconfigurable processing unit (RPU) . . . . . . . . . . . . . 2
2.1 A binary-level hardware/software partition approach [1][2] . . . . . . . . . . . . 10
2.2 Simplified system architecture, source[3] . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Tool flow, source[3] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Example of a repeating pattern of instructions in the trace of a 8-bit count kernel,
source [4] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5 On the left C code for a max function and on the right the MicroBlaze assembly
code for a Megablock representing one of the possible execution paths, adapted
from [5] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1 Example code of a for loop containing a two-way path . . . . . . . . . . . . . . 20
3.2 Relation between SpeedupOverall and Coverage for different values of SpeedupHw 21
3.3 A simplified example of a Multipath Execution Block representation (some se-
quential nodes were omitted for demonstration purposes) . . . . . . . . . . . . . 25
3.4 Process of MEB construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.5 Pattern Detector Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.6 Algorithm for finding tandem repeats with a maximum size M. . . . . . . . . . . 28
4.1 System architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2 Class diagram of the classes Instruction and MicroblazeInstruction . . . . . . . . 33
4.3 A typical smart pointer pointing to the object and its Control Block, source [6] . . 34
4.4 Class diagram of the class OVPmanager . . . . . . . . . . . . . . . . . . . . . . 35
4.5 Excerpt of the Microblaze Instruction Set[7] . . . . . . . . . . . . . . . . . . . . 35
4.6 A graph, a adjacency list representation (in the middle, and a adjacency matrix (on
the right) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.7 The MultiPath Execution Block class diagram . . . . . . . . . . . . . . . . . . . 38
4.8 The PatterDetector class diagram . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.1 Comparison of the average IpRC for the programs checkbits and bcnt between the
Megablock and MEB detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.2 A MEB representation of the pattern starting at 0x2e78 in the engine application . 57
5.3 Comparison of the average IpRC for the programs gridIterator, engine and g3fax
between the Megablock and MEB detectors . . . . . . . . . . . . . . . . . . . . 58
5.4 Comparison of the average IpRC for the programs janne complex, cnt and edn
between the Megablock and MEB detectors . . . . . . . . . . . . . . . . . . . . 61
5.5 Comparison of the Coverage for all the programs tested between the Megablock
and MEB detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
xiii
xiv LIST OF FIGURES
List of Tables
3.1 State Machine for Megablock Detection . . . . . . . . . . . . . . . . . . . . . . 28
5.1 Benchmark applications used for the result comparison . . . . . . . . . . . . . . 52
5.2 Results after testing checkbits benchmark . . . . . . . . . . . . . . . . . . . . . 53
5.3 Results after testing bcnt benchmark . . . . . . . . . . . . . . . . . . . . . . . . 53
5.4 Results after testing g3fax benchmark . . . . . . . . . . . . . . . . . . . . . . . 54
5.5 Results after testing gridIterator benchmark . . . . . . . . . . . . . . . . . . . . 55
5.6 Results after testing engine benchmark . . . . . . . . . . . . . . . . . . . . . . . 56
5.7 Results after testing janne complex benchmark . . . . . . . . . . . . . . . . . . . 58
5.8 Results after testing cnt benchmark . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.9 Results after testing edn benchmark . . . . . . . . . . . . . . . . . . . . . . . . 60
5.10 Comparison of the MEB and Megablock detectors in terms of execution time and
instructions executed for each benchmark application . . . . . . . . . . . . . . . 61
xv
xvi LIST OF TABLES
Abbreviations
ASIC Application Specific Integrated Circuit
BRAM Block Random Access Memory
CAD Computer-Aided-Design
DBT Dynamic Binary Translation
DHSP Dynamic Hardware-Software Partitioning
DSP Digital Signal Processing
FIFO First In First Out queue
FIR Finite Impulse Response
FPGA Field Programmable Gate Array
GPP General Purpose Processor
HDL Hardware Descripting Language
IPC Instructions per Cycle
IpRC Instructions per RPU call
ISA Instruction Set Architecture
ISS Instruction Set Simulator
JIT Just-in-Time
JVM Java Virtual Machine
LMB Local Memory Bus
OVP Open Virtual Platforms
OVPsim OVP simulator
RPU Reconfigurable Processing Unit
SBT Static Binary Translation
SoC System-on-Chip
xvii

Chapter 1
Introduction
Over the last years a shift has occurred towards the multi-core and many-core processor architec-
tures in order to increase speed when computing executing a large number of tasks that can be
distributed over various processors [8]. This tendency to increase the number of cores in order to
boost speed has been a flagship for most of the processor manufacturing companies around the
world. Although the advantages of this approach can be seen everywhere today, from personal
computers to mobile phones, when dealing with scenarios where intensive computation or com-
plex algorithms are key to the system’s performance, the use of application specific devices can
do more using fewer resources and at a lower cost [9].
The use of coprocessors allied with a General Purpose Processor (GPP) is an excellent option
to attack such problems. Moving computational intensive parts of an application to be executed
on these devices can accelerate the overall application. These custom architectures offer a system
which can be configured to run specific applications, increasing the performance and reliability,
lowering energy consumption [10].
Modern FPGAs have the possibility of including one or more general purpose processors,
which can be either embedded on the device or built with FPGA basic logic blocks, and imple-
mented with memory blocks, processor buses, internal and external peripheral controllers. This
is why modern FPGAs can be seen as authentic Systems-on-Chip (SoCs) digital platforms [11].
These devices combine the flexibility of software by having a GPP, yet intensive computations can
be shifted to the hardware in order to boost speed. One solution to create a hardware accelerator
is the use of a Reconfigurable Processing Unit (RPU) as a coprocessor. The two components may
communicate directly and both have access to the system memory as can be seen in Figure 1.1.
The design-flow of such devices will be divided into software and hardware design, thus leading
to the topic of Hardware/Software Codesign.
1.1 Contextualization
In the past, hardware and software were developed widely apart. During the process one team
would be in charge of developing software, and another in charge of developing the hardware.
1
2 Introduction
Figure 1.1: Block diagram illustrating the typical system connecting a general purpose processor
(GPP) and a reconfigurable processing unit (RPU)
Normally they would be brought together in the late phases of the process, when a hardware
prototype was ready and in most cases the product wouldn’t work. So the software team would
part again to redesign the software, while the hardware team went to refurbish their prototype.
And this process would go on a loop until a working product was done.
With the advent of Systems-on-Chip (SoCs), where designs are usually extremely large and
complex, the cost of producing a prototype raised hugely and the window of making an impact on
the market deeply tighten. The complexity of these systems has promoted an even more difficult
division of tasks between the hardware and software teams during the product development. And
today, the boundaries between which tasks should be implemented on hardware and which should
be done in software are becoming blurred. This makes the used methodology for producing a
working product too high a risk. Thus, introducing a new methodology of developing a product,
Hardware-Software co-design.
Hardware-software co-design is a design technique which uses simulation to bring hardware
together with software in the early phases of the design process and often throughout of it. Usu-
ally this methodology uses devices like FPGAs to rapidly produce prototypes of digital designs.
The result is a higher quality product which can be brought to the market faster and using fewer
prototypes than traditional design techniques.
Hardware/Software partition is the process of dividing a application execution between hard-
ware and software. A good partitioning can improve performance and might reduce power con-
sumption [12].
In the early years hardware/software partitioning was made manually, but with the introduction
of high level synthesis tools, the automation of this process had begun [13]. This tools have the
ability to translate source code to a hardware description language (HDL). One example tool is
the Xilinx Vivado HLS, which translates C, C++ or System C code to VHDL or Verilog[14]. This
process has some limitations, first there must be a profiling of the application in order to map the
critical parts to hardware, the languages, compilers and environment to built the application is
restricted by the target system. Then the translator creates the RTL, but normally a designer with
1.2 Current Work 3
advanced knowledge of digital systems design is needed to make adjustments to the circuit. This
process is far more complicated than simply creating the software and let it run by the GPP, so if
a hardware/software partitioning could be more transparent to the software developer, the better.
Thus, was proposed to do the partitioning at the binary level [2]. An ideal partitioning would
be one that could profile the executing binaries and detect computation-intensive parts, decode
the instructions, configure the hardware to adapt to the problem at hand, and update the binary
to communicate with logic [15]. This approach is called dynamic hardware/software partitioning
(DHSP) or simply dynamic partitioning.
1.2 Current Work
The current work builds on and extends previous work on the detection of single path trace-based
loop, called Megablock [16]. In previous work a transparent binary accelerator associated with
a convention processor was implemented. A Microblaze "soft" processor was used as GPP, and
a coarse grained RPU functions as the hardware accelerator [4][17][18]. The goal of the work is
to successfully migrate critical sections of a program execution to hardware in order to boost the
speed of execution. The Megablock was used to define this critical sections. These loops translates
repeating patterns on the execution that can be mapped into the hardware part of the system.
1.3 Motivation
Megablocks can represent critical sections, also know as kernels or hot spots, in the system, but it
only represents a single path taken in a loop, thus it can not detect paths that may be alternatives
to the detected one. This may present a loss in performance, specially because of the overhead in
communication between the GPP and the RPU. An allusive case is the detection of loops that have
condition structures inside, this structures divid the application path in two (or more depending on
the quantity of condition constructs). Consider one path happens 90% of the time and the other
10%, the Megablock would represent the most significant path, not showing what happens in the
other path. This would be fine, if the two paths would happen in a subsequent way, i.e. if one
only starts after all the iterations of the other are complete. However, there are scenarios where
this doesn’t apply choosing the path to execute is more random, for example executing 9 times a
path and 1 time the other. This would make the execution move from the GPP to RPU and back a
considerable number of times, introducing a not insignificant time spent in communication, thus
increasing communication overhead.
We can see improvements in several applications using the Megablock approach versus exe-
cuting all the instructions sequentially on the GPP, however we think that the speed up could be
even higher describing the patterns not only as single-path loops, but also expressing high-level
constructs, such as the different paths inside a loop or outer loops information, and thus reduc-
ing overhead communication between the two main components, improving as ultimate goal the
speedup of an application.
4 Introduction
1.4 Thesis Contribution
Our contribution will go further then improving the Megablock loop. We will create a new appli-
cation which will use a market-tested simulator called OVPsim. The use of this simulator permits
not only the migration for a usually faster development environment (C/C++) but also the possi-
bility of having a system that is scalable to test other processors and not only the one used during
this work, the Microblaze Processor.
The main focus of this application is to improve the detection of critical sections in the dy-
namic partitioning used in previous work, which relied on the Megablock as the central data struc-
ture for representing kernels in the execution.
We propose a new way of representing the hotspots, the Multipath Executing Block (MEB).
Multipath Execution Block is based on a graph structure which allows to show not only single-path
patterns, but also a representation of all the paths took in a critical section. This execution block
is meant to decrease the overhead communication between the GPP and RPU, providing longer
execution periods in the hardware, thus increasing the overall speedup.
1.5 Structure
In addition to the Introduction, this document contains more five chapters, organized as follows:
Chapter 2 we present the background of the Hardware/Software partition theme and discuss
the previous work developed.
In Chapter 3 we propose a multi-path trace representation of the critical sections of a program,
the MultiPath Execution Block as well as the motivation for its development and analyze the
algorithms used for its detection.
In Chapter 4, the software implementation is analyzed in detail.
Chapter 5 presents the results obtained and the comparison between our application and the
application for detecting Megablocks.
Finally, Chapter 6 concludes this document, were the results and contributions of this work are
summarized and discusses possibilities for future work.
Chapter 2
Background
In this chapter a simplified review of all the concepts necessary for understanding the literature
on dynamic partitioning will be presented. Furthermore, previous work on Hardware/Software
partition and respective implementations will be analyzed and discussed.
2.1 Conventional processors and execution flow
Today we can see processors in almost every device available. Embedded systems in general have
now systems which includes one or more of processors. The why, for such a broad application, is
the easiness to program these systems to do almost everything. Each particular processor archi-
tecture comes with its own Instruction Set Architecture (ISA), which is a well defined interface
between hardware and software [19].
Nowadays a software developer might not even look to the ISA, because of programs called
compilers. A compiler’s work is to take the source code of an application and translate it to
sequential instructions, based on each processor’s ISA. The translated version of the source code
is called machine language, which its human readable form is the assembly language. [?]
A program is nothing less than a set of instructions displayed sequentially and exclusively
identified by an address. The default execution of a program is sequential, after one executed
instruction, the next instruction in memory is executed sequentially and the program counter in-
creased by one. However, the flow in the execution can be change by some instructions that can
indicate that the next instruction to be executed is a few instructions before or ahead. This instruc-
tions are called jump or branch instructions, and they can be conditional or unconditional.
Branch instructions divide the execution path in two: the target path and a fall-through path.
Jump instructions are unconditional branches, which means they always change the execution
flow. A conditional branch is a instruction that is only taken if a particular condition is met, typ-
ically the result of some comparison (e.g. equal-to-zero, less-than, etc). Conditional branches
typically correspond to if, for, or while statements in the source code language [20]. A branch is a
set of instructions that are executed when the branch instruction is taken. This set of instructions
5
6 Background
represents then the one of the two paths that can be followed after the branch instruction is exe-
cuted. The representation of all the choices executed by GPP when running a particular program
is called trace. Bear in mind that different executions of the same program may generate different
traces, depending on the branches took by the program at execution time.
The control flow of an application is substantially connected to these instructions. It is called
a basic block to a linear sequence of program instructions that have one entry point and one exit
point. From this point we can build Control Data-Flow Graphs, that take basic blocks as nodes
and connects them with directed edges to denote the execution flow [21].
2.2 Why the need for a coprocessor
As application demands surpasses the conventional processor’s ability to deliver, the solution is
given by hardware acceleration, as a coprocessor, that is capable of improve performance with
application-specific devices. The speed advantage comes from the fact that the hardware can be
customized to execute a particular algorithm. For that reason, a circuit can contain only the exact
amount of operations needed to perform the algorithm, were a GPP must have all the possible
operations and all possible data types that an algorithm might require.
Modern processors might have the ability of processing multiple instructions per clock-cycle,
but can nowhere deliver the same level of parallelism given by a coprocessor, because this is based
on the challenges of the application that it is designed to and, because it is no less than hardware
built for specific purposes, which means it will perform much faster than a software equivalent.
Coprocessors can access many memory words, many memory addresses in each clock cycle and
access patterns can be optimized for the application. A General Purpose Processor (GPP) reads
data through the cache and its limited by the availability of the same at a given instruction. A pro-
grammer can only indirectly controls the cache-friendliness of the algorithm, as access to the data
cache is hidden from the Instruction Set Architecture (ISA). In short, hardware computing is con-
cerned with decomposing applications into spatially parallel, tiled, application-specific pipelines,
whereas the traditional GPP interprets a linear sequence of instructions, with pipelining and other
forms of spatial parallelism hidden within the microarchitecture of the processor [22].
Computation-intensive applications, typically have a certain critical sections which corre-
sponds to the most of the execution time. These sections are the parts, that when moved to execute
on a coprocessor, give a considerable boost to the overall speed and performance of the system.
A coprocessor can be either a silicon dedicated hardware, in other words an Application Spe-
cific Integrated Circuit (ASIC), or it can be a re-programmable circuit, such as Field Programmable
Arrays (FPGAs). The latter architectures offer a system which can be configured to run specific
applications like the former, but can also be reprogrammed later to include new features or correct
bugs.
2.2 Why the need for a coprocessor 7
2.2.1 Reconfigurable Processing Unit
A FPGA is an integrated circuit that is designed to be configured after manufacturing. FPGAs can
be used to implement any logic function that an Application-specific Integrated Circuits (ASICs)
can perform. For varying requirements an FPGA can be partially reconfigured while the rest of
the device is still running. Unlike silicon dedicated devices, any errors can be easily corrected by
simply reprogramming the FPGA [23].
Application Specific Integrated Circuits (ASICs) are another option to serve as coprocessors.
These devices provide the better performance in terms of area and power. [24] However, building
prototypes of the circuits have a high cost and the use of these systems only becomes profitable
when manufactured in grand quantities. Although they offer the better performance/area/power
ratios these devices can only serve one purpose and can not be reconfigured after manufactur-
ing. This means that, in a evolving market, where new specification and new features are needed
in very short times, having to replace these devices becomes an issue, thus the tendency to use
reconfigurable hardware.
Reconfigurable hardware has been gaining attention over the past decade [25]. The limita-
tions and cost given by the ASICs and the adaptability and shorter time to market of the Field
Programmable Gate Arrays (FPGAs), are key features to why, this devices, have been gaining
terrain [25]. Even the leading semiconductor chipmaker, Intel, has put a bet of 16.7 billion dollars
on this market with the recent acquisition of Altera, one of the biggest FPGA’s manufacturing
companies[26].
Modern FPGAs have the possibility of using reconfigurable hardware allied with a conven-
tional instructor processor in a codesign system. A considerable amount of area can be saved by
implementing the control path portion of an application on a microprocessor and only compute the
intensive datapath portion of an application is implemented as hardware. There are different ways
for connecting a microprocessor with a FPGA: (i) a soft processor is implemented with FPGA
logic (examples are the Altera NIOS, NIOS II, and the Xilinx Microblaze), (ii) a hard processor is
incorporated in the device with dedicated silicon (like AVR Processor integrated in Atmel FPSLIC
or or PowerPC processors embedded in Xilinx Virtex-4), or (iii) the device is attached with the
pipeline of a processor to execute customized hardware instructions [23]. These devices provide a
flexible and powerful way of implementing computationally demanding digital signal processing
(DSP) applications [25].
A Reconfigurable Processing Unit (RPU) is the name given to reconfigurable hardware that
serves as a coprocessor of a GPP in these devices design. There is a famous 90/10 rule, which
affirms that 90% of the execution time of a program is spent in 10% of the code. So the ultimate
goal is to translate the 10% into hardware logic and implement it as a hardware function to achieve
huge gains. Whereas the 90% left of the code runs on a microprocessor.
With all the advantages of using a RPU, designing and testing hardware is a great deal more
difficult than software. Commonly a hardware model is very distinct from those used in software.
These different views of hardware and software tasks may cause turbulence in the codesign pro-
8 Background
cess. Such heterogeneous platforms may require a bigger effort in Hardware/Software partitioning
[18].
2.3 Hardware/Software partitioning
Hardware/Software partitioning consists in the process of deciding which parts of the application
are executed by the software (the GPP) and which parts are more suitable to be executed in hard-
ware (the RPU). This is a crucial stage of the overall design of the device [27]. A good partitioning
may be the turning point between a system with great performance at a lower energy consumption
and a system with low frequency clock speed wasting hardware resources. This problem augments
when the GPP is a "soft" processor, i.e. implemented with FPGA logic. These processors have
less frequency clock speed and consume more power than a "hard" processor, which is a silicon
dedicated processor[28]. Previous research has shown that hardware/software partitioning can re-
sult in a speedup of 200% to 1000% as well as reducing the system power consumption by up to
99%[29][30][31][32][33][34][35][15].
Although now most of the partition tools are automated, this was not always the case. It will
be performed an analyzes on one of the first automated approaches, High Level Synthesis, and
then discussed the advantages of dynamic compilation, in order to understand what is dynamic
partitioning, that is the main focus of the previous work done.
2.3.1 High Level Synthesis
High Level Synthesis was the advent of automated hardware/software partitioning. Traditional
hardware design methods required manual RTL development and debugging and this was too
time consuming and very error prone, specially with the increasing complexity of the today’s
designs [36]. These tools can translate source code, normally in C or C++ languages, to RTL.
This is very convenient, because it permits to skip some of the hardware prototyping phases, has
fewer bugs and shorter time to RTL [13]. These compiler-based approaches provide an excellent
technical solution for hardware/software partitioning, commonly achieving an order of magnitude
performance improvements [2].
2.3.2 Static and Dynamic Compilation
In order to execute a certain program on a processor, first it must be written in a programming
language, and then it must be translated to another language which can be understood by the pro-
cessor. To the process of translating a source language to a target language is called compilation.
The main function of a compiler is the translation, but it also makes transformations and opti-
mizations to the code, helping the execution time, or decreasing power consumption.
A static compilation is one that takes place before the execution of the program. An interpreter
is another kind of language processor, that instead of creating a target program as a translation, it
executes directly the instructions stated in the source code [37].
2.3 Hardware/Software partitioning 9
A machine-language target program produced by the compiler is normally much faster than
the interpreter version at mapping inputs and outputs. In the other hand the interpreter, because it
has runtime data information, gives better error reports.
Java language uses the best of the two worlds: it compiles the source code to an intermediate
form, called bytecodes. The bytecodes are then interpreted by a virtual machine. An advantage
of this arrangement is that the bytecodes compiled in one platform can be interpreted by another,
making this approach machine independent. To achieve faster processing of inputs and outputs,
Java uses Just-in-Time compilers, they translate the bytecodes into machine code, immediately
after compilation, this is also called dynamic compilation.
Dynamic compilation enhance portability among different machines. Java programs are dis-
tributed as bytecodes, providing this way that every machine that have a Java Virtual Machine
(JVM) has the ability of execute the program. This approach also allows software developers to
not worry about target machines, and use the same toolchain and development environment for all
target systems, instead of one for each.
2.3.3 Dynamic Partition
Binary hardware/software partition, or dynamic hardware/software partition, or simply dynamic
partition is the process of reading software binaries, detect critical loops to be implemented in
hardware and alter the binaries to be executed on software [1].
Hardware-software partition tools that perform the partition at source-code level have the ad-
vantage of having high-level information, such as high-level loop constructs (chained loops), mul-
tidimensional array data, arithmetic expressions, among others which are hard to see at the binary
level. Tools like the Xilinx Vivado HLS, can transform C, C++ or SystemC code into a Register
Transfer Level (RTL). Although this tools allow to shorter prototyping periods. Software develop-
ers are restricted to use the languages and compilers provided by such tools. Even though they give
a great start, normally it would be required a designer with advanced knowledge of digital systems
design to make the adjustments necessary for the final product to work. Usually software devel-
opers for embedded systems do not have advanced digital systems design knowledge and they
already have a tool flow that is not compatible with the inclusion of such partitioner. Yet another
concern is that, sometimes, some portions of the code are not written in a high level language,
but introduced at the linker level. This information would be lost if a source code level partition
was made. Furthermore the source code might even come from different languages turning the
partition at this level a impracticality [1].
Doing the hardware/software partitioning at the binary level, i.e. after compilation and linking
offers some of the advantages of dynamic compilation, such as portability and runtime adaptation
[5]. From the point of view of a software developer, if the partitioning tool for detecting the critical
loops and consequent mapping to hardware parts were transparent to him, he would be able to
develop applications in any language, use any compiler and any development environment. The
dynamic partition brings the transparency software developers need for having complete freedom
10 Background
Figure 2.1: A binary-level hardware/software partition approach [1][2]
when the developing an application. We can see in Figure 2.1 a simplified version of a system
implemented with dynamic partition.
A case study [2] demonstrate that comparing binary-level partitioning with source code level
partitioning can achieve similar results. To summarize even though source-level partitioning has
some technical benefits, a partition at the binary level has various practical advantages, which was
the motivation to study and develop such partitioner in previous work done.
2.4 Binary Translation
Even with the years passed, modern processors remain faithful to legacy Instruction Set Architec-
tures (ISAs), some of them are over a decade old. The risk of developing a new ISA is very high
since manufacturers could lose their product’s existing software base. On the other side, software
developers find porting code to a new architecture difficult and time consuming [38].
Binary translation offers solutions for automatically converting executable code to run on new
architectures without recompiling the source code. A Binary Translator is a software component
of a computer system that converts binary code of one ISA into another ISA. Binary translation is
2.5 Previous work 11
one of the most important methods for the binary level compatibility [39]. It seeks to provide the
illusion of transparency: Code can run on platform A exactly as it would on platform B [40].
There are three types of binary translation – emulation, dynamic binary translation (DBT) and
static binary translation (SBT). An emulator interprets program instructions at runtime. None of
the interpreted instructions are saved or cached. A dynamic translator, on the other hand, translates
between the legacy and the target ISA, caching the pieces of code for future use. In subsection
2.3.2, we talked about the Java JIT compilers and its Virtual Machine, which uses Java bytecodes
to execute the programs, this a form of a dynamic translator. A static translator translates binary
code before the program starts execution and can apply more rigorous code optimizations than can
a dynamic translator. Static translators can also use execution profiles obtained during a program’s
previous run. Most current binary translators adopt the dynamic approach because it can handle
the code discovery and the code location problems more easily. However, DBT systems cannot
perform aggressive optimizations, which may incur severe run-time overheads.
Since all machines—legacy and new—are Turing machines, any computation done on one can
be emulated on another. Binary translation aspires to do more than emulate the legacy architecture
efficiently. Binary translation seeks to emulate legacy architecture so efficiently that code runs at
least as fast on the new architecture as on the legacy machine.
2.4.1 Instruction Set Simulators
In this subsection we are going to talk about one of the major differences between the work done
on this thesis and the work developed before.
In the ambit of this thesis was created an application for detecting a new structure that would
represent sections of the code where the program execution stayed the longer. Although some
algorithms from previous work were used, the new application takes a completely different ap-
proach to the critical sections and it was developed in C++, whereas in the work done before, the
profiling application was written in Java.
The main reason for the language change is the use of a new Instruction Set Simulator, called
OVPsim, which use binary translation to simulate a processor execution in runtime. This new
simulator permit greater flexibility, making easier adapting our profiling application to target dif-
ferent processors and not only the Microblaze processor, which is the only possible target using
the profiling application done in previous work.
2.5 Previous work
The previous work dwell on a transparent binary acceleration system based on a General Purpose
Processor (GPP), coupled with automatically-generated, coarse grained Reconfigurable Process-
ing Unit (RPU) [41]. A software binary is analyzed offline with a instruction set simulator that
produces a instruction trace, this trace is profiled in order to detect critical sections that can be
12 Background
mapped into hardware equivalents in order to accelerate the program execution. In the previ-
ous work the critical sections that were considered to be implemented in hardware were called
Megablocks [16], which are a trace-based loop that represent significant execution time.
However the major aim is to translate, at runtime, the critical sections of the program to hard-
ware, at the time being, a mixed offline/online approach that consists on four steps is done:
• Detection: Megablocks are detected in the execution trace, that is produced by a instruction
set simulator.
• Translation: The selected Megablocks from the previous phase are translated to hardware
specification and mapped in the RPU with all the corresponding configuration, resulting in
a hardware accelerator, without changing the software binaries.
• Identification: This phase is done at runtime, the GPP instruction are monitored in order to
detect the critical sections that were implemented as hardware.
• Replacement: Finally, execution is moved from the GPU to the RPU to execute the the
moved parts, and resumes to software execution once exited.
2.5.1 System Architecture
Figure 2.2 shows the simplified system architecture, that consists of (i) a Microblaze soft processor
as GPP, which executes unmodified input binary code from local Block RAMs (BRAMs), (ii) the
RPU, which executes selected Megablocks and exchanges operands and results with the GPP; (iii)
an Injector module which interfaces with the instruction bus of the GPP to monitor the instruction
stream and controls GPP/RPU execution. This image doesn’t show the two Local Memory Buses
(LMB) Multiplexers, which enable shared access to the BRAM ports [18].
The injector and the RPU are generated offline using a Hardware Descripting Language (HDL).
At runtime the injector is in charge of the execution migration from the GPP to the RPU and vice
versa. It monitors the instruction bus, and if the start address of a critical section mapped in the
RPU is detected, the injector inserts a branch instruction to a memory location of an automatically
generated Communication Routine (CR), that corresponds to the communication between the GPP
and the RPU. When the RPU gains control of the LMB to have access to the BRAMs. Once ex-
ecution in the RPU ends, control of the LMBs is handed back to the GPP, which executes the
remainder of the CR, moving results to the register file and resuming execution of the application
[18].
2.5.2 Tool flow
A tool suit to detect Megablocks and generate a RPU and its configuration bit was developed. An
execution file, i.e. the elf file, is fed to the Megablock Extractor tool which detects, of course, the
Megablocks. This tool uses a cycle accurate Microblaze simulator to monitor execution traces.
2.5 Previous work 13
Figure 2.2: Simplified system architecture, source[3]
This step will be discussed in further detail in subsection 2.5.3, because it is the subject of im-
provement of this thesis work. In the translation phase, Megablocks are processed by two tools:
one generates the HDL (Verilog) descriptions for the RPU and the Injector, and the other gener-
ates the CRs for the GPP. The HDL description generation tool parses Megablock information,
and maps the information in the RPU. The tool also generates routing information to be used at
runtime as well as the data required for Megablock identification.
Figure 2.3: Tool flow, source[3]
The main focus of the work developed was to develop a new representation for detecting
critical sections of a program. In the next subsection, a further analyzes will be done so the
previous work done in the detection phase can be better understood.
14 Background
Figure 2.4: Example of a repeating pattern of instructions in the trace of a 8-bit count kernel,
source [4]
2.5.3 The Megablock
The aim of the previous work, and ultimate goal of the current work, is to move efficiently the
critical sections of an application to a hardware equivalent. It is necessary identifying these crucial
segments effectively in order to perform a partition that can result in speed gains. In the work done
before, the promising sections to be translated to hardware discovered in the execution trace were
called Megablocks.
João Bispo formulated a Megablock definition in his work [5]:
Megablock Definition: Consider a statically defined program P, which is formed by the se-
quence of machine instructions [i1 i2 . . . im]. Each execution of the program generates a sequence
T, called a trace, formed with possibly repeated instructions from P. Consider S a sequence of
instructions with size m ≥ 1 present in T (being m the number of instructions). For instance, [i5
i6 i7] and [i8 i2 i2] are two specific three-instruction sequences. A Megablock is a contiguous sub-
sequence of T formed by a repeatable sequence S, represented by Sn, being n ≥ 1 the number of
times the sequence S repeats. E.g., if S=[i5 i6 i7] and S3 is a Megablock found in T means that [i5
i6 i7 i5 i6 i7 i5 i6 i7] is a contiguous subsequence in T.
A Megablock is a sequence of instructions that executes on a loop, representing a section of
the program that expresses significant execution time. An example of a detected Megablock in
assembly code can be viewed in Figure 2.4.
Megablocks are single-path oriented, this means that when a executing loop has condition
sentence, the resulting Megablock is the the path of sequential instructions in the loop that repre-
sents more execution time. An example is given on Figure 2.5 where a function to determinate
the maximum value of a vector is implemented on C language (left side of the image) and the
Megablock representation in assembly code of one of the paths taken during execution (right side
2.5 Previous work 15
Figure 2.5: On the left C code for a max function and on the right the MicroBlaze assembly code
for a Megablock representing one of the possible execution paths, adapted from [5]
of the image). This function has a if construct inside a for loop, the highlighted instructions on the
asssembly code represent the two conditions for exiting the Megablock, one for the if statement
and the other for testing the end of the for loop.
These representations have a simple control flow execution that have a single entry point, but
multiple exit points. And every time a instruction is executed, it is performed a test to determine
if the execution should continue in the RPU or rather return to the GPP.
This profiling phase is done offline by reading each instruction given the instruction set simu-
lator at a time. Although, the aim is to move this phase online, Megablocks would be detected and
later translated to hardware at runtime. This means that the only information available would be
the current instruction and the ones saved in a special memory dedicated to the detection phase.
That is why the offline profiling analyzes the trace not seen as whole but as a step by step pro-
cess, each step meaning a new executed instruction, where a Java-based simulator implementing a
Microblaze processor model was used as Instruction Set Simulator providing the binary trace.
The module for detecting Megablocks was divided in a three-step process, containing a al-
gorithm for finding repetitions in strings, a arbiter to calculate the new pattern size and a state
machine, to determine the state of the pattern discovered. A detailed analyzes of this process will
be done in the subsection 3.4.2 of the next chapter.
When a Megablock was detected, the pattern detector was able to determine the size of the
repeating sequence of instructions and how many iterations.
It is necessary to bear in mind that a Megablock can reappear in the program execution, i.e. a
Megablock containing the same instructions can occur in different times of execution, so in order
to not produce duplicates, a normalizer was implemented.
The normalizer function is to avoid producing duplicated results with no meaning, for example
16 Background
accounting identical Megablocks as different ones introducing faulty information. The normalizer
recognizes equal patterns, organizing and counting occurences, that later are used to determine
each Megablock impact on the execution time. This is key factor to decide whether a Megablock
should be translated or not to hardware. Furthermore, to determine if a Megablock is a good
candidate for implementation is evaluated the total number of instructions that would be moved to
hardware, this is not the number of instructions that a Megablock has, but that number multiplied
by the number of the Megablock appearances. This number is compared with executed instructions
threshold which tells the minimum instructions that should be migrated to the RPU in order to
approve the candidate.
Finally the approved candidates are implemented as hardware in the RPU.
2.5.4 Results Overview
This system was tested with several benchmarks and here is given an overview of the results
obtained.
In terms of performance, the global average Instructions per Cycle IPCHW for all benchmarks
was 2.42. The overall and kernel speedups for all benchmarks are 1.60x (minimum: 0.65x; maxi-
mum: 7.21x) and 1.92x (minimum: 0.78x; maximum: 7.75x), respectively.
The GPP/RPU communication overhead accounts for an average of 6.4% of the time required
for migration and RPU execution, and 3.6% of the total execution time. Each call of the RPU
takes 27 clock cycles on average. In a benchmarks with the longest execution time of nearly
10 million clock cycles when using the RPU. Out of these, 13.8% are communication overhead.
There are cases where the number of cycles of RPU execution per call is comparable to the number
of cycles required for executing the CR. The impact of the communication overhead diminishes
as the number of iterations performed per call of the RPU increases.
On power saving, on average, the RPU-based system saves 1.9mW (sigma2 = 0.1mW ). This
corresponds to 0.86% of the total power consumption for a software-only system. For these cases,
the power required by the additional circuits is slightly more than the GPP power saved. Therefore,
the performance improvements do not require additional power.
2.6 Other related work
On this subsection it will be presented some different approaches related to the previous work done
and the subject of dynamic partitioning in general. These approaches also retrieve information of
the critical sections from a program’s binary.
2.6.1 The Warp Processor
The Warp Processor [42] is a system that consists in a microprocessor and a FPGA-based re-
configurable hardware sharing instruction and data cache (or memory), a profiler and dynamic
2.7 Summary 17
Computer-Aided-Design (CAD) tools for hardware generation. This is a full-online approach to
the dynamic partition problem.
The purpose of this system is to do all the partition work online, i.e. at runtime. The software
developer, completely agnostic of the system, downloads the program targeting the microproces-
sor in the system, then a profiler reads the instructions and decides which parts to be executed in
software and hardware, then dynamic CAD tools map the chosen hardware parts in the FPGA.
This work uses decompilation in order to retrieve high-level information from the binary. Decom-
pilation is the process of retrieving high level constructs from the binary code.
2.6.2 The ADEXOR
The fundamental idea behind the ADEXOR[43] is the possibility of using a Reconfigurable Func-
tional Unit (RFU), that executes custom instructions (CIs) based on critical basic blocks, allied
with a MIPS-based processor. The RFU is in the same data path of the processor and has access
to its register file. To configure the system a offline phase detects the hot basic blocks, i.e. the
ones that executed the most times in a loop, and converted into CIs. Posterior work introduced a
approach that was capable of mapping multi exit custom instructions (MECIs) to the RFU. These
were constructed by detecting and merging hot basic blocks.
2.7 Summary
In this chapter was discussed the advantages of allying a Reconfigurable Processing Unit to a
General Purpose Processor, one of them being the level of parallelism achieved by a RPU is an
order of magnitude higher than a GPP, permitting to accelerate an application executing migrating
critical loops to hardwaree.
It was described how important Hardware/Software Partitioning is for improving the speedup
of a program execution, as well as the power consumption. High Level Synthesis was the first
approach which brought simplicity to the partition problem. However, to the point of view of
a software developer the tools needed were outside the usual tool flow for embedded develop-
ment and the need for digital systems knowledge in order to constantly re-adapt the software and
hardware to work was (and still is) a difficulty to the typical designer.
Dynamic Partition was discussed as an alternative of the typical source code level partition.
This process looks to a program execution binary to retrieve information about the critical sections.
Tools that look only to the binary do not interfere with the tool flow of the embedded systems soft-
ware developer, because it happens after the processes of compilation and linking. This approach
has advantages that were inspired in dynamic compilation, such as multi-platform portability.
The theme Instruction Set Simulator was discussed to understand the shift for a new ISS and
consequently a completely new application developed in a different language, although using some
of the algorithms.
18 Background
The previous work done was discussed. It was shown a overview of the architecture which
uses a GPP allied with a RPU for accelerating critical sections of a program. It is used a mixed of-
fline/online approach which divides the hardware/software partitioning in four phases: Detection,
Translation, Identification and Reallocation. The Detection phase was analyzed further, since it
is the main focus of this thesis. It was discussed what is a Megablock and it can be detected in a
program execution trace.
Finally it was presented some related work like the Warp Processor, which gives a full-online
approach to the dynamic partitioning problem, they use decompilation of the binary in order to ex-
tract high-level information, and online CAD tools to reconfigure the hardware. And the ADEXOR
approach introduced the concept of Application Specific Intruction Set, and used a offline profiling
phase to configure the Custom Instructions in the ASIP.
Megablocks represent a single-path oriented representation of a critical sections. In the next
chapter a new representation of the critical sections of a program execution will be presented. The
main goal is to retrieve information about possible alternative paths of the core loop and even be
able to represent high-level information, such as nested loops. The intent is to give to the hardware
developer more information that can be useful to prolong each RPU call execution.
Chapter 3
The MultiPath Execution Block
In this chapter we are going to describe what is the MultiPath Execution Block, the motivational
problems and our approach to solve those problems. We also discuss the algorithms used in previ-
ous work for the Megablock detection and our new adaptations to extract multi-path information.
3.1 Introduction
In previous work, the critical sections of the execution of a given problem were represented by
Megablocks. Megablocks, see subsection 2.5.3, are trace-based loops which represent a signifi-
cant path of the application execution that was taken a substantial amount of times. Whenever a
loop has a condition structure inside that may divide the execution flow in two, the Megablock
representation of such critical section is either the most representative path taken, or the two paths
represented by two different Megablocks that start at the same instruction address.
When a program reaches a section signalized as a Megablock it shifts the execution to the
hardware section. The instructions in the Megablock are executed, testing each one to determine if
the execution should stay in the Reconfigurable Processing Unit (RPU) or instead move back to the
General Purpose Processor (GPP). If a critical section described above is the one being executed
the two possible scenarios would result in loss of performance in comparison to the full potential
speedup. If only one path was translated, a loss in coverage would result, because the two paths are
probably very similar which means they could both be executed in hardware without adding too
many components. Or if the two paths were translated, a loss in communication overhead would
occur. Either way, if the hardware developer had the full information about the critical section, he
could implement the hardware cleverly, choosing what to implement as hardware given the final
speedup impact. Thus, the theme of this chapter.
The MultiPath Execution Block (MEB) is a graph-based representation of a critical section
of an application. Although its core is a critical loop, same as the Megablock, the MEB can
also represent alternative paths that can happen inside a loop. For example MEB can merge two
identical critical loops that are almost adjacent, with the in-between computations, providing in
this way a longer pattern to be executed in the RPU.
19
20 The MultiPath Execution Block
Figure 3.1: Example code of a for loop containing a two-way path
When a loop has a condition structure inside that creates a division in the execution path,
normally a path is most significant than the other, but instead of just outputting the information
for the most executed path, the MEB joins the two paths merging the common instructions and
creating a structure which shows all the paths occurred. Consider the Figure 3.1 where there is a
for loop containing a if construct. The MEB would represent not only the most significant path
taken while in the loop, but all the alternatives so that the execution doesn’t have to migrate from
hardware to software in order to continue.
3.2 Motivation
To improve the previous work there were two distinct approaches that could be done. One was the
migration of the Detection phase to runtime, the other was to improve the results of the this phase.
On this thesis we followed the latter approach, and there were two main factors which motivated
this approach:
• Coverage
• Communication overhead
When the ultimate goal of a system is to successfully partition an application between software
and hardware in order to boost the speed of execution, this two main factors play a key role in the
process.
When identifying critical sections in a program execution, these sections are expected be
moved from the GPP to the coprocessor. The relation between the amount of instructions that
would be migrated and all program instructions is called coverage (see Equation 3.1).
Coverage =
ExecutionMoved
ExecutionTotal
×100 (3.1)
In 1967, Gene Amdahl formulated a problem with parallel computing, he said that if a program
have 5% of uniquely sequential instructions, i.e. that can not be done concurrently, the maximum
speedup that can be achieved is 20x the speed achieved with a single processor. Thinking on our
problem, if we have a hardware accelerator which can improve the execution time by a factor
3.2 Motivation 21
Figure 3.2: Relation between SpeedupOverall and Coverage for different values of SpeedupHw
of SpeedupHw when compared to the GPP, we can use Amdahl’s Law[44] (see Equation 3.2) to
calculate the maximum speedup that can be expected using a coprocessor. In Figure 3.2 we can see
(not accounting the effects of communication overhead), the relation between overall speedup and
coverage for different values of SpeedupHw. We can conclude after analyzing the chart, that, even
with a hardware speedup incredibly high, coverage is a key factor which delimits the maximum
speedup that can be achieved. This means that even if we had a hardware accelerator which can
boost execution infinitely, the maximum overall speed would be given by Equation 3.3.
SpeedupOverall =
1
(1−Coverage)+ CoveragespeedupHw
(3.2)
SpeedupOverallMax =
1
1−Coverage (3.3)
However, high coverage doesn’t necessarily mean a considerable improvement in overall
speedup if there is a generous amount of communication overhead between the GPP and the RPU.
This amount is intrinsically linked with how coverage is distributed throughout the program execu-
tion, meaning that if one have a high coverage that, instead of being focused in one or two blocks,
is divided by several unsubstantial parts, the volume of overhead communication increases, de-
creasing the overall speedup of the application. Decreasing the communication overhead can be
achieved by migrating bigger sections to the RPU, instead of migrating a lot of small ones.
In an analogous fashion to teamwork, communication overhead can be seen as the proportion
of time that team members spend communicating with each other, instead of getting productive
work done.
22 The MultiPath Execution Block
In previous work [5][18] it was calculated the penalties inferred by communication overhead.
As referred in subsection 2.5.4 of the Chapter 2 communication overhead accounts for an average
of 6.4% of the time required for migration and RPU execution, and 3.6% of the total execution
time. A unique call of the RPU takes 27 clock cycles on average. And some of the tested instruc-
tions set could generate worse behaviors in overhead, where one of them demonstrated a 13.8% of
communication overhead related to the total execution time.
3.2.1 A motivating example
To conclude the motivation section, here is an example that can demonstrate how the MultiPath
Execution Block representation could improve the results of a detector using the previous detection
representation.
A Megablock has a maximum size for the number of elements that a repeating sequence can
have. Let the maximum size be 300 and a repeating sequence have 20 elements. If this sequence
represents a loop that runs for 16 times, the output Megablock would look like this:
-------------------------------------
Megablock occurred 16 times
0x000001e4 lwi r3, r19, 12
0x000001e8 bslli r3, r3, 12
0x000001ec addk r4, r3, r0
0x000001f0 lwi r3, r19, 36
0x000001f4 addk r5, r4, r3
0x000001f8 lwi r3, r19, 8
0x000001fc lwi r4, r19, 4
0x00000200 bslli r3, r3, 4
0x00000204 addk r3, r3, r4
0x00000208 bslli r3, r3, 2
0x0000020c addk r3, r3, r5
0x00000210 addk r4, r0, r0
0x00000214 swi r4, r3, 0
0x00000218 lwi r3, r19, 4
0x0000021c addik r3, r3, 1
0x00000220 swi r3, r19, 4
0x00000224 lwi r3, r19, 4
0x00000228 addi r18, r0, 15
0x0000022c cmpu r18, r3, r18
0x00000230 bgei r18, -76
-------------------------------------
From trace analysis we discovered that this loop is actually inside another loop that iterates
32768 times. Consider for example purposes the total number of instructions 2 million and the
SpeedupHw = 7x. This would give a coverage of:
Coverage =
32768×20
2000000
×100 = 32,768% (3.4)
3.2 Motivation 23
Giving a SpeedupOverall of:
Speedupoverall =
1
(1−0.32768)+ 0.327687
= 1.389 (3.5)
Which is a good value for one loop identified, but we have to consider the communication
overhead between the GPP and RPU, we have seen that the average Instructions per Cycle (IPC)
is 2,42. This would mean that to execute one iteration of this 20 instruction sequence, we would
need an average of 8,26 cycles, so a complete Megablock, which are 16 iterations would mean 132
cycles. If we add the 27 average cycles that takes to call the RPU we will end up with 159 cycles,
a 16,99% increased time. This means that in total, with our 32768 calls to the RPU we would be
doing 884736 cycles more, reducing the SpeedupOverall to 1,15x.
With the MEB we would only need one call to the RPU because it recognizes the alternative
path taken in the outer loop execution, see a excerpt of a MEB extractor result, showing the base
path and the alternative path.
**************************************************
Base path occurred 16 times
0x000001e4 lwi r3, r19, 12
0x000001e8 bslli r3, r3, 12
0x000001ec addk r4, r3, r0
0x000001f0 lwi r3, r19, 36
0x000001f4 addk r5, r4, r3
0x000001f8 lwi r3, r19, 8
0x000001fc lwi r4, r19, 4
0x00000200 bslli r3, r3, 4
0x00000204 addk r3, r3, r4
0x00000208 bslli r3, r3, 2
0x0000020c addk r3, r3, r5
0x00000210 addk r4, r0, r0
0x00000214 swi r4, r3, 0
0x00000218 lwi r3, r19, 4
0x0000021c addik r3, r3, 1
0x00000220 swi r3, r19, 4
0x00000224 lwi r3, r19, 4
0x00000228 addi r18, r0, 15
0x0000022c cmpu r18, r3, r18
0x00000230 bgei r18, -76
--------------------------------------------------
Alternative Path occurred 1 time
0x000001e0 bri, 68
0x000001dc swi r0, r19, 4
0x0000024c bgei r18, -112
0x00000248 cmpu r18, r3, r18
0x00000244 addi r18, r0, 63
0x00000240 lwi r3, r19, 8
0x0000023c swi r3, r19, 8
24 The MultiPath Execution Block
0x00000238 addik r3, r3, 1
0x00000234 lwi r3, r19, 8
0x00000230 bgei r18, -76
**************************************************
3.3 What is the MEB?
In this section it is described what is a MultiPath Execution Block and how it is composed.
As said, the MEB is inspired on a graph data structure. This means that analogously a node in
a graph represents a instruction in a MEB and a edge indicates the execution flow of the program.
In Section 2.1 could be learned that there are instructions which can change the execution
flow. They are branch instructions. A branch instruction can divide the path in two, and they are
the motive for building a graph based structure, so we can have nodes with more than one possible
paths. In Figure 3.3 we can see: on the left we see a pattern occurring in the execution trace, but
at a certain point there is a slightly different path that exists and then the execution resumes to
the previous pattern; on the right there is the MEB representation of all the paths and how many
times each path was taken, in this case, we understand that a main path happened 7 times and a
alternative path is executed only once.
While the Megablock defines a critical loop, a MEB delivers that information and possible
alternative paths to the main loop in order to keep the execution longer in the RPU. This attempts
to increase overall coverage, as well as increment each individual time the execution moves to the
RPU, possibly merging Megablocks and adding alternative paths. We can this way also decrease
communication overhead, augmenting the application speedup.
3.4 The MEB Extractor
The MEB creation starts as a critical loop and then it can grow to a multi-path structure if the
profiling phase detect subsequent paths that are alternatives to the detected one or even paths that
include a outer loop’s information, being able to represent high-level loops constructs.
The detection of the starting loop is the same as the Megablock (see section 2.5.3 for more
detailed information) so we will analyze the process that was used to detect the Megablocks and
then we will introduce our adaptations to show how MEBs are built. But first, we will give a
top-level view of the whole process and then discuss each component thoroughly.
3.4.1 Top level view
When profiling an application, the goal is to detect all the critical sections that should be moved to
hardware. We used a map structure to save all the MEBs we would find and group them appropri-
ately. This process have two main stages, a pattern detector and a MEB normalizer. As the names
indicate, we first detect a pattern in the execution trace and then we build and normalize a multi-
path structure taking into account the execution flow, this means we do not create a new multi-path
3.4 The MEB Extractor 25
Figure 3.3: A simplified example of a Multipath Execution Block representation (some sequential
nodes were omitted for demonstration purposes)
structure every time we detect a pattern, instead we comprehend what the pattern represents and
add the new information to a MEB map accordingly. In Figure 3.4 we can see an overview of the
process of profiling an application, instruction by instruction, and creating a map data structure
with all detected MEBs.
3.4.2 The Pattern Detector
On this subsection we are going to discuss the work done previously. We are going to describe
the module that detected the current state of a pattern, which we called Pattern Detector. Also,
26 The MultiPath Execution Block
Figure 3.4: Process of MEB construction
because the MEB is a structure built on top of a critical loop, we took advantage of the job done
to detect Megablocks and built a more complex structure with it as the base.
When detecting Megablocks in a instruction trace, the Pattern Detector used was divided in a
three-step process, which initiates with a new instruction read and finishes with the current pattern
state. This is based on the premises that Megablocks are repeating patterns on the execution and,
that we need to know in which state of a pattern we are for determining whether it is a continuation
of the last pattern detected or a different one, calculating how many iterations each pattern has.
We can see on Figure 3.5 the architecture of the pattern detector.
Detection Algorithm
Finding repetitions in a instruction trace is related to the problem of finding repeating elements
in strings. A tandem repeat or square is a string of the form αα where α is a non-empty substring.
[45] In the case of Megablocks we analogously try to find these tandem repeats, where a sequence
of instructions S is equivalent to a repeated element in a string, and represents a single iteration in
a loop. The default elements of S are single instructions, however, the algorithm can implemented
feeding it with coarser structures like basic blocks. Although the goal is to find loops with many
iterations, it was assumed that if a tandem repeat is detected many other iterations of the same
3.4 The MEB Extractor 27
Figure 3.5: Pattern Detector Module
pattern would follow. So the algorithm for the pattern recognition considers that two repetitions
are enough for initiate the pattern recognition process.
The detection algorithm is based on the assumption that ultimately the profiling phase would
be done online, so algorithms for building suffix-trees[46] were not considered since they would
require preprocessing the trace to produce results. Furthermore, each instruction that has to be
executed comes at a constant rate from the GPP, so the algorithm should have a constant processing
time for each input, so it can be synchronized with the processor.
The pseudo-code for the algorithm is presented in Figure 3.6. This algorithm needs to know a
priori the maximum size that a repeating sequence of instructions can have.
The algorithm uses a FIFO (First In First Out) queue to store the elements already processed.
An array counter is used to store information about tandem repeats and their size. When a new
element is received it is compared to the previous elements contained in the matching FIFO. If
there is a match, the index position of the counter array is incremented by one unit until it reaches
its index number, else the index position is reset to zero. After sweeping all matching FIFO another
loop searches for tandem repeats. When the counter array index equals its saved value this means a
tandem repeat has occurred and the index number signifies its size. Finally the element is inserted
in the matching FIFO, if the FIFO is full the oldest element is popped out.
When processing a single input, there can be 1 to M matches, according to the algorithm. After
an input pattern of AAAAAA three results will emerge, (i) six iterations of a sequence of size 1: A,
28 The MultiPath Execution Block
Figure 3.6: Algorithm for finding tandem repeats with a maximum size M.
(ii) three iterations of a sequence of size 2: AA , (iii) and two iterations of a sequence of size 3:
AAA. Thus, the need of an arbitrator was imposed.
Calculate Pattern Size
The arbiter is a function that calculates the new pattern size, in order to select the most relevant
match to our problem. We can target the inner loops by prioritizing the patterns with the smallest
size. To target outer loops, when for example we have an input AABBAABB and we want as result
two iterations of the sequence AABB we give priority to the biggest pattern size.
Determine Pattern State
The last step of the process is a state machine that knows for each time a new instruction arrives
what it represents in terms of pattern recognition. The state machine has five possible states: Pat-
ternStarted, PatternStopped, PatternChanzedSize, PatternUnchanged, NoPattern. The parameters
for determining each state are the current pattern size and new pattern size. We can see the full
behavior of the state machine in Table 3.1.
Table 3.1: State Machine for Megablock Detection
Current Pattern Size New Pattern Size Equal Pattern Sizes New State
0 0 true No Pattern
> 0 > 0 true Pattern Unchanged
0 > 0 false Pattern Started
> 0 > 0 false Pattern Changed
> 0 0 false Pattern Stopped
3.4 The MEB Extractor 29
3.4.3 The MEB normalizer
We are now going to describe the process of creating a single MEB and then, how the process
works when two MEBs are merged.
On the detection algorithm we used a FIFO queue to store the M (maximum pattern size)
elements that were processed. We are going to describe what happens for each state of a pattern:
• Pattern Stopped: When a pattern stop this mean that a different instruction occurred that
was not present in the last pattern detected. We save the stopped pattern in the MEB map,
according to subsection 3.4.4. But instead of resuming and process a new instruction, we
start a process called Waiting and save the last element to a list called waitingBu f f er.
• No Pattern: When this state is active, so can be the Waiting process. This process consists
of waiting for a new pattern to arrive, the maximum time arrival is equal to M, the maximum
pattern size. This means we read M instructions and if in the mean time no pattern occurred
we stop waiting and flush the waitingBu f f er. While we are waiting, every time a new
instruction arrives it is saved on the waitingBu f f er.
• Pattern Started: A new pattern has started. If we are still in the Waiting process, we com-
pare the new pattern detected with the last one and search for common nodes. If there are
common nodes, we merge the current MEB with the last detected, continuing to build on
the merged MEB, one last step before resuming we add every instruction present on the
waitingBu f f er that has been saved between the saving the old MEB and the detection of
the new one. These latest added instructions represent an alternative path, this ether can be a
smaller path that occurred, or a bigger, or even instructions from an outer loop. Sometimes
this information could be wrong and it will be discussed ahead in section X. If there are not
common nodes or we are not in the Waiting process, we create a new MEB and continue
the process.
• Pattern Unchanged: Although the pattern is still the same we need to count the iterations,
this is done in this state.
• Pattern Changed Size: When this state occurs there was a pattern active, but changed size,
which means that it is an alternative pattern, bigger or shorter, to the last one, so we don’t
stop and save the last pattern, instead we add the edges to the old pattern and continue
execution.
3.4.4 The process of saving a MEB
A map is a list of key-value pairs, meaning each value on the list can be identified by a unique key.
In our case we used an instruction address as key to a list of MEBs. We have learned that every
instruction is constituted by a unique address and a corresponding operation. Because a MEB is
composed by instructions, we identified them by their instruction with the lowest address.
30 The MultiPath Execution Block
The process of adding a new MEB to the map consists on checking if there is a key equal to
the identifier address of the new MEB, if yes we add the new element to the back of the list pointed
by the key, if not we create a new entry on the map with the key equal to the lowest address and a
list with our new MEB.
Now that we have every MEB grouped, we can describe the coverage that each address rep-
resents to the program, merging every MEB that share the same identifier. This is very useful to
decide which are the better cases to translate to hardware. Also we can see if a alternative path is
representative enough for adding the extra hardware, or if we can re-use the hardware parts that
execute most significant paths.
3.5 Summary
The MultiPath Execution Block was presented in this chapter, as well as the methods for detecting
such structures. The MEB is a graph-based structure that permits to save multiple paths in one
representation, providing this way better overall profiling information.
From previous work it was detected a problem: the Communication Overhead between the
GPP and RPU. Decreasing this component would increase further the speedup, because even
with high Coverage values of the execution moved to the RPU, most of the times, the execu-
tion moved too much from one processing unit to other, loosing valuable time, thus limiting the
overall speedup. A representation of alternative paths, nested loops or other high-level construc-
tion gives important information to the hardware developer. This way the execution can keep
executing longer times in the RPU, preventing the Communication Overhead.
It was presented the algorithms used in previous work to detect Megablocks and the new
methods used to adapt it for the detection of more complex structures, hence MEB. A full analysis
of the detection phase was done as well as how the MEB information is saved.
On the next chapter a detailed analyzes of the application for detecting MEB structures imple-
mentation will be discussed. It will present the overview architecture of the system, as well as the
classes diagrams for the main software components.
Chapter 4
Software Implementation and System
Architecture
In this chapter, the software implementation will be described, it will be shown the pattern recog-
nition algorithm implementation and the system architecture will be discussed from top to bottom,
starting from the high level architecture followed by a characterization of each element and its
operation.
4.1 Overview
In this section is presented the overview of the system architecture, which can be seen in Figure
4.1. The architecture can be divided in four main elements, and functionality behind the whole
program is well visible. The Instruction Set Simulator (ISS) iterates the instructions one-by-one.
A instruction decoder provides the information about addresses, operations and registers utilized.
The pattern detector (see subsection 3.4.2) which provides the actual state of the current pattern to
a MEB normalizer (see subsection 3.4.3 that updates a MEB map structure (see subsection 3.4.4).
It will be discussed also the data structures used throughout the application since they play
a big role in the program success. Starting with the description of what an instruction is, then
moving to the instruction set simulator and talking about the Open Virtual Platforms (OVP) tools.
Then moving on to the instruction decoder to describe an instruction is translated from binary to
assembly code. Finally is explained in detail the topics of the software engineering behind the
MEB (see chapter 3), how the pattern detector is implemented, the MEB normalizer and finally
what are the outputs of the application.
4.2 A general instruction
The application was created to be able to scale and adapt easily to other target processors, as these
were main requirements. Every processor has a different Instruction Set Architecture (ISA) but
31
32 Software Implementation and System Architecture
Figure 4.1: System architecture
every instruction after compilation is always composed by an address and an operation. These are
the two main components of a instruction.
The target processor is the Microblaze, but instead of creating an application that could only
be used to test this processor it was designed the instructions classes in a way that the Microblaze
specific functions would only be used in the decoding process, and in the rest of the program,
would be used a general instruction that could represent any processor instruction, see Figure 4.2.
As it can be seen, all the relevant information needed to detect patterns and output can be accessed
only working with objects from the Instruction Class. The MicroblazeInstruction Class functions
are only useful when the instruction is translated from binary to assembly code.
4.2 A general instruction 33
Figure 4.2: Class diagram of the classes Instruction and MicroblazeInstruction
4.2.1 Smart pointers
After this section, Instructions Class objects will be included in code excerpts as if they were
saved in the stack for a demonstrating simpler code, although, it was used one of the new C++11
features, the shared pointer, to save all the instructions read in the heap.
Java developers do not get preoccupied with freeing memory. The language provides them a
tool called Garbage Collector that sweeps the memory and cleans all the stuff that isn’t useful.
C and C++ developers on the other hand, always have to free the memory for every object that
they have allocated in order to avoid memory related errors. Java developers may say that freeing
memory is a task that a machine should do, C/C++ developers argue that with Garbage Collection
you never know when a resource is available to use again since the tool way of operation non
deterministic. Although, now. C++ developers have a deterministic way of knowing resource
availability without having to worry about freeing memory allocated before. A smart pointer is a
pointer which deletes the object it is pointing to as soon as it falls out of scope.
C++11 introduced two kinds of smart pointers: the uniqueptr and the sharedptr. The uniqueptr
as the name indicates refers to exclusive ownership, this means that when a pointer of this kind is
34 Software Implementation and System Architecture
pointing to an object it has exclusive control over the object and it can not be copied or assigned to
another pointer. The only way of passing the object around is to move the object from one pointer
to another. A sharedptr on the other hand is a pointer which shares ownership. This is achieved
by having a reference counter associated with the object associated. Two shared pointers can point
to the same object. They both point to the memory address that where the object is allocated an
the Control Block of the object which contain the reference counter, a weak count (which refers
to the weak pointers which is another kind of shared pointer), and other data [6]. It can be seen an
example of a shared pointer in Figure 4.3. A shared pointer occupies more memory space than a
unique pointer, because it also has to save the object’s Control Block.
Figure 4.3: A typical smart pointer pointing to the object and its Control Block, source [6]
In this work it was used shared pointers to save every instruction, because it was needed to
have references to an object in more that one container. So from now on, every time an Instruction
object is referred, actually it means a shared pointer is being used to control the memory resources
each object use.
4.3 Instruction Set Simulator, the OVPsim
The Open Virtual Platform Simulator (OVPsim) is a Just-in-Time Code Morphing (binary transla-
tion) simulator engine that dynamically translates target instructions to x86 host instructions [47].
OVPsim was tested with over 1000 processors providing a trustworthy and fast simulator. OVP-
sim provides infrastructure for describing platforms with one or more processors containing shared
memory and busses in arbitrary topologies and peripheral models. OVPsim can simulate arbitrary
multiprocessor shared memory configurations and heterogeneous multiprocessor platforms [48].
This simulator was built in C/C++ and the platforms to be tested also have to be written in the
same languages. Thus, it was chosen C++ as main language.
In this work, the OVPsim was used to create the instruction trace. An instruction is read one at
a time to simulate more accurately a future runtime implementation and it is passed to the pattern
detector.
4.4 The Microblaze Instruction Decoder 35
The class implementing the OVP tools is responsible for creating the platform, i.e. for creating
the Microblaze processor model, and also to download the program to its memory blocks. Then
as the program runs it fetches one instruction and constructs a Instruction class object that is then
passed to the pattern detector class. A class diagram representative of this class can be seen in
Figure 4.4.
Figure 4.4: Class diagram of the class OVPmanager
4.4 The Microblaze Instruction Decoder
This is the most specific target processor class present in the system. It is Microblaze-specific and
it can not work with other processor architectures, which means that for other processors it must
be written a new decoder.
To understand how the decoder works, the Figure 4.5 show some typical Microblaze instruc-
tions. It can be seen that each instruction may be of type A or type B. It has 6bits for the opcode,
5 for the destination register Rd, another 5 for the register A, Ra, and the last 16bits are different
for the two types, type A has a register B, Rb, which has also 5bits and the type B has a field with
16bits that not always is as the figure indicates the Imm value.
Figure 4.5: Excerpt of the Microblaze Instruction Set[7]
36 Software Implementation and System Architecture
The decoder relies on the opcode of each instruction in order to determine which it is, but
sometimes it is need to also appeal to the register A, B, or D values or in the type B case, the
16bits field to identify it.
It can be seen a code example for detecting one instruction based on the opcode. Consider
mbInst a object of the class MicroblazeInstruction.
switch(mbInst->getOpcode()){
case 0x00:
mbInst->setId(count);
mbInst->setName("add");
mbInst->setType(MicroblazeInstruction::A);
ss << mbInst->getName()
<< " r" << mbInst->getRd()
<< ", r" << mbInst->getRa()
<< ", r" << mbInst->getRb();
mbInst->setExpression(ss.str());
return mbInst;
...
}
Beginning with an instruction object only populated with the address and operation values, it
is added the the id, that it is useful for counting purposes, the name of the operation, in this case
add and the whole expression including register details in the form of, for example, add r1, r2, r3.
4.5 The MEB classes
In the previous chapter it was discussed what is the MultiPath Execution Block and what problems
it resolves, now it will be explained the implementation used in our system.
4.5.1 Graph Representations
A graph can be represented in various manners. It can be seen on Figure 4.6 two of the most
common representations: the adjacency list (in the middle) and the adjacency matrix.
Although an adjacency matrix graph representation is faster to search for an edge (u,v), in
this implementation it was chosen the use of an adjacency list, due to it taking up less space in
memory.
The adjacency-list representation of a graph G = (V,E) consists of an array Ad j of |V | lists,
one for each vertex in V . For each u ∈V , the adjacency list Ad j[u] contains all the vertices v such
that there is an edge (u,v) ∈ E. That is, Ad j[u] consists of all the vertices adjacent to u in G [49].
Summarizing, an adjacency list is a list of pairs or an array with 2 columns by V nodes, where
the first element of a pair is a node, and the second element is the list of nodes connected to the
first. On a directed graph, it represents the direction of the edges, being the direction from the first
node to its list.
4.5 The MEB classes 37
Figure 4.6: A graph, a adjacency list representation (in the middle, and a adjacency matrix (on the
right)
4.5.2 Class Diagram
For implementing the adjacency list graph representation is used three classes, the MultiPath class
which has a map structure to store all the nodes for a given graph. The Node class has its own
map structure to save all the nodes to which are connected and its respective edge. The Edge class
saves all the iterations between two nodes. We can see the class diagram that shows the scheme
used in Figure 4.7.
The weight value in the MultiPath class is very useful. It corresponds to the amount of instruc-
tions a single MEB representation executes before it stops. This can be used to calculate the total
Coverage of a MEB or even all MEBs starting at one specific address and the average Instructions
per RPU Call (IpRC) of the same group, see subsection 3.4.4 of Chapter 3.
As said, the MEB is based on a graph structure, composed by nodes and edges.
4.5.3 The Edge class
An object from the Edge class saves the information about the connection between two nodes. In
this implementation it saves how many times the path between the origin node and the destination
has happened.
As seen before, in the Node class, a node is identified by an address, but an address can
correspond to multiple id entries. So, in order to count how many times a path between two nodes
occurred it is saved the source and destinations ids in a pair list. This way it can be checked
if a pair exists before adding it and the application will never count the same path two times.
This is extremely useful because it is a robust way of not outputting wrong information. See the
code for implementing the function to add an edge between to Instructions below (simplified for
demonstration purposes):
bool addEdge(T from, T to) {
auto newPair = make_pair(from->getId(), to->getId() );
auto it = find(edgeMap.begin(), edgeMap.end(), newPair);
if(it == edgeMap.end()){
38 Software Implementation and System Architecture
Figure 4.7: The MultiPath Execution Block class diagram
edgeMap.push_back(newPair);
counter++;
return true;
}
return false;
};
The function receives two Instructions as parameters, then it picks both instructions’ ids and
makes a pair structure. Next tries to find the created pair in the list, if it succeeds it does not do
anything and returns false, since this two objects had already been linked, but if not, it adds the
created pair to the list and sets the new value for the counter. It is a fact to notice that the returning
value will be used in the Node class and posteriorly in the MultiPath calculate the weight without
counting duplicate instructions’ ids.
The Edge class could also save more relevant information in the future such as the value of the
registers for the given iteration. Which would provide pertinent information when designing the
hardware.
4.5.4 Node class
The Node class is where the instruction information is saved. It has a Instruction object and a
corresponding Edge object that saves relevant information. It is important to make a distinction
4.5 The MEB classes 39
between an Instruction’s id and it’s address. Because a node only represents a address, but can
represent multiple ids. This means that the id only shows the moment that an instruction appears
in the trace, and, as it is known, an instruction can appear multiple times, hence the multiple ids.
The Node class saves the adjacency list with a map data structure. Each key-value pair refers
to the address of the adjacent node first and the matching Edge object. The function for adding a
node is:
bool addAdjacentNode(T to){
unsigned int address = to->getAddress();
auto it = adjMap.find(address);
if(it != adjMap.end()){
if(!(it->second.addEdge(instruction, to)))
return false;
} else {
Edge<T> newEdge(instruction, to);
adjMap.emplace(address, newEdge);
}
return true;
};
The functions has a Instruction object as parameter, then it searches its address to know if
there is already an edge connecting the two nodes. If there is, the function addEdge from the Edge
class (see subsection ahead) is called, if not, a new Edge object is created, and the pair (address,
Edge object) is added to the adjacency list. Before in the Edge class we talked about the returning
value of the addEdge function. It was said that it would be important to count the weight of a
MEB representation and here it can be seen that if the two objects had already been accounted the
function returns false so the weight in the main object would not be incremented.
When is necessary to merge two MultiPaths together (see subsection ahead for more details), a
search is made to find common nodes between the two and then is called the mergeAdjacencyList
of the origin node.
void mergeAdjacencyMaps(Node& other) {
auto otherAdjMap = other.getAdjacencyMap();
for( auto otherIt = otherAdjMap.begin(); otherIt != otherAdjMap.end();
otherIt++ )
auto thisIt = adjMap.find(otherIt->first);
if(thisIt != adjMap.end()) {
thisIt->second.incrementCounterBy(otherIt->second.getCounter());
} else {
adjMap.emplace(otherIt->first, otherIt->second);
}
}
};
It is started by sweeping both adjacency lists, looking for common entries. It is known that
a adjacency list is actually a map with the key-value pairs where each pair is the destination
node’s address and the Edge object associated with the connection. When an match is found, both
counter’s of the Edge objects are added together and the count saved.
40 Software Implementation and System Architecture
The merge function, as of the current implementation is only used at the conclusion of the
program to merge the MEBs that are identified with the same address. The edge counter of each
connection had to be taken into account so a more accurate information could be provided when
merging two graphs. All the edge’s information (the list of id pairs and the counter) could be
saved, but for speed reasons and because the information of the id pairs list is not used again, the
application only adds the counters’ values.
There is only left to say that the nodes that don’t exist in adjacency list are added to it.
4.5.5 MultiPath class
The MultiPath class as we saw in Figure 4.7 is the base class for the MultiPath Execution Block.
This class uses a map structure to save all the nodes for a given path. Each entry of the map
is a key-value pair in the form of (address, Node object). Each MultiPath object have a weight
value associated to later on compare MultiPath objects with nodes in common and output the most
significant MultiPath representing the maximum time it would stay executing on the RPU.
We are going to discuss some of the main functions used in this class starting by the function
used to add nodes:
std::map<unsigned int, Node>::iterator addNode(Instruction in) {
Node<T> newNode(in);
typename std::map<unsigned int, Node<T>>::iterator it;
it = nodeMap.emplace_hint(nodeMap.end(), in->getAddress(), newNode);
return it;
};
The add node function it is really simple, we receive a Instruction object as parameter and
then we create a new Node object containing the Instruction object and then save it in the map
structure. Notice that we use the emplace_hint function to return the iterator which points to the
newly add node. This is very useful in the next function, the add edge one:
void addEdge(Instruction from, Instruction to) {
auto itFrom = nodeMap.find(from->getAddress());
auto itTo = nodeMap.find(from->getAddress());
if(itFrom == nodeMap.end()) {
itFrom = this->addNode(from);
}
if(itTo == nodeMap.end()) {
itTo = this->addNode(to);
}
if(itFrom->second.addAdjacentNode(to)){
weight++;
}
};
4.5 The MEB classes 41
Here, we receive two Instruction objects and the direction of the edge is implicit on the decla-
ration. We try to find both addresses on the map to see if it is necessary creating the nodes for the
instructions, and then with add the to node to the from adjacency list, as we saw earlier. The last
step of the function is to increment the associated weight. Before in the Edge and Node classes
we returned true for the cases that a new link was created or and old one incremented, or false to
signal a duplicate pair. So the weight is only incremented in the case where the called functions
return true.
There is an important function that test if two MultiPath objects have nodes in common, and
the code for it is:
bool haveCommonNodes(MultiPath& other) {
auto otherMap = other.getNodeMap();
typename std::map<unsigned int, Node<T>>::iterator it, jt;
for(it = otherMap.begin(); it != otherMap.end(); it++) {
jt = nodeMap.find(it->first);
if(jt != nodeMap.end()) {
return true;
}
}
return false;
};
In this excerpt we can see how we sweep the two maps to find nodes in common and as soon as
we have a common occurrence we return true. If there are not nodes in common we return false.
Finally, the last function we are going to discuss is for merging two MultiPath objects.
void mergeWith(MultiPath& other) {
auto otherMap = other.getNodeMap();
typename std::map<unsigned int, Node<T>>::iterator otherIt, thisIt;
for(otherIt = otherMap.begin(); otherIt != otherMap.end(); otherIt++) {
thisIt = nodeMap.find(otherIt->first);
if(thisIt != nodeMap.end()) {
thisIt->second.mergeAdjacencyMaps(otherIt->second);
} else {
nodeMap.emplace(otherIt->first, otherIt->second);
}
}
}
We have two graphs, this and other and we want to merge them. First we find the nodes of
other that are common to this. Then for that nodes we merge the two adjacency lists. When there
are other’s nodes which are not common to this we add them to the adjacency list of this. This
function is only used when we have two graphs that we know have nodes in common other wise
we could end up with two separate graphs inside a MultiPath object.
42 Software Implementation and System Architecture
Figure 4.8: The PatterDetector class diagram
4.6 The PatternDetector class
In this section we are going to discuss the implementation of this class starting by showing the
class diagram and giving an overall overview. Next we explain what information is relevant to
keep between program iterations, i.e. between instructions. And we show the implementation of
the process described in subsection 3.4.2 and what happens in an iteration.
4.6.1 Class Overview
In our system architecture presented before, we stated that two of the main components, the pattern
detector and the normalizer were separated. In functionality they are and we wanted to express
that each one of this components plays a different role in the system. Although, in the current
implementation they actually are defined inside the PatternDetector class. In Figure 4.8 we have
the diagram of the PatternDetector class, as we can see it implements the arbiter to calculate
the size of a current pattern and the state machine, which determines the state of the pattern, as
functions. The normalizer is implemented in the function normalize.
4.6 The PatternDetector class 43
4.6.2 Saved Information
When we are iterating through a program trying to detect patterns, instruction by instruction, there
are some parameters we must save from one step to the other in order to know what has happened
and what the new information means.
It would be impractical or even in some cases impossible, given the size of the programs, to
save every instruction from the moment it appears until the end of the profiling. Having in memory
all the instructions would be a plus factor to detect patterns, and to extract complex constructs, but
a program can easily have more than 1 million instructions, making this approach unusable. So we
decide to use a window approach, this means that we save a certain number of saved instructions,
and when we reach a limit, we dispose the first one in, and add the new element. For this purpose
we used a list structure and every time a new element is pushed and the list is full we pop the
oldest element, we called this list in our implementation fifo. This way we can detect patterns until
a certain number, being this number the size of the list. We saw in subsection 3.4.2 of chapter 3
how we detected patterns with much bigger sizes than this, using another list called waitingBuffer.
This two lists preserve the information that new iterations need to detect patterns in a much bigger
list of instructions, the program itself.
We need to preserve the algorithm counting array from iteration to iteration, because of the
functionality of the same. Every time a repeating element is found we increment a position,
correspondent to the difference between the two equal elements, i.e. index = second element’s
id - first element’s id, in the counting array. And we have to preserve the counters to continue to
increment them in order to signal a tandem repeat.
When we are building a MEB we cannot dispose of it every time a new instruction appears, so
the currentMp saves the current state of a pattern in the form of a MultiPath object (see subsection
4.5). Another information we keep is a pointer to the last saved MultiPath object in the MEB map
data structure, respectively lastSavedMp and multipathMap.
We save the starting element of a pattern in order to, when incrementing the number of times
a path was taken, we do not try to increment that number every time the pattern state is UN-
CHANGED, because if not it would try to increment all the edges from the new instruction to the
currentPatternSize element of the fifo list.
When a pattern stops, we save the id of the new element that changed the state to STOPPED.
This information is relevant when we are merging two MEBs and adding an alternative path, as
we will see below and most of all to know whether we should continue waiting for a new pattern
or if we should considered the last pattern closed. This is intrinsically linked with the variable
isWaiting which plays a big role in the whole system.
4.6.3 An iteration
A iteration from the point of view of the pattern detector has four main steps:
• Run the detection algorithm
44 Software Implementation and System Architecture
• Arbitration of the pattern size
• Determination of the new pattern state
• Normalization of the results
In this subsection we will explain how these steps where implemented, first by showing the
function process which is the one which receives the new instruction, that comes from the instruc-
tion decoder. This function demonstrates the flow of an iteration of the PatternDetector class.
void PatternDetector::process(mbPtr mbInst) {
unsigned int newPatternSize;
std::vector<bool> bitArray(maxSize, false);
unsigned int currentId = fifo.front()->getId();
// in this section goes the pattern recognition algorithm
newPatternSize = calculatePatternSize(bitArray, currentPatternSize, true);
state = getNewState(currentPatternSize, newPatternSize);
normalize(state, newPatternSize);
currentPatternSize = newPatternSize;
}
We can see for each new step we declare a newPatternSize variable, since every iteration will
be independent of the others ran before. A boolean vector is also declared for saving the sizes of
the tandem repeats found. The last declaration is the current instructions id, which is useful in
cases ahead. After all functions, which we will explain ahead are done, we save the new pattern
size in the currentPatternSize variable for the next iteration of the process.
As a new instruction arrives, the algorithm tries to find a pattern in the fifo list. In the next
code excerpt we see the current implementation for the detection algorithm.
customized_push(fifo, newInstruction);
std::list<Instruction>::iterator it;
it = fifo.begin();
it++;
int i;
for(i = 1; it != fifo.end(); it++) {
if(newInstruction->getAddress() == (*it)->getAddress()){
if(countingVector[i] < i){
countingVector[i]++;
}
} else {
countingVector[i] = 0;
}
i++;
}
for(i = 1, it = fifo.begin(); it != fifo.end(); i++, it++) {
if(countingVector[i] == i){
4.6 The PatternDetector class 45
// detected repeating sequence of size i
bitArray[i] = true;
} else {
// not detected
bitArray[i] = false;
}
}
The implementation of the algorithm, inserts at the beginning the new element in the fifo
through a customized push, this means that when the list is full we pop out the oldest element. We
use an iterator to sweep the list starting from the beginning, we advance one iteration to start the
comparison between the first and second elements. We use a integer vector to save the counters.
And then save the results in the boolean vector.
The next step in the process is to transmit the boolean vector with the algorithm results to the
calculatePatternSize function. We talked earlier of the possibility of targeting inner loops or outer
loops. In our approach we always target outer loops, since they give the most complex constructs
to build our MultiPath graphs.
This is because even with the target pointing to external loops, we always have access to
information on inner loops by the way the algorithm is structured, i.e. an inner loop has a smaller
size which means it appears first, and when a larger pattern happens next we build upon the inner
loop, just adding more information, thus having all the information about the inner loop and the
outer, assembling a more complex structure. And that is why we always give priority to bigger
patterns. See the implementation of the arbiter next:
unsigned int PatternDetector::calculatePatternSize(
vector<bool> bitArray,
unsigned int previousPatternSize,
bool priorityToBiggerPatterns) {
unsigned int newPatternSize = 0;
for(unsigned int i = 1; i < maxSize; i++){
if(bitArray[i] == true){
newPatternSize = i;
}
}
if (!priorityToBiggerPatterns) {
return newPatternSize;
}
if (previousPatternSize > newPatternSize) {
// Check if previous pattern size is still active
if(bitArray[previousPatternSize]){
return previousPatternSize;
}
}
return newPatternSize;
}
46 Software Implementation and System Architecture
The function receives as parameters the boolean vector which has the results of the detection
algorithm, the previous pattern size and the boolean variable which indicates the priority for bigger
patterns. Then we sweep the results vector to search for the first tandem repeat, and then we check
if a previous pattern is still active, by seeing the value of the position of the last pattern size. This
is where we arbiter if a previous pattern should continue or not. We give permission if the last
pattern size if bigger and of course is still active. Thus, the priority for bigger patterns. Finally we
return the new pattern size or the previous based on this conditions.
This value arbitrated new pattern size from the function above and the previous pattern size
are fed to the state machine as follows:
PatternDetector::State PatternDetector::stateMachine(
unsigned int lastPatternSize,
unsigned int currentPatternSize) {
State newState(NO_PATTERN);
if(lastPatternSize != currentPatternSize){
if(lastPatternSize == 0){
newState = STARTED;
} else if(currentPatternSize == 0){
newState = STOPPED;
} else {
newState = CHANGED_SIZE;
}
} else {
if(currentPatternSize > 0){
newState = UNCHANGED;
} else {
state = NO_PATTERN;
}
}
return newState;
}
This function is explained in detail in subsection 3.4.2 of Chapter 2. But as we can see the
function returns the state of the current pattern. And finally this and the new pattern size are fed
into the normalizer function, which we will explain its implementation in more detail.
Firstly the function header and declaration of variables:
void PatternDetector::normalize(
PatternDetector::State state,
unsigned int newPatternSize ) {
unsigned int i;
std::list<Instruction>::iterator it;
std::list<Instruction>::iterator lastIt;
unsigned int currentId = fifo.front()->getId();
As we can see it is receives the two parameters mentioned above, we declare a couple of
iterators to iterate through the instructions’ lists and we fetch the front list element’s id.
4.6 The PatternDetector class 47
Then we have the switch case construct:
switch (state) {
From here we execute the operations taking into the state, beginning with the STARTED state.
case STARTED : // INIT MEGAGBLOK
startedElement = fifo.front()->getAddress();
for(i = 0, it = fifo.begin(); i <= newPatternSize; i++, it++) {
if(i == 0) {
lastIt = it;
} else {
currentMp.addEdge(*it, *lastIt);
lastIt = it;
}
}
The first thing we do is fill the currentMp structure, which represents a MultiPath object that
was clear. Now we have to test if we are still waiting for a new pattern to merge with the last saved
one, and the following code only occurs if that is true, if not it jumps to the next iteration.
if(isWaiting) {
if( lastSavedMp->second.back().haveCommonNodes(currentMp) ) {
for(i = 0, it = fifo.begin(); i <= newPatternSize; i++, it++) {
if(i == 0) {
lastIt = it;
} else {
lastSavedMp->second.back().addEdge(*it, *lastIt);
lastIt = it;
}
}
// alt path
for(i = 0, it = waitingBuffer.begin(); i < currentId - stoppedId; i++, it
++) {
if(i == 0) {
lastIt = it;
} else {
lastSavedMp->second.back().addEdge(*lastIt, *it);
lastIt = it;
}
}
currentMp = lastSavedMp->second.back();
popLast();
}
}
break;
So if the isWaiting variable is true, we check for common nodes between the currentMp and the
last saved MultiPath. If there are common nodes this means that they belong to the same pattern.
48 Software Implementation and System Architecture
We start by adding the new nodes to the last saved MEB and then we add the nodes between the
currentId and the stoppedId. Sometimes this adds more than necessary, but it is okay thanks to the
way that the addEdge function in the Edge class is written, not permitting for counting duplicates.
The last thing we do is assign the currentMp with the last saved MEB that we just added the recent
nodes and we eliminate it from the list.
The UNCHANGED state is really simply we just add edges, always that the current instruction
is equal to the startedElement.
case UNCHANGED :
if(fifo.front()->getAddress() == startedElement) {
for(i = 0, it = fifo.begin(); i <= itSize; i++, it++){
if(i == 0){
lastIt = it;
} else {
currentMp.addEdge(*it, *lastIt);
lastIt = it;
}
}
}
break;
The CHANGED_SIZE is equally easy but now also assign the new value of the started element.
Check in the following code that we do not use the break command so when
case CHANGED :
startedElement = fifo.front()->getAddress();
for(i = 0, it = fifo.begin(); i <= itSize; i++, it++) {
if(i == 0) {
lastIt = it;
} else {
currentMp.addEdge(*it, *lastIt);
lastIt = it;
}
}
break;
Now the STOPPED state:
case STOPPED :
waitingBuffer.clear();
waitingBuffer.push_back(q.front());
stoppedId = fifo.front()->getId();
isWaiting = true;
lastSavedMp = saveMultiPath(currentMp);
currentMp.clear();
break;
In this state we start by clearing the waitingBuffer, since we have to start filling it again, starting
with the element that make the state change to STOPPED. We also save the id of that element and
4.7 Summary 49
set the variable isWaiting to true. Last things we save the current MEB that is at currentMp and
we clear the variable, so when a pattern starts again we can build it from the start.
Finally, the NO_PATTERN state is very important, because it is here we fill the waitingBuffer
or decide it is time to stop waiting.
case NO_PATTERN : // WAIT OR DO NOTHING
if(isWaiting) {
if((currentId - stoppedId) < maxSize && stoppedId > 0) {
waitingBuffer.push_back(fifo.front());
} else {
waitingBuffer.clear();
currentMp.clear();
isWaiting = false;
}
}
break;
}
We decide if it time to stop waiting by evaluating the difference between the currentId and the
stoppedId, we also check if the stoppedId is bigger than zero, because in the beginning that is its
initial value and we do not want to save anything until that value is set, which means a pattern has
occurred and stopped. Concluding, we add the new element to the waitingBuffer if we are waiting
when we stop waiting we reset the waitingBuffer the currentMp and set the isWaiting variable to
false.
And that ends the normalizer and the PatternDetector implementation.
4.7 Summary
This chapter discussed the implementation of the profiling application developed. The chapter
started with the overview of the system architecture, followed by an explanation of the Instruction
class, whose objects are is used in the other classes.
The Instruction Set Simulator, OVPsim was presented and discussed its use, as well as why it
was the motive for creating a completely new application in C++.
Next followed the process of reading an instruction with the simulator and then decoding it
with the Microblaze Instruction decoder developed for the purpose.
The MultiPath Execution Block implemented architecture was shown and each component
discussed individually, starting by referring it was based on a adjacency list graph representa-
tion, composed of nodes and directed edges and describing the major functions for adding nodes,
merging MEBs and overall functionality.
Futhermore, it was discussed the pattern detector implementation. The presentation follows
the phases of an iteration step since the arrival of a new instruction at the detection algorithm until
what action to perform in the end by the normalizer, so we can understand the full process of
discovering patterns and normalizing them.
50 Software Implementation and System Architecture
Next chapter will discuss the results achieved by the application developed and compare them
with the application developed in the previous work which use the Megablock representation for
the critical sections.
Chapter 5
Verification and Validation of Results
In this chapter is discussed the obtained results of the application developed. It was not possible to
test the results after hardware implementation, since the current system it is not capable of using
a multi-path trace representation of the critical sections as input. The results of the MEB detector
are compared with the Megablock detector in order to validate the developed application results
and check for improvements over the previous profiling application.
5.1 Test Methodology
The profiling applications, the MEB and Megablock detectors, were tested with the same in-
puts and the results are compared in terms of Coverage and Instructions per RPU Call (IpRC).
These two parameters are the base to test improvements, since coverage is a vital parameter of the
speedup of an application and the IpRC relates to the time spent in the RPU for each call to the
RPU, meaning that the more instructions processed at a time, the least will be the Communication
Overhead. These two parameters were explained in detail in section 3.2 of Chapter 3.
The test methodology took advantage of a series of benchmarks available from the Powerstone
benchmarks suite[50], WCET benchmarks [51] and other applications used in previous work.
The used applications are listed on Table 5.1 and are divided in three categories according to the
program complexity level:
• Simple programs: Programs containing critical sections only formed by single-path loops.
• Medium complexity programs: Programs that contain critical sections only formed by single-
path loops, and may also have critical sections that involve loops with multi-paths or nested
loops
• High complexity programs: Programs that are constructed almost entirely using nested
loops, multi-path loops or other complex loop structures. This programs may contain com-
plex data handling, such as array or matrix manipulations.
51
52 Verification and Validation of Results
Table 5.1: Benchmark applications used for the result comparison
Complexity Benchmark Description
Simple
checkbits Bit iteration program
powerstone bcnt Bit shifting and anding through 1K array
Medium
powerstone g3fax Group three fax decode (single level image decompression)
gridIterator A grid iterator
powerstone engine Engine control application
High
WCET janne_complex Nested loop program
WCET cnt Counts non-negative numbers in a matrix
WCET edn Finite Impulse Response (FIR) filter calculations
5.2 Result Comparison
For each program tested, a corresponding table has the MEB and Megablock results for Coverage
and average Instruction per RPU Call IpRC and it is provided a graphical comparison to better see
the differences between the two detectors. For each case the correspondent graphical information
about the IpRC is showed and at the end a comparison between all the Coverage values for each
tested program is provided.
The two profiling applications have different Instruction Set Simulators, which may result in
differences between the two traces produced by each one. An analysis for every case was done to
match all patterns based on their core loops, this means that the two applications may refer to the
same critical sections with different identifiers. This identifier parameter is the starting address of
a pattern and sometimes one address on one side may correspond to more than one in the other, i.e.
a MEB pattern may relate to two Megablocks patterns found, or vice-versa, this can happen for
example when the MEB detector found a critical section constituted by a loop and an alternative
path and saved the MEB with an address from the alternative path and later, on the execution, only
the loop occurred and the MEB detector saves the loop with a different address. For the two MEBs
saved the Megablock detector only saved the loop for both cases, thus a Megablock may relate to
more than one MEB found.
As said above, each program tested fits into a level of complexity, the next three subsections
are dedicated to each one of the levels, starting from the simple programs and finishing at the
highest complexity ones.
For every MEB or Megablock found by the detectors the associated numbers of nodes and
instructions are provided, accordingly. These two parameters are the graph nodes present in one
MEB and the number of instructions that a Megablock loop has. A graph node represents a unique
instruction on the MEB application and a instruction of the Megablock application is part of the
single-path loop found and it may repeat in some cases. The comparison between these two can
demonstrate the complexity of the MEB representation by evaluating if the number of nodes in-
creased highly in relation to the number of instructions of the loop detected by the Megablock
profiling application, which can be a motive for a more complex hardware architecture. In some
5.2 Result Comparison 53
cases however the number of graph nodes may be less than the number of instructions, this is ex-
plained by the fact that the Megablock detected has repeated instructions in its core. This happens
when a Megablock is constituted by an outer loop containing repeating inner loops.
. The correspondent representation of the MEB detector for patterns like this would be a graph
with only unique nodes and the repetition of the inner loops is demonstrated by the edge count,
that shows the number of times a edge is taken.
5.2.1 Simple Programs
In this subsection we show the results of the programs checkbits and bcnt, the critical sections of
this programs are only single-path loops, so the results are very similar between the two profiling
applications, see Tables 5.2 and 5.3.
Table 5.2: Results after testing checkbits benchmark
MEB Megablock
Address Coverage Average IPC Nodes Coverage Average IPC Instructions Address
0x01b0 98,33% 143213 287 97,60% 142927 287 0x01b0
0x015c 1,39% 2024 4 1,90% 2020 4 0x015c
total 99,72% 145237 291 99,50% 144947 291 total
Table 5.3: Results after testing bcnt benchmark
MEB Megablock
Address Coverage Average IPC Nodes Coverage Average IPC Instructions Address
0x015c 40,95% 4120 4 48,40% 4116 4 0x015c
0x118c 2,84% 268 22 2,40% 220 22 0x11c8
0x18dc 51,14% 5145 343 41,70% 4802 343 0x18dc
total 94,94% 9533 369 92,50% 9138 369 total
It can be observed for both cases that there are almost identical results. In checkbits the total
Coverage in the two detectors surpasses the 99%, and the average IpRC are very much alike. In
comparison it is noticed the same results on bcnt. The average IpRC comparison is shown in
Figure 5.1. The number of nodes and instructions are the same, because there are only single-
paths loops on these applications, meaning that a graph representation of a loop containing four
instructions would be a graph with four nodes and four edges implementing a cycle.
Concluding this subsection, we can say that the MEB profiling application has a slightest
advantage for both tested programs, although the major critical sections were detected successfully
in both, MEB and Megablock, detectors.
54 Verification and Validation of Results
Figure 5.1: Comparison of the average IpRC for the programs checkbits and bcnt between the
Megablock and MEB detectors
5.2.2 Medium Complexity Programs
In this subsection, the detectors were tested with more complex programs that, as said before, may
contain nested loops or loops with multi-path structures, adding to the single-path critical sections
seen before in simple programs. The three programs compared are: g3fax, a group three fax
decode with single level image decompressing, its results are showed in Table 5.4; gridIterator,
which, as the name says, it performs iterations through a grid structure, see Table 5.5 for the
results; and engine which corresponds to a engine control application, its results are presented in
Table 5.6.
Table 5.4: Results after testing g3fax benchmark
MEB Megablock
Address Coverage Average IPC Nodes Coverage Average IPC Instructions Address
0x015c 0,08% 1772 4 0,00% 1768 4 0x015c
0x118c 0,01% 308 22 0,60% 220 22 0x11c8
0x18b0 43,41% 4156 236
0,20% 150 30 0x1954-1
0,10% 54 27 0x1954-2
0,00% 57 57 0x1954-3
0,00% 69 23 0x197c
0,80% 104 45 0x1ad8
0x19c8 40,04% 27632 16 88,40% 27616 16 0x19c8
0x1c30 11,91% 2767 12 5,20% 96 12 0x1c30
total 95,45% 36635 290 95,30% 30134 236 total
5.2 Result Comparison 55
The results of the g3fax program are very similar across the two detectors. Both tools en-
countered a significant pattern starting at the address 0x19c8, however, the traces given by the
two Instruction Set Simulators are different, resulting in a difference in coverage of that section,
a 88% coverage in the Megablock detector while the MEB detector has a 40% coverage. From a
detailed analyzes it was concluded that the Megablocks 0x1954-1, 0x1954-2, 0x1954-3, 0x1957
and 0x1ad8 correspond to the same pattern of the MEB detector starting at address 0x18b0. This
pattern signifies a coverage of 43% in the MEB, while only just above 1% in the Megablock detec-
tor. The difference exists because of the ability of the MEB algorithm merge common instructions
from a critical section with multiple paths and represent the alternatives in the same structure. The
IpRC for this pattern and in the one starting at 0x1c30 are much higher in the MEB application
meaning a larger time spent each time on the RPU. The nodes and instructions for this application
are equal for every case that the detected critical section is a single-path loop. For the pattern
starting at 0x18b0 it can be seen that the sum of instructions of the Megablocks detected is almost
the same size of the number of graph nodes of the MEB represention, meaning that the hardware
implementation complexity for the MEB may not increase too much in relation to the Megablock.
Table 5.5: Results after testing gridIterator benchmark
MEB Megablock
Address Coverage Average IPC Nodes Coverage Average IPC Instructions Address
0x01d4 7,17% 674166 38
7,30% 20646 333 0x01dc
0,20% 280 20 0x01e4
0x0304 0,04% 65 37
0,10% 74 37 0x0304
0x02f8 0,17% 195 47
0x0ff0 1,21% 114674 14 1,20% 114660 14 0x0ff0
total 8,60% 789100 136 9% 135660 404 total
The results of the program gridIterator from Table 5.5 describe four distinct critical sections.
In terms of coverage for each case it can be observed that the results are very similar, for example
the most significant has around 7% in both detectors. However the average IpRC is significantly
higher in the MEB detector. This value is approximately 32 times higher in the above referred case
in the MEB profiling application, which would ultimately result in a higher speedup. A curious
case happened with this application in relation to the number of nodes and instructions of the
critical sections’ representations. In the most important section, starting at 0x01d4 in the MEB
detector, it can be seen that the number of nodes is actually lower than the number of instructions
detected by the Megablock. This comes from the fact that the two Megablocks detected at that
critical section have two very different sizes in terms of number of instructions, the smallest is a
simple single-path loop and the other include a more complex loop, that have some repeated inner
loops on its core representing a specific sequence of paths on a loop. The MEB detector is able to
recognize the two Megablocks as being different paths of the same pattern, merging all the paths
too a single graph representation that only has unique instructions, thus reducing the number of
nodes comparatively to the number of Megablock instructions.
56 Verification and Validation of Results
Table 5.6: Results after testing engine benchmark
MEB Megablock
Address Coverage Average IPC Nodes Coverage Average IPC Instructions Address
0x0160 0,00% 84 4 0,00% 80 4 0x0160
0x1010 0,01% 330 22 0,00% 220 22 0x1054
0x1750 1,57% 29 9 4,10% 54 9 0x1750
0x189c 1,40% 25 8 5,10% 72 8 0x189c
0x1954 1,46% 26 8 5,40% 96 8 0x1954
0x1b24 9,15% 653 84 0,00% 53 26 0x2dd4
0x2e78 15,00% 132 10
0,20% 50 10 0x2e78-1
2,70% 88 8 0x2e78-2
0x2dc4 25,45% 146 10
12,20% 112 8 0x2dc4-1
0,30% 50 10 0x2dc4-2
0x2e6c 4,44% 39 3 0,80% 51 3 0x2e6c
0x2db8 6,38% 34 3 0,40% 69 3 0x2db8
total 58,49% 1498 161 31,20% 995 119 total
In the engine program, the results, showed Table 5.6 demonstrate almost twice the coverage
in the MEB detector. We can also observe that a specific pattern on the MEB 0x2e78 has 15%
coverage and when compared to the other side of the table it only results on about 3%, this is
because this loop has a two-way path which we can see in Figure 5.2. This is a success case in our
application since we can provide more time in the RPU without having to use a lot more hardware.
We can also see from the table that the average IpRC is again higher in the MEB detector, this
comes as no surprise since we have the information for the two-way loops. The values of the
nodes and instructions are equal in every single-path loop detected. The pattern starting at address
0x1b24 is the one with the most disparate numbers. This case represents a more complex structure
formed by the MEB detector, that correspond to simpler pattern on the Megablock detector, thus
having almost four times more nodes than instructions. The patterns identified by the addresses
0x2e78 and 0x2dc4 show a successful case, where the graph nodes are equal to the biggest number
of instructions of the correspondent Megablocks found. This will not implicate a more complex
hardware, therefore is a case in which the gains do not have associated disadvantages.
Concluding the section we show the above comparisons of the average IpRC for the three
scenarios presented, see Figure 5.3.
5.2.3 High Complexity Programs
This last subsection of programs presents the results of the most complex programs tested. These
programs have nested loops, multi-path loops and complex array and matrix manipulations. The
janne complex program is a nested loop program, where the inner loops number of iterations
depends on the outer loops current iteration number is represented in Table 5.7. Next program
is cnt, which counts non-negative numbers in a matrix, using nested loops, and the results can
be seen in Table 5.8. The last tested program is edn, a Finite Impulse Response (FIR) filter that
5.2 Result Comparison 57
Figure 5.2: A MEB representation of the pattern starting at 0x2e78 in the engine application
58 Verification and Validation of Results
Figure 5.3: Comparison of the average IpRC for the programs gridIterator, engine and g3fax
between the Megablock and MEB detectors
implements a large number of vector multiplications and array handling, the results are represented
of Table 5.9.
Table 5.7: Results after testing janne complex benchmark
MEB Megablock
Address Coverage Average IPC Nodes Coverage Average IPC Instructions Address
0x01c8 60,34% 321 321
21,90% 108 18 0x01c8-1
10,70% 53 53 0x01c8-2
total 60,34% 321 321 32,60% 161 71 total
The first program tested, the janne complex is a very small program which only has about 500
instructions. The program however have a nested loop structure with multiple paths inside. The
results are very clear, on the Megablock side we have two patterns starting on the same address,
each one for a different path, resulting in less coverage and average IpRC in total than the MEB
detector which assumes the both paths belong to the same pattern and presents only one result with
the double of the coverage and the double of the average IpRC. Evaluating the number of nodes of
the MEB detector, it can be seen that is bigger than the number of instructions of the same pattern
on the Megablock detector, this comes from the fact that the MEB detector finds a more complex
pattern than the single-path one detected in the Megablock detector.
The cnt program have a well structured code with nested loops. We can see by the results that
the biggest difference comes from the MEB detecting the outer loop of the 0x0c6c which is the one
that starts in 0x02c8. In the Megablock, the pattern is recognized as two entries and one is almost
negligible. But in the MEB we have a huge advantage by having the outer loop information, since
5.2 Result Comparison 59
Table 5.8: Results after testing cnt benchmark
MEB Megablock
Address Coverage Average IPC Nodes Coverage Average IPC Instructions Address
0x015c 1,06% 444 4 1,30% 440 4 0x015c
0x02c8 63,92% 26760 111 21,80% 88 8 0x0c6c-1
0x03f4 15,35% 6426 74
13,70% 5232 654 0x03f4
2,70% 512 64 0x0400
0x07bc 7,04% 2950 50 5,00% 2958 51 0x07bc
0x0a4c 5,48% 2295 45 4,30% 2300 45 0x0a4c
0x0c60 0,47% 49 3 0,30% 66 3 0x06c0
0x0c6c 0,67% 93 10 0,10% 50 10 0x06c-2
total 93,98% 39017 297 49,20% 11646 839 total
it permits a higher coverage and a higher average IpRC. The results for the nodes and instructions
have a bit of everything, it can be seen larger numbers for the MEB detector in some cases and in
other cases, bigger numbers for the Megablock detector. Analyzing the first remarkable case, the
pattern starting at 0x02c8, it can be seen a higher number for the MEB, this comes from the fact
that this pattern represent a merge of several paths, forming a complex structure not detected by the
Megablock profiling application. Although it represents a more complex hardware implementation
the speedup gains may compensate the increased complexity since the coverage is three times
higher for the same pattern and the average IpRC are much higher. The next remarkable case is the
one starting at 0x03f4 which on the other hand has a larger number of instructions compared with
the number of nodes. This can be explained by the fact that this pattern is constituted by an outer
loop with more than one possible path inside, the Megablock detected is a combination of paths
that form a single-path loop that contain repeated instructions, due to the fact that the inner loops
have the same starting point. For example, if a loop has two possible paths inside, A and B, and the
program execution follows the pattern: AAAB the Megablock detector recognizes this pattern as
being a single loop of AAAB, while the MEB detector would represent only A with an alternative
path B. It can be seen analyzing the table, that the MEB representation has more nodes then the
smaller Megablock for the same pattern, but the difference is very small. Concluding, the pattern
found by the MEB detector is a combination of the two found by the Megablock detector, but it
does not introduce a need for a more complex hardware implementation. The remaining cases are
single-path loops that have the same number of nodes and instructions for the two detectors.
The last program tested was the edn benchmark application, this program has nested loops,
vector multiplications and array handling in its execution. This program results are the most
different between the two detectors. Where the MEB detector has almost 100% coverage, the
Megablock on the other hand has almost 0%. This is because the maximum size of a pattern
in the Megablock case limits the amount of data in memory that the detection algorithm has for
detecting tandem repeats. But, because in the case of the MEB, although it is started with the
same maximum pattern size but continue to add bigger patterns and merge common paths, we end
60 Verification and Validation of Results
Table 5.9: Results after testing edn benchmark
MEB Megablock
Address Coverage Average IPC Nodes Coverage Average IPC Instructions Address
0x015c 0,00% 24 4
undetected
0x01dc 90,74% 559552 728
0xcc8 0,02% 105 7 0,00% 98 7 0x0cc8
0x0d4c 1,80% 695 190
undetected
0x0fc8 0,16% 125 38
0x1034 0,11% 41 3
0x10e8 0,11% 41 3
0x1160 1,06% 46 119
0x11ac 0,06% 369 101
0x1354 0,11% 41 3
0x1420 0,11% 41 3
0x14e8 0,11% 41 3
0x1590 0,11% 41 3
0x169c 0,78% 4814 26 0,80% 2382 6 0x18e8
0x182c 0,45% 44 6 undetected
total 95,70% 566020 1237 0,80% 2480 13 total
up with a structure that represents almost all the computations done in the execution. Since the
Megablock detector did not find the pattern starting at 0x01dc of the MEB detector and was not
able to capture any relevant data about the program, the information about the number of graph
nodes on the MEB application and the number of instructions can not be compared with interest,
because there is not a sufficient number of cases to analyzed. So in this benchmark application, it
can be seen the true power of having a more complex structure to represent the critical sections of
a program.
We can see the comparison between the two detectors for these two programs in Figure 5.4
considering the average IpRC.
Finally we can see the overall coverage for all the applications tested in Figure 5.5.
5.2.4 Running Time
In this subsection will be discussed the running time of the two detectors for all benchmark appli-
cations. In Table 5.10 are presented the execution times:
It can be seen that as the number of total instructions or complexity increases the execution
time also increases. Although, while the Megablock detector increases in a constant manner,
related to the number of instructions that the program has, the MEB detector has a more complex
relation. The MEB detector execution time is directly linked to the number of instructions a
program has, however it also depends on the complexity of the program. This is because, as the
complexity of a program increases so does the complexity its critical sections and because the data
5.3 Discussion of Results 61
Figure 5.4: Comparison of the average IpRC for the programs janne complex, cnt and edn between
the Megablock and MEB detectors
Table 5.10: Comparison of the MEB and Megablock detectors in terms of execution time and
instructions executed for each benchmark application
MEB Megablock
Time(mm:ss) Time(mm:ss)
checkbits 00:02 00:02
bcnt 00:01 00:01
grid 10:40 01:10
g3fax 03:50 00:11
engine 00:29 00:09
janne complex 00:00 00:01
cnt 00:02 00:02
edn 20:10 00:02
Average 04:24 00:12
structures are sweeped many times for comparison and search purposes the execution time of the
MEB detector increases hugely.
5.3 Discussion of Results
In the previous subsection we analyze a set of programs with different complexity. Each program
was tested on the MEB detector and on the Megablock detector.
The comparison with the previous profiling application was necessary so that the results of
the new application developed could be validated against a proven tool. Also, the comparison
demonstrates the differences between the two profiling tools in order to check for improvements
62 Verification and Validation of Results
Figure 5.5: Comparison of the Coverage for all the programs tested between the Megablock and
MEB detectors
over the previous work. The tested parameters, Coverage and average IpRC, test if our initial
motivation to improve the previous work, explained in detail in section 3.2 in Chapter 3, was
satisfied.
When testing programs with low complexity the differences between the two application-
profiling tools were minimal. Both tools provided with accurate representation of the single-path
loops and in both cases the coverage and the average IpRC were almost identical.
Testing programs with medium complexity brought the first differences of the two profiling
tools. Although in the g3fax program the results were alike, in the other programs starting by
the gridIterator it was noticed a increasing in the average IpRC values which ultimately would
increase the time spent on each RPU call. Thus, decreasing the Communication Overhead, which
alone increases the overall speedup. The engine program results confirmed the increasing in the
average IpRC and in some cases recognized same patterns and merged them as one, this resulted
in a higher Coverage over the Megablock detector.
Increasing the complexity to the maximum level tested the results were even more enlighten-
ing. It can be seen in the program janne complex the double of the Coverage and average IpRC
values. This resulted from joining the two Megablocks found into a single critical section repre-
sentation with multiple paths. Next program tested, the cnt, demonstrated the tendency of almost
doubling the Coverage value, again by recognizing the same pattern in two different paths, but the
5.4 Summary 63
highest gains come from the average IpRC values almost quadruplicate. Finally the edn program
showed the biggest difference of all. This program has a number of nested loops inside one an-
other, representing a really complex pattern structure impossible to recognize with the Megablock
tool. However the MEB detector managed to find all the loops in the program and the main criti-
cal section consisting in four chained loops, resulting in a major difference (around 96% to almost
0%) of Coverage and of course IpRC.
It can be made a preliminary assessment for the difference of the hardware complexity for
both cases. Some benchmark applications showed that for representing a more complex pattern
it will probably need a more complex hardware in comparison to the Megablock approach. But,
the results also showed that in some cases, the MEB can give a more clear representation of the
critical section, for example merging two paths with common instructions, not adding hardware
complexity in comparison. Summarizing, the MEB representations for multi-path critical sections
may increase in some cases the complexity of the hardware, but will increase the speedup, leav-
ing the hardware designer the choice for the implementation or not of the critical section found,
evaluating the trade off of the two parameters.
To conclude, it is very obvious that when the programs are very simple, containing only critical
section with single paths, both tool behave similarly. But, as the complexity increases the MEB
detector starts to gain over the Megablock approach, sometimes by joining paths of the same
critical sections, other times by recognizing complex structures that can not be detected with a
single-path oriented approach. So as the complexity of an application increases the MEB shows
better results. When comparing the execution time between the two tools, the results were very
disparate. The MEB detector takes considerably more time to detect critical sections when the
program increases size or increases complexity. This factor may be an issue in some programs
that have a very high complexity level and have millions of instructions, however, if executed
offline the times are acceptable, since the results will be used for future hardware design.
5.4 Summary
In this chapter we analyzed various results given by the MEB and Megablock detectors. We
proved the validity of the results of our application comparing them against the results of an
already validated tool.
We saw that as the programs have more complex loop structures with conditions inside and
nested loops, the MEB detector consistently gives better results of coverage and average IpRC.
The average IpRC is the most important factor since it permits that the execution stays longer in
the RPU, decreasing this way the communication overhead.
64 Verification and Validation of Results
Chapter 6
Conclusions and Future Work
6.1 Conclusions
This work’s system ultimate goal was to create an application to profile an program execution
trace and detect its the critical sections. We had another target in mind, that was to build a easily
re-adaptable software using a Instruction Set Simulator called, OVPsim.
The application development continued previous work done where an implementation using a
trace-based loop, called Megablock, was used to represent the critical sections of a program execu-
tion. We discussed the four steps of the dynamic partition implemented: Detection, Identification,
Translation and Replacement and stated that our focus was on the detection phase.
The motivation for creating a new structure, extending the Megablock, came from the fact that
the one of the major aspects for speeding up a program execution time was the size of the critical
sections related to the all program instructions, this was called Coverage. In the previous work
there were good values of coverage, but although the overall values were good, in various cases
the patterns were too dispersed loosing execution time in communication between the GPP and
the RPU in order to move the execution from one to the other. This was called communication
overhead.
In order to diminish the communication overhead of a program we proposed a new represen-
tation for the critical sections of a program, one that would have information about alternative
patterns a loop can have inside and that can even give information about outer loops and high
level constructs. We proposed the MultiPath Execution Block, a graph-based representation that
permits to build more complex structures and have the information needed for the coverage values
not be too dispersed and instead of loosing time in communication overhead, the execution would
spend more time on the RPU.
A top-down view of the structure of the application was explained in detail and the achieved
results were presented and discussed.
The results obtained are very promising, because they present a clear tendency to increase the
time spent at the RPU for each call. It was demonstrated this by comparing both detectors, the
one based on the Megablock and on the our new MEB. The results unequivocally display a better
65
66 Conclusions and Future Work
overall Coverage and in all cases better average of executed Instructions Per RPU Call in the RPU,
which was the main goal all along.
6.2 Future Work
The future work should focus on improving the execution time of the application. Since it is based
on graph-structures and many graph travels are necessary, it has a much longer execution time
than the Megablock detector.
The algorithms have to optimized in order to in the future this could be implemented as hard-
ware in a full-online system. This is the primer goal.
In a detailed way in the future, we should have the information of the registers values for each
time a MEB is detected, because if the hardware designer have this values when designing the
RPU, he can build data paths based on the registers values and know which registers are constant
throughout a iteration, which is a major advantage to save hardware components.
References
[1] G. Stitt and F. Vahid. Hardware/software partitioning of software binaries. IEEE/ACM Inter-
national Conference on Computer Aided Design, 2002. ICCAD 2002., pages 164–170, 2002.
doi:10.1109/ICCAD.2002.1167529.
[2] G. Mcgregor, B. Einloth, F. Vahid, and G. Stitt. Hardware/software partitioning of software
binaries: a case study of H.264 decode. 2005 Third IEEE/ACM/IFIP International Con-
ference on Hardware/Software Codesign and System Synthesis (CODES+ISSS’05), 2005.
doi:10.1145/1084834.1084905.
[3] João Bispo, Nuno Paulino, João M P Cardoso, and João Canas Ferreira. From instruction
traces to specialized reconfigurable arrays. Proceedings - 2011 International Conference
on Reconfigurable Computing and FPGAs, ReConFig 2011, pages 386–391, 2011. doi:
10.1109/ReConFig.2011.43.
[4] João Bispo, Nuno Paulino, João M P Cardoso, and João Canas Ferreira. Transparent runtime
migration of loop-based traces of processor instructions to reconfigurable processing units.
International Journal of Reconfigurable Computing, 2013, 2013. doi:10.1155/2013/
340316.
[5] João Carlos and Viegas Martins. Mapping Runtime-Detected Loops from Microprocessors to
Reconfigurable Processing Units. PhD thesis, INSTITUTO SUPERIOR TÉCNICO, 2012.
[6] Scott Meyers. Effective Modern C++.
[7] Embedded Development Kit. MicroBlaze Processor Reference Guide. Development,
081:1–210, 2006. URL: http://scholar.google.com/scholar?hl=en&btnG=
Search&q=intitle:MicroBlaze+Processor+Reference+Guide#0.
[8] Mark D Hill and Michael R Marty. Amdahl ’s Law in the Multicore Era. Computer,
41(July):33–38, 2008. doi:10.1109/MC.2008.209.
[9] Xilinx Inc. What is a FPGA?, 2015. consulted in September 11th, 2015. URL:
WhatisaFPGA?
[10] Bryan H Fletcher. FPGA Embedded Processors Revealing True System Performance. Lan-
guage, 2005.
[11] E Monmasson, L Idkhajine, M N Cirstea, I Bahri, a Tisan, and M W Naouar. FPGAs in
industrial control applications. Industrial Informatics, IEEE Transactions on, 7(2):224–243,
2011. doi:10.1109/TII.2011.2123908.
67
68 REFERENCES
[12] M. D. Galanis, G. Dimitroulakos, and C. E. Goutis. Speedups from partitioning critical
software parts to coarse-grain reconfigurable hardware. Proceedings of the International
Conference on Application-Specific Systems, Architectures and Processors, pages 50–55,
2005. doi:10.1109/ASAP.2005.60.
[13] Stuart Clubb. Catapult C R© Synthesis. Time, (April), 2009.
[14] Xilinx Inc. Vivado High-Level Synthesis. URL: http://www.xilinx.com/
products/design-tools/vivado/integration/esl-design.html.
[15] Greg Stitt, Roman Lysecky, and Frank Vahid. Dynamic Hardware / Software Partitioning :
A First Approach. Architecture, pages 2–7.
[16] João Bispo and João M P Cardoso. Using the MegaBlock to Partition and Optimize Programs
for Embedded Systems at Runtime. pages 699–710, 2010.
[17] Joao Bispo, Nuno Paulino, Cardoso, and Ferreira. Transparent trace-based binary accel-
eration for reconfigurable HW/SW systems. IEEE Transactions on Industrial Informatics,
9(3):1625–1634, 2013. doi:10.1109/TII.2012.2235844.
[18] Nuno Paulino and João Canas Ferreira. A Reconfigurable Architecture for Binary Accelera-
tion of Loops with Memory Accesses. 7(4), 2014.
[19] Roth Martin. CIS 501 - Introduction to Computer Architecture: Unit 2 - Instruction Set
Architecture. URL: https://www.cis.upenn.edu/~milom/cis501-Fall05/
lectures/02_isa.pdf.
[20] CellPerformance. Background on Branching. URL: http://cellperformance.
beyond3d.com/articles/2006/04/background-on-branching.html.
[21] Frances E. Allen. Control Flow Analysis. Proceedings of ACM Symposium on Compiler
Optimization, pages 1–19, 1970. URL: http://portal.acm.org/citation.cfm?
doid=800028.808479, doi:10.1145/800028.808479.
[22] Maya B. Gokhale and Paul S. Graham. Reconfigurable Computing: Accelerating Compution
with Field-Programmable Gate Arrays. Springer, 2005.
[23] Husain Parvez and Habib Mehrez. Application-specific mesh-based heterogeneous FPGA
architectures. Application-Specific Mesh-based Heterogeneous FPGA Architectures, pages
1–150, 2011. doi:10.1007/978-1-4419-7928-5.
[24] Ian Kuon and Jonathan Rose. Measuring the gap between FPGAs and ASICs. IEEE Transac-
tions on Computer-Aided Design of Integrated Circuits and Systems, 26(2):203–215, 2007.
doi:10.1109/TCAD.2006.884574.
[25] Theerayod Wiangtong, Peter Y K Cheung, and Wayne Luk. Hardware / Software Codesign.
(May):14–22, 2005.
[26] Intel Coorporation. Intel to Acquire Altera. Technical report, 2015.
[27] Zhi Guo, Walid Najjar, Frank Vahid, and Kees Vissers. A quantitative analysis of the speedup
factors of FPGAs over processors. FPGA ’04: Proceedings of the 2004 ACM/SIGDA 12th
International Symposium on Field Programmable Gate Arrays, (February):162–170, 2004.
doi:10.1145/968280.968304.
REFERENCES 69
[28] Altera Corporation. White Paper Accelerating High-Performance Computing With FPGAs.
Cluster Computing, (October):1–8, 2007.
[29] a. Balboni, W. Fornaciari, and D. Sciuto. Partitioning and exploration strategies in the
TOSCA co-design flow. Proceedings of 4th International Workshop on Hardware/Software
Co-Design. Codes/CASHE ’96, (c):62–69, 1996. doi:10.1109/HCS.1996.492227.
[30] Petru Eles, Zebo Peng, Krzysztof Kuchcinski, and Alexa Doboli. System level hardware/-
software partitioning based on simulated annealing and tabu search. Design automation for
embedded . . . , 32:5–32, 1997. URL: http://link.springer.com/article/10.
1023/A:1008857008151, doi:10.1023/A:1008857008151.
[31] Daniel D. Gajski, Frank Vahid, Sanjiv Narayan, and Jie Gong. SpecSyn: An Environment
Supporting the Specify-Explore-Refine Paradigm for Hardware/Software System Design.
1996.
[32] R.K. Gupta and G. De Micheli. Hardware-software cosynthesis for digital systems. IEEE
Design & Test of Computers, 10(3), 1993. doi:10.1109/54.232470.
[33] J. Henkel and R. Ernst. A Hardware/software Partitioned Using A Dynamically Determined
Granularity. Proceedings of the 34th Design Automation Conference, 1997. doi:10.
1109/DAC.1997.597233.
[34] J. Henkel. A low power hardware/software partitioning approach for core-based embedded
systems. Proceedings 1999 Design Automation Conference (Cat. No. 99CH36361), 1999.
doi:10.1109/DAC.1999.781296.
[35] Roman Lysecky and Frank Vahid. A study of the speedups and competitiveness of FPGA
soft processor cores using dynamic hardware/software partitioning. Proceedings -Design,
Automation and Test in Europe, DATE ’05, I:18–23, 2005. doi:10.1109/DATE.2005.
38.
[36] Calypto. Catapult Product Family. 2014. URL: http://calypto.com/en/
products/catapult/overview/.
[37] Alfred V Aho, Monica S Lam, Ravi Sethi, and Jeffrey D Ullman. Compilers: Principles,
Techniques, and Tools (2nd Edition). Number 0. 2006. URL: http://www.amazon.
com/Compilers-Principles-Techniques-Tools-2nd/dp/0321486811.
[38] R Erik, David Kaeli, and Yaron Sheffer. Welcome to the Opportunities of Binary Translation.
Computer, 2000.
[39] Weiwu Hu, Qi Liu, Jian Wang, Songsong Cai, Menghao Su, and Xiaoyu Li. Efficient binary
translation system with low hardware cost. Proceedings - IEEE International Conference
on Computer Design: VLSI in Computers and Processors, pages 305–312, 2009. doi:
10.1109/ICCD.2009.5413138.
[40] Zheng Shan, Haoran Guo, and Jianmin Pang. BTMD: A Framework of Bi-
nary Translation Based Malcode Detector. 2012 International Conference on
Cyber-Enabled Distributed Computing and Knowledge Discovery, pages 39–43,
2012. URL: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?
arnumber=6384942, doi:10.1109/CyberC.2012.16.
70 REFERENCES
[41] Nuno Paulino, João Canas Ferreira, João Bispo, and João M P Cardoso. Transparent Accel-
eration of Program Execution Using Reconfigurable Hardware *. pages 1066–1071, 2015.
[42] Frank Vahid, Greg Stitt, and Roman Lysecky. Warp processing: Dynamic translation of
binaries to FPGA circuits. Computer, 41(7):40–46, 2008. doi:10.1109/MC.2008.
240.
[43] Hamid Noori, Farhad Mehdipour, Kazuaki Murakami, Koji Inoue, and Morteza Saheb Za-
mani. An architecture framework for an adaptive extensible processor. Journal of Supercom-
puting, 45(3):313–340, 2008. doi:10.1007/s11227-008-0174-4.
[44] Gene M. Amdahl. Validity of the Single Processor Approach to Achieving Large Scale
Computing Capabilities. 1967.
[45] Dan Gusfield and Jens Stoye. Linear time algorithms for finding and representing all the
tandem repeats in a string. Journal of Computer and System Sciences, 69(4):525–546, 2004.
doi:10.1016/j.jcss.2004.03.004.
[46] Jens Stoye and Dan Gusfield. Simple and Flexible Detection of Contiguous Repeats Using a
Suffix Tree.
[47] Open Virtual Platforms. OVP technology. URL: http://www.ovpworld.org/
ovptechnology.
[48] Open Virtual Platforms. Technology OVPsim. URL: http://www.ovpworld.org/
technology_ovpsim.
[49] Robert Sedgewick and Kevin Wayne. Algorithms. 2011.
[50] Jeff Scott, L.H. Lee, John Arends, and Bill Moyer. Designing the Low-Power M•
CORE TM Architecture. Power Driven Microarchitecture Workshop, (June):145–150, 1998.
URL: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.
39.2527&amp;rep=rep1&amp;type=pdf.
[51] Jan Gustafsson and Adam Betts. The mälardalen WCET benchmarks: Past,
present and future. . . . -OpenAccess Series in . . . , (Wcet):136–146, 2010.
URL: http://drops.dagstuhl.de/opus/volltexte/2010/2833/,
doi:10.4230/OASIcs.WCET.2010.136.
