Dynamic instruction set extension of microprocessors with embedded FPGAs by Bauer, Heiner
Faculty of Electrical and Computer Engineering
Chair of Highly-Parallel VLSI Systems and Neuro-Microelectronics
Diploma Thesis
DYNAMIC INSTRUCTION SET
EXTENSION OF MICROPROCESSORS
WITH EMBEDDED FPGAS
Heiner Bauer
Born on: November 20, 1991
to achieve the academic degree
Diplomingenieur (Dipl.-Ing.)
Supervisors
Dr.-Ing. Sebastian Höppner
Dr.-Ing. Johannes Partzsch
Supervising professor
Prof. Dr.-Ing. habil. Christian Mayr
Submitted on: March 23, 2017

STATEMENT OF AUTHORSHIP
I hereby certify that I have authored this diploma thesis entitled Dynamic instruction set
extension of microprocessors with embedded FPGAs independently and without undue as-
sistance from third parties. No other than the resources and references indicated in this
diploma thesis have been used. I have marked both literal and accordingly adopted quota-
tions as such. During the preparation of this thesis I was supported by:
Dr.-Ing. Sebastian Höppner as a supervisor
Dr.-Ing. Johannes Partzsch as a supervisor
Florian Schraut of Racyics GmbH with all layout tasks
Additional persons were not involved in the preparation of the present thesis.
Dresden, March 23, 2017
Heiner Bauer
ABSTRACT
Increasingly complex applications and recent shifts in technology scaling have created a
large demand for microprocessors which can perform tasks more quickly and more energy
efficient. Conventional microarchitectures exploit multiple levels of parallelism to increase
instruction throughput and use application specific instruction sets or hardware accelerators
to increase energy efficiency. Reconfigurable microprocessors adopt the same principle of
providing application specific hardware, however, with the significant advantage of post-fab-
rication flexibility. Not only does this offer similar gains in performance but also the flexibility
to configure each device individually.
This thesis explored the benefit of a tight coupled and fine-grained reconfigurable micro-
processor. In contrast to previous research, a detailed design space exploration of logical
architectures for island-style field programmable gate arrays (FPGAs) has been performed
in the context of a commercial 22 nm process technology. Other research projects either
reused general purpose architectures or spent little effort to design and characterize custom
fabrics, which are critical to system performance and the practicality of frequently proposed
high-level software techniques. Here, detailed circuit implementations and a custom area
model were used to estimate the performance of over 200 different logical FPGA architec-
tures with single-driver routing. Results of this exploration revealed similar tradeoffs and
trends described by previous studies. The number of lookup table (LUT) inputs and the
structure of the global routing network were shown to have a major impact on the area
delay product. However, results suggested a much larger region of efficient architectures
than before. Finally, an architecture with 5-LUTs and 8 logic elements per cluster was se-
lected. Modifications to the microprocessor, which was based on an industry proven instruc-
tion set architecture, and its software toolchain provided access to this embedded reconfig-
urable fabric via custom instructions. The baseline microprocessor was characterized with
estimates from signoff data for a 28 nm hardware implementation. A modified academic
FPGA tool flow was used to transform Verilog implementations of custom instructions into
a post-routing netlist with timing annotations. Simulation-based verification of the system
was performed with a cycle-accurate processor model and diverse application benchmarks,
ranging from signal processing, over encryption to computation of elementary functions.
For these benchmarks, a significant increase in performance with speedups from 3 to
15 relative to the baseline microprocessor was achieved with the extended instruction set.
Except for one case, application speedup clearly outweighed the area overhead for the ex-
tended system, even though the modeled fabric architecture was primitive and contained no
explicit arithmetic enhancements. Insights into fundamental tradeoffs of island-style FPGA
architectures, the developed exploration flow, and a concrete cost model are relevant for
the development of more advanced architectures. Hence, this work is a successful proof
of concept and has laid the basis for further investigations into architectural extensions and
physical implementations. Potential for further optimization was identified on multiple levels
and numerous directions for future research were described.
KURZFASSUNG
Zunehmend komplexere Anwendungen und Besonderheiten moderner Halbleitertechnolo-
gien haben zu einer großen Nachfrage an leistungsfähigen und gleichzeitig sehr energieef-
fizienten Mikroprozessoren geführt. Konventionelle Architekturen versuchen den Befehls-
durchsatz durch Parallelisierung zu steigern und stellen anwendungsspezifische Befehlssät-
ze oder Hardwarebeschleuniger zur Steigerung der Energieeffizienz bereit. Rekonfigurier-
bare Prozessoren ermöglichen ähnliche Performancesteigerungen und besitzen gleichzeitig
den enormen Vorteil, dass die Spezialisierung auf eine bestimmte Anwendung nach der
Herstellung erfolgen kann.
In dieser Diplomarbeit wurde ein rekonfigurierbarer Mikroprozessor mit einem eng gekop-
pelten FPGA untersucht. Im Gegensatz zu früheren Forschungsansätzen wurde eine um-
fangreiche Entwurfsraumexploration der FPGA-Architektur im Zusammenhang mit einem
kommerziellen 22 nm Herstellungsprozess durchgeführt. Bisher verwendeten die meisten
Forschungsprojekte entweder kommerzielle Architekturen, die nicht unbedingt auf diesen
Anwendungsfall zugeschnitten sind, oder die vorgeschlagenen FGPA-Komponenten wur-
den nur unzureichend untersucht und charakterisiert. Jedoch ist gerade dieser Baustein
ausschlaggebend für die Leistungsfähigkeit des gesamten Systems. Deshalb wurden im
Rahmen dieser Arbeit über 200 verschiedene logische FPGA-Architekturen untersucht. Zur
Modellierung wurden konkrete Schaltungstopologien und ein auf den Herstellungsprozess
zugeschnittenes Modell zur Abschätzung der Layoutfläche verwendet. Generell wurden die
gleichen Trends wie bei vorhergehenden und ähnlich umfangreichen Untersuchungen beob-
achtet. Auch hier wurden die Ergebnisse maßgeblich von der Größe der LUTs (engl. "Look-
up Tables") und der Struktur des Routingnetzwerks bestimmt. Gleichzeitig wurde ein viel
breiterer Bereich von Architekturen mit nahezu gleicher Effizienz identifiziert. Zur weiteren
Evaluation wurde eine FPGA-Architektur mit 5-LUTs und 8 Logikelementen ausgewählt. Die
Performance des ausgewählten Mikroprozessors, der auf einer erprobten Befehlssatzarchi-
tektur aufbaut, wurde mit Ergebnissen eines 28 nm Testchips abgeschätzt. Eine modifizierte
Sammlung von akademischen Softwarewerkzeugen wurde verwendet, um Spezialbefehle
auf die modellierte FPGA-Architektur abzubilden und eine Netzliste für die anschließende
Simulation und Verifikation zu erzeugen.
Für eine Reihe unterschiedlicher Anwendungs-Benchmarks wurde eine relative Leistungs-
steigerung zwischen 3 und 15 gegenüber dem ursprünglichen Prozessor ermittelt. Obwohl
die vorgeschlagene FPGA-Architektur vergleichsweise primitiv ist und keinerlei arithmeti-
sche Erweiterungen besitzt, musste dabei, bis auf eine Ausnahme, kein überproportionaler
Anstieg der Chipfläche in Kauf genommen werden. Die gewonnen Erkenntnisse zu den
Abhängigkeiten zwischen den Architekturparametern, der entwickelte Ablauf für die Ex-
ploration und das konkrete Kostenmodell sind essenziell für weitere Verbesserungen der
FPGA-Architektur. Die vorliegende Arbeit hat somit erfolgreich den Vorteil der untersuchten
Systemarchitektur gezeigt und den Weg für mögliche Erweiterungen und Hardwareimple-
mentierungen geebnet. Zusätzlich wurden eine Reihe von Optimierungen der Architektur
und weitere potenziellen Forschungsansätzen aufgezeigt.
CONTENTS
1 Introduction 1
1.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Background 4
2.1 Reconfigurable Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.2 Academic Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.3 Commercial Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 FPGA Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Global Routing Network . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 Local Routing and Logic Element . . . . . . . . . . . . . . . . . . . . . 12
3 FPGA Architecture Exploration 14
3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.1 CAD tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.2 Transistor Level Implementation . . . . . . . . . . . . . . . . . . . . . 18
3.1.3 Area model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Exploration Setup and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.1 Benchmarks and CAD settings . . . . . . . . . . . . . . . . . . . . . . 27
3.2.2 Evaluation Methodology and Parameter Selection . . . . . . . . . . . 29
3.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.4 Discussion and Candidate Architecture . . . . . . . . . . . . . . . . . . 37
4 Microprocessor Integration 40
4.1 Microprocessor architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2 Hardware Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.1 Reconfigurable Functional Unit . . . . . . . . . . . . . . . . . . . . . . 42
4.2.2 Instruction Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3 Software Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.3.1 Instruction Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3.2 High-Level Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.4 Array Size Tradeoffs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5 Application Benchmarks and Results 49
5.1 Benchmark Selection and Evaluation Methodology . . . . . . . . . . . . . . . 49
5.1.1 Fast Fourier Transformation . . . . . . . . . . . . . . . . . . . . . . . . 50
5.1.2 Data Encryption Standard . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.1.3 Exponential function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.2 Functional verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6 Discussion and Outlook 59
6.1 Discussion of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.2 Extensions and Future Research . . . . . . . . . . . . . . . . . . . . . . . . . 60
Bibliography 62
Appendix 69
VI
SYMBOLS AND NOTATION
Fc,in Fraction of routing tracks per cluster input
Fc,local Flexibility of cluster crossbar
Fc,out Fraction of available routing tracks per cluster output
FS Flexibility of switch block
I Number of cluster inputs
K Number of inputs per LUT
L Length of a routing segment
N Number of logic elements per cluster
W Number of routing tracks per channel
g Logical effort of a stage
h Electrical effort of a stage
p Parasitic delay of a stage
fopt Stage effort that minimizes delay
D Normalized path delay
S Number of stages along the considered path
F Path effort
VII
ACRONYMS
ALU Arithmetic Logical Unit
ASIC Application Specific Integrated Circuit
ASIP Application Specific Instruction set Processor
BLIF Berkeley Logic Interchange Format
CAD Computer Aided Design
CB Connection Block
CGRA Coarse-Grained Reconfigurable Array
CMOS Complementary Metal Oxide Semiconductor
COFFE Circuit Optimization For FPGA Exploration
DES Data Encryption Standard
DFT Discrete Fourier Transformation
eFPGA embedded FPGA
FDSOI Fully Depleted Silicon-On-Insulator
FFT Fast Fourier Transformation
FIR Finite Impulse Response
FPGA Field Programmable Gate Array
ISA Instruction Set Architecture
LE Logic Element
LSU Load Store Unit
LUT Lookup Table
MOPS Million Operations Per Second
MWTA Minimum-Width Transistor Area
PLB Processor Local Bus
RFU Reconfigurable Functional Unit
RTL Register Transfer Level
SB Switch Block
SOC System On Chip
SRAM Static Random Access Memory
VPR Versatile Place and Route
VTR Verilog To Routing
VIII
1 INTRODUCTION
Since the dawn of digital integrated circuits, microprocessors have been a popular choice to
solve complex tasks. Covering a large spectrum, from minimal 8-bit embedded processors
in household appliances to 64-bit multi-core machines in supercomputers, their ubiquitous
presence today shows the success of this design style. Advances in transistor process tech-
nology and coordinate refinements in microprocessor architecture have allowed software
systems to become increasingly complex. Groundbreaking innovations like the Internet,
smartphones or modern advanced driver assistance systems all rely on powerful hardware
to effectively handle the numerous levels of abstraction. Nevertheless, the end of Den-
nard scaling has lead to a radical shift in the design of processor architectures [EBA+11].
Limited scaling of the power density per transistor necessitates parallel processing archi-
tectures, highly energy-efficient computations, and sophisticated voltage and frequency
scaling schemes [HSE+12]. Furthermore, a number of innovative and diverse approaches
are being pursued to extract performance primarily from scaling in terms of silicon area.
Even without these recent complications, designing and implementing general purpose
microprocessors and choosing the right combination of instructions has always been a chal-
lenge [Col06]. To justify the high cost and long turnaround time, not only of logical design
but also of verification, physical design, and fabrication, the final product has to perform well
on a wide range of possible applications to be economically viable. Still, the limited set of
available instructions provides only inefficient implementations for a number of attractive
applications. Modern systems thus try to enhance performance by increasing instruction
throughput and by exploiting thread-level parallelism with multi-core systems. These ad-
vanced architectures are not only challenging to design and to debug, but they also scale
poorly for applications with limited parallelism.
A different approach is to incorporate special purpose hardware accelerators for tasks
that occur frequently, e.g., graphics rendering [HMB+14], packet processing [Cav16], or
database analytics [KLS+16]. Modern technology scaling has allowed even embedded mi-
croprocessors, which are traditionally designed to be highly cost effective, to include appli-
cation specific accelerators for encryption and signal processing [Tex16] or error correction,
true random number generation, and cryptographic hashing [STM16].
Likewise, application specific instruction set processors (ASIPs) augment a baseline mi-
croprocessor with custom instructions which are tailored to one specific application. Choos-
ing the right set of custom instruction is usually carried out with the help of a cycle-accurate
simulation model and a set of representative benchmark applications. Careful selection and
appropriate implementation of these custom instruction can leverage details of the target
application in early design stages and dramatically increase the computational performance
and energy efficiency [HAN+16]. However, this approach also faces the same tradeoffs be-
tween flexibility and performance and is much more sensitive to requirement changes in
the target application.
Technology scaling has also enabled a quite different design style, namely reconfigurable
digital circuits. FPGAs, the most prominent representative of this group, allow the end user
to implement custom digital circuits without the need for expensive and time consuming
design of application specific integrated circuits (ASICs). Nevertheless, the high flexibility
of such fine-grained fabrics, which allow individual applications to be optimized at bit level,
1
1.1 Purpose
incurs significant drawbacks regarding silicon area, power consumption and maximum op-
erating frequency compared to equivalent ASIC solutions [KR09].
Consequently, combining reconfigurable circuits with software programmable micropro-
cessors offers a compelling solution, with sufficient flexibility to target a wide range of ap-
plications. Applications can be partitioned to offload control flow operations, which are hard
to design at bit level, onto the microprocessor and use the reconfigurable fabric to imple-
ment datapath operations. Ideally, this would combine the advantages of general purpose
microprocessors and ASIPs, but also add the exceptional capability to program individual
devices with application specific instructions after fabrication. Within minutes, a designer
could explore multiple custom instructions during the development phase or improve exist-
ing implementations after they have been shipped and integrated into a system.
1.1 PURPOSE
Multiple implementation styles for such reconfigurable computing platforms exist and a
number of different architectures and tools have been studied. Despite a multitude of
projects and many attempts, not a single concept has had a long-lasting success, neither in
a commercial nor in an academic context. Important problems which need to be addressed
include: design and optimization of the reconfigurable fabric, development of suitable com-
puter aided design (CAD) software to produce high quality results, and system integration of
the fabric with existing hardware and software toolchains. Most previous research regarding
reconfigurable hardware focused on isolated problems and few researchers have shown the
advantage of their solution with a complete system. This thesis is primarily concerned with
exploring and selecting tradeoffs in the design of a custom FPGA architecture, tailored to
the requirements of a tight coupled reconfigurable functional unit (RFU). Emphasis on this
step is critical, since many design decisions and the feasibility of advanced software tech-
niques to exploit runtime reconfiguration depend on the characteristics of the reconfigurable
hardware and its technology specific implementation.
Despite its importance in providing the necessary flexibility after fabrication and its im-
mediate impact on overall system performance, adequate design and analysis of the re-
configurable fabric regarding implementation cost, overhead due to reconfiguration, and
computational performance was often done with little care or was completely neglected.
Few research groups used silicon prototypes, many estimated performance from commer-
cial FPGA architectures (for which mature CAD tools exist) and some inferred performance
from ASIC implementations with standard cells [KBS+10]. Many publications lack plausible
specification of fabric performance or the evaluation methodology, but accurately quantify
the code size of developed CAD tools, boast with large amounts of energy reduction and
present impressive speedups across large benchmark suites [VSL08, LVT04]. Without de-
tailed physical models of the fabric or validation on hardware prototypes, crucial aspects
and limitations were not captured. Thus, all claimed performance gains remain hypothetical
and developed design tools cannot be reused in subsequent research. Furthermore, few
academic projects assumed realistic performance of the microprocessor and instead used
high-level simulation models [CTMB13], microprocessors emulated on FPGAs [WC96], or
purely educational microarchitectures [BSKH07].
This thesis focuses on practicality and relevance of all components. Prevailing design
wisdom for logical architectures of general purpose FPGAs is reevaluated in the context of a
modern 22 nm fully depleted silicon-on-insulator (FDSOI) process technology from Globalfoundries
[CMP+16]. A custom benchmark set, representative of circuits that implement custom in-
structions, a much more robust area model, and detailed transistor level implementations
are used to provide meaningful and concrete results of the reconfigurable hardware. The
customized FPGA fabric is integrated into the execution stage of a 32-bit microprocessor,
2
1.2 Scope
which is based on a commercial instruction set architecture (ISA) [DJL+97]. An implementa-
tion of this processor has been taped out as part of a 28 nm test chip and accurate signoff
data are available to establish a baseline. Finally, complete application benchmarks are used
to evaluate the performance and area overhead of the instruction set extensions.
1.2 SCOPE
Figure 1.1 highlights the different levels of abstractions that are dealt with in this thesis.
Based on a detailed transistor level model, a bottom-up approach is adopted to first explore
the design space of logical architectures for fine-grained fabrics and to use the results after-
wards to implement a reconfigurable functional unit.
Procesor microarchitecture
Machine code
Software implementation
Application
Reconfigurable functional unit
FPGA logical architecture
Transistor level implementation
Software domain
Hardware domain
Figure 1.1: Simplified overview of the levels of abstractions
This stands in contrast to previous research which often put a stronger emphasis on issues
that arise at the depicted interface of the hardware and software domain. This includes
coupling mechanisms of the reconfigurable portion [vSKN+06], automatic extraction of can-
didate instructions with high-level synthesis [API03], or unique exploitation of runtime recon-
figuration [JH15]. Important parameters of the FPGA architecture were either cherry-picked
from datasheets of commercial architectures or inferred from insights of previous out-of-
date explorations which focused on general purpose applications.
Chapter 2 classifies existing approaches to reconfigurable computing and presents con-
ventional island-style FPGA architectures. The next chapter describes the CAD tools, circuit
implementations, and the area model which were used to conduct a design space explo-
ration across several hundred logical FPGA architectures. This detailed exploration was
necessary to be aware of the possible area delay tradeoffs for each individual logical ar-
chitecture under the influence of single-driver routing and the performance of the chosen
transistor technology. After determining concrete characteristics for one particular set of
logical parameters, chapter 4 describes how this FPGA fabric is integrated as a functional
unit into the microprocessor hardware and how a programmer can utilize its functionality.
To asses the benefit of instruction set extensions, chapter 5 gives an overview over the
implementation and results of application benchmarks with the complete system. The last
chapter concludes with a summary of the most important results and presents extension
to the proposed architecture and directions for future research.
3
2 BACKGROUND
This chapter motivates different concepts of reconfigurable computing, presents previous
research on reconfigurable microprocessors, and reviews architectural details of island-style
FPGAs.
2.1 RECONFIGURABLE COMPUTING
Due to its simplicity, the idea of reconfigurable computing has a long history and a large
spectrum of implementations exist. In general, any system that allows customization of its
computing resources at runtime is filed under reconfigurable computing. A straightforward
way to implement such systems with off-the-shelf parts involves either board level or sys-
tem level integration of standalone FPGAs with conventional x86 workstations [FPM12] or
implementation of extensible soft processors on commodity FPGAs [CFHZ04]. While both
approaches might be useful for certain workloads, they rely on pre-built general purpose
platforms which sacrifice performance in favor of flexibility.
The focus of the following section and the rest of this thesis will be narrowed to sys-
tems that couple ASIC implementations of microprocessors with customized reconfigurable
fabrics on chip level. A more general overview of architectures for reconfigurable com-
puting, including the evolution of standalone FPGAs, can be found in multiple surveys
[CH02, Ama06, TPD15, WWC16]. By concurrently optimizing the hardware of the proces-
sor and the reconfigurable fabric, the interaction between both can be matched to allow
more powerful execution of individual applications. Highly irregular control flow operations
are mapped onto software running on the microprocessor, while demanding datapath oper-
ations are efficiently implemented with reconfigurable hardware. Additionally, this approach
enables the designed components to be reused as building blocks in much larger system
on chips (SOCs), which would not be possible with prefabricated devices.
At the same time, and especially when the effort for the required CAD tools is considered,
more design challenges have to be met to show meaningful results with these systems.
Due to the strong interdependency between hardware architectures and CAD tools, many
research projects have started from scratch and developed their own custom solution. As
a consequence, the reported results are hard to validate and the developed tools cannot be
reused. One exception is the open-source Verilog to routing (VTR) flow [LGW+14], which
was used as a basis in this work and which is presented in more detail later.
2.1.1 CLASSIFICATION
At the heart of any reconfigurable system is a mechanism that allows the end users to
tailor some aspect of the computation to their needs. A distinctive property of this mecha-
nism is the granularity of configuration. Although most systems adopt either one extreme,
combinations of coarse- and fine-grained architectures exist.
Coarse-grained reconfigurable arrays (CGRAs) consist of hardwired processing elements
that perform abstract operations such as bit-shift, addition, multiplication, or permutation.
With limited to no flexibility, these elements have to be designed as generic as possible.
4
2.1 Reconfigurable Computing
To still implement a large set of applications with these systems, a flexible routing network
provides connections between the processing elements. Operations which require data
formats other than the native word length, often shared by the processing elements and
the routing networks, are synthesized from a sequence of steps performed with the pro-
cessing elements. Limitations in the available set of functions usually allows only datapath
operations to be mapped onto CGRAs. Therefore, CGRAs are often supported by a general
purpose microprocessors to ease development of complete applications.
On one hand, the reduced flexibility and hardwired processing elements minimize over-
head, provide very high throughput for their fundamental functions, and allow complete syn-
thesis with conventional standard cell flows. On the other hand, selection of the processing
element capability is crucial and has to be done a priori. The challenges in finding the right
tradeoffs are similarly to selecting the right instruction set for microprocessors. Neverthe-
less, numerous applications exhibit inherent data level parallelism where CGRAs facilitate
implementations with high area efficiency and high performance. Due to the coarse na-
ture of the processing elements, technology mapping for CGRAs is usually performed via
high-level synthesis. That is, the operations and their scheduling are directly extracted from
a software implementation in a high-level programming language, for example, C. Yet, no
generic approach or tool flow for mapping applications to CGRA exists, which is partly due
to the vast differences in the architecture of processing elements and routing networks.
Representative examples of CGRAs are described in [BEM+03] and [ABP11]. The global
routing network and the mix of processing elements of the latter is shown in figure 2.1.
Figure 2.1: Mesh routing network of the EGRA [ABP11] (© 2011 IEEE)
Fine-grained architectures, such as FPGAs, allow simultaneous configuration of the rout-
ing network and the computational elements. Both components can be customized at bit
level, providing maximum flexibility. The only limiting factor regarding application mapping
is usually the amount of configurable resources or capacity of the device. Refer to section
2.2 for a more detailed look at the architecture of island-style FPGA. Essentially, simple fine-
grained fabrics consist of multiplexers, dense configuration memories, and sequencing ele-
ments. Although in principle possible, all necessary components can be synthesized from a
standard cell library, however this leads to excessive overhead in chip area and performance
degradation [KBW03, Kim16]. Even commercial products, with full custom design of all leaf
cells, can not completely mitigate the price that has to be paid for post-fabrication configura-
5
2.1 Reconfigurable Computing
bility. With high flexibility, fined-grained architectures are much more suitable to implement
both control and data flow operations, nevertheless some users prefer to develop most of
the application in software or to reuse existing software solutions.
Compared to CGRAs, the overall design flow for using field programmable logic is more
generic and similar to the conventional ASIC methodology. Hence, algorithms and tool
flows can be partly reused. Mapping of user circuits is performed with conventional logic
optimization and synthesis from register transfer level (RTL) descriptions. However, tech-
nology mapping for LUT-based FPGAs is unique, since these elements are fan-in limited in
contrast to standard cells which are depth limited. Nevertheless, this challenge is identical
for all LUT-based fabrics and independent of other logical parameters. Place and route, on
the other hand, is highly architecture specific and requires customized solutions. This is
also true for generating the bitstream which, after being programmed into the configuration
memory of the device, has to implement the desired functionality.
Another important property used to categorize reconfigurable systems is the coupling
between the reconfigurable fabric and the host microprocessor. Tight coupling minimizes
latency and allows direct access of the processors resources. From the perspective of
the processor, the fabric is treated like any other functional unit and consequently called
reconfigurable functional unit. Advantages offered by this approach are twofold. First, re-
sources that are already in place for execution of native instructions are reused by custom
instructions. This can include reuse of the register file, the memory interface, and the in-
struction issue logic, especially for superscalar architectures that issue and execute multiple
instructions at once. Second, a system with tight coupled RFU limits neither the conven-
tional software execution nor the configurable logic by attaching it to a system-wide bus,
which would incur additional latency. Furthermore, simply issuing a custom instruction of-
fers the most natural interface for software applications to access the capabilities of RFUs.
Given that the compiler toolchain is aware of the semantics of the custom instructions, their
scheduling can be optimized in the same manner as with native instructions. This approach
also offers the possibility for superscalar architectures to incorporate multiple RFUs.
Loose coupled systems either employ a dedicated bus system or reuse already present
interconnect structures. This goes in hand with a separate clock domain to decouple the
fabric clock from the clock domain of the microprocessor. Hence, more flexibility in clocking
the reconfigurable portions of the design is provided. Still, synchronization overhead in chip
area and especially in latency make these systems less attractive for mapping small and sim-
ple computational kernels. Indeed, 14 cycles were measured for the access of processor
registers from the reconfigurable fabric of a recent commercial system. This rendered the
loose coupled FPGA fabric less attractive for implementing automatically generated hard-
ware accelerators [VHCH15]. Consequently, loose coupled systems are better suited to
map large sections of target applications onto the reconfigurable fabric, which in turn must
provide high capacity and dedicated heterogeneous resources.
Despite the clear distinction with the previously described categories, hybrid approaches
have also been proposed [KBS+10, LCB+06, RCS+10]. A carefully designed combination of
coarse- and fine-grained resources allows such systems to benefit from their complemen-
tary strengths.
2.1.2 ACADEMIC SYSTEMS
A detailed study regarding embedded FPGAs (eFPGAs) can be found in [Neu10]. Based on
previous research [LN04], Neumann derived a heavily modified island-style architectures tai-
lored to arithmetic oriented applications. To reduce the overhead of global routing, a routing
architecture with signal groups of different weights was introduced, based on the observa-
tion that datapaths have quite regular and predictable structures. Consequently, the logic
6
2.1 Reconfigurable Computing
element (LE) architecture has extensive support for arithmetic functions and local routing
is grouped into function and bit slices. Due to high effort of the physical implementation in
a 180 nm and 130 nm transistor technology, only 3 distinct architectures were developed
and compared to the commercial Cyclone I architecture. Although these implementations
only represented a small set of the template-based semi-automatic layout framework, no
detailed exploration was performed to motivate the choices of the underlying logical archi-
tecture. Benchmarks included add-compare-select circuits, digital filters, and a quad 4-bit
multiply-accumulate unit for a correlation receiver. All benchmarks were mapped and con-
figured manually, since the custom architecture was not fully supported by the academic
CAD tools of that time. The best architecture achieved 10 times higher area and energy
efficiency than the commercial architecture.
A related study explored how this eFPGA architecture could be coupled to an ASIP as
well as conventional ARM940T and MIPS-IV processor cores [vS10]. The evaluation of the
processor architectures was based on cycle-accurate simulation models, while the model
for the eFPGA relied on more accurate data from physical implementations. Among the
proposed coupling mechanisms, tight coupling with direct access of the processor regis-
ters was found to be highly effective. Memory access of the reconfigurable fabric proved
to be beneficial only for some specific applications. Availability of a native multiplication
instruction increased the efficiency and speedup of the reconfigurable system, since this
operation could be reused without spending fabric resources. Throughout the comparison
of pure software implementations with different reconfigurable system architectures, a fig-
ure of merit based on the product of required area, runtime, and energy consumption (ATE)
was used consistently. Complete applications such as finite impulse response and median
filters, encryption using the data encryption standard (DES), and a correlation receiver were
chosen as representative benchmarks. In the last case, correlation was extended with local
generation of the system specific spreading code by means of a linear-feedback shift regis-
ter. With application speedups in the range of 14 to 128, ATE costs were reduced by three
orders of magnitude. Although not explicitly supported by the eFPGA architecture, a prelim-
inary evaluation of multiple configuration contexts combined with dynamic reconfiguration
found this feature to be of no value for the chosen DES implementation, which required only
few (~ 300) logic elements.
The XiRisc system [BBM+04] features a row-based fabric [LMB+06] which represents
a refinement over the first generation [LMB+06]. Leveraging the multiported register file
of the microprocessor, any row inside the fabric has access to the four 32-bit operands
that are read each cycle. Fast computation is achieved with dual 4-LUTs, explicit support
of carry-select architectures and dedicated pipeline registers at the output. The fabric can
switch between 4 contexts within a single cycle and can preload configuration data over a
272-bit wide interface. Pipeline registers were designed to keep the results of computations
between context switches. Benchmarks were focused on signal processing tasks, DES, and
codecs for error correction. All benchmark results were based on measurements of a silicon
prototype manufactured with a 160 nm process technology. They included speedups up to
80 but also precise data regarding the area overhead of the fabric as well as analysis of the
energy efficiency.
Similar academic projects have proposed fine-grained architectures but lack convincing
performance evaluation. OneChip highlighted the importance of a single-chip integration
of the reconfigurable fabric and the microprocessor, but was implemented on a multi-chip
FPGA platform [WC96]. A followup study investigated the effect of a superscalar processor
architecture and reported speedups of up to 32, however these results were solely based
on high-level simulation models and bold performance estimates [CC01]. Other projects
provided more details of their customized row-based architectures. The Garp fabric was
designed to naturally support carry-save arithmetic and hence chose a fundamental gran-
7
2.1 Reconfigurable Computing
ularity of two bits [Hau00]. Extensive support for carry-save computations was built into
the carry chain and the logic elements compromised a single 4-LUT and shift logic. Since
the fabric was not viewed as a functional unit of the MIPS processor, four memory buses
were integrated into the routing structure to allow direct memory access during computa-
tions. Again, performance was only estimated with a software model and compared to an
UltraSPARC machine. Chimaera aggressively tried to reduce the area overhead and thus
omitted sequencing elements like flip-flops and provided minimal routing resources with
strong directional bias [HFHK04]. LEs consisted of one 3-LUT and one fracturable 4-LUT
with dedicated carry chain support. Although the performance of the fabric itself was mea-
sured on a 0.6 µm silicon prototype, no results are available for a complete system with the
microprocessor.
A rather rare approach used a tight coupled, coarse-grained architecture to extend two
versions of the SimpleScalar simulation model [CTMB13]. Two additional special functional
units were added, one for simple and one for more complex expressions. The first included
simple arithmetic logical units (ALUs) and shift functionality, while the latter additionally in-
cluded a multiply-accumulate unit. A set of instruction sequences was extracted from two
benchmark suites and subsequently analyzed to guide the selection of these operations.
Both units can read up to 4 operands from the processor register file and simultaneously
write 2 operands back. The control signals which configured the custom datapath were
automatically generated by a graph-based mapper. Based on synthesis results with an aca-
demic 45 nm transistor technology, the clock frequency of the baseline microprocessor was
estimated to be 633 MHz. Since the proposed special functional unit was not decoupled,
the extended processor model had to run all instructions at a reduced frequency of 606
MHz. Detailed results regarding area overhead of the special functional unit were not re-
ported. Despite the advantage of an advanced out-of-order microarchitecture, no speedup
greater than 1.8 was achieved.
2.1.3 COMMERCIAL SYSTEMS
Although a large spectrum of architectures has been studied in research projects, few of
them are transformed into commercial products. One exception is the "configurable sys-
tem on chip" prototype [BV03] that combined a SPARC V8 microprocessor with a commer-
cial CGRA fabric [BEM+03]. The commercial derivative kept the overall system architecture,
including the multi-layer bus system for loose coupling, but replaced the SPARC micropro-
cessor with an ARME7J-S core [VB03]. The fabric was optimized for high throughput and
had dedicated access to a memory section alongside the system-wide memory, which con-
tained the program code. Communication between processing elements was packet-ori-
ented and based on a hierarchical interconnect structure. Two types of elements were
implemented, ALUs and memory. A complex configuration protocol was implemented in
hardware and allowed configuration caching and partial reconfiguration at runtime. Based on
synthesis results for a 130 nm process, the reconfigurable system performed typical signal
processing tasks like fast Fourier transformations (FFTs) and finite impulse response (FIR)
filters up to four times more energy-efficient than an off-the-shelf digital signal processor.
A far more popular option, especially among vendors of conventional standalone FPGAs,
are fine-grained, loose coupled systems. Even though microprocessors can be viewed as
a recent trend in the evolution FPGA architectures, similar systems were already present
more than a decade ago. Modern incarnations have exploited technology scaling to com-
bine high-capacity fabrics with more powerful microprocessors. An overview of existing
entry-level systems and their predecessors is given in table 2.1.
8
2.2 FPGA Architecture
Table 2.1: Entry-level commercial systems with loose coupled FPGA fabrics
Vendor Device Node Processor fCPU Inter- Capacity Ref.
family (nm) core (GHz) connect (k LEs)
Atmel FPSLIC 350 AVR 0.03 custom 2.3 [Atm03]
Altera Excalibur 180 ARM922T 0.20 AHB1 38.4 [Alt02]
Xilinx Virtex II Pro 130 PPC 405 0.30 PLB2 99.2 [Xil11]
Microsemi SmartFusion2 65 ARM M3 0.17 AHB1 150.0 [Mic16]
QuickLogic EOS S3 40 ARM M4F – – 2.8 [Qui15]
Altera Cyclone V SoC 28 ARM A9 0.95 AXI3 100.0 [Alt15]
Xilinx Zynq 7000 28 ARM A9 1.00 AXI3 444.0 [Xil16]
1 AMBA High-performance Bus
2 Processor Local Bus
3 Advanced eXtensible Interface
Loose coupling of conventional FPGA fabrics through standardized interconnects and reuse
of IP-based microprocessor designs allows such systems to be constructed with minimal
effort. Additionally, vendors can rely on mature CAD tools from their main product lines
and exploit software ecosystems, including complete operating systems, of dominating
ISAs. Other commercial vendors with custom designed fabrics, CAD tools, and integration
schemes seem to have had less success [TAJ00, Arn05].
2.2 FPGA ARCHITECTURE
FPGA technology evolved from mask programmable gate arrays and has largely replaced
other families of programmable logic. In over 30 years, a variety of FPGA architectures have
been proposed and implemented. The following will provide a rough classification of FPGA
architectures based on their programmability and their global routing network.
Configuration bits which ultimately define the functionality of a circuit are stored in mem-
ory cells alongside routing resources and logic elements of an FPGA. The characteristics
of these cells have a direct impact on the overall architecture. The most mature and cur-
rently dominating architectures use volatile static random access memory (SRAM) cells.
To avoid reconfiguration at each power cycle from external non-volatile memory chips and
possible disturbance of the configuration state (soft errors), other approaches directly inte-
grate non-volatile memory cells on the same die [Alt16] or within the fabric itself [Act07].
These are niche products due to substantial effort of integrating the non-volatile devices with
conventional complementary metal oxide semiconductor (CMOS) processes and significant
reduction in logic density. Since SRAM-based FPGAs directly benefit from technology scal-
ing, they have been and still are the focus of industrial and academic research. Nonetheless,
architectures based on emerging non-volatile memories, which promise high density, seam-
less integration with CMOS logic, and low power operation, could become more viable in
the future [XZC+11].
The global routing architecture plays an important role, since it is the major contributor
to overhead in silicon area, power consumption and delay. Row-based architectures, which
mimic mask programmable gate arrays, only provide routing resources in one direction.
Logic blocks are connected by routing tracks in adjacent rows and might contain dedicated
hard-wired connections between rows. Although this approach can be used to efficiently
implement datapaths and reduce mapping complexity, it is restricted to a fixed word length
and limited routability might render irregular logic infeasible to map.
Tree-based or hierarchical architectures also attempt to reduce the complexity of the rout-
9
2.2 FPGA Architecture
ing network, but impose a muti-stage structure onto the network. Based on the observation
that circuits tend to have only a limited connectivity between nodes, a fixed number of LEs
are grouped together and are directly connected. This structure recursively connects other
groups and forms a hierarchy. Given a custom place and route tool and compatible logic
elements, this approach can significantly reduce the overhead and even reuse modern com-
mercial front-end tools for synthesis and logic optimization [Wan13]. The major drawback is,
again, that the imposed structure restricts flexibility and the irregular structure of the rout-
ing network requires large custom layouts. Additionally, scaling the array capacity requires
redesign of the routing network to balance the hierarchy.
Mesh-based or island-style architectures are popular in commercial implementations and
have enjoyed the most attention from research. They extend row-based architectures to
form a two-dimensional network of routing channels without restricting routability in a par-
ticular direction. Each tile, containing logic elements and their local interconnect, is symmet-
ric and has the same number of connections to the global routing network. This reduces
custom design to one instance of a tile and allows scaling of array capacity by increasing the
number of tiles in a rectangular placement. Still, adequate flexibility in the routing network
requires a carefully chosen amount of routing tracks. However, many methods have been
researched and shown to effectively reduce area overhead without substantial reduction of
routability or performance. Many academic tools, including the versatile place and route
(VPR) tool, primarily support island-style architectures. A discussion of island-style architec-
tures and implementation details of VPR can be found in [BRM99]. Combinations of mesh
and tree networks have been adopted in at least one commercial FPGA architecture.
Due to the advantages of simpler physical design, high routing flexibility, and the availabil-
ity of open design tools, SRAM-based FPGAs with an island-style architecture were chosen
for this thesis. The following discussion of logical parameters for such FPGAs is geared
towards structures that are directly supported by the CAD tools.
2.2.1 GLOBAL ROUTING NETWORK
A schematic of a global routing network for island-style architectures is shown in figure 2.2.
The highlighted section shows the components of a tile: two connection blocks (CBs), a
switch block (SB) and the logic cluster. This definition also includes the vertical and hori-
zontal routing tracks that form the respective routing channels. To allow external signals to
enter the routing network, IO blocks have to be provided at the edges of the array. For stan-
dalone FPGAs these blocks represent pad cells and for eFPGAs they represent the system
interface of signals routed into the macro. Based on external requirements, these blocks
allow a certain number of signals to enter the array (IO capacity) and are placed on specific
positions on array perimeter (IO placement). The array is further characterized by the total
number of tiles, which limits the size of possible user circuits, and the aspect ratio of the tile
arrangement. The latter has to be considered during top-level integration of eFPGA macros.
Routing channels are uniformly occupied by a fixed amount of routing tracks and accordingly
have a channel width W . This is a crucial parameter as the fixed supply of routing tracks
directly influences the quality of mapped circuits and has to be determined at design time
of the fabric. It also critically influences the design of the SBs, which are placed at inter-
sections of vertical and horizontal channels and allow routed signals to change direction.
Historically, routing tracks were driven by tri-state buffers to implement bidirectional routing
with signals traveling in both directions of a track. This so-called multi-driver routing allowed
identical treatment of cluster inputs and cluster outputs. In modern FPGAs, this design
style has been replaced with single-driver routing and unidirectional routing tracks. A direct
comparison has suggested that single-driver routing is superior in most aspects [LLTY04].
Due to the single-driver topology chosen in this work, SBs also act as entry points for the
10
2.2 FPGA Architecture
logic cluster outputs. To feed the logic cluster with inputs from the global routing network,
CBs are placed at the vertical and horizontal edges. Restricting tiles to connections with one
SB and two CB represents a special case. Better routability could be achieved by allowing
connections of the logic cluster to all adjacent channels, however, this increases design and
modeling effort, since tiles at the edges would violate the symmetry.
Figure 2.2: Island-style global routing network and fundamental parameters
Figure 2.3: Uniform segmented routing with W = 12, L = 3, and no depopulation; only
horizontal routing and taps for one direction are shown
Another important parameter, the segment length L, defines how many tiles a certain
segment can span. Segmented routing allows to reduce the number of multiplexer per SB
11
2.2 FPGA Architecture
and increases the number of direct connections between SBs. Symmetry of the tile is pre-
served by staggering track endpoints with respect to the segment length as shown in figure
2.3. The figure also shows the internals of the highlighted SB to visualize how additional
connections are tapped off from passing routing tracks and how logic cluster outputs are
integrated. To further reduce the size of multiplexers inside SBs, not all possible connec-
tions from passing tracks have to be implemented. This so-called depopulation is especially
advantageous for circuits that do not require full routing flexibility. A similar method can be
used to reduce the number of tracks that are connected per cluster output. Finally, elabo-
rate routing networks employ segmentation with multiple segment lengths at once. This
mirrors the idea of signal hierarchies but requires significant design effort and again breaks
symmetry for tiles near the perimeter. Looking at one particular track inside of a SB, it is also
necessary to specify how flexible the SB is, i.e., in how many directions the signal can be
routed and to which start points within the channel it can reach. The number of directions
is described by the switch block flexibility FS and the pattern by the switch block topology.
Previous research has identified Wilton switch blocks [Wil97] with flexibility of FS = 3 to be
appropriate, hence these values are used here. A detailed study of switch blocks and their
connection patterns, although targeted at multi-driver architectures, is described in a book
by Lemieux and Lewis [LL04].
2.2.2 LOCAL ROUTING AND LOGIC ELEMENT
Just as important as the global routing network are the structures and the parameters that
describe the clustered logic elements. The cluster structure assumed here, together with
the most important parameters, is shown in figure 2.4.
Figure 2.4: Routing inside the logic cluster and structure of logic elements
Clustering N LEs together allows local feedback between elements without requiring ac-
cess to the slow global network. Each LE has K inputs and two separate outputs. Any
combinational logic function with a fan-in less than or equal to K can be implemented with
the K-LUT and its output can be feed into the flip-flop to implement sequential logic. Nev-
ertheless, this structure with one single-output K-LUT and a flip-flop is very primitive com-
pared to state-of-the-art commercial FPGAs, which contain fracturable LUTs with multiple
outputs, sophisticated carry and shift logic, and multiple flip-flops. The input of each logic
element is driven by a crossbar that implements individual selection of K inputs for each
logic element from the available signals, which consist of global inputs and feedback from
local LE outputs. As before, routing flexibility may be unnecessary for specific applications
12
2.2 FPGA Architecture
and hence reducing the flexibility of the crossbar Fc,local can provide another option for area
reduction [LL01]. To route signals into the cluster, I inputs are provided by the CBs. The
number of cluster inputs has to be chosen such that all logic elements can be fed with suf-
ficient global signals. Providing too few inputs reduces the effectiveness of clustering and
forces the routing algorithm to leave logic elements unused. Early research has empirically
derived the following rule of thumb to ensure high utilization (>98 percent) of logic elements
[Ahm01]:
I =
K
2
(N + 1) (2.1)
The number of cluster inputs directly defines the number of all multiplexers within CBs of a
tile, where each multiplexer allows a fraction Fc,in of all routing tracks (including those with-
out endpoints at this tile) to connect to one of the cluster inputs. Likewise, Fc,out describes
the fraction of routing tracks that can be reached by a cluster output.
The previously described logical architecture is restricted to homogeneous FPGAs, which
are constructed from a number of identical tiles with identical function. The primary motiva-
tion for this restriction is that it enables a homogeneous global routing network, described
by only a few numerical parameters, but also simplifies the task of FPGA design and lay-
out. Recent architectures, studied in research or present in commercial devices, contain
more advanced fabrics with heterogeneous components such as memory blocks, dedicated
multipliers, and hardware support for IO protocols. Also, fabrics employed in the systems
presented in section 2.1.2 have extended logic element structures and local interconnects
to explicitly support arithmetic operations. However, integration of heterogeneous blocks
into the global routing network or arithmetic enhancements of the logic elements require a
thorough understanding of the tradeoffs and interdependencies within the baseline hetero-
geneous architecture. Only a limited number of design decisions can be guided by intuition
or designer experience and the vast design space spanned by the previously described log-
ical parameters necessitates a comprehensive exploration.
13
3 FPGA ARCHITECTURE EXPLORATION
Faced with a large set of possible logical architectures and no guidance to select them in
the context of a 22 nm process technology, a comprehensive design space exploration was
conducted. The goal was not only to find architectures suitable for a tight coupled instruc-
tion set extension, but also to characterize them regarding implementation cost and perfor-
mance. This chapter presents the methodology for this exploration including the employed
CAD tools, transistor level implementations, the area model, and the selection of logical
parameters. Also, results are presented and discussed by comparing them with previous
explorations.
3.1 PRELIMINARIES
While analytical models have been proposed for a similar task [DLW+11, HWY+09], their ap-
proach was unsuited for the chosen process technology. Nevertheless, they provide a help-
ful reference for general insights and discussion of parameter interdependencies. Hence,
SPICE simulations are regarded as the most accurate form of estimating performance. Only
a few comprehensive explorations have been performed to study the impact of the most
important logical parameters on FPGA performance and area efficiency. Results from these
studies are quoted frequently to justify architectural decisions without respecting the con-
text in which these results have been obtained. Table 3.1 lists important details of the most
commonly referred to design space explorations for logical FPGA architectures.
Table 3.1: Comparison of design space explorations for logical FPGA architectures
Technology Routing Parameter Benchmarks CAD flow Reference
node (nm) topology variation
350 multi-driver N, K 20 largest MCNC
SIS, FlowMap,
[AR00]
VPR
180 multi-driver N, K
20 largest MCNC + SIS, FlowMap,
[AR04]
8 computer vision VPR
901 single-driver N, K, L 20 largest MCNC
SIS, FlowMap,
[KR09]
VPR, HSPICE
22 single-driver
N, K, L
16 custom
Odin II,
This work
Fc,in, Fc,out ABC, VPR
1 Multiple technologies studied using predictive models [CSO+00].
For the purpose of this study, modern CAD tools, an advanced process technology, and a
set of custom benchmarks, representative of typical datapath circuits, were used. Results
from previous explorations are not applicable in the context of this thesis, since they used
benchmarks and cost functions aimed at general purpose applications. All previous studies
listed in table 3.1 relied on SIS [SSL+92] and Flowmap [CD94] for logic optimization and
14
3.1 Preliminaries
technology mapping, both of which have since been superseded. Furthermore, these stud-
ies only reported results in terms of area or delay, but did not include critical information
on configuration overhead or circuit topologies. Still, their results are useful to prioritize
the selection of parameters according to their interdependencies. Additionally, they have
established certain guidelines on how to define and measure a figure of merit for FPGA
architectures. Also, they provide reference results that can be used to compare the relative
impact of a parameter on that figure.
3.1.1 CAD TOOLS
Figure 3.1 gives an overview of the steps and CAD tools that have been used during the
exploration. Most of these tools were afterwards reused to map circuits of custom instruc-
tions onto the final FPGA architecture.
Figure 3.1: Extended Verilog to Routing CAD flow, original steps highlighted
The majority of this flow relies on the VTR project [LGW+14], which combines various
open-source tools to provide a complete flow that maps a user circuit, described with Ver-
ilog, onto a parameterizable FPGA architecture.
Odin II [JKGS10] is used to parse the RTL description of the user circuit, synthesize high-
-level expressions, and produce a netlist in the Berkeley logic interchange format (BLIF) that
can be read by subsequent tools. Unfortunately, Odin II has proven to be a major impedi-
15
3.1 Preliminaries
ment. First, it only supports a very limited subset1 of Verilog, which makes it hard to reuse
already designed and verified components or to develop new RTL circuits. As this thesis
deals with mostly small and manageable circuits, these restrictions are regarded merely
as annoyances. However, it might be impractical to describe circuits in a supported for-
mat once larger designs and fabrics with higher capacity are used. Second, since Odin II
also supports to detect and replace arithmetic functions with heterogeneous blocks, it pro-
duces poor synthesis results if the target architecture does not contain such elements (es-
pecially multipliers). Other researchers have identified similar problems with Odin II [Hun15]
or sidestepped them using commercial tools for front-end synthesis [MWL+15]. Third, and
more dramatically, it is easy to describe circuits that will lead to synthesis mismatch, that is,
Odin II produces a netlist that is functionally incorrect. Synthesis mismatch is also possible
with other tools, as Verilog purposely allows non-synthesizable constructs, however, the
example in listing A.1.4 is extremely simple and unambiguous. The discrepancy is directly
apparent from the synthesized BLIF netlist in listing A.1.5 and has also been verified with a
correctly synthesized netlist produced by a commercial tool2. To nonetheless make use of
the VTR flow and ensure correct behavior in this work, user circuits were carefully designed
and all application benchmarks were verified with the approach described in section 5.2.
The next step in the CAD flow consists of logic optimization and technology mapping to
K -LUTs, which is accomplished using ABC [Ber16] and a corresponding command script.
ABC is a framework for analyzing and modifying sequential logic circuits, predominantly us-
ing and-inverter graphs, and can be regarded as the successor of SIS. It provides a large
variety of algorithms for logic optimization, technology mapping, and formal verification. Re-
grettably, complete end-to-end verification of the post-routing netlist with the RTL descrip-
tion using ABC was not possible, since the circuit first had to be synthesized by Odin II.
Using a hybrid academic and commercial flow, it has been shown that ABC on its own is
capable of producing high-quality results, on a par with a modern commercial FPGA CAD
suite [VKF15].
The mapped BLIF netlist produced by ABC and a description of the FPGA architecture is
afterwards fed into the versatile place and route tool [BR97]. Since its introduction, VPR
has become the de facto standard platform for research on island-style FPGAs. Due to its
permissive and open-source licensing, many extensions have been contributed to improve
its flexibility and to improve its support for modern heterogeneous FPGA architectures. A
flexible description language [LAR11], based on XML, allows description not only of the log-
ical FPGAs architecture but also of configurable and physical characteristics of fundamental
building blocks. This description is used during packing to group LUTs from the netlist into
logic clusters and to determine the assignment of local interconnect resources. Based on
the amount of occupied clusters, a square array of tiles with sufficient capacity is automat-
ically determined by VPR. A user-defined placement of IO resources can be provided to
specify the location of design signals on the array perimeter. Otherwise, the assignment of
IO blocks is automatically determined by the placement algorithm. Afterwards, the packed
netlist is placed onto the array with a simulated annealing algorithm. During timing-driven
placement, VPR assigns each packed cluster a position within the array while trying to min-
imize the cost in terms of delay, estimated from distances between tiles [MBR00]. Once a
legal placement has been determined, VPR proceeds with timing-driven routing, using the
Pathfinder algorithm [ME95]. This procedure first routes the design without considering
congestion and then repeatedly rips up and reroutes critical connections. Convergence to-
wards a feasible routing is archived by gradually increasing the cost weight for congestion.
Crucial for congestion and the success of routing is the channel width W , which is either
1Unsupported: module parameters, generate statements, variable bit shifts, multi-dimensional wires, arrays
of instances, combinational overriding, complex expressions within concatinations.
2Synopsys Design Compiler I-2013.12
16
3.1 Preliminaries
provided by the user or automatically determined by VPR with multiple routing trials.
Finally, timing analysis on the routed design is performed and a timing annotated post-rout-
ing netlist is generated together with a detailed report on cluster usage, channel occupation,
and critical path length. The post-routing netlist can subsequently be used to inspect the
behavior of the routed design with a conventional Verilog simulator. Fine-tuning of all steps
within VPR is possible by overriding a set of default options. Additionally, to allow experi-
ments with different implementations of the employed algorithms, it is possible to intercept
the output of all stages and to provide inputs from other tools. Access to VPR’s internal rep-
resentation of the routed netlist makes it possible to generate configuration bitstreams for
specific physical implementations of an FPGA architecture. Bitstream generation for se-
lected Xilinx architectures is possible with the extension presented in [HEW13], however
routing and bitstream generation still rely on tools from the vendor.
When VPR was first introduced, no detailed physical models of FPGAs were available so
that many crude sizing rules were incorporated to reflect the influence of different logical
architectures on performance or occupied area. Detailed descriptions of the built-in models
and assumptions within VPR are described by its original authors in a book [BRM99]. In
particular, they developed the minimum-width transistor area (MWTA) model to estimate the
required area for full custom layouts of different FPGA leaf cells. This model estimates total
area by counting scaled versions of a transistor with minimum width. For loads that require
stronger drivers, a linear relationship between driving capability and the width of a transistor
is established. Furthermore, it assumes that the layout area is dominated by the active area
of transistors and that additional area for routing can be neglected. MWTAs are frequently
used to express the size of FPGA tile components in a technology independent unit. In
the current version of VPR, the architecture description provides options to override most
of those built-in assumptions with absolute timings or concrete values of occupied area.
However, it still contains hidden and poorly documented assumptions that try to estimate
transistor sizes, especially for routing tracks, with unjustified load models and assumptions
about the size of memory cells.
As this exploration was intended to provide reasonable estimates of delay and area for
FPGA components in a commercial 22 nm process technology, the original VTR flow was
extended prior to the exploration phase. To reflect the physical characteristics of different
logical architectures, a detailed transistor level model was adapted from a tool for fully-auto-
matic sizing of FPGA tiles [Chi13]. COFFE, which stands for "Circuit Optimization For FPGA
Exploration", was primarily developed as an alternative to a similar tool for automatic transis-
tor sizing previously developed and described by Kuon [KR09]. Most importantly, COFFE is
capable of generating SPICE testbenches of individual tile components together with proper
loads according to the parameters of a logical FPGA architecture. This allows the tool to es-
timate delay and layout area of all necessary building blocks. Once a tile is characterized,
the defining physical properties are used to annotate the architecture description which is
subsequently used by VPR.
COFFE estimates layout area based on the width of individual transistors and a modified
MWTA model, fitted to area data obtained from layouts in a 65 nm technology. The quality
of this new model cannot be assessed objectively since the basis for fitting as well as the
factor for subsequent scaling to a 22 nm technology are not documented. Detailed modeling
of the circuit topologies allows COFFE to calculate concrete values for area overhead due
to configuration memory, where a single-bit cell is assumed to occupy 4 MWTAs. Based on
the estimated areas, COFFE attempts to include interconnect parasitics with a lumped RC
model. Wire length is estimated from simple connectivity rules and the assumption that all
cell layouts are square. Process specific data in terms of transistor models together with
sheet resistance and interconnect capacitance for the metal layers are read into COFFE
from a file.
17
3.1 Preliminaries
The logical FPGA architecture is specified by an additional file containing the set of logical
parameters that describe one tile of a homogeneous FPGA. Tiles are modeled as described
in section 2.2, i.e., each tile consists of one switch block, two connection blocks and one
logic cluster. All routing channels have the same channel width and employ single-driver
routing. COFFE supports depopulation in the connection blocks and in the cluster cross-
bar. Logic elements have the architecture shown in figure 2.4 with the addition of optional
feedbacks from the local output directly into one LUT input. However, this feature is dis-
abled and not further explored in this thesis, as it introduces asymmetry in the layout of
LUTs and requires additional overhead in the form of a multiplexer and configuration mem-
ory. Additionally, COFFE’s more flexible logic elements are restricted to one local and one
global output, while the flip-flop is always driven by the single output of the LUT. Next, the
underlying transistor level implementations of the tile components are described.
3.1.2 TRANSISTOR LEVEL IMPLEMENTATION
To allow characterization of delays with SPICE simulations, COFFE also includes a library of
circuits that implement all necessary tile components. Emphasizing modularity, the library
composes larger circuits from a set of basic components, including a pi-shaped segment for
wire loads, an inverter, a level-restorer, an NMOS pass gate, and a transmission gate. Here,
the transmission gate is unused and instead pass gates are used exclusively to implement
multiplexer structures.
8
sel
6
out
sel[5:2]
sel[1:0]
4
2
28
4
4
outin in out
logical
physical
sel[0]
sel[1]
sel[2]
sel[4]
sel[3]
sel[5]
sel[2]
sel[4]
sel[3]
sel[5]
in[3]
in[2]
in[1]
in[0]
in[4]
in[5]
in[6]
in[7]
Figure 3.2: Logical views and two-level pass gate implementation of a 8:2:1 routing
multiplexer
In fact, due to the homogeneous and symmetric nature of the FPGA tiles, multiplexers
are sufficient to implement all necessary logical components, except for flip-flops. Routing
18
3.1 Preliminaries
multiplexers and LUTs represent the two different classes of multiplexer structures and
each require different circuits to ensure area efficient implementations.
Routing multiplexers can be used to directly implement connection blocks, switch blocks
and the crossbar. Configurable routing is achieved by attaching the outputs of memory cells
to the select inputs of the multiplexer shown in figure 3.2. The output signal can thus be
selected from any of the input signals by programming an appropriate configuration pattern
into the memory cells. The figure also shows the pass gate network employed by COFFE
to implement routing multiplexers. Although two-level multiplexer structures require more
configuration bits, they provide fast operation by allowing only one pass gate per stage to be
enabled by using a one-hot configuration scheme. Other schemes with less configuration
bits would require additional overhead for decoding or would introduce more than two pass
gates in the path of the routed signal. A special case is represented by the output selection
multiplexer of the logic element. Since the described architecture of the logic element
is very simple, only two signals need to be routed. This can be achieved with a single
configuration bit and two pass gates in parallel, assuming the memory cell provides two
complementary outputs.
To implement combinational logic, the aforementioned routing structures connect multi-
ple LUTs, each generating a single output bit in response to a multi-bit input pattern. Using
this approach, LUTs represent direct implementation of Boolean functions and their truth-ta-
bles. By storing all possible outputs within memory cells and arranging them at the input
of a multiplexer, K-LUTs can implement arbitrary logic functions, limited only by the inputs
of the logic function. Figure 3.3 shows how a pass gate LUT with four logical inputs is
implemented in COFFE.
Figure 3.3: Logical and truncated physical view of a 4-LUT
In contrast to a routing multiplexer, the LUT must provide minimal delay through the select
inputs of the multiplexer. The other inputs are driven by the quasi-static outputs of the
configuration memory. Thus, a binary tree structure is used in which each select signal
drives two complementary branches of the tree. To select only one lookup value at a time,
the logical select inputs must be available in inverted and non-inverted form (e.g., aq and a).
Again, this structure was chosen so that the critical path through the select inputs contains
no unnecessary decoders.
On one hand, pass gate implementations offer very compact implementations of the
previously described structures. On the other hand, they require special signal buffering
to restore the full signal swing. The signal level at the output of pass gates using NMOS
transistors is reduced by the required threshold voltage drop across the gate and source
terminals. Multiple series pass gates, as shown in figure 3.3 for the LUT, can reduce the
19
3.1 Preliminaries
signal swing below the switching threshold of subsequent stages and hence require a lev-
el-restoring buffer. Figure 3.4 depicts such a circuit with an additional PMOS pull-up. The
feedback ensures that the signal level at the input is fully restored to the supply voltage
during a rising edge. Thus, unnecessary static power draw by the inverters is avoided. At
the same time, the feedback inhibits a falling edge and hence has to be sized relatively weak
[RCN03].
Figure 3.4: Physical view of a level-restoring buffer
Both multiplexer structures either contain or are directly followed by this special buffer. The
only exception are multiplexers that implement the crossbar within the cluster. Since all of
their outputs directly feed into the input stage of LUTs, the crossbar multiplexers are only
followed by a single inverter stage with a PMOS pull-up. Within the LUT input stage, three
inverters form a fork to generate the complementary select inputs.
COFFE simulates the delay of all tile components using combinations of the described
multiplexer structures and buffers. Each component is composed of the basic library ele-
ments according to the specification trough the logical parameters. Additionally, appropriate
load circuits are generated to capture the context of the component within the tile. Memory
cells that hold configuration bits are not directly modeled within SPICE simulations, since
they are not part of the critical path. However, they are crucial for overall area efficiency and
thus included in the area calculations. The flip-flop is also not critical for timing, but it is still
represented by a representative transistor network with static sizes to account for loading
of the LUT output buffer.
The previous paragraphs only described the structural model of FPGA tiles, but COFFE
also includes a method to size transistors in response to different logical architectures. As
each architecture represents unique loads to different components, a large set of appro-
priate transistor sizes needs to be found. The premise of COFFE is to use a purely sim-
ulation-based approach that automatically searches for optimal sizes according to a cost
function. As COFFE is currently the only open implementation of such an automatic transis-
tor sizing for FPGA tiles (the tool developed by Kuon was not published), it was evaluated
with the commercial 22 nm technology library used here. Due to external restriction, using
this library was only possible on a platform shipping Cadence Spectre instead of HSPICE,
which was originally used by COFFE’s author. Using a 22 nm PTM library3, delay results of
both simulators on a different platform were verified to be virtually identical. Additionally, a
much more realistic method to estimate layout area, as explained in section 3.1.3, was intro-
duced. Results of the evaluation quickly showed that COFFE’s main functionality, namely
fully-automatic transistor sizing, was unusable in the setting of this exploration.
Characterization of a single architecture using COFFE took longer than the 4 hours origi-
nally reported by its author. As table A.1.2 shows, the fastest simulation took over 20 hours
using Spectre. As the total runtime of COFFE is dominated by time spent on SPICE sim-
ulations, table A.1.3 lists the results of a followup evaluation on a different platform using
HSPICE. The cause for the overhead in runtime is thus believed to be related to the per-
formance of the simulator, which is further substantiated by the observation that transient
3http://ptm.asu.edu/
20
3.1 Preliminaries
multipoint analyses with Spectre are performed increasingly slower for a large number of
points. Even with a consistent runtime of 4 hours per architecture and a modest selection of
120 architectures, the total runtime of 20 days would be impractical. More importantly, the
results show that, except for one architecture, the employed sizing algorithm is not capable
of reducing the cost as measured by the built-in cost function. Regardless of the cause,
it renders COFFE’s automatic sizing unusable in the environment of this thesis, which re-
quires all three components: a different SPICE simulator, a more realistic layout area model,
and a foundry supplied process library.
COFFE’s internal optimization is driven by equalizing the rise and fall times of all inverters.
However, this does not necessarily minimize the delay nor is there any functional require-
ment for equally fast transitions. Inspection of sizing results revealed that this approach
chose inverters with excessively large sizes and heavily skewed size ratios between NMOS
and PMOS transistors. A much better goal would allow unequal rise and fall times and in-
stead try to minimize the average delay of the inverters. Additionally, while sizing individual
subcircuits, new results are not propagated until the sizing for all subcircuits is complete.
Therefore, multiple time-consuming iteration are required and there is no guarantee of avoid-
ing feedback loops due to interdependencies between tile components. Finally, the cost
function, which determines the termination of the algorithm and evaluates the quality of a
particular sizing solution, is based on a built-in representative critical path through the tile.
Each component in this path is given a weight to describes how critical this component is to
overall performance and how important it is to size it correctly. Determining these weights
is non-trivial and a fundamental problem of designing FPGA architectures. Due the to the
configurable and flexible nature of the involved components, the critical path through the tile
is application dependent and thus unknown at design time. It is unclear how these weights
were determined for COFFE and how they should be determined for extensions of the ar-
chitecture description (e.g., LUT sizes or cluster architectures). Kuon and Rose estimated
these weights by using VPR to map 20 benchmarks circuits onto a baseline architecture
and by extracting the relative influence of the components on the critical path [KR09]. This
approach still seems non-ideal, as VPR is mostly timing-driven and hence sensitive to both
the benchmarks and the baseline architecture.
A comprehensive design space exploration has more relaxed requirements, since the
main goal is to rank logical architectures and to understand what impact a particular logical
parameter has on performance and area efficiency. As an alternative to automatic transistor
sizing, this thesis relies on the method of logical effort to size transistors. While the sizing
method does not guarantee optimal sizes, it is still capable of finding reasonable sizes in
response to different loads. It is expected that the results found with this method are very
close or identical to what would have been found with optimally sized transistors. Although
the final sizing solution has potential for manual optimization, it is assumed that all other
considered cases have a similar potential.
The method of logical effort [SSH99] uses a first-order RC delay model4 to describe and
minimizes delay along a path in a circuit, given the number of stages and for each stage
the logical effort, the electrical effort, and the parasitic delay. Electrical effort h is calculated
from the ratio between the input capacitance of a stage and the effective capacitance at the
output, while the logical effort g is an inherent parameter of each stage and is calculated
from the ratio of its driving capability and that of a unit inverter. For a stage that drives
multiple copies of itself, electrical effort is directly described by the more common notion
of fan-out. Parasitic delay p describes the delay of a stage due to its own parasitic output
capacitance, i.e., the delay without any external loads. All delay expressions are normalized
to the absolute delay τ of a unit inverter in the considered technology to express a circuit’s
characteristics with process independent values. An estimate for the normalized total delay
4Most importantly, it neglects the effect of different signal slopes between the input and the output.
21
3.1 Preliminaries
D along a path of S stages is expressed as:
D =
S∑
i=1
gihi +
S∑
i=1
pi (3.1)
Since the second term is independent of transistor sizes, minimizing D only requires to find
sizes that minimize the first term and is archived by balancing the path effort equally among
all stages. The optimum effort fopt borne by each stage is found with:
fopt =
Sopt
√
F with the path effort F =
S∏
i=1
gihi (3.2)
The optimum number of stages can be approximated with:
Sopt = log4(F ) (3.3)
As COFFE is restricted by its fixed circuit structure, the number of stages cannot be chosen
freely and thus remains static for each architecture considered in this work. An approach
that would allow buffer with a flexible number of stages according to the path effort would
have to consider the requirement that the signals should not be inverted regardless of their
routing. Nevertheless, the authors of the book which describes the method of logical effort,
found that it is not necessarily required to choose exactly Sopt stages and that values in its
vicinity also achieve reasonable results [SSH99]. Equation 3.3 is still useful to evaluate the
discrepancy between the designed stages and an optimal design. Finally, individual sizes
are calculated starting at the last stage and by assuming that the input capacitance Cin is
proportional to the size of each stage:
Cin = Cout
gi
fopt
(3.4)
The method fundamentally relies on the assumption that the delay of each stage can be
described as a linear function of its electrical effort. Therefore, SPICE simulations were per-
formed with a set of calibration circuits to validate this behavior and to extract the necessary
parameters, i.e., gi , pi and τ . Once these parameters were determined, all components in
the FPGA tile could be split up into stages and sized according to the method of logical
effort. As discussed, pass gates have unequal delays for the different transitions and were
thus characterized in conjunction with the following inverter. Figure 3.5 shows the circuits
that were used to characterize this structure and a conventional inverter.
Figure 3.5: Circuits used to extract delay paramters for the pass gate structure (left)
and an inverter (right)
Considering only enabled pass gates, the shown pass gate structure appears both as part
of the routing multiplexers as well as within the binary tree of the LUT. A second inverter,
shown in figure 3.5, is also simulated to separately characterize its behavior. Additionally,
22
3.1 Preliminaries
this circuit was used to determine the effect of loading inverters with disabled, partially
disabled, and fully enabled series pass gates to represent the different load conditions for
buffers that drive inputs of multiplexer. Characterizing these two basic circuits was thus
sufficient to describe the delay of all critical components within COFFE as a combination of
multiple stages. Since the selected circuits produced different delays for both transitions,
the parameters for the linear delay model were extracted from the average delays in figure
3.6. All delay results, for the characterization as well as for COFFE, refer to the propagation
delay or more specifically, the time span between the point when the input level crosses
50 percent and the point when the output level crosses 50 percent.
Figure 3.6: Simulated and modeled average delays for the inverter and the pass gate
structure of figure 3.5
As seen from figure 3.6, the linear delay model based on the extracted parameters (dashed
lines) closely reassemble the values reported by the SPICE simulation. A minor discrepancy
is observed for the pass gate structure for a fan-out of one, introduced by the relative size
of the PMOS pull-up. As the second stage of the buffer is usually sized sufficiently large
and hence provides a larger electrical effort, this effect can be ignored. It is important
to emphasize that the extracted parameters and the linear delay model were exclusively
used to find appropriate sizes according to the electrical effort of each stage and its driving
capability. Final delay characterization for individual tile components was still performed
using a SPICE simulator.
Finally, to break the loop of interdependencies between components and to avoid the
introduction of a representative path for delay optimization, a further simplification was in-
troduced. For all described implementations of multiplexers, the area budget is dominated
by pass gates. Hence, it is favorable to restrict sizing to the buffers that drive them instead
of investing area into larger pass gates. Pass gates within multiplexers were thus restricted
to minimum width. This simplified sizing for the preceding stage as the load capacitance
at its output could be estimated directly from the number of connected multiplexer inputs
23
3.1 Preliminaries
and an assumption of the most common configurations. This combination of sizing buffers
for delay and multiplexers for area is believed to provide fairly balanced results, without
sacrificing excessive area for a minor reduction in delay.
To perform the design space exploration without COFFE’s original sizing method, its core
was heavily modified to allow direct sizing based on logical effort. Each circuit was separated
into the characterized stages and annotated with the appropriate quantities to find sizes
that minimize delay under the given constraints. Additionally, the modeling of LUTs was
extended to allow 2 <− K <− 6. Furthermore, a number of corrections regarding the area
calculation as well as the calculation of fan-out and the number of elements per tile were
performed. The following paragraph briefly summarizes these calculations.
Starting at the cluster input, each of the I multiplexer within the connection blocks has
W Fc,in inputs and its buffer is connected to KN different inputs of multiplexers within the
crossbar. The crossbar itself consists of KN multiplexers, each with I + N inputs. Each
crossbar multiplexer drives one input of the N logic elements, however the inverter fork is
treated as a separate stage and hence the buffer of the crossbar multiplexer has to drive
the input of two inverters. Driving the complementary inputs of the LUT, each arm of the
fork has to drive the gate terminal of 2u−1 pass gates, where u depends on the position
of the select input. For example, u = 4 for the logical select input a of figure 3.3. Finally,
each logic element has separate output buffers to account for dissimilar fan-out of the local
and global outputs. The local output buffer drives KN different multiplexer inputs of the
crossbar. Looking at the fan-out of the global output buffer is more involved as segmented
routing influences the number of start points. Hence, each SB contains W2L multiplexer
per direction, totaling 2WL switch block multiplexers per tile
5. Uniform routing in all four
directions thus requires each global LE output to drive W2L Fc,out SB inputs. Within a switch
block, three distinct sources determine the number of inputs per multiplexer. First, each
start point is fed by FS corresponding endpoints of a segment. Second, (FS − 1)(L − 1)
connections are tapped off from segment midpoints of tracks that pass this switch block.
Third, as this thesis interprets Fc,out as the fraction of available routing tracks (i.e., start
points), each SB multiplexer is fed additionally from NFc,out cluster outputs. Following the
popular choice of selecting Fc,in to be equal to 1N leads to one additional input due to cluster
outputs. This is also the enforced minimum for smaller output flexibilities or smaller cluster
sizes. Only determining the number of inputs for routing multiplexers is not sufficient, since
the two-level pas gate implementation also requires a specification of the topology. Based
on the number of required inputs R, COFFE chooses the number of pass gates in each
stage, L1 and L2 respectively, according to:
L2 =
√
R (3.5)
L1 =
⌈ R
L2
⌉
(3.6)
Rounding in both cases ensures L1 >− R, but can lead to factorizations that leave some
inputs unused. Clearly, this approach does not minimize area in all cases6 and it is not
directly evident whether this choice minimizes delay.
Equipped with the discussed sizing rules, calculations to determine fan-out, and flexible
circuit descriptions, the modified version of COFFE can directly size and characterize FPGA
tiles from a specification of logical parameters, without resorting to numerous time-consum-
ing SPICE simulations. Following the same goal of reducing simulation time, but using a
different strategy, an alternative tool was introduced recently [ZLO+16]. It restricts model-
ing to the logic cluster and uses COFFE’s pass gate implementations to generate a library
of tile components using a commercial standard cell characterization tool. Apparently, this
5Single-driver routing restricts W to integer multiples of 2L.
6A better approach would consider all possible factorizations and the size of memory cells for configuration.
24
3.1 Preliminaries
step is not based on actual layouts and hence neglects parasitics. Once this component
library is generated, specification of the logical architecture is sufficient to synthesize and
simulate a tile model. Using a commercial synthesis tool allows automatic selection of cell
sizes and advanced buffer sizing and insertion. However, models are restricted to a 65 nm
process technology and access is only possible via a web interface. Although in principle
suited for a comprehensive exploration, many details and assumptions of the model that
critically influence the performance of tile components are thus hidden.
3.1.3 AREA MODEL
Clearly, the original MWTA model was born out of necessity and originally might have been
reasonable for process technologies of that time. And although this model was long sus-
pected to be inaccurate, which is further substantiated by recent evidence [KY16], it was
adapted and reused in many previous explorations. While comprehensive explorations can
not afford to lay out multiple complete FPGA tiles and hence require models that are con-
venient to use, quality and validation of the model must be the decisive factor. One major
problem of the MWTA model is that it explicitly ignores transistor contacting or overhead
due to metal routing. Furthermore, the model is applied uniformly to all components, with-
out regard to their unique functions or connectivity. However, layouts in modern processes
are heavily constrained by a vast number of strict design rules, especially regarding con-
tact spacing and arrangement of irregular structures. Design rules are primarily optimized
for standard cell designs, which represents the major design style for modern CMOS pro-
cesses. Indeed, full custom design of small cells in the context of this work is practically
limited to very few feasible layouts. As an alternative to the MWTA model, a custom area
model was developed to estimate the area of layouts compatible with a nine-track standard
cell grid in the targeted 22 nm process technology.
In this model, sizing of transistors was restricted to scaling the width of individual tran-
sistors, which all share the same minimum length. Restricted by the cell height, transistors
with a width above a certain maximum width are laid out by connecting multiple gate fingers
in parallel, effectively creating a transistor of the required width. This leads to discrete steps
in the occupied area as the width of a transistor is increased. Although the individual fingers
are not necessarily restricted to identical widths, the model makes the simplification that all
fingers are of maximum width. This was directly implemented in the sizing calculations by
restricting results to integer multiples of the maximum finger width. Based on fundamental
design rules that constrain the arrangement of NMOS and PMOS transistors, two cases
have to be considered. In the first case, the layout only consists of NMOS transistors, for
example, two-stage routing multiplexers. For circuits with only a few transistors, design
rules impose inefficient layouts with single-height cells in which half the area is reserved for
PMOS transistors. Above a certain threshold, it is more efficient to use double-height cells
that can utilize the complete cell area. The threshold exists due to dummy cells that are re-
quired in double-height cells for isolation to adjacent cells. Hence, ignoring the overhead for
dummy cells, double-height cells are twice as effective for layouts containing only NMOS
transistors. The second case is represented by layouts that contain combinations of NMOS
and PMOS transistors, e.g., buffers. For these circuits, PMOS transistors usually dominate
in terms of width (due to reduced carrier mobility) and are hence decisive for estimating
the required area. Also, double-height cells using both transistor types provide no benefit in
terms of area for these circuits. In both cases, the number of effective gate pitches, i.e., the
vertical dimensions of a cell, can be estimated from summing over the maximum width of
each transistor pair, corrected by one gate pitch required for spacing of dissimilar structures.
This correction ensures accurate results for small cells and for very irregular cells. Given the
absolute gate pitch wgate and the absolute height of one cell hcell , the area for single-height
25
3.1 Preliminaries
layouts can be estimated in both cases with:
Asingle−height = hcellwgate
∑
i
(max(wNi , wPi ) + 1) (3.7)
Here, wNi and wPi represent the individual widths of NMOS and PMOS transistor respec-
tively, normalized to the maximum finger width. For double-height layouts containing only
NMOS transistors, the area can be estimated using:
Adouble−height = Adummy + wgate
hcell
2
∑
i
(wNi + 1) (3.8)
Figure 3.7 compares both area estimates (solid lines) to values measured from layouts
(points) of a representative selection of cells required for an FPGA tile. A 32-input CB mul-
tiplexer, a 8-input SB multiplexer, a 4-LUT, and buffers of various sizes were laid out and
measured.
Figure 3.7: Measured and modeled layout areas for different FPGA tile components
Additionally, the figure plots the estimate of the MWTA model shipped with COFFE (dashed
line). As expected, it was unsuited to estimate the area for standard cell compatible tile de-
signs. In fact, it was observed during layout that most structures could be laid out very
efficiently and that full custom design would offer almost no advantage in terms of layout
density [Sch16]. Based on a few simple but concrete design rules, the custom area model
was able to consistently estimate layout area solely from transistor widths. More impor-
tantly, its consistent overestimates enable a designer to estimate an upper bound for the
total tile area, without actually having to perform a complete layout. This property is crucial
for ranking logical architectures with different tiles. Previous attempts that fitted the original
MWTA model to layout measurements traded simplicity of the model for higher uncertainty
and a higher risk of underestimating the total area.
26
3.2 Exploration Setup and Results
Beyond the area estimated from transistor sizes, this thesis further assumed a fixed area
of 10 MWTAs per bit of configuration memory, as well as 40 MWTAs per flip-flop. Both
estimates are based on the number of gate pitches from robust implementations of an ex-
isting 28 nm standard cell library with the same relative cell height. The estimated area
for configuration memory can be considered conservative compared to other explorations,
however, minimum sized cells become unreliable as many of these cells are exposed to
process variations. The custom area model was integrated into COFFE, together with ex-
tensive modifications that allow each subcircuit to select a context specific area model.
Hence, either fixed values, measured from existing layouts, or custom estimates, based on
transistor widths and the structure of each subcircuit, can be annotated.
3.2 EXPLORATION SETUP AND RESULTS
The main goal of this exploration was to rank logical architectures and to find a reasonable
candidate architecture in the context of the targeted 22 nm process technology. Capturing
the underlying technology constraints, with a more realistic and consistent area model as
well as with SPICE simulations of detailed transistor level circuits, is believed to provide
more confident results than previous analytical models or studies in older technology nodes.
Nevertheless, it is important to understand the tradeoff between simplicity and accuracy
of the model as well as its restrictions. The subsequently reported results are inherently
limited in their ability to predict the final performance, since no complete layout of an FPGA
tile has been performed. Delay simulations were based on pre-layout models and did not
incorporate any layout parasitics. Although COFFE’s original wire load model was kept in
place, even with more accurate area estimates the quality of this method remains unknown.
Additionally, the discussed constraints made it necessary to revert to a rough first-order
method to determine transistor sizes. Finally, area results for a complete tile are based
on the sum of the estimated cell areas and do not include overhead due to subsequent
automatic place and route steps.
3.2.1 BENCHMARKS AND CAD SETTINGS
Apart from the electrical characteristics of an FPGA architecture, the quality of the CAD tools
as well as the selection of benchmarks play an important role. The following describes the
benchmark circuits and their selection, the intricacies of the involved academic tools, and
tool settings that influenced the reported result.
Benchmarks are important as they should reflect the characteristics of the context in
which the FPGA architecture is used. As the eFPGA explored in this thesis was targeted
at a tight coupled instruction set extension, a custom set of benchmarks was developed.
Table 3.2 lists the various circuits that represent typical circuits for datapath operations but
also more irregular structures like bit shifts, sorting networks, and hash functions. These cir-
cuits implement frequently occurring functions for which software implementations on the
microprocessor are limited by the fixed ISA and achieve inadequate performance. Also, the
word length of the benchmarks was matched to the bit length of typical data types used by
software implementations. Generally, these circuits are relatively small as they only imple-
ment computational kernels and not standalone applications. Custom benchmarks are less
common in explorations that focus on general purpose FPGA architectures, nevertheless, no
widely accepted standard benchmark suite exists. Frequently, the 20 largest MCNC [Yan91]
benchmarks are used, although they are quite old and considered unrepresentative for de-
signs that are used in conjunction with modern heterogeneous FPGAs. Hence, a benchmark
set with larger circuits, called Titan23 [MWL+15], has been composed to capture modern
27
3.2 Exploration Setup and Results
trends of high-capacity fabrics that contain heterogeneous resources. Still, results for the
20 largest MCNC benchmarks are frequently reported as a common reference point.
Table 3.2: Benchmark circuits used to explore logical FPGA architectures for tight
coupled instruction set extensions
Name Description Data type (bits) 4-LUTs1 Edges1
add-sub Conditional add or subtract 32 203 573
add-compare Add, compare, and select 16 253 730
32 300 874
add-shift Addition and conditonal shift 32 167 512
counter Ripple-carry counter 16 63 104
mac Multiply accumulate 16 955 2939
32 3673 11319
mul-fast Carry-save multiplication 16 701 2277
32 2957 9529
feistel Feistel rounding function 16 89 211
crc Cyclic redundancy check 16 42 101
int-hash Hashing, masks precomputed 32 367 1124
reversebit Bit reversal 16 192 600
32 632 2176
shift Variable shift 32 907 3070
sort Sorting network 16 1124 3803
1 Reported by ABC using the baseline script from listing A.1.2.
For technology mapping and logic optimization, a more recent version of ABC [Ber16] was
used together with a slightly more advanced command script from the branch new_ABC of
the VTR repository. This script is based on rewriting techniques which optimize the depth of
the logic network without incurring excessive area overhead [MCB06]. It was found that this
set of commands consistently provided better results over those from main development
branch. Additionally, this version of ABC ships with more sophisticated high effort methods
which are later used in conjunction with full application benchmarks. During the design
space exploration, only the baseline command script from A.1.2 was used as it provided
fairly balanced results in terms of area and delay.
Addressed as a major problem [RD11], the routing algorithm used by VPR is sensitive
to the order in which nets appear in the placed netlist. Minor changes to this order, either
caused by a different placement seed or by a different logical architecture, can lead to vastly
different routing results, even for placement netlists that are functionally identical. As the
current version of VPR shipped with VTR still exhibits this behavior, a careful setup for the
exploration was needed. In order to minimize jitter of the final results, all VPR runs dur-
ing the exploration were performed with 50 randomly selected placement seeds for each
benchmark. As routing only has an impact on delay but not on the number of occupied tiles,
only the placement with the least delay was evaluated and results from other placements
were ignored. Additionally, the default options for the router and the placement engine
were tuned for more reliable results according to hints from previous reports on this issue
[LL04]. A list of the options passed to VPR during the exploration can be found in table
A.1.4. Finally, previous studies usually allowed VPR to choose the channel width W accord-
ing to the individual demand of each circuit or used one channel width that allowed routing
for all considered circuits. Both approaches are flawed. The first approach assumes that
the channel width can be chosen freely once a tile is designed or that it has only a minor im-
pact on overall performance. However, looking back at the fan-out calculations from section
28
3.2 Exploration Setup and Results
3.1.2, changing W would require full resimulation of the architecture for each individual cir-
cuit. The second approach simply ensures that all circuits are routable, however, the supply
of routing tracks can have a significant impact on the final results. When using a common
channel widths for circuits that have slightly different routing demands, results are skewed
in favor of the circuit with less demand.
Faced with this dilemma, a two-stage approach was performed in the present exploration.
Given a set of logical parameters, the first step simulated a tile with sufficient routing tracks
for all benchmarks using the modified version of COFFE. Afterwards, VPR was instructed
to map all benchmarks with this fixed W , again ensuring consistent results with 50 differ-
ent placement seeds. Based on the maximum number of occupied channels, the actual
demand was determined for each circuit and the circuits were assigned into one of at most
eight channel width bins according to their demand. Bins were determined such that the
number of circuits in each group is maximized while respecting the constraints for W due
to single-driver routing and segmentation. This channel width binning ensured low-stress
routing for each circuit by increasing the required channel width by 30 percent. In the sec-
ond step, tiles with the otherwise unchanged logical architecture and eight different channel
widths were simulated and each circuit was mapped by VPR using the assigned tile with
the correct channel width.
The input files that specify individual logical architectures to COFFE were generated ac-
cording to a set of logical parameters, their value range, and a specific step size. This
procedure automatically chose the number of cluster inputs according to the equation 2.1
but rounded to the next largest even integer so that each connection block drives the same
number of inputs. Based on these input files, the modified version of COFFE sized the
transistors, determined the delay of all components using SPICE simulation, calculated the
total area, and generated annotated architecture descriptions for VPR. The format of these
descriptions was extended to allow separate annotation of the area required for global rout-
ing, that is, the area for switch blocks and connection blocks. As previously mentioned,
this value was inferred by VPR’s built-in models, however, the reported values are based
on connectivity of the routed benchmarks, whereas the presented physical model provided
more accurate and more consistent results.
3.2.2 EVALUATION METHODOLOGY AND PARAMETER SELECTION
Once the channel width binning and the final mapping had been performed, results were
gathered and interpreted. To rank different logical architectures and to asses the quality
of results, it is common to calculate the area delay product over all benchmark circuits.
That means, for each circuit the critical path, reported by VPR’s built-in timing analysis, is
multiplied by the area, determined from the tile size and the number tiles required by each
circuit. The critical path is further split up into net delay, which refers to the delay due to
switch blocks in the global routing network, and into logic delay, which is the delay from
the input of the CB to the cluster outputs. Area usage for designs mapped onto FPGAs is
inherently quantized to the size of the tiles and the discrete routing structure. As discussed,
VPR automatically selects a square array with sufficient tiles to map the circuit. However,
not all tiles are used and hence using the raw size of the array skews the results unjustly. In
particular, the number of unused tiles or cutoff is a function of N and K as they directly define
the capacity of a logic cluster. As these parameters were changed during the exploration,
the area usage was instead inferred only from the number of occupied tiles. Furthermore,
the options for placement were chosen such that a relatively low number of logic elements
remained unused. A quick empirical evaluation showed that VPR consistently generates
placements with high cluster utilization. In over 90 percent of the observed mappings7,
7For architectures with N ∈ {2, 4, 6, 8, 10, 12, 14}.
29
3.2 Exploration Setup and Results
utilization was greater than 90 percent. As discussed earlier, area results reported by VPR
include both area for the logic cluster and area for the global routing structure, which in
this case includes area for the switch blocks and the connection blocks. After exploring
the design space and once a particular logical architecture was chosen, N and K remained
fixed and hence a more coarse area measurement that considered only the raw array size
was used. Finally, to reduce the results from all circuits into a single figure of merit, the
geometric mean over all area and delay results was calculated, as well as the area delay
product. This procedure is widely used in other investigations and allows simultaneously
assessment of the logical architecture on area and delay results of multiple benchmarks,
which are reported on different numerical scales. Also, this procedure allows comparison
of this exploration with previous studies.
Due to the large number of architectures and the interdependencies between logical pa-
rameters, the exploration was performed in two phases. This was done by ordering the
logical parameters according to their expected relative influence on overall performance. In
the first phase, a set of parameters which have been identified by others to have the great-
est impact on the area delay product was varied. This includes N, K and L, which directly
define the logic cluster capacity and the structure of the global routing network.
Table 3.3: Selection of logical parameters for the first exploration phase
Parameter Value range Step size Note
N [2, 14] 2
K [2, 6] 1 Restricted by COFFE
L {1, 2, 4, 6} -
I [4, 46] 2 Derived from equation 2.1
W [22, 168] 2L Channel width binning
FS 3 - Fixed
Fc,in 0.2 - Fixed
Fc,out 0.1 - Fixed
Fc,local 1 - Fixed
Table 3.4: Selection of logical parameters for the second exploration phase
Parameter Value range Step size Note
N 8 - Fixed
K 5 - Fixed
L 4 - Fixed
I 24 - Fixed
W [56, 112] 2L Channel width binning
FS 3 - Fixed
Fc,in [0.075, 0.250] 0.025
Fc,out [0.075, 0.250] 0.025
Fc,local 1 - Fixed
All other parameters were either derived from these parameters or fixed to the values
listed in table 3.3. Additionally, cluster crossbars and routing segments were fully populated.
Selection of the fixed values was based on results from previous empirical studies [BRM99].
Ignoring the additional effort for channel width binning, 140 different logical architectures
were explored in this first phase.
30
3.2 Exploration Setup and Results
In the second phase, the cluster input and output flexibilities Fc,in and Fc,out are varied
as these parameters are believed to have only a secondary effect on the figure of merit.
Table 3.4 lists the parameter values used in the second phase together with their value
ranges. This final phase explored 64 additional architectures. Still, the influence of these
parameters on architectures that employ single-driver routing has only been studied briefly
before [Chi13]. A preliminary evaluation with values up to 0.75 found that larger flexibilities
provided no benefit in terms of delay and only increased the size of the routing structures.
Therefore, the value range was reduced to smaller values with a finer step size. To reduce
the number of architectures to be explored in this phase, the previously varied parameters
were fixed according to a subjective criterion which will be discussed in the next section.
Hence, the second phase is used to verify that the fixed values chosen in the first phase
were indeed a reasonable choice.
3.2.3 RESULTS
The exploration in the first phase was conducted to determine whether is more efficient to
use small clusters, which require many tiles to implement a circuit, or to use large clusters,
which can implement a given circuit with fewer tiles. Starting with the number of LUT inputs
K , it is also important to recall that the quality of the technology mapper plays a key role in
answering this question.
Figure 3.8: Occupied logic cluster area across all benchmarks for L = 4
Naively, one would expect that the size of LUTs roughly doubles as K is increased8, while
the number of required LUTs decreases as their logic capacity increases. However, there
is also a circuit specific cutoff as the technology mapper is not able to cover the logic net-
work LUT inputs are used. Since all clusters in this exploration used non-fracturable LUTs
and a homogeneous LUT size across logic elements, some LUTs were inevitably used to
implement functions with a fan-in less than K .
8Ignoring buffers, there are 2K+1 − 2 pass gates and 2K configuration memory cells.
31
3.2 Exploration Setup and Results
Figure 3.8 shows how the geometric mean of required logic cluster area increases as
the number of LUT inputs was increased. Two opposing effects influenced the depicted
trend. On one hand, fewer tiles were required as their logic capacity increased. On the
other hand, the size of a single cluster grew, not only due to the exponential increase in LUT
size, but also due to the increased size of the fully populated crossbar. For larger Ks, this
effect turned out to be overwhelming. Still, increasing the number of LUT inputs was most
effective in reducing the number of tiles up to 5 inputs.
Simultaneously, as fewer tiles were needed, the amount of area required by global routing
resources decreased, however their relative contribution was dominating for K <− 4. Espe-
cially clusters with small LUTs were dominated by the area of the global routing network.
The effective total area due to the routing network and the logic cluster is shown in 3.9.
Figure 3.9: Total occupied area across all benchmarks for L = 4
As both trends, the decrease in routing area and the increase in logic area, canceled each
other out, a region with more or less constant total area for 3 <− K <− 5 was observed. Also,
the trends were found to be similar across all cluster sizes. Increasing the number of logic
elements had very little effect on the total area, except for clusters with 2 logic elements.
The observation of excessive total area for small and for large LUTs is in line with previous
work [AR04], however, Kuon and Rose observed a decrease in area as the number of LUT
inputs was decreased and attributed it to single-driver routing [KR09]. With the present
setup and results, this conclusion could not be confirmed. The results also differ in that
the relative increase in area from three to six LUT inputs was only around 35 percent and
occurred abruptly. Although the results shown so far have only included architectures with
a segment length of 4, the trends for other values of L were almost identical. More details
on the impact of changing L are given later.
Looking at N and its impact on delay gives a similar picture. Clusters with sufficient size
exhibited very similar performance, except for very small clusters.
32
3.2 Exploration Setup and Results
Figure 3.10: Net delay across all benchmarks for L = 4
Figure 3.11: Total delay across all benchmarks for L = 4
Somewhat surprising, increasing the number of logic elements seemed to have almost
no influence on logic delay, which is not shown here. Instead, figure 3.10 plots the delay
33
3.2 Exploration Setup and Results
due to global routing as a function of N. Clusters with very small LUTs and very few logic
elements benefited from clustering as their critical path was dominated by delays of the
global routing network. Increasing both K and N also increased the number of required
cluster inputs and the load on the switch block multiplexers. Hence, increasing the cluster
size beyond a certain capacity did not reduce delay. A similar but somewhat reduced effect
is also visible in the plot for total delay in figure 3.11. Again, it becomes clear that indeed K
had a consistent and very strong influence and N only had to be chosen sufficiently large.
Increasing the fan-in of LUTs allowed the technology mapper to reduce the number of logic
levels on the critical path. However, similar to the results of total area, delay-wise there was
little benefit in using LUTs with more than five inputs. This observation might be strongly
related to the selected benchmark circuits.
Figure 3.12: Area delay product across all benchmarks for L = 4
Combining both results leads to the area delay product, depicted in figure 3.12. Small
clusters with few logic elements or small LUTs were routing limited and hence dominated by
the significant overhead of the routing structures, both in area and delay. Still, a large range
of logical architectures achieved results in a region with almost identical area delay product,
indicating that in this region performance and area could be traded with minimal penalty.
Furthermore, the effect of clustering appeared to be very weak. For large clusters, the
reduced total delay was traded for a significant increase in total area, which was largely due
to the inefficiency of mapping circuits to larger LUTs. Previous explorations found coherent
results, some with a more pronounced minimum of the area delay curve at a certain LUT
size [AR04].
Up until now, only the characteristics for architectures with a segment length of 4 were
presented as this value provided a good choice, in agreement with previous studies. How-
ever, it is still instructive to look at the impact of L on area and delay, as segmentation
reduces the number of switch block multiplexer and increases the number of inputs driven
by routing buffers. At the same time, the connectivity between clusters is increased by
reducing the logical distance between switch blocks.
34
3.2 Exploration Setup and Results
Figure 3.13: Area delay product for N = 8 across all benchmarks, relative to L = 1
Figure 3.14: Net delay for N = 4 across all benchmarks, relative to L = 1
Figure 3.13 shows the relative influence of L on the area delay product for clusters with
8 logic elements. The largest impact was observed for small logic clusters as they are
35
3.2 Exploration Setup and Results
limited by the global routing network. Efficiency increases for all architectures, regardless
of the size, up to a segment length of four. For these cases it might be worthwhile to
allow depopulation of intermediate segment taps. Increased delay was also observed for
clusters with fewer inputs, however, the reduced number of switch block multiplexers and
their corresponding area more than compensated this effect.
Looking at the delay for smaller clusters in more detail, figure 3.14 shows the net delay
for various LUT sizes and a cluster size of 4 logic elements. A sudden decrease in net delay
was observed for increasing L from 1 to 2. At that point, the switch block fan-out was
only increased slightly while the logical distance between clusters was halved. Effectively,
this reduced the number of switch blocks on the critical path and the slight increase in
delay per switch block was more than compensated. However, as the segment length was
increased further, each switch block multiplexer had more inputs and its buffer had to drive
a larger load. Kuon and Rose observed a similar relative decrease in delay for the transition
to a segment length of 2 but with no further increase in delay for larger segments. This
discrepancy is most likely due to their advanced transistor sizing method that has been
shown to perform well for sizing routing buffers [KR09]. Inspection of Sopt calculated by
the employed sizing method (see section 3.1.2) further revealed that the two-stage routing
buffer design was best suited for a segment length of 2.
In the second exploration phase, the choice of the cluster input and output flexibilities was
further investigated. Based on the results of the first phase, a wide range of combinations
for K , N, and L resulted in an almost equal area delay product. Since implementation of
custom instructions was expected to benefit from fabrics with higher performance, a delay
oriented architecture was further explored in the second phase. Hence, in this phase the
design space was restricted by setting K = 5, N = 8, and L = 4.
Figure 3.15: Contour plot of the area delay product as a function of cluster input and
output flexibilities with K = 5, N = 8, and L = 4
A previous study pointed out that the choice of the cluster output flexibility for single-driver
routing required a reevaluation of established results from older studies using multi-driver
routing. However, this particular evaluation only considered three different values and its
definition of Fc,out is slightly different [Chi13]. Here, both Fc,in and Fc,out were studied across
36
3.2 Exploration Setup and Results
a wide range of values. Increasing both values has the potential to reduce the number of
nets on the critical path by increasing the connectivity between the cluster and the routing
channels. However, as the number of inputs per multiplexer grows, area and delay per
multiplexer also increase. Figure 3.15 shows the resulting impact on the area delay product
for the selected architecture. At least in the employed value range, Fc,out had only a minor
influence. This is due to discrete steps imposed by the choice of the other parameters and
by the assignment of outputs across the accessible routing tracks. Evidently, the choice of
Fc,in had a much greater impact and although there is a clear minimum at Fc,in = 0.2, other
values in the vicinity or even at Fc,in = 0.1 provided almost equal results. Still, the shown
results only applied to one specific architecture with a specific cluster size. For comparison,
figure A.1.3 shows the result for an architecture with a much smaller logic cluster. In that
case, Fc,out had a much larger influence, caused by a sudden increase in area towards larger
values.
3.2.4 DISCUSSION AND CANDIDATE ARCHITECTURE
Looking back at the number of opposing trends that influence the performance and area
efficiency, it becomes clear that the conducted design space exploration provides valuable
information to guide the design of logical FPGA architectures. Although the general trends
were consistent with previous explorations, this was certainly not evident beforehand, con-
sidering that this thesis used up-to-date CAD tools, a custom benchmark set, a 22 nm pro-
cess technology, and a consistent area model based on physical design rules. Still, many
small details were found to be different. Most importantly, the area delay product remained
nearly constant for a number of different cluster and LUT sizes. Clustering multiple logic
elements only had a very limited effect, mostly benefiting clusters composed of 2-LUTs.
However, any other number of LUT inputs consistently provided better results, especially
towards medium sized clusters. Additionally, single-driver routing was found to have no
major impact on the selection of K and N. Also, the prevailing choices for cluster input and
output flexibilities were found to be reasonable for the selected values of K , N, and L. Still,
other values provided equal or even better results and it was observed that a change in the
cluster capacity and in the global routing network can dramatically effect this choice.
In summary, the most influential parameter was confirmed to be the number of LUT
inputs K . Increasing K is easily justified for small clusters as it reduced delay without in-
curring significant cost in terms of area. However, beyond 5 inputs the efficiency of larger
LUTs decreased and hence delay improvements could not compensate the increased area
cost. Beyond a cluster size of 2, N can be chosen more or less freely. Finally, the results
suggested that a large segment length is primarily suited for small clusters, which are dom-
inated by the overhead of the global routing network.
What constitutes the best architecture clearly depends on the application domain and
the specific context in which the FPGA is used. For some applications it might be more
important to reduce the area overhead, while for others minimum delay is required to meet
performance goals. In addition, most applications might have a limited power budget and
require highly energy-efficient computations. This is especially relevant for implementations
in more recent technology nodes where the static power consumption is often dominated by
leakage. Although this work does not attempt to model power consumption, it is expected
that leakage has a strong correlation with the total area of a tile.
The selected candidate architecture is specified by Fc,in = 0.2, Fc,out = 0.1, the fixed val-
ues from table 3.4, and a channel width W of 96 tracks. The latter also has a significant im-
pact on the area delay product but was chosen such that all considered benchmark circuits
could be routed successfully. Looking at the individual components and their performance
in table 3.5, it also becomes clear that the area overhead due to configuration memory is sig-
37
3.2 Exploration Setup and Results
nificant. Furthermore, area and delay values are especially large among routing multiplexer.
Table 3.5: Area and delay results for individual tile components of the candiate
architecture
Component Multiplexer Instances Delay 1 Total area Configuration area
topology per tile (ps) (µm2) (%) (bits)
SB multiplexer 12:3:1 48 269 8.24 63.6 7
CB multiplexer 20:4:1 24 199 10.71 62.9 9
Crossbar multiplexer 35:5:1 40 67 13.48 66.7 12
Local LE output 2:1 8 172 1.95 38.5 1
Global LE output 2:1 8 83 1.65 45.4 1
5-LUT binary tree 8 199 42.08 56.9 32
LUT input - 40 55 0.75 0 0
Flip-flop - 8 - 3.00 0 0
Logic cluster - 1 - 960.26 58.7 752
Tile - - - 1612.62 60.6 1304
1 Simulated with: Cadence Spectre and a 22 nm technology library.
Design corner: typical process (pre-layout), 25 ◦C, 0.8 V.
Table 3.6: Area and delay results after mapping the benchmark circuits onto the
candidate architecture
Benchmark Total Logic Net Number of
delay (ns) delay (ns) delay (ns) required tiles
add-sub 9.92 7.05 2.87 12
add-compare-16 9.33 4.45 4.88 16
add-compare-32 13.47 8.19 5.28 18
add-shift 8.99 5.91 3.08 11
counter 4.06 3.13 0.93 6
mac-16 29.41 11.30 19.10 111
mac-32 63.47 23.58 39.89 386
mul-fast-16 18.54 8.93 9.61 68
mul-fast-32 37.01 15.82 21.19 297
feistel 2.83 1.63 1.20 6
crc-16 2.82 1.62 1.20 2
int-hash 6.39 2.71 3.68 41
reverse-bit-16 4.47 2.33 2.14 17
reverse-bit-32 5.55 2.34 3.21 53
shift-32 6.31 2.63 3.67 92
sort-16 10.33 3.37 6.96 113
Still, a majority of the tile area is occupied by the logic cluster. Depopulation of the local
crossbar could provide a first step for increasing its area efficiency. Further optimization
would be possible with a more sophisticated design of the switch blocks. The presented
circuit implementations restricted the routing buffers to a simplistic two stage design, how-
ever, this block is expected to have a significant larger load than buffers within the cluster,
since it has to drive many cluster inputs and routing tracks that can span multiple tile lengths.
38
3.2 Exploration Setup and Results
Hence, routing buffers have to be modeled separately and the influence of interconnect par-
asitics has to be captured with a more robust wire load model. Further extensions in this
direction could rely on a number design methods and architecture styles for single-driver
routing that have been published so far [LLTY04, Lee06].
Results for individual benchmarks mapped to the candidate architecture using the base-
line ABC command script are listed in table 3.6. It is important to remember that these
results were achieved with a simplistic baseline architecture for the logic elements which
does not include any enhancements for arithmetic applications. Common extensions such
as carry chains, fracturable LUTs with shared inputs as well as dedicated logic, e.g., to ex-
plicitly support carry-save arithmetic, are prime candidates for further optimization. Finally,
it is important to understand that the logical parameters were chosen based on empirical
results obtained with a customized set of benchmarks . As a reference point, the 20 largest
MCNC benchmarks were mapped and their area delay product was determined with the
same methodology as before. From figure A.1.2 it is evident that the results are similar
but clearly influenced by the choice of the benchmark circuits. With all these restrictions in
mind, the next chapters evaluated what benefit a potential eFPGA fabric would provide as a
tight coupled instruction set extension, assuming it has the characteristics of the presented
candidate architecture.
39
4 MICROPROCESSOR INTEGRATION
This chapter briefly outlines the main features of the targeted microprocessor, explains what
hardware changes were necessary to couple the eFPGA to the processor, and how the in-
struction set extensions can be accessed from a software program. Furthermore, possible
tradeoffs between the size of the array and performance of custom instructions are demon-
strated.
4.1 MICROPROCESSOR ARCHITECTURE
As evident from the variety of different ISAs described in section 2.1.2, the generic se-
mantics of instruction execution make the approach of tight coupled reconfigurable fabrics
portable across many microprocessor architectures. Nevertheless, most of the employed
architectures are based on the RISC philosophy and are hence very similar in their internal
structure. For the purpose of this investigation, a fully functional microarchitecture in the
form of synthesizable RTL code was provided by Racyics GmbH1. This microprocessor im-
plements a modern derivative of the ISA presented in [DJL+97]. It has extended support for
memory access but does not include any of the described instructions for signal processing.
Implementation and verification of this microarchitecture were not part of this thesis.
In essence, the microprocessor implements a conventional 32-bit RISC machine with a
load store unit (LSU) that fetches both data and instructions from a single 32-bit memory bus.
Memory addresses utilize the full 32-bit range and allow to address a large memory space
with dedicated support for memory-mapped IO and access to individual bytes. To efficiently
utilize this memory interface, the processor is equipped with an instruction cache capable
of holding up to 64 instructions. During prefetch, the cache logic is able to detect already
cached branch targets. This caching mechanism allows efficient execution of program loops
which require a large amount of memory bandwidth to load operands.
Combining 32 global registers and 64 local registers, the windowed register file provides
each program subroutine with local storage of up to 14 variables. The global section con-
tains the program counter, a status register, special purpose configuration registers, and 14
globally visible registers for general purpose usage. Dedicated hardware implements auto-
matic stack management of the local section to ensure fast and consistent context switches
between program subroutines. With minimal overhead, stack frames are transparently cre-
ated, restored, and swapped between the local registers and memory. A special addressing
mode provides subroutines with explicit access to any stack frame, regardless whether it
resides in memory or in the register local file.
The processor supports instructions with a length of 16 bits, 32 bits, and 48 bits. Com-
bined with clever encoding of immediate operands, addressing modes, and special cases,
15 flexible instruction formats allow the compiler to generate very compact code. Further-
more, the program counter is exposed to the pipeline so that it can be referenced like any
other general purpose register. Branches and instructions accessing memory can utilize
this feature to efficiently generate memory addresses without encoding large offsets within
the instruction words. Memory accesses also benefit from a separate two-stage memory
1http://www.racyics.com
40
4.2 Hardware Integration
pipeline, which automatically handles pending memory requests without stalling the execu-
tion of independent instructions. Most of the powerful arithmetic and logical instructions
execute within one cycle and have extensive support for detection of signed overflow. An
overview of the most important instructions is given in table A.1.1. Memory instructions
reuse the ALU to generate addresses according to one of multiple addressing modes, which
are optimized to access many typical data structures from memory. Put simply, instructions
are executed in-order by a shallow two-stage pipeline. This allows fast reaction to excep-
tions such as external interrupts or internally generated traps. Exceptions are generated
by the processor to pause the ordinary program flow and to handle special events such as
signed overflow, privilege violations, trace conditions, or interrupts. All exceptions can be
uniquely identified to backtrack their origin. Interrupts are individually enabled or disabled
and handled according to a predefined order. The interrupt priority of the internal timer mod-
ule can be programmed by a special control register. Additionally, a watchdog module can
reset the processor to a known default state in the case of a timeout.
Performance and implementation cost in terms of area for this microarchitecture were
not available for the targeted 22 nm process technology. However, the microprocessor has
been implemented as part of a test chip in a 28 nm process technology. Table 4.1 lists both
the area and the maximum clock frequency for this test chip as well as the scaled values
that are assumed for a hypothetical 22 nm implementation.
Table 4.1: Assumed performance and area cost for a hypothetical microprocessor
implementation in a 22 nm process technology
Process technology Origin Area (µm2) Max. clock frequency (MHz)
28 nm
Synthesis 37477 213
Signoff 42149 213
22 nm Scaled model 29900 300
Note that the area for the test chip is specified with two different values. After synthesis,
only the raw sum of cell area is known and this value does not yet account for overhead due
to place and route steps. However, as the values for the tile area are calculated in the same
manner, the assumed area values for the processor are derived from the synthesis report.
Signoff area is given as a rough guideline on what overhead to expect. Also, performance
results of the 28 nm implementation are influenced by the choice of a low power process
technology option.
4.2 HARDWARE INTEGRATION
With no concrete layout implementation of the eFPGA macro, the coupling mechanism
is modeled from a high-level perspective. This considers only the necessary functional
changes to access the reconfigurable fabric by means of custom instructions and to provide
a variable clock signal to drive the internal sequential logic. Once a complete tile imple-
mentation becomes available, a more detailed analysis would be necessary to derive timing
constraints for the interface of the eFPGA hard macro and the rest of the standard cell based
design. For example, to constrain the path for operands that are read from the microproces-
sors register file, the depth at which the receiving flip-flop is placed inside the fabric would
have to be considered. However, this value is not known during ASIC synthesis due to
the post-fabrication flexibility of the eFPGA. For now, a timing annotated netlist of the user
circuit is used to model and implement the functionality of different custom instructions.
41
4.2 Hardware Integration
The high-level interface that has to be implemented by all implementations of custom logic
within the RFU is described next.
4.2.1 RECONFIGURABLE FUNCTIONAL UNIT
For tight coupling, minimal overhead is mandatory to fully exploit the unique capabilities of
the reconfigurable fabric. In this thesis, the RFU has direct access to all registers of the
processor register file. Since the clock signal for the eFPGA is directly derived from the
main clock signal of the processor, no additional overhead for synchronization, e.g., using
exchange registers as done in other systems [vS10, HFHK04], is necessary. Furthermore,
since each custom instruction has different requirements in terms of logic delay, the num-
ber of clock cycles can be chosen at runtime. Also, no direct memory access is provided to
the fabric since the internal control unit already has to arbitrate memory access between
the cache prefetcher and the LSU. Due to the limited set of native instructions, most appli-
cations are limited by internal computations and not by the memory bandwidth. For now,
memory transactions have to be explicitly initiated by software, but a future design might
add additional interfaces which would allow the RFU to directly create requests to the LSU.
Figure 4.1 shows the interface between the microprocessor and the RFU as well as a
hypothetical configuration controller.
Figure 4.1: Processor integration and interfaces of the reconfigurable functional unit
The most important connections between the RFU and the processor are the write and
read ports of the register file. For most arithmetic instructions, two 32-bit register operands
are read and one 32-bit result is produced by the functional unit. As this might pose a re-
striction for certain custom instructions, access to a third register operand was planned.
However, constraints on the instruction format did not allow to directly address a third reg-
ister and hence this operand would have to be placed in predefined register, for example a
otherwise unused global register. For implementation of multi-cycle instructions, the result
is only sampled by the register file once the write-enable signal write_en is asserted. In
42
4.2 Hardware Integration
parallel to computing the result, the RFU is also able to change the status flags that pass
control flow information back to the program. For custom instructions that do not utilize
this feature, a feedthrough of the previous flag values has to be implemented, since no ded-
icated enable signal has been provided to conditionally sample flag values from the RFU.
Furthermore, it was planned to allow the RFU to modify the destination register address
regadr_a, so that pipelining across multiple invocations of the same instruction would be
possible. Using this approach two operands and the corresponding destination register ad-
dress would enter the pipeline within the fabric at the first cycle of each custom instruction.
Once the result is produced by the last stage of the pipeline and ready to be written back
into the processor register file, the correct destination register address would be provided
to the address decoder of the register file. To control special cases of individual custom
instructions, the signal static_ctrl provides additional user defined control bits, which are
directly encoded in the instruction word. This is useful to differentiate multiple states of
the same instruction, to control which registers inside the fabric are written and read, or to
extend the RFU opcode for multiple custom instructions. Finally, the clock signal rfu_clk
sequences computational steps inside the fabric for custom instructions that require mul-
tiple processor clock cycles to finish. This clock signal is derived from the main processor
clock using a clock gate and a counter with an instruction specific reset value. Both are part
of the processor hardware and are not implemented by the fabric.
In addition to the interface between the RFU and the microprocessor, figure 4.1 also
shows how a configuration controller would be attached to facilitate reconfiguration of dif-
ferent configuration states, also referred to as a configuration contexts. The active config-
uration context is the state that is programmed in the configuration memory cells within
the fabric. Here, the described FPGA fabric does not support storing and switching multi-
ple internal contexts. Hence, each custom instruction requires reconfiguration with inactive
configuration contexts from fabric external memory. This includes storage in on-chip SRAM,
possibly alongside the data and instruction sections, or in off-chip memories. To allow par-
tial reconfiguration that selectively programs only occupied tiles, the fabric would have to
provide a parallel interface with separate row and column addresses, similar to a conven-
tional SRAM array. To synchronize reconfiguration and the program flow, the microprocessor
would control the configuration controller via a memory-mapped interface. During config-
uration, the microprocessor would still be able to process programs that do not require a
custom instruction. Once finished, the configuration controller would assert an interrupt
and the microprocessor would be able to start using the RFU. With this setup, it would also
be possible to implement flexible decompression schemes of configuration images using
the microprocessor.
4.2.2 INSTRUCTION CONTROL
Ideally, each custom instruction would be associated with a unique opcode. However, re-
strictions of the existing instruction encoding and the compiler required this work to reuse
the opcode of an existing instruction. The selected instruction CRC16 represents a special
case, since it is already an application specific instruction. Hence, the compiler is not aware
of its functionality and will not use this instruction as part of an implementation for a con-
ventional high-level program. Instead, it has to issued explicitly by a system programmer.
In this work, the instruction format and the corresponding hardware within the instruction
decoder for the CRC16 instruction was changed. All custom instructions use this opcode to
enable the logic within the RFU. Full compatibility with the native instruction set can eas-
ily be restored by implementing the functionality of CRC16 within the reconfigurable fabric.
Looking back at table 3.6, such a custom instruction would require minimal delay and only
two tiles of the candidate architecture. Although not as fast as the native ASIC implementa-
43
4.3 Software Integration
tion, compared to a pure software implementation the reconfigurable fabric could perform
the same operation more than 40 times faster2.
Conventional instructions are first decoded by the decoding stage and then executed in
a second pipeline stage. One important task of decoding is to determine the instruction
length so that the program counter can point to the beginning of the next instruction. To
ensure correct functionality, the decoder was changed to account for the new instruction
format associated with the CRC16 instruction. As some instructions require more than one
cycle for execution, an internal counter stalls the decoding stage while these instructions
are being executed. Prior to the first execution cycle of an instruction, this counter is reset
to a value inferred by the instruction decoder directly from the opcode. Instructions that
extend the native instruction set reuse this mechanism but instead of decoding the reset
value from the opcode, it is read from a separate field within the custom instruction format.
This format is described in more detail in the next section. Additionally, a separate counter
implements the previously discussed clock divider which provides the RFU with a variable
clock signal.
Figure 4.2: Simulated waveforms of the clock signals and counter values
Intricate details of the instruction pipeline require custom instructions to either implement
purely combinational circuits with a critical path length less than half the processor clock pe-
riod or implement circuits that require one additional cycle. In the latter case, the additional
cycle can be used to read the operands from the register file. The waveforms in figure 4.2
show how the clock signal of the RFU is aligned to the processor clock. The counter values
for instruction duration and the clock divider are represented by the signals r_instr_state
and r_clk_count respectively. All custom instructions are bound to the discrete time steps
imposed by the processor clock. Hence, in some cases it is more critical to optimize for de-
lay so that the instruction can fit into a certain number of core cycles, while in other cases
the delay constraints are more relaxed.
4.3 SOFTWARE INTEGRATION
Modern development suites for ASIPs not only ship with a cycle-accurate simulation model
to profile applications on the baseline microprocessor, but also include a retargetable soft-
ware toolchain. That means, not only the hardware design of the processor itself is flex-
ible but also the software toolchain required to generate executables. The set of unique
custom instructions and further configuration parameters are captured with a machine de-
scription that is used by the retargetable toolchain to generate machine specific instruction
2Masurements with 8192 randomized samples indicated that a reference software implementation with full
compiler optimizations requires 91 cycles on average.
44
4.3 Software Integration
sequences. Furthermore, advanced tool suites provide options to fully automate the identi-
fication and design of custom instructions from an initial software implementation [IL07].
Developing a toolchain with similar capabilities or even porting an existing compiler to
the customized instruction set was out of scope for this thesis. Therefore, an existing
software toolchain based on the GNU compiler collection [S+13] was used together with the
redefined functionality of the CRC16 instruction. The provided toolchain includes a modified
compiler, linker, and assembler, which are all compatible with the instruction set. With
this setup, minimal effort was required to develop benchmark applications in C which were
capable of using the reconfigurable fabric via custom instructions. However, the targeted
ISA has no native support for floating point operation or even integer multiplication. Hence,
a library function was created to provide integer multiplication of 32-bit and 64-bit operands
to implement applications using fixed-point arithmetic. Listing A.1.1 shows a section of the
assembler implementation for the first. Restricted by the limited functionality of the native
instructions, it requires 139 cycles on average and is explicitly not optimized for size. For
64-bit operands, 398 cycles are required due to overhead of handling double word operands.
Additionally, a simulation-based profiler was developed to identify the most critical sections
of software applications.
4.3.1 INSTRUCTION FORMAT
As the original format of the CRC16 instruction did not provide sufficient flexibility for the
additional control information required by the RFU, a new instruction format as shown in
figure 4.3 was introduced.
E cycles[5:0]clk_div[4:0]stat_ctrl[3:0]
opcode a b Ra Rb
immediate[14:0]S
15 10 9 8 7 4 3 0
Figure 4.3: Custom instruction format to utilize the RFU
The first 16 bits in the upper row are used to specify the opcode and to address two
separate register operands in the register file. This format is used widely among the native
instructions that perform ALU operations. Corresponding to the first operand operand_a in
figure 4.1, the destination register of the result is implicitly specified by the relative register
address Ra. Separate control bits (a and b) specify whether the respective register address
selects a global or local register. The next 16 bits contain the total number of clock cycles
cycles and the counter value for the clock divider clk_div. Bit lengths for these fields
were selected in order to provide the most useful combinations of instruction duration and
intermediate clock cycles. Values for these fields are specific to the implementation of
user circuits and are derived from the timing analysis. The remaining bits are occupied by
four additional bits to provide the control signal stat_ctrl and a bit E which can extend the
instruction format to 48 bits. In that case, an additional signed immediate operand can be fed
to the RFU, however, this format remained unused throughout the application benchmarks.
4.3.2 HIGH-LEVEL INTERFACE
The original software toolchain allowed convenient development of software applications
using a high-level descriptions in C. Not only does this simplify the development of new
applications, it also allows partial reuse of existing code. Still, restricted by the instruction
set, no complete implementation of the standard library was available.
To make use of the RFU and to implement the new instruction format, a set of wrapper
functions was written. This allows a programmer to issue custom instructions simply by
45
4.4 Array Size Tradeoffs
calling a function. Details of the instruction format and reuse of the CRC16 instruction are
hidden behind this abstraction. Furthermore, this interface can be reused across applica-
tions, regardless of the functionality of the implemented circuit within the RFU.
10 static inline unsigned int
11 rfu_func_32(uint32_t op_a,
12 uint32_t op_b,
13 const uint16_t static_ctrl,
14 const uint8_t num_cycles)
15 {
16 const uint16_t op_imm_low = (num_cycles & RFU_MASK__NUM_CYCLES);
17 const uint16_t op_imm_high = (static_ctrl & RFU_MASK__STATIC_CTRL_32);
18 const uint16_t op_imm = ((op_imm_high << RFU_LEN__NUM_CYCLES) | op_imm_low);
19 asm volatile("CRC16 %[Ra], %[Rb]\n\t"
20 ".hword %[imm] \n\t"
21 : [Ra] "+r" (op_a)
22 : [Rb] "r" (op_b), [imm] "i" (op_imm));
23 return op_a;
24 }
Listing 4.1: C function to wrap custom 32-bit instructions
Listing 4.1 shows the function definition that is used to interface custom instructions with
the original software toolchain. The inline assembler expression in line 19 to line 22 is a GNU
specific extension of the C language and allows to describe a sequence of special purpose
instructions and their side effects. A crucial feature of this description is the ability to directly
use C variables without having to know the corresponding processor register that has been
allocated by the compiler. Without this ability, additional overhead would be required to
move the result of the custom instruction between predefined registers and the context in
which the instruction is used. Also, due to compile time optimizations, there is no overhead
for the function call or the construction of the immediate operand in line 16 to line 18. This
immediate operand implements the extension of the native CRC16 instructions by the addi-
tional fields shown in figure 4.3. The inline assembler expression also ensures that these
fields are placed as a contiguous bit string together with the original CRC16 machine code.
The complete header file further contains preprocessor macros to describe the instruction
format (e.g., RFU_LEN__NUM_CYCLES) and a second function definition for instructions with a
length of 48 bits.
4.4 ARRAY SIZE TRADEOFFS
With both components, a hardware model and a software interface, in place, the benefit
of a reconfigurable system with a tight coupled, fine-grained fabric could be explored using
complete application benchmarks. Yet, to actually implement the full system, a designer
would be faced with the decision to define the array capacity of the RFU in terms of a cer-
tain number of FPGA tiles. Similar to standalone FPGAs, the capacity of the device restricts
the number and type of applications that are accessible. Although the microprocessor alone
is sufficient to implement nearly all applications, the array size still plays a key role in most
cases. Many interesting applications, with a demand for high throughput, allow hardware
implementations that are easily scaled, either using parallel instances of the same circuit
or using a heavily pipelined implementation. Hence, more capacity is almost always ben-
eficial, nevertheless, any implementation will be restricted by an absolute area constraint.
To compare different hardware implementations of the same application, it is thus com-
mon practice to determine the area delay product, as done in the design space exploration,
which describes the cost or the efficiency of a particular implementation. For multiple im-
46
4.4 Array Size Tradeoffs
plementations with identical area delay product, the designer is then able to choose one
implementation that satisfies most constraints.
Here, an evaluation using an FIR filter as a benchmark is performed to determine the
minimal useful capacity for the array. Looking back at table 3.6, it becomes clear that a large
range of array sizes is covered by different applications and that the results of this evaluation
are specific to the chosen benchmark. Nevertheless, multiplication is a universal operation
and central to a range of applications. Furthermore, the chosen hardware implementation
is easily scaled and a large variety of alternative implementations exists.
A software implementation for the chosen FIR filter is directly obtained from its time
domain difference equation:
y (n) =
N−1∑
k=0
bkx(n − k) (4.1)
For this evaluation an FIR system with 17 filter coefficients bk was chosen to implement an
equiripple high-pass filter. The coefficients and reference results for the response to white
noise are generated using Octave [EBHW16] and LTFAT [PSH+14]. To process one sample, a
fixed-point software implementation using only the microprocessor hardware requires 7200
cycles, of which roughly 94 percent are required for the multiplication that implements the
convolution in equation 4.1.
To evaluate possible tradeoffs in array size, only the software routine for multiplication
is replaced by a custom instruction that uses a carry-save multiplier. This multiplier is con-
structed from a rectangular array of carry-save adders, followed by one full adder row that
implements a carry-propagate adder to converts the redundant results into a non-redundant
value. This value is further processed by the software to accumulate the final output of
the filter. The highly symmetric carry-save structure is usually pipelined, however, here it
is scaled down in one direction to spread the computation across multiple cycles. Regis-
ters at the output of the last carry-save row and corresponding control circuitry implement
time multiplexing to reduce the number of full adder rows and hence the total number of
required tiles. A number of time multiplexing schemes with steps between 1 and 8 as well
as multiplexing of the carry-propagate adder were designed.
In addition to these design strategies, the methods for logic optimization and technology
mapping played a key role for the evaluation. ABC’s main technology mapping algorithm
uses a cut-based approach to transform the technology independent logic network into a
netlist of K-LUTs. To improve runtime and to reduce memory utilization, this algorithm does
not enumerate all feasible cuts, but instead prioritizes them according to a set of optimization
goals [MCCB07]. The default goal minimizes the depth of the logic network, however ABC
is also capable of minimizing the total number of LUTs. Both goals were used with the
rest of the baseline script in listing A.1.2 and are referred to as ”low effort” and ”low effort
area”. Additionally, a high effort script, found in listing A.1.3, was created. It is based on
multiple iterations of technology mapping and uses a balanced sum-of-product structure for
aggressive delay optimization [MBJK11].
Figure 4.4 plots the results for combinations of different design strategies and the three
command scripts. The solid lines represent regions of constant area delay cost, where de-
signs towards the origin are favorable due to lower cost. Interesting designs are annotated
with the number of time multiplexing steps, the number of carry-save full adder rows, and
the number of full adder columns. Although not all designs cover a region with similar cost,
they still span a large range in which area can be traded for delay and vice versa. Partly, this
spread across different cost regions is due to the discrete delay steps imposed by the clock
period of the microprocessor. Designs with more multiplexing steps are more sensitive to
this quantization since all multiplexing steps are effected by the length of the critical path.
47
4.4 Array Size Tradeoffs
Also, there is an offset in area due to a minimum number of tiles required for the control
circuitry, the carry-propagate adder, and the flip-flops.
Figure 4.4: Possible tradeoffs in instruction latency and array size through a
combination of technology mapping options and design strategies
For applications that require very efficient implementations the design annotated with
”5x8x32” would provide the best choice as the complete FIR filter application would run 15
times faster at the expense of only 79 tiles. Applications that require high throughput might
justify to use the design ”1x32x32”, which provides a relative speedup of 20 but requires
over 300 tiles. Still, it was shown that a combination of design strategies and technology
mapping options allow tradeoffs between area and delay over a large range. Based on the
results of the first design, an array with 81 tiles could be considered reasonable. Inclusion of
fabric internal flip-flops was crucial to perform these tradeoffs. Most computations require
an internal data format with a custom bit length and hence a purely combinational fabric that
only reuses the processor registers is too restrictive. Fabric internal flip-flops are essential
to efficiently implement pipelining or time multiplexing and they add little to the overall area
of the logic element (cf. table 3.5).
48
5 APPLICATION BENCHMARKS AND
RESULTS
To evaluate the advantages of the described tight coupled, fine-grained reconfigurable in-
struction set extension, three real world application benchmarks were mapped to the sys-
tem and compared to software implementations using only the original instruction set. This
chapter describes the evaluation methodology, the selection of benchmark applications,
their algorithmic implementation, and the final results.
5.1 BENCHMARK SELECTION AND EVALUATION METHODOLOGY
Using benchmarks to compare the performance of different architectures is inherently prob-
lematic, since many system specific parameters, beyond the selection of benchmarks them-
selves, influence the final results. Also, benchmarks are only useful if results are indicative
of performance on designs or problems that occur in the field. The same dilemma is en-
countered not only during design of FPGA architectures or reconfigurable systems in gen-
eral, but more prominently for comparison of conventional microprocessor architectures.
For systems employing high performance processors, the standard performance evaluation
corporation1 has been established to provide benchmark suites and publication of results.
Similar benchmark suites exist for embedded microprocessors, some of which have been
reused to compare reconfigurable microprocessors. Still, no generally accepted standard
benchmarks exist for reconfigurable systems as the large variety of architectures makes it
difficult to design representative and easily reproduceable benchmarks.
Here, details of the microprocessor architecture and manual design of the logic inside
the RFU restricted the type and the number of application benchmarks. In the end, three
benchmarks from different application domains were chosen. The benchmark set includes:
complex FFT, encryption with DES, and computation of the exponential function. All of
them are easily implemented in software and most of them are found in the evaluation of
other reconfigurable systems (cf. section 2.1.2).
Starting from a software implementation, a profile based on cycle-accurate RTL simula-
tions of the microprocessor running at a clock frequency of 300 MHz for each application
benchmark is determined. This established a performance baseline for systems without
an instruction set extension. The SRAM model, attached to the microprocessor memory
bus, assumes a delay of one cycle, i.e., an instruction referring to a register that is written
by a preceding load instruction is stalled for one cycle until the memory pipeline has com-
pleted the transfer. All software implementations, with and without custom instructions,
are compiled with suitable optimization options offered by the compiler.
After the simulation, the profiler generates annotated listings which provide information
on how often a certain instruction was executed during the benchmark and a summary of to-
tal cycles for each subroutine. Based on this information, a designer can identify hot spots,
which are ISA specific sequences of instructions that are critical for the overall performance
1http://www.spec.org/
49
5.1 Benchmark Selection and Evaluation Methodology
of the application. Afterwards, circuits were designed to implement custom instructions
that accelerate as many of these hot spots as possible. These circuits were mapped with
the modified VTR flow onto the candidate architecture for timing analysis and to generate
the annotated post-routing netlist. Each benchmark design was mapped for three different
netlists produced by ABC (see section 4.4) with VPR options identical to those used during
the design space exploration. A second RTL simulation, this time with custom instructions
and the RFU netlist, was performed to determine the performance for the system with a
corresponding instruction set extension. The relative speedup was afterwards calculated
from the ratio between the number of cycles required in the first case and the number of cy-
cles required in the second case. To measure only the speedup of the particular benchmark
itself, each benchmark wrapped the computation in a distinct program subroutine. That way,
overhead for processor startup and for comparing computed results with reference results
was not included in the measurement.
In the same manner, the cycles that would be required to configure the FPGA fabric are
not included. First, although the amount of configuration data is known relative precisely,
the effective configuration bandwidth is determined by the implementation of the config-
uration controller and the configuration interface of the fabric. At this point, neither the
fabric nor the configuration controller has been implemented. Furthermore, compression
techniques can be applied to reduce the overhead of configuration. Second, all benchmark
implementations offered only potential for a single custom instruction. That is, for each
benchmark, configuration data for the fabric would have to be programmed only once dur-
ing initialization and would remain static during the complete runtime of the application.
In principle, it would be possible to use multiple independent custom instructions for one
application. These would have to be configured in succession and would share the fabric
capacity. Unfortunately, none of the application benchmark offered potential for this kind of
scenario.
Finally, the number of required tiles for each custom instruction was determined from
VPR’s resource report. As this final evaluation was conducted to provide a comparison on
system level, no absolute constraint was set to restrict the number of tiles. Instead, the
area of the extended system was calculated from the scaled area required for the baseline
microprocessor of table 4.1 and the effective size of a square array, capable of holding the
required number of tiles. Again, overhead for memory to store the program binary or to store
configuration data was not included. Nevertheless, details on the size of code, as well as
the number of bits required to configure the square array for each custom instruction are
provided as a reference point in the results section. As the size of the array was not fixed
across all benchmarks, VPR was not restricted in choosing the placement of IO pins for the
eFPGA according to figure 4.1. However, for final integration into the ASIC synthesis flow,
a fixed placement has to be determined in conjunction with a fixed size of the array. Each
circuit and each array size require different placements to achieve optimal results. Still, an
evaluation of performance degradation due to a fixed IO placements using the circuits for
DES and the 32-bit multiplier from section 4.4 revealed no noticeable effect. Nevertheless,
other FPGA architectures with less flexible routing and extreme array ratios could be more
sensitive to IO placement.
5.1.1 FAST FOURIER TRANSFORMATION
The fast Fourier transformation is a computationally efficient version of the discrete Fourier
transformation (DFT), that is, a method to compute the frequency spectrum of a time dis-
crete signal. A large set of operations, which are challenging to carry out in the time domain,
can be performed with significantly less effort in the frequency domain. Hence, a multitude
of applications in the domain of digital signal processing require an efficient DFT implemen-
50
5.1 Benchmark Selection and Evaluation Methodology
tation. To pick one concrete example, orthogonal frequency-division multiplexing, widely
used in contemporary communication systems, such as the fourth generation of mobile
telecommunications, heavily relies on efficient implementations of the DFT and its inverse
transformation [Nus14]. Due to its popularity and relative simple hardware and software
implementations, the FFT is commonly used to benchmark digital signal processors and
other embedded microprocessors.
Often, communication systems employ complex valued signaling, so that a complex FFT
has to apply the transformation to the real and imaginary values of a time sample. Starting in
the time domain, N complex samples x are transformed by the DFT to their representation
X in the frequency domain using the relation:
X (k) =
N−1∑
n=0
x(n)W knN with W N = e
−j2π/N (5.1)
A radix-2 FFT algorithm restricts N to a power of 2 and exploits the resulting symmetry
and periodicity of the complex exponential function to dramatically reduce the number of
computations required to implement equation 5.1. A so-called butterfly network, shown in
figure 5.1, is used to recursively compute the necessary terms. Using this structure as the
fundamental computational step, log2(N) butterfly stages are required, where each stage is
implemented by N/ 2 parallel butterfly networks. Generally, this approach is known as the
Cooley-Tukey algorithm. In-depth treatments on how it is derived and what other algorithms
exist can be found in standard literature [PM96].
Figure 5.1: Signal-flow graph of a butterfly network for the FFT
The software implementation uses a decimation-in-time algorithm and relies on precom-
puted values for W N , the so-called twiddle factors. Together with the complex time sam-
ples, twiddle factors are stored in memory. During computation of the FFT, time samples
are discarded and successively overwritten with intermediate results. This in-place algo-
rithm reduces the memory footprint, but requires additional effort to compute the memory
indices pa, pb, pW of figure 5.1 and leads to a bit-reversed order of the frequency samples.
Therefore, prior to performing the butterfly computations, the time samples are shuffled
so that the frequency samples appear in natural order after the computation is finished.
Samples are quantized to 16 bits and stored in a dense data format that packs both compo-
nents into a single memory word. All computations are performed in a scaled fixed-point
format which normalizes each component to absolute values less than or equal to 1.0. The
software implementation is not limited to one specific number of samples and allows free
choice of N, as long as it is a power of two.
Initially, it was thought that the implementation of the FFT would provide opportunities
to showcase the use of multiple custom instructions in one application. However, profiling
the software implementation revealed that the majority of cycles was spent, again, on the
multiplication subroutine. To process 4096 complex samples, the pure software implemen-
tation spends less than one percent of the cycles for shuffling and up to 85 % of the cycles
on the multiplication subroutine. So, only the multiplication was accelerated with a custom
instruction that implements a 16-bit multiplication with automatic rescaling. The achieved
speedup was deemed sufficient and no further implementations of custom instructions
51
5.1 Benchmark Selection and Evaluation Methodology
were investigated. Implementing a pipelined custom instruction is expected to require sig-
nificant more effort, since a certain amount of memory indices would have to be computed
prior to loading multiple operands from memory at once. Furthermore, even after reducing
the number of cycles spent on multiplication, shuffling of the samples only took up 4 %
of the total cycles and is hence still not an attractive target for custom instructions, which
could, for example, provide a fast and flexible implementation of bit-reverse addressing.
5.1.2 DATA ENCRYPTION STANDARD
The data encryption standard is a symmetric block cipher which was developed in the 1970s
and standardized to allow interoperability between different systems. Standardization was
key to its widespread use and it has since been incorporated into many applications, even
though concerns about its security were raised early on. Today, it is considered out-dated
and not sufficiently robust due to its short key length [Sch15]. Curiously, the availability of
affordable FPGAs with high capacity allowed researches to build systems that successfully
break the DES with brute-force attacks, given enough time and a known plaintext [GKN+08].
Operating on data blocks with a length of 64 bits, DES is broken up into a key schedul-
ing phase and an encryption phase with multiple rounds. Key scheduling uses the initial
key with a length of 56 bits and creates distinct subkeys for each encryption round. An
initial bit permutation is applied to the plaintext block and the result is split into two 32-bit
halves. During each encryption round, one half is permuted, combined with a subkey using
bitwise XOR, applied to a substitution box, and permuted again. Afterwards, the results is
combined with the other 32-bit half using a bitwise XOR operation. Finally, the halves are
swapped for the next round. Each round performs the same fundamental operations in the
same order. Substitution boxes are essentially lookup tables which are used to conceal the
relationship between the plaintext and the ciphertext. DES defines eight such boxes with a
static relationship between the input and the output. Furthermore, the standard specifies
that 16 encryption rounds are performed on each plaintext block. Afterwards, the halves
are merged and a final permutation is applied. Due to symmetry of the cipher, decryption
uses almost identical steps. A successor of DES, called advanced encryption standard,
was introduced to provide more robustness with key lengths up to 256 bits. Still, this new
standard shares most of the fundamental operations with DES, including bitwise XOR, bit
permutations, and substitution boxes.
As noted in [vS10], DES is a popular benchmark among reconfigurable systems but dif-
ferent software implementations are used to establish a baseline. Therefore, the relative
speedups for DES reported in various publications are difficult to compare. The reference
implementation [Sch15] is clearly not optimized and hence not representative for real world
applications. To allow a more realistic comparison, this work uses the openly available im-
plementation from libgcrypt2, which uses pre-permuted entries in the substitution boxes
to minimize the number of computation steps during each round. This approach reduces
each round to combining one 32-bit half with the subkey, using the substitution box, and
combining the result with other half. All permutation steps, except for the initial and final
permutations of the 64-bit block, are thus avoided. This greatly increases the performance
of software implementations which resort to inefficient instruction sequences to perform
those functions using bit shifts and bit selections. Especially for triple DES, which effec-
tively triples the key length and the number of rounds, this is a significant advantage. A
software implementation of triple DES was used here. For encryption of multiple plaintext
blocks, the application uses the electronic codebook mode, in which key scheduling is done
only once and all plaintext blocks are encrypted with the same subkeys. The implemented
DES algorithm requires roughly 2.5 times more cycles for key scheduling than for encryption
2http://www.gnu.org/software/libgcrypt/
52
5.1 Benchmark Selection and Evaluation Methodology
of one block. However, the encryption rounds dominate the total runtime after only a few
blocks of plaintext due to the chosen mode of operation.
To profile the software implementation, 1024 blocks were encrypted using triple DES and
verified with precomputed reference ciphertexts. As expected, most cycles were spent dur-
ing the encryption rounds, with each round requiring 65 cycles. The implemented custom
instruction exploits the fact that encryption rounds operates on the same data blocks and
only the subkeys have to be fed into the RFU. Due to their length of 48 bits, they occupy
both input register operands. Intermediate results are stored in the fabric with flip-flops,
which have to be loaded prior to the 48 rounds. Instead of storing the substitution boxes in
memory, they are naturally implemented with LUTs inside the FPGA fabric. Due to the bit
length of the permuted entries in the substitution boxes, the lookups are a major contributor
to the critical path length.
5.1.3 EXPONENTIAL FUNCTION
As a final benchmark, evaluation of the exponential function ex was chosen. Hardware ac-
celeration to efficiently compute this and other elementary functions is found even in some
general purpose microprocessor architectures [ST99]. It is also central to application specific
systems, for example, platforms for large-scale neuromorphic computing using spiking neu-
ral networks [PHE+17]. A variety of algorithms exists to implement the function. Properties
of the exponential function are used to tailor these algorithms to system and application
specific requirements. For example, many applications restrict the interval in which the
function is evaluated and only require results with a certain precision. Additionally, efficient
implementations directly utilize system specific hardware resources. Table-based methods
and polynomial approximations are the most common approaches to compute the function
[Mul06]. Although they can be quite fast and are easily extended to provide results with high
precision, both rely on the availability of fast multiplication, often with operands beyond the
native word length.
As already shown with the FIR filter and the FFT, the targeted microprocessor is signif-
icantly restricted by the lack of a native multiplication instruction. Therefore, applications
for which multiplication is central benefit the most from a custom instruction that only
implements this operation. To demonstrate the performance advantage and versatility of
a dynamic instruction set extension, the last benchmark uses an algorithm which imple-
ments the exponential function in software without a single call to the multiplication rou-
tine. It is based on additive normalization using an iterative method to select a sequence
Dn = 1 + dnr −n such that t −
∑N
n=0 ln(Dn) converges to zero. With the radix r set to 2 and a
second recurrence for En, which converges linearly towards E0et , the following equations
can be found [Mul06]:
tn+1 = 2tn − 2n+1ln(2 + dn2−n) witht ∈ [0,
∞∑
u=0
ln(1 + 2−u)] (5.2)
En+1 = En + Endn2−n (5.3)
dn =
⎧⎪⎨⎪⎩
−1 if tn < − 0.5
0 if − 0.5 <− tn<0.5
1 if 0.5 >− tn
(5.4)
Extensions to higher radices and further optimizations are possible [EL04], but are not con-
sidered here. Furthermore, a very similar algorithm can be used to compute the logarithm
and other elementary functions. The main advantage of this approach is the fact that each
iteration step can be performed with simple addition and shift instructions. On the targeted
microprocessor, this algorithm is more than four times faster than a conventional algorithm
53
5.1 Benchmark Selection and Evaluation Methodology
based on a polynomial approximation tailored to the same value range. Both use a fixed-
point number format, in which values are represented by 17 bits for the integral part and
15 bits for the fractional part [PHE+17]. To express the result of the computation in the
same format, input values are restricted to the interval [ln(2−15), ln(216 − 2−15)]. The value
E0 for the recurrence of equation 5.3 is initialized to precomputed values of the exponential
function using only the integral part of the input. That way, the integral part of t addresses a
table of 23 values and the fractional part is used to initialize t0. Similar approaches of range
reduction are used in other algorithms to reduce the degree of polynomials or to increase
precision. Here, range reduction is naturally implemented by proper initialization, without
having to perform a final multiplication step. Additionally, a lookup table with 71 entries is
used to store the expression involving the logarithm in equation 5.2. The error distribution
of the implemented exponential function is shown in 5.2. It was found that 35 iterations are
sufficient to achieve an error of less than one least significant bit in the fractional part over
the full range of valid inputs. Roughly 18 cycles are spent on each iteration by the software
implementation.
Figure 5.2: Error distribution of the fixed-point implementation for −10.4 < x < 11.1,
with a maximum normalized error of -0.761 for x ≈ 10.893
The implementation of the custom instruction placed the lookups for the integral and the
fractional part within the fabric. This was faster than reusing the lookups from the pure
software implementation but required a significant amount of FPGA tiles. Precision require-
ments of the results lead to lookups which exceeds the native word length and hence did
not fit into a single register operand. Also, a special mode was implemented to truncate the
internal result and set the carry flag according to the most significant bit that is lost. Using
this special mode in the last iteration, rounding towards the nearest value was efficiently
implemented.
54
5.2 Functional verification
5.2 FUNCTIONAL VERIFICATION
Even with a fully functional CAD flow, rigorous verification of all benchmarks was necessary
to ensure consistent and credible results. Additionally, a number of directed tests were per-
formed during the modifications of the microprocessor hardware as well as during develop-
ment of the software implementations. For example, a very simple custom instruction was
used to test the integration and the interface of the RFU. Another test was performed to
stimulate the clock divider and the counter that sequences multi-cycle instructions. Also,
the CRC16 instruction was used to test the integration of post-routing netlists into the RFU.
Experiments with inline assembler expressions were used to explore the instruction for-
mat for custom instruction and their alignment with native machine code generated by the
compiler. Furthermore, numerous test with the VTR flow and COFFE were necessary to
automate the FPGA design space exploration. All benchmarks were first developed on a
x86 system and compared to existing reference implementations. All of these simple tests
revealed existing flaws early on and avoided problems that would have been much harder
to debug with the complete application benchmarks running on the simulation model of the
microprocessor.
Figure 5.3 shows the different components that are used to verify the functionality of the
final benchmark applications.
C source 
code
testcase
control
mc32_core
RFU
regfile
instruction &
data memory
profiler
testbench
VPR netlist 
primitives
VTR flow
routing resource
graph
FPGA
architecture
description
bitstream generator
RTL
description
of RFU
post-routing 
netlist
bitstream
cross-
compiler
machine 
code
annotated
listing
reference
results
native x86 
compiler
simulation
host
PC
SR
Figure 5.3: Overview of the simulation-based verification approach
The top section is occupied by the FPGA CAD flow and its results. The dashed upper
blocks show how a bitstream generator would be used in conjunction with VPR to generate
55
5.3 Results
a configuration image that could be programmed into a hardware prototype. Since this
thesis is only concerned with a simulation model, the post-routing netlist, which implements
the custom logic inside the RFU, is used in conjunction with the RTL simulation of the
microprocessor. A generic test case resets and initializes the processor, which then fetches
and executes instruction from memory. At the same time, a profiler observes the state of
the microprocessor and records the value of the status register and the program counter
at each cycle. This information is afterwards used to annotate the machine code listing
and to determine the number of cycles required by each subroutine. The left section of
figure 5.3 shows how the software toolchain was used to create the machine code for the
microprocessor and the reference results. These are computed by the simulation host,
which also simulated the targeted microprocessor. Separate compilers had to be used,
since both machines implement vastly different ISAs.
The correctness of the reference results produced by the microprocessor specific imple-
mentation was further verified with external implementations. This is explicitly not shown
in the figure. Fixed-point implementations of the FFT and the FIR filter were verified with
results of high precision implementations used by Octave. In both cases, random noise
was used to stimulate the systems with inputs from a large value range. Results of the
exponential function were compared with the implementation of the C standard library. A
set of test vectors is shipped with libgcrypt to verify the functionality of DES. That way, all
application benchmarks were first developed and tested on the x86 host. This approach
also came with the advantages of better visibility to debug applications and faster program
execution. For example, verifying the fixed-point exponential implementation over the full
range with the simulated microprocessor required more than one hour. With this setup, the
microprocessor performs the comparison of computed results with the reference results
online. As both results stem from the same algorithmic implementation, comparison was
performed on bit level. The same reference results were also used to verify the correctness
of results computed by custom instructions. In that case, it was often helpful to replace the
netlist of the RFU with the original RTL description, which has an identical interface and
provides better observability of internal signals.
5.3 RESULTS
After implementing the application benchmarks with the baseline and with the extended
systems, profiling data and VPR’s resource reports were used to compare the performance
and size of both system. Crucially, both systems use the same software toolchain and share
all native instructions. Hence, the performance of critical sections and their size relative to
the complete baseline application is identical on both systems. If a different ISA were used
as a performance baseline of software implementations, these sections could be vastly
different. Their relative size is critical, since it limits the maximum achievable speedup. For
example, if a custom instruction can replace one function which requires half the cycles
of an application, the extended system will be able to run the same application only twice
as fast, regardless of the raw speedup between the custom instruction and the original
function. It is important to consider this nonlinearity between raw and total speedup during
identification and design of custom instructions.
Benchmark results of the systems with instruction set extension are given in table 5.1. In
addition to the presented application benchmarks, the results for the FIR filter from section
4.4 were also included in this comparison. The first three rows show the performance ben-
efit of a tight coupled reconfigurable system. Absolute performance is given as a reference
for comparison with other conventional systems and measured in million operations per sec-
ond (MOPS). It is based on the average time required by the system to process one sample
and thus specific for each application. Furthermore, the estimated size of the critical section
56
5.3 Results
in the original software implementation and the speedup for the total application are listed.
The latter is derived from the ratio of cycles required by the pure software implementation
and the cycles required by the extended system. To evaluate the area efficiency, the size
of each instruction set extension and the total system size is given. As already discussed,
the cost of the extended system has to consider the complete size of the array and not
only the required tiles. Calculations of the system size does not include memory to hold
instructions or data. To still get an impression for the required memory capacity, the size
of each code section and the fabric configurations are listed. Code size was determined
with the size utility of the supplied software toolchain. In all four cases, the compiler was
instructed to perform aggressive optimizations which ignore the size of the executable. The
required storage capacity of configuration data is derived directly from the array size and
the configuration bits per tile, without considering any kind of compression. This figure can
be used to estimate the time required for initial configuration, for example, assuming that
the configuration controller can saturate the same bandwidth as the memory interface of
the microprocessor.
Table 5.1: Performance and size of systems with the proposed instruction set
extension for the application benchmarks
17-tap Complex Triple Exponential
FIR filter FFT DES function
Absolute performance (MOPS) 0.67 0.47 0.62 1.32
Size of critical section (%) 94 85 98 95
Relative application speedup 15.9 6.3 8.5 3.0
Number of required tiles 79 81 54 163
Total tiles of square array 81 81 64 169
Relative system size1 5.4 5.4 4.5 10.1
Number of samples 4096 4096 1024 687
Size of code (KiB) 1.3 1.9 3.93 1.4
Size of configuration (KiB) 12.9 12.9 10.2 26.9
1 Assuming Acore = 29 900 µm2 and Ati le = 1602 µm2 , see table 4.1 and 3.5.
In all four scenarios, significant acceleration of the complete application was achieved. Ex-
cept for the last benchmark, the gain in performance was possible with only a very moderate
increase in area. For these cases, the area delay product of the system with instruction set
extension was considerably lower than that of the baseline microprocessor. Therefore, ap-
plications which require both a high throughput and low latency benefit from the proposed
architecture. These cases also show that such an instruction set extension is superior to sys-
tems that employ multiple instances of the microprocessor in parallel. For such alternative
systems, the increase in throughput is, at best, bought with an identical increase in system
area. Usually, additional area overhead is required for these systems to ensure consistent
memory access. Furthermore, multiprocessor systems are only suited for certain appli-
cations and do not necessarily decrease latency. Latency, however, can be performance
critical, e.g., to quickly perform calculations inside interrupt service routines or to verify
checksums. Unfortunately, the implementation using custom instructions to compute the
exponential function was less efficient. This was mostly due to the large number of tiles
required to store the precomputed lookup values. More efficient use of these resources
would be possible with an implementation that performs the computation of multiple inde-
pendent values in parallel. Reusing the lookup values for the logarithm is possible, since
their index is only a function of the iteration step. Furthermore, an alternative implementa-
57
5.3 Results
tion could use a redundant internal representation to speed up the computation.
Looking at the numbers of code size, it is clear that significantly more memory was re-
quired to store data samples than to store instructions. And although custom instructions
reduced the code size, the amount of absolute reduction is insignificant, even if the over-
head for configuration is ignored. This is partly due to the very efficient encoding of the
native instructions. For example, the full FFT application, including the fully unrolled multi-
plication routine, required less than 2 KiB of storage for instructions. Furthermore, the goal
of this comparison was to evaluate the efficiency of computational capabilities. That is,
both the pure software implementation and the enhanced version represent more or less
identical algorithmic implementations where only the inner kernel was computed on differ-
ent hardware. Hence, a complete system, either with or without the eFPGA extension, is
expected to require the same amount of memory for data samples. Also, the overhead
in memory capacity required for the configuration can be reduced dramatically using com-
pression schemes. In addition, multiple custom instructions could reuse the reconfigurable
fabric and could further increase the system efficiency.
58
6 DISCUSSION AND OUTLOOK
This last chapter summarizes the steps which were performed during this thesis and dis-
cusses the most important results and their implications. Furthermore, extensions of the
proposed architecture and directions for future research are outlined.
6.1 DISCUSSION OF RESULTS
The motivation of this thesis was that the architecture and design of the fine-grained fabric
is central to the performance of the system and the practicality of high-level software tech-
niques. In contrast to similar research projects, a design space exploration with detailed
physical models, including a custom area model, was performed. Although still relative
simple, the area model was shown to provide consistent and more reliable results. It was
found that the context of a tight coupled instruction set extension and a modern 22 nm
process technology favors a broader range of logical architectures than common wisdom
for conventional FPGAs suggested. Still, the results were generally in line with previous
studies, which also found K , N, and L to have the greatest impact on performance. This
study further showed that architectures at either extreme are inefficient, that is, very small
or very large logic clusters are unfavorable. Furthermore, the choice of input and output
flexibility for a medium sized cluster in combination with single-driver routing was explored.
Again, insights from previous studies were confirmed, however, further results suggested
that smaller clusters require a significantly different choice for optimal results. It was also
observed that the choice of benchmark circuits and the quality of CAD tools had a significant
influence on the results. Further investigations of architectures with depopulation of the lo-
cal crossbar and the global routing segments were not conducted, limited by the chosen
transistor sizing method and the fixed circuit implementations. Area and delay results for
one specific logical architecture, primarily chosen to minimize delay, were presented, along
with specific details regarding the number of configuration bits. Due to lack of more reliable
results, e.g., from a concrete layout implementation or a hardware prototype, these prelimi-
nary performance estimates were further used to evaluate the modeled fabric in conjunction
with a microprocessor.
Tight coupling was implemented in the simulation model to ensure minimal latency be-
tween the reconfigurable functional unit and the microprocessor. Direct access to all proces-
sor registers and a configurable clock divider ensured almost seamless integration without
compromising performance of native instructions. In principle, the proposed interface of
the RFU is also suited for pipelining or integration of multiple parallel RFUs. Direct memory
access was explicitly not provided to the RFU to avoid synchronization overhead. A custom
instruction format was designed and integrated into the existing software toolchain with
minimal effort. Overall, the modifications to the microarchitecture were minimal and most
of the existing infrastructure was reused by the new functional unit. Hence, the concept
of a reconfigurable functional unit and the developed interfaces are suitable for adoption
across ISAs.
Finally, the complete system was benchmarked using real world application benchmarks.
These benchmarks were chosen from three different application domains, and yet each ap-
59
6.2 Extensions and Future Research
plication offered sufficient potential for acceleration with custom instructions. In all four
cases, the extended system more than doubled the performance compared to the baseline
system. In the case of an FIR filter, the original ISA suffered from the lack of fast multi-
plication and a corresponding custom instruction accelerated this application by a factor of
15. Furthermore, an array capacity of 81 tiles was sufficient for the majority of benchmarks,
which suggests that fabrics with very few reconfigurable resources are beneficial for a vari-
ety of applications. This stands in contrast to commercially available loose coupled system,
where high latency between the microprocessor and the FPGA fabric constrain of target
application space. However, a direct comparison of capacity or density with other commer-
cial eFPGA architectures was not possible due to lack of information and due to a variety
of logic element architectures. The investigation also showed that tight coupling and the
performance of the fabric are very well suited to implement specific operations with low
latency. Beyond a replacement of the native CRC16 instruction, many application specific
functions were easily implemented.
In spite of architectural compromises and a minimalist logic element architecture, the
proposed system has shown promising results, both in performance and in area efficiency.
A noticeable increase in performance was achieved with simple and primitive implementa-
tions of custom instructions. In terms of area delay product, the proposed architecture, with
the assumed properties, is superior to conventional approaches of increasing throughput,
e.g., multi-core systems. Furthermore, the presented fabric architecture represents a bare
minimum, with no explicit support for arithmetic operations and without bias towards cer-
tain regular structures. Hence, a large optimization potential exists, not only for the logical
FPGA architecture but also for its physical implementation.
6.2 EXTENSIONS AND FUTURE RESEARCH
As a consequence of the large spectrum of involved topics and the focus on the FPGA
design space exploration, this work only scratched the surface of other important aspects
of reconfigurable systems. A number of interesting features, with the potential for signifi-
cant performance improvements, of tight coupled reconfigurable systems were outside the
scope of this investigation. The following list groups extensions of the proposed architecture
and directions for future research into four distinct categories.
With a focus on logical architecture, the present work used a very basic model for phys-
ical implementations. Further aspects, concerning the design and layout of memory cells
for configuration, possibly with multi-context capabilities, as well as multiplexer topologies
and the fundamental choice of switch building blocks were ignored. Transmission gates
are generally preferred for operation with very low supply voltages, since they are better
suited to switch both signal levels. Pass gates were primarily chosen to minimize the area
of multiplexer implementations, however, the discussed inefficiencies for small NMOS lay-
outs might reduce this advantage. Transmission gates would further simplify the circuit
design, since no level restoring PMOS pull-up would be required. Also, many circuit design
styles have been proposed, specifically to reduce power consumption for FPGAs [LCG06].
In addition, the targeted FDSOI process technology offers potential to exploit body-biasing
for higher energy efficiency. Moreover, the usage of non-volatile memory cells might re-
quire special treatment with the potential to favor different circuit topologies. The employed
transistor sizing method was primarily chosen to rank logical architectures and showed opti-
mization potential for the model of routing buffers, which goes hand in hand with improved
wire load models. Finally, the present work has neglected the design and the influence of
the required clock network as well as flip-flop initialization and the power-on sequence of
the fabric.
Logical architectures were restricted to aspects that are easily parameterizable in the
60
6.2 Extensions and Future Research
physical model and fully supported by the FPGA CAD flow. Structural bias in the global
routing network and dedicated connections between logic elements are employed by other
architectures to exploit regularity in arithmetic circuits. Furthermore, area efficiency could
be increased by depopulating crossbars and routing segments. Carry chains, which require
redesign of the logic cluster and extensions to the logic elements, are expected to greatly
enhance performance. As carry chains fundamentally change the local routing structure, a
reevaluation of cluster size and logic element architecture would be necessary. Effective-
ness of carry chains also depends on the quality and support of the CAD tools [PZN+15].
Furthermore, dedicated hardware and fundamentally different logic elements have been
proposed to augment or replace conventional LUT-based architectures [PA12]. Still, in com-
bination with carry chains, most modern FPGAs contain much more versatile fracturable
LUTs. Also, ABC supports cascaded LUTs, combinations of LUTs and logic gates, or combi-
nations of LUTs with different sizes [MBFG15]. Additionally, most presented reconfigurable
microprocessors are equipped with multi-context fabrics that can rapidly switch between
multiple configuration states. This requires specifically designed configuration cells but
also support of the logical architecture [TCJW97]. Multi-context fabrics require overhead
to hold multiple states at once and restrict all contexts to a common size. Alternatively,
switching between multiple contexts could be accomplished with soft-multiplexers using
LUT resources, without requiring each context to have the same size. A qualitative sketch
comparing these two approaches is shown figure A.1.1.
To fully exhaust the potential of the reconfigurable hardware, suitable CAD tools and algo-
rithms are mandatory. Replacement of Odin II is inevitable to map complex designs, ease
RTL design entry, and explore new high-level tools. Yosys [Wol] apparently has better Ver-
ilog support and a more robust implementation, based on regression tests with multiple
other synthesis tools. Currently, it is actively being developed, ships with a detailed docu-
mentation, and is available under a permissive license. Still, evaluation to qualify this tool as
a replacement should emphasize integration with VPR, support of FPGA specific features,
and quality of results. Next, ABC supports additional options for delay oriented logic op-
timization and technology mapping, which have not been explored in this thesis. Finally,
VPR has primitive support for handling pre-mapped macros. This could be used to compare
the quality of automatic packing and placement with manually created macros, especially
in the context of carry chains. Also, replacements for its placement and routing algorithms
with the possibility of more deterministic results and higher quality have been investigated
[Gor14, VBS13]. In addition to these user-centric tools, many opportunities exist to improve
state-of-the-art tools for modeling FPGA architectures. Especially automatic transistor siz-
ing has been popular in the context of FPGAs, however this thesis has shown that COFFE is
severely limited by its fixed circuit structures and inefficient sizing algorithm. Furthermore,
to explore architectural features that aim to reduce power consumption, robust physical
models for power modeling and estimation are required.
Lastly, this thesis has laid the basis to further explore system level extensions of tight
coupled reconfigurable systems. Based on the discussion of section 4.4 and the results of
the application benchmarks, it would be interesting to evaluate multiple RFUs either with
a superscalar architecture or with a multi-core system. Partial runtime reconfiguration is
particularly well suited for tight coupled architectures. Other than the already discussed
modifications on the logical and physical level, this would also require explicit support by
the instruction set and adoption of a reconfiguration aware software toolchain [Rul09]. Au-
tomatic identification of custom instructions and improved profiling could help the designer
to assess multiple options more quickly. Finally, different algorithms and architectures have
been studied in order to reduce the configuration overhead using compression techniques
[MD06]. A significant reduction in size is possible, especially when the inherent overhead
for one-hot configuration pattern of two-level routing multiplexers is considered.
61
BIBLIOGRAPHY
[ABP11] G. Ansaloni, P. Bonzini, and L. Pozzi, “EGRA: A coarse grained reconfigurable
architectural template,” IEEE Trans. Very Large Scale Integr. VLSI Syst., vol. 19,
no. 6, pp. 1062–1074, Jun. 2011.
[Act07] SX-A Family FPGAs, Actel Corporation, Mountain View, CA, USA, 2007.
[Ahm01] E. Ahmed, “The effect of logic block granularity on deep-submicron FPGA per-
formance and density,” Master’s thesis, University of Toronto, 2001.
[Alt02] Excalibur: Hardware Reference Manual, 3rd ed., Altera, San Jose, CA, USA,
Nov. 2002.
[Alt15] User Customizable ARM-Based SoC, Altera, San Jose, CA, USA, 2015.
[Alt16] MAX 10 FPGA Configuration User Guide, Altera, San Jose, CA, USA, 2016.
[Ama06] H. Amano, “A survey on dynamically reconfigurable processors,” IEICE Trans.
Commun., vol. E89-B, no. 12, pp. 3179–3187, 2006.
[API03] K. Atasu, L. Pozzi, and P. Ienne, “Automatic application-specific instruction-set
extensions under microarchitectural constraints,” in Design Automation Conf.
ACM, 2003, pp. 256–261.
[AR00] E. Ahmed and J. Rose, “The effect of LUT and cluster size on deep-submicron
FPGA performance and density,” in Int. Symp. Field-Programmable Gate Arrays.
ACM, 2000, pp. 3–12.
[AR04] E. Ahmed and J. Rose, “The effect of LUT and cluster size on deep-submicron
FPGA performance and density,” IEEE Trans. Very Large Scale Integr. VLSI Syst.,
vol. 12, no. 3, pp. 288–298, Mar. 2004.
[Arn05] J. M. Arnold, “S5: the architecture and development flow of a software config-
urable processor,” in Int. Conf. Field Programmable Technology, Dec. 2005, pp.
121–128.
[Atm03] AT94KAL Series Field Programmable System Level Integrated Circuit, Atmel,
San Jose, CA, USA, 2003.
[BBM+04] M. Bocchi, C. D. Bartolomeis, C. Mucci et al., “A XiRisc-based SoC for embed-
ded DSP applications,” in IEEE Custom Integrated Circuits Conf., Oct. 2004, pp.
595–598.
[BEM+03] V. Baumgarte, G. Ehlers, F. May et al., “PACT XPP—a self-reconfigurable data
processing architecture,” J. Supercomputing, vol. 26, no. 2, pp. 167–184, 2003.
[Ber16] Berkeley Logic Synthesis and Verification Group, “ABC: A system for sequential
synthesis and verification,” http://www.eecs.berkeley.edu/~alanmi/abc/, 2016,
Changeset ID: 8E08604F8AD3.
62
Bibliography
[BR97] V. Betz and J. Rose, “VPR: A new packing, placement and routing tool for
FPGA research,” in Int. Workshop Field-Programmable Logic and Applications.
Springer, 1997, pp. 213–222.
[BRM99] V. Betz, J. Rose, and A. Marquardt, Architecture and CAD for deep-submicron
FPGAs. Kluwer Academic Publishers, 1999.
[BSKH07] L. Bauer, M. Shafique, S. Kramer, and J. Henkel, “RISPP: Rotating instruction
set processing platform,” in Design Automation Conf., Jun. 2007, pp. 791–796.
[BV03] J. Becker and M. Vorbach, “Architecture, memory and interface technology inte-
gration of an industrial/ academic configurable system-on-chip (CSoC),” in IEEE
Comput. Soc. Annu. Symp. VLSI, Feb. 2003, pp. 107–112.
[Cav16] ThunderX2 Product brief, Cavium, San Jose, CA, USA, 2016.
[CC01] J. E. Carrillo and P. Chow, “The effect of reconfigurable units in superscalar
processors,” in Int. Symp. Field-Programmable Gate Arrays. ACM, 2001, pp.
141–150.
[CD94] J. Cong and Y. Ding, “FlowMap: an optimal technology mapping algorithm for
delay optimization in lookup-table based FPGA designs,” Trans. Computer-Aided
Design Integrated Circuits and Syst., vol. 13, no. 1, pp. 1–12, Feb. 1994.
[CFHZ04] J. Cong, Y. Fan, G. Han, and Z. Zhang, “Application-specific instruction genera-
tion for configurable processor architectures,” in Int. Symp. Field-Programmable
Gate Arrays. ACM, 2004, pp. 183–189.
[CH02] K. Compton and S. Hauck, “Reconfigurable computing: A survey of systems
and software,” ACM Comput. Surv., vol. 34, no. 2, pp. 171–210, Jun. 2002.
[Chi13] C. Chiasson, “Optimization and modeling of FPGA circuitry in advanced process
technology,” Master’s thesis, University of Toronto, 2013.
[CMP+16] R. Carter, J. Mazurier, L. Pirro et al., “22nm FDSOI technology for emerging
mobile, internet-of-things, and RF applications,” in IEEE Int. Electron Devices
Meeting, Dec. 2016, pp. 2.2.1–2.2.4.
[Col06] R. P. Colwell, The pentium chronicles: The people, passion, and politics behind
Intel’s landmark chips. Wiley-IEEE Computer Society Press, 2006.
[CSO+00] Y. Cao, T. Sato, M. Orshansky, D. Sylvester, and C. Hu, “New paradigm of pre-
dictive MOSFET and interconnect modeling for early circuit simulation,” in IEEE
Custom Integ. Circuits Conf., 2000, pp. 201–204.
[CTMB13] L. Chen, J. Tarango, T. Mitra, and P. Brisk, “A just-in-time customizable proces-
sor,” in Int. Conf. Computer-Aided Design, Nov. 2013, pp. 524–531.
[DJL+97] M. Dolle, S. Jhand, W. Lehner, O. Muller, and M. Schlett, “A 32-b RISC/DSP
microprocessor with reduced complexity,” IEEE J. Solid-State Circuits, vol. 32,
no. 7, pp. 1056–1066, Jul. 1997.
[DLW+11] J. Das, A. Lam, S. J. E. Wilton, P. H. W. Leong, and W. Luk, “An analytical model
relating FPGA architecture to logic density and depth,” IEEE Trans. Very Large
Scale Integr. VLSI Syst., vol. 19, no. 12, pp. 2229–2242, Dec. 2011.
63
Bibliography
[EBA+11] H. Esmaeilzadeh, E. Blem, R. S. Amant, K. Sankaralingam, and D. Burger, “Dark
silicon and the end of multicore scaling,” in Int. Symp. on Comput. Architecture,
Jun. 2011, pp. 365–376.
[EBHW16] J. W. Eaton, D. Bateman, S. Hauberg, and R. Wehbring, GNU Octave version
4.2.0 manual: a high-level interactive language for numerical computations,
2016.
[EL04] M. D. Ercegovac and T. Lang, Digital arithmetic. Morgan Kaufmann Publishers,
2004, ch. 10: Function Evaluation, pp. 548 – 607.
[FPM12] M. J. Flynn, O. Pell, and O. Mencer, “Dataflow supercomputing,” in Int. Conf.
Field Programmable Logic and Applications, Aug. 2012, pp. 1–3.
[GKN+08] T. Güneysu, T. Kasper, M. Novotný, C. Paar, and A. Rupp, “Cryptanalysis with CO-
PACOBANA,” IEEE Trans. Comput., vol. 57, no. 11, pp. 1498–1513, Nov. 2008.
[Gor14] M. Gort, “Fast CAD for FPGAs,” Ph.D. dissertation, University of Toronto, 2014.
[HAN+16] S. Haas, O. Arnold, B. Nöthen et al., “An MPSoC for energy-efficient database
query processing,” in Design Automation Conf. ACM, 2016, pp. 112:1–112:6.
[Hau00] J. Hauser, “Augmenting a microprocessor with reconfigurable hardware,” Ph.D.
dissertation, University of California, Berkeley, 2000.
[HEW13] E. Hung, F. Eslami, and S. J. E. Wilton, “Escaping the academic sandbox: Real-
izing VPR circuits on xilinx devices,” in Int. Symp. Field-Programmable Custom
Computing Machines. IEEE, Apr. 2013, pp. 45–52.
[HFHK04] S. Hauck, T. W. Fry, M. M. Hosler, and J. P. Kao, “The chimaera reconfigurable
functional unit,” IEEE Trans. Very Large Scale Integr. VLSI Syst., vol. 12, no. 2,
pp. 206–217, Feb. 2004.
[HMB+14] P. Hammarlund, A. J. Martinez, A. A. Bajwa et al., “Haswell: The fourth-genera-
tion Intel core processor,” IEEE Micro, vol. 34, no. 2, pp. 6–20, Mar. 2014.
[HSE+12] S. Höppner, C. Shao, H. Eisenreich et al., “A power management architecture
for fast per-core DVFS in heterogeneous MPSoCs,” in IEEE Int. Symp. Circuits
and Syst., May 2012, pp. 261–264.
[Hun15] E. Hung, “Mind the (synthesis) gap: Examining where academic FPGA tools lag
behind industry,” in Int. Conf. Field Programmable Logic and Applications, Sep.
2015, pp. 1–4.
[HWY+09] E. Hung, S. J. E. Wilton, H. Yu, T. C. P. Chau, and P. H. W. Leong, “A detailed delay
path model for FPGAs,” in Int. Conf. Field-Programmable Technology, Dec. 2009,
pp. 96–103.
[IL07] P. Ienne and R. Leupers, Eds., Customizable embedded processors: Design,
technologies and applications. Morgan Kaufmann Publishers, 2007.
[JH15] L. J. Jung and C. Hochberger, “Feasibility of high level compiler optimizations
in online synthesis,” in Int.l Conf. Reconfigurable Computing and FPGAs, Dec.
2015, pp. 1–7.
[JKGS10] P. Jamieson, K. B. Kent, F. Gharibian, and L. Shannon, “Odin II - an open-source
verilog HDL synthesis tool for CAD research,” in Int. Symp. Field-Programmable
Custom Computing Machines, May 2010, pp. 149–156.
64
Bibliography
[KBS+10] R. Koenig, L. Bauer, T. Stripf et al., “KAHRISMA: A novel hypermorphic recon-
figurable-instruction-set multi-grained-array architecture,” in Design, Automation
Test in Europe Conf., Mar. 2010, pp. 819–824.
[KBW03] N. Kafafi, K. Bozman, and S. J. E. Wilton, “Architectures and algorithms for
synthesizable embedded programmable logic cores,” in Int. Symp. Field Pro-
grammable Gate Arrays. ACM, 2003, pp. 3–11.
[Kim16] J. H. Kim, “Synthesizable FPGA fabrics,” Master’s thesis, University of Toronto,
2016.
[KLS+16] G. K. Konstadinidis, H. P. Li, F. Schumacher et al., “SPARC M7: A 20 nm 32-core
64 MB L3 cache processor,” IEEE J. Solid-State Circuits, vol. 51, no. 1, pp. 79–91,
Jan. 2016.
[KR09] I. Kuon and J. Rose, Quantifying and exploring the gap between FPGAs and
ASICs. Springer, 2009.
[KY16] F. F. Khan and A. Ye, “An evaluation on the accuracy of the minimum width tran-
sistor area models in ranking the layout area of FPGA architectures,” in Int. Conf.
Field Programmable Logic and Applications, Aug. 2016, pp. 1–11.
[LAR11] J. Luu, J. H. Anderson, and J. S. Rose, “Architecture description and packing
for logic blocks with hierarchy, modes and complex interconnect,” in Int. Symp.
Field Programmable Gate Arrays. ACM, 2011, pp. 227–236.
[LCB+06] A. Lodi, A. Cappelli, M. Bocchi et al., “Xisystem: a XiRisc-based SoC with recon-
figurable IO module,” IEEE J. Solid-State Circuits, vol. 41, no. 1, pp. 85–96, Jan.
2006.
[LCG06] A. Lodi, L. Ciccarelli, and R. Guerrieri, “Low leakage techniques for FPGAs,”
IEEE J. Solid-State Circuits, vol. 41, no. 7, pp. 1662–1672, Jul. 2006.
[Lee06] E. Lee, “Interconnect driver design for long wires in field programmable gate
arrays,” Master’s thesis, University of British Columbia, 2006.
[LGW+14] J. Luu, J. Goeders, M. Wainberg et al., “VTR 7.0: next generation architecture
and CAD system for FPGAs,” Trans. on Reconfigurable Technology and Syst.,
vol. 7, no. 2, pp. 6:1–6:30, Jun. 2014.
[LL01] G. Lemieux and D. Lewis, “Using sparse crossbars within LUT clusters,” in Int.
Symp. Field Programmable Gate Arrays. New York, NY, USA: ACM, 2001, pp.
59–68.
[LL04] G. Lemieux and D. Lewis, Design of Interconnection Networks for Pro-
grammable Logic. Kluwer Academic Publishers, 2004.
[LLTY04] G. Lemieux, E. Lee, M. Tom, and A. Yu, “Directional and single-driver wires in
FPGA interconnect,” in Int. Conf. Field- Programmable Technology, Dec. 2004,
pp. 41–48.
[LMB+06] A. Lodi, C. Mucci, M. Bocchi et al., “A multi-context pipelined array for embedded
systems,” in Int. Conf. Field Programmable Logic and Applications, Aug. 2006,
pp. 1–8.
[LN04] K. Leijten-Nowak, “Template based embedded reconfigurable computing,”
Ph.D. dissertation, Eindhoven University of Technology, 2004.
65
Bibliography
[LVT04] R. Lysecky, F. Vahid, and S. X.-D. Tan, “Dynamic FPGA routing for just-in-time
FPGA compilation,” in Design Automation Conf. ACM, 2004, pp. 954–959.
[MBFG15] A. Mishchenko, R. Brayton, W. Feng, and J. Greene, “Technology mapping into
general programmable cells,” in Int. Symp. Field-Programmable Gate Arrays.
ACM, 2015, pp. 70–73.
[MBJK11] A. Mishchenko, R. Brayton, S. Jang, and V. Kravets, “Delay optimization using
SOP balancing,” in Int. Conf. Computer-Aided Design. IEEE Press, 2011, pp.
375–382.
[MBR00] A. Marquardt, V. Betz, and J. Rose, “Timing-driven placement for FPGAs,” in Int.
Symp. Field Programmable Gate Arrays. New York, NY, USA: ACM, 2000, pp.
203–213.
[MCB06] A. Mishchenko, S. Chatterjee, and R. Brayton, “DAG-aware AIG rewriting: A
fresh look at combinational logic synthesis,” in Design Automation Conf., Jul.
2006, pp. 532–535.
[MCCB07] A. Mishchenko, S. Cho, S. Chatterjee, and R. Brayton, “Combinational and se-
quential mapping with priority cuts,” in Int. Conf. Computer-Aided Design, Nov.
2007, pp. 354–361.
[MD06] U. Malik and O. Diessel, “The entropy of FPGA reconfiguration,” in Int. Conf.
Field Programmable Logic and Applications, Aug. 2006, pp. 1–6.
[ME95] L. McMurchie and C. Ebeling, “Pathfinder: A negotiation-based performance-
driven router for FPGAs,” in Int. Symp. Field-programmable Gate Arrays. ACM,
1995, pp. 111–117.
[Mic16] SmartFusion2 System-on-Chip FPGAs Product Brief, Microsemi, Aliso Viejo, CA,
USA, 2016.
[Mul06] J.-M. Muller, Elementary functions : Algorithms and implementation, 2nd ed.
Birkhäuser, 2006.
[MWL+15] K. E. Murray, S. Whitty, S. Liu, J. Luu, and V. Betz, “Timing-driven titan: En-
abling large benchmarks and exploring the gap between academic and com-
mercial CAD,” Trans. Reconfigurable Technology and Systems, vol. 8, no. 2, pp.
10:1–10:18, Mar. 2015.
[Neu10] B. Neumann, “Modellierung und Analyse arithmetikorientierter eFPGA-architek-
turen,” Ph.D. dissertation, RWTH Aachen, 2010.
[Nus14] H. Nuszkowski, Digitale Signalübertragung im Mobilfunk, 2nd ed. Vogt, 2014.
[PA12] H. Parandeh-Afshar, “Closing the gap between FPGA and ASIC: Balancing flexi-
bility and efficiency,” Ph.D. dissertation, EPFL, 2012.
[PHE+17] J. Partzsch, S. Höppner, M. Eberlein et al., “A fixed point exponential function
accelerator in 28nm CMOS for a digital neuromorphic system,” to appear in IEEE
Int. Symp. Circuits and Syst., 2017.
[PM96] J. G. Proakis and D. G. Manolakis, Digital signal processing : principles, algo-
rithms and applications, 3rd ed. Prentice Hall, 1996.
66
Bibliography
[PSH+14] Z. Průša, P. L. Søndergaard, N. Holighaus, C. Wiesmeyr, and P. Balazs, “The large
time-frequency analysis toolbox 2.0,” in Sound, Music, and Motion, ser. Lecture
Notes in Computer Science. Springer, 2014, pp. 419–442.
[PZN+15] A. Petkovska, G. Zgheib, D. Novo et al., “Improved carry chain mapping for the
VTR flow,” in Int. Conf. Field Programmable Technology, Dec. 2015, pp. 80–87.
[Qui15] EOS S3 Sensor Processing SoC Platform Brief, QuickLogic, Sunnyvale, CA, USA,
2015.
[RCN03] J. M. Rabaey, A. P. Chandrakasan, and B. Nikolić, Digital integrated circuits : a
design perspective, 2nd ed. Pearson, 2003.
[RCS+10] D. Rossi, F. Campi, S. Spolzino, S. Pucillo, and R. Guerrieri, “A heterogeneous dig-
ital signal processor for dynamically reconfigurable computing,” IEEE J. Solid-S-
tate Circuits, vol. 45, no. 8, pp. 1615–1626, Aug. 2010.
[RD11] R. Y. Rubin and A. M. DeHon, “Timing-driven pathfinder pathology and reme-
diation: Quantifying and reducing delay noise in VPR-pathfinder,” in Int. Symp.
Field Programmable Gate Arrays. ACM, 2011, pp. 173–176.
[Rul09] M. Rullmann, “Models, design methods and tools for improved partial dynamic
reconfiguration,” Ph.D. dissertation, Technische Universität Dresden, 2009.
[S+13] R. M. Stallman et al., Using the GNU compiler collection: For GCC version 4.8.2,
Free Software Foundation, Boston, USA, Oct. 2013.
[Sch15] B. Schneier, Applied cryptography: protocols, algorithms, and source code in C,
2nd ed. John Wiley & Sons Inc, 2015.
[Sch16] F. Schraut, Personal communication, Sep. 2016.
[SSH99] I. Sutherland, B. Sproull, and D. M. Harris, Logical Effort : Designing fast CMOS
circuits. Morgan Kaufmann Publishers, 1999.
[SSL+92] E. Sentovich, K. Singh, L. Lavagno et al., “SIS: A system for sequential cir-
cuit synthesis,” EECS Department, University of California, Berkeley, Tech. Rep.
UCB/ERL M92/41, 1992.
[ST99] S. Story and P. T. P. Tang, “New algorithms for improved transcendental functions
on IA-64,” in Symp. Computer Arithmetic, 1999, pp. 4–11.
[STM16] STMF417 Reference manual, STMicroelectronics, Geneva, Switzerland, Sep.
2016.
[TAJ00] X. Tang, M. Aalsma, and R. Jou, “A compiler directed approach to hiding config-
uration latency in chameleon processors,” in Int. Workshop Field Programmable
Logic and Applications. Springer, 2000, pp. 29–38.
[TCJW97] S. Trimberger, D. Carberry, A. Johnson, and J. Wong, “A time-multiplexed FPGA,”
in Int. Symp. Field-Programmable Custom Computing Machines, Apr. 1997, pp.
22–28.
[Tex16] MSP430FR59XX Family User’s Guide, Texas Instruments, Dallas, Texas, USA,
Oct. 2016.
[TPD15] R. Tessier, K. Pocek, and A. DeHon, “Reconfigurable computing architectures,”
Proc. IEEE, vol. 103, no. 3, pp. 332–354, Mar. 2015.
67
Bibliography
[VB03] M. Vorbach and R. Becker, “Reconfigurable processor architectures for mobile
phones,” in Proc. Int. Parallel and Distributed Processing Symp., Apr. 2003, pp.
6–12.
[VBS13] E. Vansteenkiste, K. Bruneel, and D. Stroobandt, “A connection-based router for
FPGAs,” in Int. Conf. Field Programmable Technology, Dec. 2013, pp. 326–329.
[VHCH15] M. Vogt, G. Hempel, J. Castrillon, and C. Hochberger, “GCC-plugin for auto-
mated accelerator generation and integration on hybrid FPGA-SoCs,” in Proc.
Int. Workshop FPGAs for Software Programmers, Sep. 2015.
[VKF15] E. Vansteenkiste, A. Kaviani, and H. Fraisse, “Analyzing the divide between
FPGA academic and commercial results,” in Int. Conf. Field Programmable Tech-
nology, Dec. 2015, pp. 96–103.
[vS10] T. von Sydow, “Modellbildung und Analyse heterogener ASIP-eFPGA-Architek-
turen,” Ph.D. dissertation, RWTH Aachen, 2010.
[vSKN+06] T. v. Sydow, M. Korb, B. Neumann, H. Blume, and T. G. Noll, “Modelling and quan-
titative analysis of coupling mechanisms of programmable processor cores and
arithmetic oriented eFPGA macros,” in IEEE Int.l Conf. Reconfigurable Comput-
ing and FPGA’s, Sep. 2006, pp. 1–10.
[VSL08] F. Vahid, G. Stitt, and R. Lysecky, “Warp processing: Dynamic translation of
binaries to FPGA circuits,” Computer, vol. 41, no. 7, pp. 40–46, Jul. 2008.
[Wan13] C. Wang, “Building efficient, reconfigurable hardware using hierarchical inter-
connects,” Ph.D. dissertation, University of California, Los Angeles, 2013.
[WC96] R. D. Wittig and P. Chow, “OneChip: an FPGA processor with reconfigurable
logic,” in Symp. FPGAs for Custom Computing Machines, Apr. 1996, pp.
126–135.
[Wil97] S. Wilton, “Architecture and algorithms for field programmable gate arrays with
embedded memory,” Ph.D. dissertation, University of Toronto, 1997.
[Wol] C. Wolf, “Yosys open synthesis suite,” http://www.clifford.at/yosys/.
[WWC16] M. Wijtvliet, L. Waeijen, and H. Corporaal, “Coarse grained reconfigurable archi-
tectures in the past 25 years: Overview and classification,” in Int. Conf. Embed-
ded Comput. Syst.: Architectures, Modeling, Simulation, 2016, pp. 235–244.
[Xil11] Virtex-II Pro and Virtex-II Pro X Platform FPGAs: Complete Data Sheet, 5th ed.,
Xilinx, San Jose, CA, USA, 2011.
[Xil16] Zynq-7000 All Programmable SoC: Technical Reference Manual, Xilinx, San Jose,
CA, USA, 2016.
[XZC+11] C. J. Xue, Y. Zhang, Y. Chen et al., “Emerging non-volatile memories: Oppor-
tunities and challenges,” in Int. Conf. Hardware/Software Codesign and Syst.
Synthesis. ACM, 2011, pp. 325–334.
[Yan91] S. Yang, “Logic synthesis and optimization benchmarks user guide, version 3.0,”
Microelectronic Centre of North Carolina, Research Triangle Park, NC, USA,
Tech. Rep., 1991.
[ZLO+16] G. Zgheib, M. Lortkipanidze, M. Owaida, D. Novo, and P. Ienne, “FPRESSO: en-
abling express transistor-level exploration of FPGA architectures,” in Int. Symp.
Field-Programmable Gate Arrays. ACM, 2016, pp. 80–89.
68
APPENDIX
A MICROPROCESSOR INSTRUCTION SET
Table A.1.1: Overview of important instructions from the targeted ISA
Type Mnemonics Description
Memory LD.D/A/N/S/R/P/AX/DX Read data from memory
ST.D/A/N/S/R/P/AX/DX Write data to memory
Logical OR, ORI, ORIB Bitwise OR
XOR, XORI Bitwise exclusive OR
AND, ANDN, ANDNI Bitwise AND
NOT Invert bits
SHL, SHLI, SHLD Shift left by n
SHR, SHRI, SHRD Shift right by n
ROL Rotate left by n
SET Conditionally set bits
CMP, CMPI, CMPBIB Comparison to set flags
MASK Bitwise AND with immediate
CRC16 Cyclic redundancy check
Arithmetic MOV, MOVI, MOVD Copy values
XM, XX Copy values shifted left by n
ADD, ADDS, ADDC Addition
SUM, SUMS Addition with immediate
SUB, SUBS, SUBC Subtraction
NEG, NEGS Negation
SAR, SARI, SARD Arithmetic shift right by n
TESTLZ Count leading zeros
DIVU, DIVS Integer division with remainder
Control CHK, CHKZ Check value range
BR1, DBR1 Unconditional branch and delayed branch
CALL, CALLX Execute a subroutine
TRAP Execute a trap handler
SETADR Calculate strack frame address
FRAME Create a new stack frame
RET Return from subroutine
1 Augmented by 12 conditional variations.
69
B FPGA design space exploration
1 ___mulsi3:
2 FRAME L5, L2
3 MOVI L4, 0
4 low_n_00:
5 CMPBIB L1, 31
6 BZ low_n_01
7 ADD L4, L0
8 low_n_01:
9 SHLI L4, 1
10 SHLI L1, 1
11 BNN low_n_02
12 ADD L4, L0
13 low_n_02:
14 SHLI L4, 1
15 SHLI L1, 1
16 BNN low_n_03
17 ADD L4, L0
Listing A.1.1: Excerpt of the software implementation to multiply two 32-bit operands
B FPGA DESIGN SPACE EXPLORATION
Table A.1.2: Results of automatic transistor sizing with a custom area model and a
commercial 22 nm technology library
K N W L initial cost1 final cost1 CPU hours 2
2 10 80 2 0.112 0.123 36.85
3 10 80 2 0.165 0.192 124.91
4 10 80 2 0.236 0.273 21.67
5 10 80 2 0.358 0.408 62.57
6 10 80 2 0.766 0.741 67.79
1 Cost function: Area1 * Delay1,
auto-terminated after 2 itertations
2 SPICE Simulator: Cadence Spectre 15.1
Machine: 2.5 GHz Intel Xeon E5-2670 v2
Rebalancing for one solution (M = 1)
Table A.1.3: Results of automatic transistor sizing with COFFE’s MWTA model and a
22nm PTM library
K N W L initial cost1 final cost1 CPU hours 2
4 10 80 2 0.0507 0.0411 1.2
5 10 80 2 0.0786 0.0631 2.1
6 10 80 2 0.1312 0.0976 2.3
1 Cost function: Area1 * Delay1
2 SPICE Simulator: HSPICE 2016.06
Machine: 2.9 GHz Intel Xeon E5-2960
Rebalancing for one solution (M = 1)
70
B FPGA design space exploration
B.1 FPGA CAD FLOW
Table A.1.4: non-default VPR options
VPR option Value Stage Effect
timing_tradeoff 0.80 Placement Increase cost weight for delay
bb_factor 5 Routing Increase routing search radius
pres_fact_mult 1.05 Routing Defer congestion-based routing
max_router_iterations 2000 Routing Increase routing effort
1 resyn; resyn2; resyn2rs;
2 resyn2a; resyn3; strash;
3 scleanup; scleanup; scleanup;
4 scleanup; scleanup; scleanup;
5 scleanup; scleanup; scleanup;
6 dch;
7 if -K <LUT_SIZE>;
Listing A.1.2: Generic baseline ABC script
1 st;
2 if -gm -K 8 -F 10 -W 2.0 ; dch;
3 scleanup; scleanup; scleanup;
4 if -K 5; mfs2 -W 2.0; strash; dch;
5 if -gm -K 8 -F 10 -W 2.0; dch;
6 if -K 5 -F 10 -W 2.0; mfs2; strash; dch;
7 if -K 5 -F 10 -W 2.0; mfs2; strash; dch;
8 if -K 5 -F 10 -W 2.0; mfs2;
Listing A.1.3: High effort ABC script for K = 5
1 module rfu(data_i, clk_i, result_shift_o, result_concat_o);
2 input [31:0] data_i;
3 input clk_i;
4 output [63:0] result_shift_o;
5 output [63:0] result_concat_o;
6
7 reg [63:0] result_shift;
8 reg [63:0] r_result_shift;
9 reg [63:0] result_concat;
10 reg [63:0] r_result_concat;
11
12 always @(*) begin
13 result_shift = data_i << (29 + 1);
14 result_concat = {data_i, 30’b0};
15 end
16
17 always @(posedge clk_i) begin
18 r_result_shift <= result_shift;
19 r_result_concat <= result_concat;
20 end
21
22 assign result_shift_o = r_result_shift;
23 assign result_concat_o = r_result_concat;
24 endmodule
Listing A.1.4: Verilog module to produce Odin II synthesis mismatch
71
B FPGA design space exploration
148 .latch gnd top^r_result_concat~28_FF_NODE re top^clk_i 3
149 .latch gnd top^r_result_concat~29_FF_NODE re top^clk_i 3
150 .latch gnd top^r_result_concat~62_FF_NODE re top^clk_i 3
151 .latch gnd top^r_result_concat~63_FF_NODE re top^clk_i 3
152 .latch top^data_i~0 top^r_result_shift~30_FF_NODE re top^clk_i 3
153 .latch top^data_i~0 top^r_result_concat~30_FF_NODE re top^clk_i 3
154 .latch top^data_i~1 top^r_result_shift~31_FF_NODE re top^clk_i 3
155 .latch top^data_i~1 top^r_result_concat~31_FF_NODE re top^clk_i 3
156 .latch top^data_i~2 top^r_result_concat~32_FF_NODE re top^clk_i 3
157 .latch top^data_i~3 top^r_result_concat~33_FF_NODE re top^clk_i 3
158 .latch top^data_i~4 top^r_result_concat~34_FF_NODE re top^clk_i 3
159 .latch top^data_i~5 top^r_result_concat~35_FF_NODE re top^clk_i 3
Listing A.1.5: Excerpt of the incorrectly synthesized BLIF netlist
B.2 MISCELLANEOUS
Figure A.1.1: Logical view of implementation options for fabrics with M parallel
contexts, restricted to LUT resources
72
B FPGA design space exploration
Figure A.1.2: Area delay product across the 20 largest MCNC benhchmarks with L = 4
Figure A.1.3: Contour plot of the area delay product as a function of Fc,in and Fc,out
with N = 6, K = 3, L = 2
73
