Software tools or embedded reconfigurable processors by Mucci, Claudio
Universit` a degli Studi di Bologna
FACOLT ` A DI INGEGNERIA
Dottorato di Ricerca in Ingegneria Elettronica,
Informatica e delle Telecomunicazioni
XIX Ciclo
ING-INF/01
Software Tools for Embedded
Reconﬁgurable Processors
Tesi di Dottorato di Relatore
Claudio Mucci Chiar. mo Prof. Roberto Guerrieri
Coordinatore
Chiar. mo Prof. Paolo Bassi
Anno Accademico 2005-2006Keywords:
Reconﬁgurable architectures
Programming environment
Application Development
HW/SW Co-Design
Digital Signal ProcessingContents
1 Introduction 1
2 Reconﬁgurable computing overview 9
2.1 Instruction set metamorphosis ................. 9
2.2 Coarse-grained reconﬁgurable computing . .......... 1 3
2.3 XiRisc reconﬁgurable processor ................. 2 0
2.4 DREAM adaptive reconﬁgurable DSP . . . .......... 2 5
2.4.1 PiCoGA-III architecture ................. 2 7
3 Programming tools for reconﬁgurable processors 31
3.1 Motivations ............................ 3 1
3.2 Algorithm development on reconﬁgurable
processors (programming issues) . . . . . . .......... 3 5
3.3 Instruction setextension implementationon astandard com-
pilation tool-chain . ........................ 3 8
3.4 Bridging the gap from hardware to software through C-
described Data Flow Graphs . . ................. 4 2
3.5 Overview of programming tools for reconﬁgurable processors 46
3.6 Griffy project overview . . . . . ................. 5 0
4 Mapping DFG on reconﬁgurable devices 57
4.1 ILP exploitation through pipelined DFG and Petri Nets . . . 57
4.2 Instruction scheduling: optimized DFG for pipelined com-
putation . . ............................ 6 5
4.2.1 Scheduling of direct acyclic graphs . .......... 6 5
iii CONTENTS
4.2.2 Scheduling of data ﬂow graphs . . . . ......... 6 7
4.2.3 Execution-time pipeline management ......... 7 1
4.2.4 Griffy Front-End architecture . . . . . ......... 7 4
4.3 Target-speciﬁc customizations and back-end ﬂows . . . . . . 76
4.3.1 DFG mapping for PiCoGA . . . . . . ......... 7 7
4.3.2 DFG mapping for eFPGA ................ 7 9
5 Simulation of dynamically reconﬁgurable processors 85
5.1 Functional simulation . . . . . . ................ 8 7
5.1.1 Functional emulation . . . ................ 8 7
5.1.2 Reconﬁgurable devices management via virtual target 90
5.2 Instruction set extension through dynamic libraries . . . . . 92
5.2.1 Cycle-accurate simulation model . . . ......... 9 6
5.2.2 Simulation speed analysis................1 0 0
6 Application development on reconﬁgurable processors 105
6.1 Reconﬁgurable software development time: hardware and
software approaches .......................1 1 1
6.2 Example of application mapping ................1 1 6
6.2.1 MPEG-2 motion compensation on the XiRisc processor116
6.2.2 AES/Rijndaelimplementationon theDREAMadap-
tive DSP . . . .......................1 3 3
6.2.3 Low-complexity transform for H.264 video encoding 146
6.2.4 H.264intra prediction with Hadamard transform for
4x4 blocks . . .......................1 6 2
7 Performance and development time trade-offs 173
8 Conclusions 187
A Griffy-C syntax 191
A.1 Overview . . . ...........................1 9 2
A.1.1 Standard Operators . . . . ................1 9 9
A.1.2 Arithmetical Operators . ................1 9 9
A.1.3 Bitwise Logical Operators ................2 0 0CONTENTS iii
A.1.4 Direct Assignment . . . .................2 0 3
A.1.5 Shift Operators . . . . . .................2 0 3
A.1.6 Comparison Operators . .................2 0 4
A.1.7 Conditional Assignment .................2 0 8
A.1.8 Advanced Operators . . .................2 1 0
A.1.9 Concatenate operator (#) .................2 1 0
A.1.10 LUT operator (@) . . . . .................2 1 1
A.1.11 Built-in function as hard-macros . . ..........2 1 3iv CONTENTSList of Figures
1.1 Computational requirements vs. Moore’s law and battery
storage . . . ............................ 2
1.2 Factors considered most important in choosing a micropro-
cessor (source: J.Turley, “Survey says: software tools more im-
portant than chips”, Nov. 2005, www.embedded.com)....... 5
1.3 Performance vs. Development Time in a commercial DSP
(source: “EFR(EnhancedFull-Rate) vocoderonDual-MACST122
DSP” STMicroelectronics online, www.stm.com)......... 7
2.1 PRISC Architecture overview . ................. 1 0
2.2 OneChip architecture . . . . . . ................. 1 1
2.3 Garp architecture . ........................ 1 2
2.4 XiRisc reconﬁgurable processor architecture .......... 1 3
2.5 Molen architecture ........................ 1 4
2.6 FPGA integration density (source: R. Hartenstein “Why we
need reconﬁgurable computing education”) .......... 1 5
2.7 MorphoSys architecture . . . . ................. 1 6
2.8 PACT XPP architecture . . . . . ................. 1 7
2.9 CHESS architecture and its hexagonal topology . . . . . . . . 18
2.10 Detailed XiRisc reconﬁgurable processor architecture . . . . 20
2.11 Pipelined Conﬁgurable Gate Array (PiCoGA) ver. 1.0 . . . . 21
2.12 PiCoGA Reconﬁgurable Logic Cell (RLC) . .......... 2 2
2.13 Simpliﬁed DREAM architecture ................. 2 5
2.14 Programmable address generator schema . .......... 2 6
2.15 Simpliﬁed PiCoGA-III Reconﬁgurable Logic Cell (RLC)... 2 9
vvi LIST OF FIGURES
3.1 Performance vs. Time-to-develop design space . . . . . . . . 34
3.2 Basic software tool-chain extension to support reconﬁgura-
bility issues . ........................... 4 1
3.3 Examples of control and data ﬂow graphs . . ......... 4 3
3.4 Griffy Algorithm Development Environment ......... 5 1
3.5 DFG Description . . ....................... 5 2
3.6 Example of optimization of routing-only operators . . . . . . 53
3.7 Griffy-C Debugging and Validation Environment . . . . . . 55
4.1 Computation paradigm relaxation preserving the data de-
pendencies . . ........................... 6 0
4.2 DFG and the corresponding Petri Net representation . . . . 62
4.3 Petri Net transition ﬁring . . . . . ................ 6 3
4.4 DAG scheduling pseudo-code (with routing only optimiza-
tion) . . . . . . ........................... 6 6
4.5 Example of ALAP correction for static variables . . . . . . . 69
4.6 Simpliﬁed DFG scheduling algorithm . . . . ......... 7 0
4.7 Candidates analysis algorithm . ................ 7 2
4.8 Pipeline stage controller simpliﬁed architecture . . . . . . . . 73
4.9 P-block and S-block simpliﬁed architecture . ......... 7 3
4.10 Simpliﬁed Griffy Front-End architecture . . . ......... 7 5
4.11 Simpliﬁed Griffy ﬂow for PiCoGA-III . . . . ......... 7 6
4.12 PiCoGA-III control unit programmable interconnect . . . . . 78
4.13 XiSystem SoC architecture . . . . ................ 8 0
4.14 Overall software tool-chain . . . ................ 8 1
4.15 XiSystem MPEG2 decoder performance . . . ......... 8 4
5.1 Griffy code viewer . ....................... 8 9
5.2 Simpliﬁed XiRisc simulation structure . . . . ......... 9 4
5.3 An example of pipeline evolution ................ 9 7
6.1 Case study: saturating MAC for low bit-rate audio com-
pression . . . ...........................1 0 6
6.2 Case study: Griffy-C code for saturating arithmetic . . . . . 107LIST OF FIGURES vii
6.3 Case study: software pipelining across processor and PiCoGA108
6.4 Variation of %speed-up wrt
 
￿ and
 
￿ ..............1 1 3
6.5 Variationof speed-upwrt #optimizedkernelandlocalspeed-
up
 
￿ ................................1 1 4
6.6 Motion estimation . ........................1 1 6
6.7 Search path . ............................1 1 9
6.8 Absolute Difference (AD) DFG .................1 2 0
6.9 Concurrent 4-pixel Sum of Absolute Differences . . . . ...1 2 0
6.10 Memory layout . . ........................1 2 1
6.11 Enhanced search path . . . . . .................1 2 3
6.12 Concurrent 4-blocks SAD . . . .................1 2 4
6.13 Unfolded SAD function based on sad4blk ..........1 2 6
6.14 sad4blk D F G...........................1 2 7
6.15 sad4blk Place & Route . . . . .................1 2 8
6.16 Full-Search workload vs. search window side . . . . . . ...1 3 1
6.17 Common AES-Round block diagram . . . . ..........1 3 7
6.18 Inverse multiplicative on composite ﬁelds schemes . . ...1 3 9
6.19 AES/Rijndael selected kernel and implementation . . . ...1 4 0
6.20 Speed-ups wrt RISC processor . .................1 4 2
6.21 Throughput vs. interleaving factor . . . . . ..........1 4 4
6.22 Fast implementation of the 1-D H.264 transform . . . . ...1 5 1
6.23 Fully-unfolded bi-dimensional transform diagram . . . ...1 5 2
6.24 Partially unfolded 4x4 DCT schema . . . . ..........1 5 3
6.25 sub4x4dct rows occupation . . .................1 5 4
6.26 Modiﬁed sub4x4dct for area optimization . ..........1 5 5
6.27 Fully-unfolded inverse 4x4-IDCT basic diagram . . . . ...1 5 6
6.28 Partially-unfolded inverse 4x4-IDCT basic diagram . . ...1 5 7
6.29 Modiﬁed clipping function structure . . . . ..........1 5 7
6.30 Speed-up ﬁgure with respect to a RISC processor working
at the same frequency . . . . . . .................1 6 0
6.31 Throughput achieved with respect to interleaving factor . . 160
6.32 Energy efﬁciency with respect to interleaving factor . . ...1 6 1
6.33 Intra prediction modes for 4x4 luma block . ..........1 6 2viii LIST OF FIGURES
6.34 PiCoGA SAD structure . . . . . . ................1 6 4
6.35 DCT and Hadamard transform . ................1 6 5
6.36 1-D Hadamard transform butterﬂy schema . .........1 6 6
6.37 Fully unfolded 4x4 SATD data ﬂow graph . . .........1 6 7
6.38 Partially folded 4x4 SATD block diagram . . .........1 6 7
6.39 Shifter register structure used for the matrix transposition . 168
6.40 Optimized SATD mapping . . . . ................1 6 9
6.41 4x4 SAD and SATD speed-up ﬁgures with respect to the in-
terleaving factor . . . .......................1 7 0
6.42 4x4 SAD and SATD throughput with respect to the inter-
leaving factor ...........................1 7 0
6.43 4x4 SAD and SATD energy efﬁciency with respect to the
interleaving factor . . .......................1 7 1
7.1 Application development trade-off . . . . . . .........1 7 7
7.2 Development Time vs Speed-Up percentage . .........1 7 7
7.3 Distribution of speed-up with respect to development time . 178
7.4 XiRisc vs DSP Development Time/Speed-Up analysis . . . . 181
7.5 DREAM speed-up . . .......................1 8 2
7.6 DREAM throughput .......................1 8 3
7.7 DREAM energy efﬁciency . . . . ................1 8 4
A.1 Multiple entry-point Griffy ﬂow ................1 9 3
A.2 Concatenate operator .......................2 1 0
A.3 Multiplier chunk . . .......................2 1 4List of Tables
4.1 PiCoGA vs. eFPGA computational efﬁciency comparison . . 82
4.2 Area occupation and working frequency of circuits mapped
on the eFPGA . . . ........................ 8 3
5.1 Simulation results (without PiCoGA) . . . . ..........1 0 1
5.2 Simulation results (with PiCoGA) . . . . . . ..........1 0 2
6.1 MPEG-2 computation-aware analysis . . . ..........1 1 7
6.2 Test-sequence features . . . . . .................1 3 0
6.3 Performances . . . ........................1 3 1
6.4 MPEG-2: ﬁnal results . . . . . . .................1 3 2
6.5 AES/Rijndael encoder performance . . . . ..........1 4 1
6.6 AES-128 encryption comparisons . . . . . . ..........1 4 5
6.7 sub4x4dct . ............................1 5 4
6.8 sub4x4dct . ............................1 5 5
6.9 F4x4idct and add4x4 . . . . . . .................1 5 8
6.10 4x4 Sum of Absolute Differences (SAD) . . ..........1 6 3
6.11 4x4 SATD static performance . .................1 6 8
7.1 Experimental results on application development . . . ...1 7 5
7.2 XiRisc vs. TI TMS320C6713 Performance Comparison . ...1 7 9
7.3 XiRisc vs. TI TMS320C6713 Performance Comparison . ...1 8 0
A.1 Griffy operators . . ........................1 9 4
A.2 Typologies of LUTs supported . .................2 1 1
ixx LIST OF TABLESChapter 1
Introduction
Flexible computational platforms are one of the most important need of
the modern electronic marketplace. The growth of non-recurring engi-
neering costs (NREs) coupled with the need of shorter time-to-market im-
pose to look forward, toward ﬂexible solutions. The added capability to
update directly on the ﬁeld or to provide on-the-ﬂy new functionalities
makes appealing devices which can both reduce re-design costs and in-
crease the product lifetime. As an example, ﬂexible platforms allow to
change the supported standards for telecommunication devices, as cell-
phones or wireless router, or to build new products when the standard is
not well deﬁned, or in the status of draft, in order to match the optimal
time-to-market. Furthermore, market convergence toward devices inte-
grating multiple and heterogeneous applications is one of the most impor-
tant challenge for the consumer electronic scenario. As an example, each
smartphone, today, includes ofﬁce applications, video capabilities, and
can work with different wireless communication standards (GSM, UMTS,
WiFi and maybe WiMax).
Processor-based embedded systems are becoming wide spread and the
term ﬂexibility was often coupled with the presence of a processor and
its software programming environment. But, the huge increase of the
portable-device market puts pressure on application designers who need
to combine computational power, ﬂexibility and limited energy consump-
tion. Modern embedded applications such as wireless communication
12 Introduction
1
9
9
2
1
9
9
6
2
0
0
0
2
0
0
4
1
9
8
0
1
9
8
4
1
9
8
8
Shannon
Law
Moore’s
Law
Battery
capacity
2
0
0
8
per Second
Operations
2
0
1
2 time
Figure 1.1: Computational requirements vs. Moore’s law and battery storage
and portable multimedia require computational power to grow faster than
Moore’s law and much faster than the energy provided by the batteries for
a given application [1], as shown in Fig. 1.1.
In this context and specially for portable low-power applications, de-
signers cannot use the leverage of frequency scaling if they want to meet
the performance requirements imposed by quality of service and real-time
constraints. The exploitation of instruction level parallelism in many digi-
tal signal processors (DSPs) and/or VLIW(VeryLong Instruction Word) or
superscalar processors for embedded applications is an attempt to tackle
the performance gap but usually fails to reduce the energy consumption.
Many digital signal processing algorithms require sub-word (e.g. few bits)
computations which under-use the common 32-bit datapath of a standard
processor [2]. Hence many DSPs provide vectorized processing capabili-
ties, augmenting the instruction set with Single Instruction Multiple Data
(SIMD) instructions (like the Intel MMX). On the other hand, micropro-
cessors, whether general-purpose processors or DSPs, remain the most3
reusable block in modern systems-on-chips (SoC) and the high-level lan-
guages usedforprogramming themarewell-known skills amongembedded-
application developers.
Anewprocessor-based computation paradigm, namely“adaptivecom-
puting”, appeared in the early 90s as a promising way to bridge the gap
between general purpose microprocessors and application speciﬁc inte-
grated circuits (ASICs), in order to support new applications which were
both computational intensive and energy hungry. The most appealing
idea was to add application speciﬁc hardware accelerators in a standard
processor architecture (typically aRISCprocessor) to improve performance
on application critical hot-spots, while letting the processor handle the
control parts.
It should also be noted that, given a technology node, the area re-
quired for a new processor architecture increase by a factor that is greater
than the achieved performance improvements. This means that the tra-
ditional computing paradigm offered by processors itself loses in com-
putational efﬁciency (operations per second per mm
￿), thus causing an
undeniable crisis of standard and well-known devices. State of the art
system-on-chips for mobile applications, like ST Nomadik, Philips Nexpe-
ria, TI OPAM and Intel PXA, meet performance requirements and power
efﬁciency using the processor (usually an ARM9 core) as a supervisor (for
example, the operating system runs on the processor), while the compu-
tational intensive parts are commonly demanded to application-speciﬁc
hardware accelerators. From an engineering point of view, in this way,
the effort of accelerating an application focuses on a few computational
intensive kernels, thus reducing the time-to-develop.
Awidescenario of adaptive computing approaches has beenpresented
in the literature. We can distinguish among three different approaches:
￿ Application-Speciﬁc Standard Processors (ASSP), which are processors
featuring a customized instruction set targeting a given application.
Application speciﬁc instructions include, for example, the simple
multiply-and-accumulate operation in DSPs or the Sum of Absolute4 Introduction
Differences (SAD) used in video encoding motion compensation en-
gines [3].
￿ Conﬁgurable processors, which enable SoC designers to rapidly extend
a base processor for speciﬁc application tasks (e.g. adding custom-
tailored execution units, registers, register ﬁles and communication
mechanisms at the register transfer level (RTL)), thus providing a
faster and easier way to build an ASSP [4, 5, 6].
￿ (Dynamically) Reconﬁgurable processors, which are able to customize
the instruction set at execution time by coupling a standard proces-
sor core with a run-time programmable device, such as a Field Pro-
grammable Gate-Array (FPGA) [30, 10].
While in both ASSPs and conﬁgurable processors the instruction set
extension is deﬁned at the mask level, thus limiting the device in term
of both ﬂexibility and application ﬁeld, dynamically reconﬁgurable archi-
tectures allow the end-user to meet the requirements of a wide range of
applications. Consequently, dynamically reconﬁgurable architectures are
also suitable for use in low volume products as well, since they do not
suffer from non-recurring design costs. Furthermore, run-time reconﬁg-
urability (also known as on-line reconﬁgurability) allows one to update
the device frequently, thus increasing its lifespan.
In the ﬁeld of run-time programmable machines, reconﬁgurable pro-
cessors form a natural extension to the widely used DSPs or microcon-
trollers for embedded applications, providing a third trade-off point, in
addition to general purpose architectures and dedicated hardware acceler-
ators. However, reconﬁgurable processors alter the boundary between tra-
ditional hardwareandsoftware programming, requiring inevitablechanges
in the programmers’ approach and the deﬁnition of new design patterns
[29]. Algorithm development on reconﬁgurable processors requires ex-
pertise in both hardware and software programming ﬂows and this may
prove an obstacle for a community of developers long used to C-based
algorithm implementations.5
Sof t war e  t ool s
Per f or m ance
Pr i ce
O per at i ng  syst em s
Har dwar e  t ool s
Avai l abl e  sof t war e
Per i pher al s
Power   Consum pt i on
Suppl i er   r eput at i on
Fut ur e  r oadm ap
Fam i l i ar i t y
Debug  suppor t
Popul ar i t y
Avai l abl e  as  I P
0% 40% 60% 80% 20%
Figure 1.2: Factors considered most important in choosing a microprocessor
(source: J.Turley, “Survey says: software tools more important than chips”,
Nov. 2005, www.embedded.com)
According to a survey on embedded development in the telecommuni-
cations, automotive, consumer, wireless, defence, industrial, and automa-
tion sectors, software tools are considered the most important factor in
choosing a microprocessor (see Fig. 1.2). Of course, if we analyze spe-
ciﬁc sectors, speciﬁc parameters such as power consumption for portable
devices increase in importance. Nevertheless, for programmers software
tools are “the things they touch”, the interface with the processor. In the
case of reconﬁgurable processors the importance of software tools grows
because of the hybrid nature of these architectures. This is the reason why
the lack of user-friendly tools and ﬂows for exploring and implementing
the hardware and software portions of an algorithm has caused so much
difﬁculty when developing an application in such architectures [8].
Although not well suited to capture all the parallelism of an applica-
tion [9], the fact that knowledge of the ANSI-C programming language is
widespread among embedded systems and DSPs programmers suggests
one should also use it as the application description language for reconﬁg-
urable processors. This introduces the problem of translating behavioral6 Introduction
C into some form of HDL description, or directly into hardware (i.e. con-
ﬁguration bits for a run-time programmable device). Unfortunately, these
abstraction layers hide many implementation choices from the designer,
often making it difﬁcult to obtain high-quality results even with a deep
understanding of the tools and the underlying architecture. Despite this,
for a wide spectrum of application ﬁelds, and thus for a large part of ap-
plication developers, the availability of a fast and easy way to improve
system performance has an impact on the time-to-market, increasing the
return on investments. Many reconﬁgurable processors described in the
literature, as well as many start-ups, propose C-based design frameworks
in order to cut long-time implementation cycles and/or to reduce the skill
gaps for reconﬁgurable architecture development.
In this thesis will be described a C-based algorithm development en-
vironment for reconﬁgurable processors. It has been successfully applied
to the XiRisc reconﬁgurable processor, coupling a RISC core to a custom-
designed mid-grain reconﬁgurable datapath. It has been also realized
a prototype providing the HDL code required to a RISC processor en-
hanced with a standard embedded FPGA. A C-based conﬁguration ﬂow
enables even unexperienced users to efﬁciently develop algorithms on the
reconﬁgurable processor. Performance improvements of 2-3
￿ can be ob-
tained in 1-2 days of work, without requiring hardware design expertise
or awareness of the underlying architecture. Of course, experienced users
can achieve far better results, through manual optimizations, just as DSP
programmers may optimize at the assembly level so as to obtain the opti-
mal performance. As an example, Fig. 1.3 shows a case-study addressing
the relation between development time and performance in the case of
a commercial DSP featuring a speciﬁc instruction set extension for audio
coding applications. Near-optimal performance can be achieved by focus-
ing the implementation effort (mainly spent working with built-in func-
tions, loop restructuring and assembly-level optimization) on less than
25% of the code lines.
Most existing reconﬁgurable architectures use automatic or semiau-
tomatic C-to-HDL conversion tools to plug into standard synthesis and7
Cyc l e Count
Mc y c l e/s
16. 9
12. 5
10. 5
9. 5
1  w eek
1  M ont h 4  M ont hs >1  Y ear
Wo r kl oad
ETSI   C  M odel
“O ut- of-t he-box”
Op t i mi zed
C  source
75%   C  source
25%   ASM  source 100%
ASM  Op t i mi zed
Cl ose  t o  opt i mu m  perf orm ance wi t h
Li mi t ed eff ort   and  easy ma i nt ainabi l i t y
Figure 1.3: Performance vs. Development Time in a commercial DSP (source:
“EFR (Enhanced Full-Rate) vocoder on Dual-MAC ST122 DSP” STMicro-
electronics online, www.stm.com)
Place & Route techniques for conﬁguring the hardware accelerator. The
approach proposed in this thesis targets application ﬁelds where trading
some of the performance speed-up for a higher level of programmability
is important. A key contribution of this thesis is that it shows quantita-
tively how much performance one can gain by spending additional time
ﬁnely optimizing an implementation without ever leaving the purely C-
based design environment. It will be shown that knowledge of hardware
description languages and hardware design techniques is not required for
effective exploitation of a dynamically reconﬁgurable architecture, espe-
cially if the latter has been designed from the beginning to accommodate
an efﬁcient software-oriented design ﬂow.8 IntroductionChapter 2
Reconﬁgurable computing
overview
2.1 Instruction set metamorphosis
On 22 May 1999, The Economist (vol. 351, no. 8120, p. 89) reported the
following:
“In 1960 Gerald Estrin, a computer scientist at the University of Califor-
nia, Los Angeles, proposed the idea of a ﬁxed plus variable structure com-
puter. It would consist of a standard processor, augmented by an array
of reconﬁgurable hardware, the behavior of which could be controlled
by the main processor. The reconﬁgurable hardware could be set up to
perform a speciﬁc task, such as image processing or pattern matching,
as quickly as a dedicated piece of hardware. Once the task was done,
the hardware could be rejigged to do something else. The result ought
to be a hybrid computer combining the ﬂexibility of software with the
speed of hardware. Although Dr. Estrin built a demonstration machine,
his idea failed to catch on. Instead, microprocessors proved to be cheap
and powerful enough to do things on their own, without any need for
reconﬁgurable hardware. But recently Dr. Estrin’s idea has seen some-
thing of a renaissance. The ﬁrst-ever hybrid microprocessor, combining a
conventional processor with reconﬁgurable circuitry in a single chip, was
launched last month. Several ﬁrms are now competing to build recon-
910 Reconﬁgurable computing overview
ﬁgurable chips for use in devices as varied as telephone exchanges, tele-
visions and mobile telephones. And the market for them is expected to
grow rapidly. Jordan Selburn, an analyst at Gartner Group (an American
information-technology consultancy), believes that annual sales of recon-
ﬁgurable chips will increase to a value of around $50 billion in 10 years
time. (The Economist: Reconﬁgurable Systems Undergo Revival)”.
Thanks to the evolution of microelectronics and the enhancement of
Field Programmable Gate Array, after 30 years from the Estrin’s idea, in
1993, Athenas and Silverman formalized the concept of instruction set
metamorphosis or adaptive instruction set proposing the PRISM archi-
tecture [16]. Coupling a RISC processor with a Xilinx FPGA the authors
realized the ﬁrst relevant prototype of reconﬁgurable processor. For the
embedded world, the ﬁrst signiﬁcant example of processor including run-
time programmable hardware in the same chip is probably the PRogram-
mable Instruction Set Computer (PRISC) [17] proposed by Razdan and
Smit one year after.
Result Operand Bus
Source Operand Buses
PFU FU2 FU1
Register
Logic
File
Bypass
and
PFU: Programmable Functional Unit
FU:   Functional Unit
(a) PRISC
LUT LUT LUT LUT
LUT LUT LUT LUT
Outputs to result bus
Inputs from operand buses
(b) Programmable FU
Figure 2.1: PRISC Architecture overview
As shown in Fig. 2.1, the PRISC architecture deﬁnes a straightfor-
ward and efﬁcient way to exchange data with the programmable hard-
ware adopting a schema in which the programmable hardware was em-
bedded in the processor pipeline as the other functional units (Arithmetic
Logic Unit, multiplier,...). TheProgrammable Functional Unit (PFU) has2.1 Instruction set metamorphosis 11
PFU2
BFU
PFU1
M
U
X
M
U
X
M
U
X
ID EX MEM
Opcode
Instr.
RD_EXMEM
RD_MEMWB
Forwarding
unit
Signals required for
dependency check
to WB
RData2
RData1
Figure 2.2: OneChip architecture
been designed as a combinatorial matrix of Look-Up Tables (LUTs) inter-
connected via programmable wires like in FPGA technology. Combinato-
rial paths limited both the frequency and the size of the PFU. Following
an analogue schema, Wittig and Chow proposed the OneChip architecture
[18], improving the PRISC proposal with the capability of implementing
sequential logic and Finite State Machine (FSM).
One of the most important milestones of reconﬁgurable computing is
the Garp processor [19], developed at the University of California, Berke-
ley. Garp couples a MIPS processor with a FPGA-like reconﬁgurable de-
vice organized as a datapath (see Fig. 2.3). As for the second release of
OneChip, the Garp architecture provides the reconﬁgurable device the di-
rect access to the memory with an undeniable computational advantage.
In fact, while the computation shifts from the processor to the program-
mable hardware, the access to data long time appeared as a wall (or a bot-
tleneck) for the ﬁrst generation of reconﬁgurable processors. In the case
of Garp, the reconﬁgurable array is connected to the processor core as a
coprocessor accessed by explicit move operations (move-to, move-from)
like that ones required for ﬂoating-point units.12 Reconﬁgurable computing overview
Instruction
Cache
Data
Cache
Main
Processor
(MIPS) Array
Reconfigurable
Internal Bus
External Bus
External Memory
Figure 2.3: Garp architecture
XiRisc reconﬁgurable processor [66] can be considered the ﬁrst silicon
implementation of custom designed reconﬁgurable instruction set proces-
sor. XiRisc couples a 2-way 32-bit Very Long Instruction Word (VLIW)
RISC processor with a custom designed reconﬁgurable LUT-based data-
path (the Pipelined Conﬁgurable Gate Array, PiCoGA) integrated in the
processor pipeline, as well as the other functional units. The VLIW ar-
chitecture allows to read up to 4 and write up to 2 registers at once, thus
improving the bandwidth between the processor core and the reconﬁg-
urable device, although a direct memory access is not provided. As in
Garp, the datapath control is performed by a dedicated programmable pi-
peline manager, that enables the activation of each array row. Fig. 2.4
shows the overall architecture, while section 2.3 provides a detailed de-
scription of this architecture and its embedded reconﬁgurable device Pi-2.2 Coarse-grained reconﬁgurable computing 13
RCU
RCU
RCU
RCU
RCU
CONTROL
UNIT
R
E
G
I
S
T
E
R
 
F
I
L
E
DATA CHANNEL 2
PiCoGA writeback
channels
Processor writeback channels
DATA CHANNEL 1
SHARED DATA CHANNEL
PROCESSOR INTERFACE
P
i
C
o
G
A
Figure 2.4: XiRisc reconﬁgurable processor architecture
CoGA.
The Molen [12] polymorphic processor focuses on the architectural for-
malization of the reconﬁgurable computation paradigm, with a special
glance at programming aspects. The Molen has been implemented on a
Xilinx Virtex-II Pro FPGA, utilizing the embedded PowerPC 405 core to
allocate, deallocate and execute instructions on the reconﬁgurable hard-
ware, as depicted in Fig. 2.5.
2.2 Coarse-grained reconﬁgurable computing
Standard FPGA technology has been the heart and soul of reconﬁgurable
processing pioneers, focused on the formalization of the new computation
paradigm. Unfortunately, state of the art FPGAs early appeared as too
big, slow and power hungry if compared to application requirement and
ASIC-based solutions. The full-ﬂexibility offered from the bit-level pro-
grammability introduced too many overhead dueto programmable logics,14 Reconﬁgurable computing overview
MEMORY
CP
ρµ −code CCU
DATA
ARBITER
I_BUFFER
CR
GPR
Reconfigurable
Unit
Figure 2.5: Molen architecture
programmable interconnects and Static RAM cells needed to conﬁgure all
the device. Comparing the number of transistors available for computa-
tion to the number of transistors required by a standard FPGA technology
we needto remove wiring and reconﬁgurability overheads resulting about
three order of magnitude below the Moore curve. This gap increases since
the effective density is reduced by routing congestion that in big devices
further decrease the interconnect capabilities, as shown in Fig. 2.6. This
is what Hartenstein calls “reconﬁgurable computing paradox”, born to be
more efﬁcient than processors in term of operations per second per mm
￿,
but intrinsically less efﬁcient in term of transistor per mm
￿ if compared to
the Moore curve.
The need for new programmable devices envisioned in the past years
by Nick Tredennick has been accomplished by the proposal of a surpris-
ingly wide scenario of reconﬁgurable devices trading part of the ﬂexibility2.2 Coarse-grained reconﬁgurable computing 15
Density
0
103
106
109
t
r
a
n
s
i
s
t
o
r
s
/
m
i
c
r
o
c
h
i
p
1980 1990 2000 2010
FPGA physical (wiring overhead)
microprocessor
Moore curve
FPGA logical (reconfigurability overhead)
FPGA routed
(routing congestion)
Effective
10
Figure 2.6: FPGA integration density (source: R. Hartenstein “Why we need re-
conﬁgurable computing education”)
in order to improve hardware efﬁciency. In its visionary retrospective [30],
Hartenstein underlined that “in contrast to FPGA use (ﬁne grain reconﬁg-
urable) the area of Reconﬁgurable Computing mostly stresses the use of
coarse grain reconﬁgurable arrays (RAs) with path-widths greater than 1
bit, because ﬁne-grained architectures are much less efﬁcient because of
a huge routing area overhead and poor routability [2]. Since computa-
tional datapaths have regular structure, full custom designs of reconﬁg-
urable datapath units (rDPUs) can be drastically more area-efﬁcient, than
by assembling the FPGA way from single-bit CLBs. Coarse-grained archi-
tectures provide operator level CLBs, word level datapaths, and powerful
and very area-efﬁcient datapath routing switches.”
Many mid and coarse grain devices (but all termed as coarse grain in
the Hartenstein’s taxonomy) have been proposed from both academia and
industry in order to increase the ratio between the grain of the basic logic
cell and the programmable interconnects in which the computational logic16 Reconﬁgurable computing overview
is embedded. The computational capability of the basic logic cell shifts
from the few LUTs to complete 32-bitwise arithmetic logic units (ALUs).
Furthermore, in many cases, interconnect ﬂexibility has been reduced, for
examplesupporting onlythe connection of nearestrows or among nearest-
neighbor cells, in this way also reducing the associated overhead.
PipeRench [46] is one of the ﬁrsts and most important reconﬁgurable
devices featuring a datapath structure based on stripes. Each stripe is com-
posed by arithmetic logic unit, LUTs and a dedicated circuitry to speed-up
carry chains. PipeRench introduces the concept of virtual hardware com-
putation by means of fast partial dynamic reconﬁguration. The conﬁgura-
tion of each stripe can be rapidly changed from the pipeline manager thus
allowing to fold deep pipelines on the device.
RC RC RC RC
RC RC RC RC
RC RC RC RC
RC RC RC RC
RC RC RC RC
RC RC RC RC
RC RC RC RC
RC RC RC RC
RC RC RC RC
RC RC RC RC
RC RC RC RC
RC RC RC RC
RC RC RC RC
RC RC RC RC
RC RC RC RC
RC RC RC RC
Figure 2.7: MorphoSys architecture
MorphoSys [55] couples a 32-bit RISC core with an 8x8 mesh of 16-bit
ALUs with a peculiar interconnect architecture based on nearest-neighbor
wires and few regional connections (see Fig. 2.7). In order to reduce the2.2 Coarse-grained reconﬁgurable computing 17
ALU PAE
RAM
RAM
RAM
RAM RAM
RAM
RAM
RAM
Streaming ports
Figure 2.8: PACT XPP architecture
conﬁguration bits, the mesh can be programmed by rows or columns. In
other words, each row or column can implement a single instruction mul-
tiple data (SIMD) computation. The architecture features a multi-context
conﬁguration memory in order to minimize reconﬁguration penalty.
The PACT XPP digital signal processor [53] is composed by a matrix
of 16-bit Processing Array Elements (PAEs) working as an event-driven
data-stream datapath. Internal signals synchronize the data-ﬂow, while
the communication is performed by means of packets transmission. Con-
cerning the routing architecture, the array is organized in rows, and the
data transfer among successive rows is performed in a synchronous way
through dedicated registers. Recently, PACT has introduced also small
processor cores based on a simpliﬁed 16-bit VLIW structure, in order to
achieve better performance ﬁgures on control intensive tasks. Figure 2.8
shows the overall architecture.
The CHESS reconﬁgurable arithmetic array [27], developed from HP
Labs andevolvedin theElixentLtd. D-Fabrix[26], isa bi-dimensionalmesh
of 4-bit ALUs. The principal goals for CHESS were to increase both arith-
metic computational density and the bandwidth and capacity of internal
memories signiﬁcantly beyond the capabilities of current FPGAs, whilst
enhancing ﬂexibility. For that, a chess-board layout alternating switch-18 Reconﬁgurable computing overview
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU Embedded
RAM
Embedded
RAM
Figure 2.9: CHESS architecture and its hexagonal topology
boxes and ALUs is used as shown in Fig. 2.9. This allows CHESS to sup-
port strong local connectivity and communication among ALUs and gives
an effective routing network which uses only 50% of the array area, much
less than in traditional FPGA structures.
DREAM adaptive DSP [75] is one of the most recent reconﬁgurable
processor coupling a standard RISC core with a pipelined reconﬁgurable
datapath (the third generation of PiCoGA). The reconﬁgurable device is
an important evolution (if not a revolution) of the original PiCoGA, aug-
mented with 4-bit ALUs (comprising extended operations like a Galois
Fields Multiplier over GF(2
￿)) as basic computational blocks in addition to
the 64-bit LUTs. This allows DREAM to increase the computational den-
sity of the device. Furthermore, the adopted co-processor schema allows
the directed access to the local memory sub-system. In particular, a high
bandwidth buffer infrastructure has been implemented in order to allow2.2 Coarse-grained reconﬁgurable computing 19
up to 12 32-bit inputs and 4 32-bit outputs per cycle (the maximum band-
width of the new PiCoGA device). Section 2.4 shows the detail of this
architecture.
The coarsening process of reconﬁgurable architectures has underlined
some interesting proposals in which the basic cell is represented by a small
processor. The two main examples are probably the RAW machine [49]
from MIT and the PicoChip [25]. RAW, acronym of Reconﬁgurable Ar-
chitecture Workstation, provides a RISC multiprocessor architecture com-
posed of nearest neighbor connected 32-bit modiﬁed MIPS R2000 micro-
processor tiles. Each processor features 6-stage pipeline with ALU, ﬂoat-
ing point and 32 Kbyte SRAM. PicoChip is a massively parallel array of
430heterogeneous processors linkedbyadeterministic high-speedswitch-
ing matrix. About 230 processors include multiply and accumulate func-
tionalities, but thecharacteristics andinstruction set of elementsshould in-
clude support for specialist operations such as spread, de-spread or compare-
add-select.
Specialization of computational blocks is thus an undeniable trend of
reconﬁgurable computing, similarly at the specialization that DSPs pro-
vided with dedicated instruction set extension. The main goal has been
and will be to reduce the impact of reconﬁguration in term of area. In-
stead of application speciﬁc devices, we can term this approach as ﬁeld-
speciﬁc since the ﬂexibility offered by reconﬁgurable approaches appears
higher than that one offered by processor based systems augmented with
dedicated circuits. In any case, heterogeneity, additional interconnect con-
straints, as well as special and complex functionalities have an undeniable
impact on the programmability of the device and the efﬁciency in which
the devices can be used from programmers, as will be discussed in the
next chapter.20 Reconﬁgurable computing overview
2.3 XiRisc reconﬁgurable processor
SHIFTER
SHIFTER
INSTR DECODE
LOGIC  1
INSTR DECODE
LOGIC  2
SHARED FUNCTIONAL UNITS
DATA CHANNEL 2
DATA CHANNEL 1
PiCoGA
GATE−ARRAY CONTROL
GATE−ARRAY WRITEBACK CHANNEL
F.U.    #3
( ... )
(Data Memory Handle)
F.U.    #2
(Multiply/MAC)
F.U.    #1
M
U
X
M
U
X
M
U
X
INSTRUCTION
MEMORY
M
U
X
ALU
M
U
X
ALU
P
i
C
o
G
A
 
C
o
n
t
r
o
l
 
U
n
i
t
R
E
G
I
S
T
E
R
 
F
I
L
E
MUX
MUX
Figure 2.10: Detailed XiRisc reconﬁgurable processor architecture
The XiRisc reconﬁgurable processor [66, 68] (Figure 2.10) is a 2-issue
Very Long Instruction Word (VLIW) RISC architecture, with two 32-bit
datapaths, featuring aﬁnegrain reconﬁgurable functional unit (aPiCoGA,
PipelinedConﬁgurable Gate Array) that allows the user to dynamically adapt
the instruction set to the application workload. PiCoGA is a multi-context
array of 24 rows, each of them composed of 16 ﬁne grain Reconﬁgurable
Logic Cells (RLCs), including four-input 16-bit look-up tables and dedi-
cated logic to support the efﬁcient implementation of both arithmetic and
logic operators. Programmable interconnects allowpoint-to-point bit-level
communication using the island-style topology showed in Fig. 2.11. In or-
der to reduce the area overhead due to programmable interconnects, the2.3 XiRisc reconﬁgurable processor 21
.
RLC
LUT
16x2
LUT
16x2
4x32−bit input data bus from Reg File
2x32−bit output data bus to Reg File
192−bit configuration bus from Configuration cache
p
G
A
 
C
O
N
T
R
O
L
 
U
N
I
T
CONNECTION
BLOCK
C
O
N
N
E
C
T
I
O
N
B
L
O
C
K
V
E
R
T
I
C
A
L
SWITCH
BLOCK
HORIZONTAL
2
2
2
22
2
2
2
CARRY
CHAIN
12
c
o
n
f
i
g
u
r
a
t
i
o
n
 
b
u
s
REGISTERS
LOGIC,
EN
INPUT
LOGIC
OUTPUT
l
o
o
p
−
b
a
c
k
1
2
 
g
l
o
b
a
l
 
l
i
n
e
s
 
t
o
/
f
r
o
m
 
R
F
pGA control unit signals
Figure 2.11: Pipelined Conﬁgurable Gate Array (PiCoGA) ver. 1.0
routing topology features 2-bit granularity, which is relaxed to 1-bit only at
level of the connect blocks. Fig. 2.12 shows the detailed RLC architecture.
XiRisc architecture exploits an assembly-level mechanism to add cus-
tomized instructions, called PiCoGA operation or pgaop, which can re-
place on average 10-40 assembly instructions, and up to
￿400 when a deep
hardware approach (involving for example a synthesis step) is adopted.
ThePiCoGAimplementspipelinedinstructions using adataﬂowparadigm
[63]. As will be described in the next chapters, customized instructions are
extracted, based on user annotations, from ANSI-C code. The Griffy com-
piler translates them into data-ﬂow graphs (DFGs) which are thereafter
mapped on the PiCoGA. From the programmer’s point of view, pgaops
are application-speciﬁc intrinsics (as pseudo-function calls) in C code or
assembly instructions which trigger PiCoGA computations. In this way,
the user can still utilize a C-based globally imperative description style for
the implemented code.
The PiCoGA reconﬁgurable device is integrated as a Functional Unit22 Reconﬁgurable computing overview
1
0
1
0 1 2 3 0 1 2 3
BURST_ON
BURST_DATA
CONF_CX_SEL
CONF_ADDR
CONF_DATA
EXEC_CX_SEL
X
[
3
:
2
]
X
[
1
:
0
]
RLC
EEN
cout
co
v[3:2]
v[1:0]
CK
Z[3:2] Z[1:0]
INIT
Retime0
DA B C
AF1
AF0
ADDR1[3:0] ADDR0[3:0]
Const
CinAdd
cin
ci
SF
Mode
Ecin
CinSel
C[1]
c
o
u
t
cin
Y[0]
EN Y[3:2] Y[1:0]
EnSel
Q
[
3
:
2
]
Q
1
[
3
:
2
]
Q
1
[
1
:
0
]
Q
[
1
:
0
]
Xchg
Mask
Arith
CinExt Odd
MUX
BLOCK
LUT 16x2
SLICE1 SLICE0
LUT 16x2
LOOP LOGIC
INPUT MUX
SELECT
CARRY
Retime1
R1
R0 R0
R1
h[3:2] h[1:0]
Memory (1 bit)
Configuration
1 bit wire
2 bit wire
4 bit wire
8 bit wire
Figure 2.12: PiCoGA Reconﬁgurable Logic Cell (RLC)2.3 XiRisc reconﬁgurable processor 23
(FU) of the processor core, thus reducing communication overheads to
and from other FUs. On the other hand, a register ﬁle based communi-
cation could be a bottleneck for applications in which high degree of data
parallelism can be exploited by streaming or by vectorized direct memory
access. The PiCoGA can load up to 4 pgaops for each of its 4 conﬁg-
uration contexts, and operations loaded in the same context can be exe-
cuted concurrently. Embedded hardwired control logic handles conﬂicts
on write-back channels when various pgaops need to write on the pro-
cessor register ﬁle. Furthermore, PiCoGA can operate concurrently with
the other functional units of the processor, since the data ﬂow consistency
is ensured by a register locking mechanism.
Starting from a C source code, the compiling tool builds a pipelined
DFG by scheduling instructions. It then maps:
￿ DFG-node functionalities on the PiCoGA RLCs;
￿ pipeline management on a row-based dedicated control unit that en-
ables execution of the mapped pipeline stages [70].
Nodes in the DFG are functional operations mapped onto the device
resources (e.g., addition/subtraction or a bitwise logic operation). The pi-
peline is then built through operations scheduling of the C-level DFG rep-
resentation. As described in [64] (and explained in the following), a data-
ﬂow graph represents dependencies among computational nodes through
the data dependencies graph. A pipelined data-ﬂow computation, includ-
ing both data dependencies and resource constraints, can be modelled us-
ing synchronous Petri-Nets [65, 77]. In this model both data dependencies
and resource constraints are represented by arcs and tokens and each com-
putation transition ﬁres when all input arcs have a token and a token for
each output arc has been produced. A set of transitions which ﬁres simul-
taneously is also called a step.
Following this elaboration pattern, a dedicated programmable control
unit is used to handle the pipelineactivity, triggering the DFG nodes when
all necessary resources are ready and stalling when they are unavailable.24 Reconﬁgurable computing overview
In order to minimize its area occupation, one row control unit (RCU) is
dedicated to each array row, so that the minimum granularity for RLC ac-
tivation is 16. More than one PiCoGA row could be used to build a wider
pipeline stage. On the other hand, cascading more than 1 RLC in a sin-
gle pipeline stage is often impossible because of the ﬁxed high working
frequency (
￿ 166-200MHz) constraint. When a pipeline stage performs
a computation, the control unit exploits a dedicated programmable in-
terconnection channel to send tokens to predecessor and successor nodes
[70].2.4 DREAM adaptive reconﬁgurable DSP 25
2.4 DREAM adaptive reconﬁgurable DSP
DREAM architecture [75] is a dynamically reconﬁgurable platform cou-
pling the PiCoGA-III reconﬁgurable device with a RISC processor using
a loosely-coupled memory mapped co-processor schema. A high band-
width memory sub-system provides/receives data to/from PiCoGA-III
allowing one to both maximize the throughput and interface the DREAM
architecture with for example external computational blocks. Figure 2.13
shows the simpliﬁed DREAM block diagram.
…
…
…
…
…
…
…
…
…
… Pi CoGA- I I I
Cont r ol
Un i t Ar r ay
A ddress Ge n e r ators
I nt erconnect  Cr oss-Ba r
Hi gh-Bandw i dt h
Me mo r y  Ba nk
Si mp l e Re gi sters
uP
Ri sc
Me mo r y-M apped
Cont r ol   I nt erf ace
MUX
REG
LUT ALU
(add, sub, GFmu l t,…)
Reconfi gurable Logi c Ce l l   ( RLC)
Figure 2.13: Simpliﬁed DREAM architecture
The processor, a 32-bit RISC core with 4+4Kbyte of data/instruction
memory, is responsible of DREAM management, although it could be also
used to implement portions of applications, such as the control part of
the code. The high bandwidth memory sub-system is composed of 16
4Kbyte 32-bit memory banks, each of them accessed independently to the
other ones by programmable address generators. A fully-populated inter-
connect cross-bar allows the user to modify the connection with PiCoGA-26 Reconﬁgurable computing overview
& &
ma s k
Ba s e
Ne x t   Ba s e
+ St r i de
Local  Count er
Ne x t
Count er
+ 1
==
Count
End  of 
Count
0
LocalA ddr
Ne x t
A ddress
+ St ep
&
ma s k
+
A ddress
=  Re gi ster
1
1 1 00
0
Ful l   A ddress
Figure 2.14: Programmable address generator schema
III I/Os (12 32-bit inputs and 4 32-bit outputs). A 64-entry conﬁguration
cache is provided for the interconnect, allowing to switch among different
connection topologies without any additional overheads, while the same
is not provided for the address generators. Furthermore, an additional
simple 32-bit register ﬁle is provided for local data, synchronized with
PiCoGA-III by a register locking mechanism. Concerning the programma-
ble address generators, it has been introduced the capability of handling
power-of-2 modulo addressing, in addition to standard step and stride
addressing modes. Fig. 2.14 shows the block diagram of the address gen-
erators. When mask is zero, for each read/write request, the local address
is initialized to the base address and is incremented by step (that could be
negative) for count times. When count operations are performed, the base
address is incremented by stride (that could be negative). If mask is not set
to zero (and features only one transition from 0 to 1, as in 0b00001111), it2.4 DREAM adaptive reconﬁgurable DSP 27
allows to perform a selective bit-wise or between the local counter and the
base address. Since the local counter updating is masked, the mask allows
to wrap around the counting on a power-of-2 sub-buffer.
2.4.1 PiCoGA-III architecture
The PiCoGA-III is a programmable gate array especially designed to im-
plement high-performance algorithms described in C language. The focus
of the PiCoGA-IIIis to exploit the Instruction Level Parallelism (ILP) avail-
able in the innermost loops of a wide spectrum of applications, including
multimedia, telecommunication and data encryption. From a structural
point of view, the PiCoGA-III is composed of 24 rows, each of them imple-
menting a possible stage of a customized pipeline. Each row is composed
of 16Reconﬁgurable Logic Cells(RLC)andaconﬁgurable horizontal inter-
connect channel. Each RLC includes a 4-bit ALU that allows to efﬁciently
implement 4-bitwise arithmetic/logic operations, and a 64-bit look-up ta-
blein order to handlesmall hashtables andirregular operations hardly de-
scribable in C and that traditionally beneﬁt from bit-level synthesis. Each
RLC is capable of holding an internal state, as the result of an accumula-
tion, and provides fast carry chain propagation through a PiCoGA row.
In order to improve the throughput, the PiCoGA supports the direct im-
plementation of Pipelined Data-Flow Graphs (PDFGs), thus allowing to
overlap the execution of successive instances of the same PGAOP (where
a PGAOP is a generic operation implemented on the PiCoGA). Flexibil-
ity and performance requirements are accomplished handling the pipe-
line evolution through a dynamic data-dependency check performed by a
dedicated Control Unit.
Summarizing, with respect to a traditional embedded FPGAs featur-
ing homogeneous island-style architecture, the PiCoGA-III is composed
of three main sub-parts:
￿ A homogeneous array of 16x24 RLCs with 4-bit granularity (capable
of performing operations, for example, between two 4-bitwise vari-
ables) and connected through a switch-based 2-bitwise interconnect28 Reconﬁgurable computing overview
matrix;
￿ A dedicated Control Unit which is responsible to enable the execu-
tion of RLCs under a dataﬂow paradigm;
￿ A PiCoGA Interface which handles the communication from and to
the system (data availability, stalls generation, and so on).
In terms of I/O channels, the PiCoGA-III features 12 32-bit inputs and
4 32 -bit outputs, thus allowing for each PGAOP to read up to 384 bits and
to write 128 bits per cycle. The PiCoGA-III is a 4-context reconﬁgurable
functional unit capable of loading up to 4 PGAOPs for each conﬁguration
layer. PGAOPs loaded in the same layer can be executed concurrently, but
a stall occurs when a context switch is performed. The main features of
the PiCoGA architecture are:
￿ A ﬁne grain conﬁgurable matrix of 16x24 RLCs
￿ Areconﬁgurable Control Unit, basedon 24RowControl Units(RCUs)
that handles the matrix as a datapath.
￿ 12 primary 32-bit inputs and 4 primary 32-bit outputs
￿ 4 conﬁguration contexts are provided as a ﬁrst-level conﬁguration
cache
– only 2 clock cycles are required to change the active context
(context switch)
– only 1 conﬁguration context can be active at a time.
￿ Up to 4 independent PiCoGA operations can be loaded in each con-
text, featuring partial run-time reconﬁguration.
Each RLC can compute algebraic and logic operations on 2 operands of
4 bits each, producing carryout and overﬂow signals, and a 4-bit result. As
a consequence, each row can provide a 64-bit operation or 2 32-bit opera-
tions (or four 16-bit, eight 8-bit operations, and so on). The cells commu-
nicate through an interconnection architecture with a granularity of 2 bits.2.4 DREAM adaptive reconﬁgurable DSP 29
I nput
preprocessi ng
4- bi t
ALU
LUT
4- bi t
out
C arry
chai n
REG
Fi rst  4- bi t
operand
Second
4- bi t
operand
2- bi t
cont ro l
oper and
•Mu l t i pl exer
• Condi t i onal
oper at i ons
•S i gn  I nversi on
•Addi t i on
•M ul t i pl i cat i on  bl ock
•Sat ur at i ng  ari t hm et i c
•G al oi s  Fi el d  mu l t i pl i cat i on
GF( 2^4)
•Bi t - Level   oper at i ons
• Com par i son
•Ha s h
•Conf i gur abl e  as  6: 1,  
5: 2,   4: 4  t abl es
C arry I N
C arry
OUT
4- bi t
Resul t
I nt ernal   Feed-
back
( accum ul at or s)
Cont ro l
oper and
2- bi t   cont ro l  out put :  
•Si gn
•O verf l ow
• C arry out
Figure 2.15: Simpliﬁed PiCoGA-III Reconﬁgurable Logic Cell (RLC)
Each task mapped on the PiCoGA is deﬁned PGAOP. The granularity of a
PGAOP is typically equivalent to some tens of assembly operations. Each
PGAOP is composed by a set of elementary operators (logic or arithmetic
operations) that are mapped on the array cell.
Each PiCoGA cell also contains a storage element (FF) that samples the
output of each operation. This storage element cannot be bypassed cas-
cading different cells, since the constant frequency of work featured by Pi-
CoGA. Thus PiCoGA can be considered a pipelined structure where each
elementary operator composes a pipeline stage. Computation on the array
is controlled by a RCU which triggers the elementary operations compos-
ing the array. Each elementary operation will occupy at most a clock cycle.
A set of concurrent (parallel) operations forms a pipeline stage.
The internal architecture of the Reconﬁgurable Logic Cell is depicted
in Fig. 2.15. Three different structures can be identiﬁed:
￿ The input pre-processing logic, which is responsible to internally
route inputs to the ALU or the LUT and to mask them when a con-30 Reconﬁgurable computing overview
stant input is needed
￿ The elaboration block (ALU & LUT), which performs the real com-
putation based on the operation selected by the RLCop block
￿ The output manager, which can select outputs from the ALU, the
LUT, and eventually from the carry-chain and synchronizes them
through Flip-Flops. The output block samples when receives the Ex-
ecution Enable from the control unit. Therefore the control unit is
responsible for the overall data consistency as well as the pipeline
evolution.
Operations implemented in the ALU & LUT block are:
￿ 4-bitwise arithmetic/logical operations eventually propagating acarry
to the adjacent RLC (e.g. add, sub)
￿ 64-bit lookup tables organized as:
– 1-bit output 4/5/6-bit inputs
– 2-bit outputs 4/5-bit inputs
– 4-bit outputs 4-bit inputs
– a couple of independent lookup tables featuring respectively 1-
bit output 4-bit inputs, and 2-bit outputs 4-bit inputs.
￿ 4-bit multiplier module; more in detail, it is a multiplier module with
10-bit (in case of A * B. 6 bit are for the operand A and 4 bit for
the operand B) of inputs and 5-bit output, including 12 Carry Select
Adder and speciﬁcally designed to efﬁciently implement small and
medium multiplier on PiCoGA resource.
￿ 4-bit Galois Field Multiplier GF(
￿
￿), with irreducible polynomial
 
￿
￿
 
￿
￿ .Chapter 3
Programming tools for
reconﬁgurable processors
3.1 Motivations
In the past, the term ﬂexibility has often been linked with the software
implementation offered by general-purpose microprocessors, whose com-
putational model aims at:
￿ modifying the task (the algorithm) by changing a set of instructions
(the program) in a read/write memory.
￿ implementing the algorithm using a small number of general com-
puting resources, roughly corresponding to the assembly instruc-
tions, which are reused in the course of time.
This kind of computation is called temporal computation and rapidly
shows its limitations when algorithm operations fail to match the hard-
ware computational resources [11]. In reconﬁgurable architectures the
application features are exploited through conﬁguration of customizable
hardware, thus improving the match between processor capabilities and
algorithm requirements. Unlike the temporal computation of traditional
processors, reconﬁgurable processors use a spatial model of computation
that (ideally) perfectly matches the algorithm, in terms of both available
3132 Programming tools for reconﬁgurable processors
parallelism and computation granularity. Signiﬁcant performance speed-
up and energy reduction can be achieved on critical kernels compared to
standard architectures [7].
Unfortunately, programmable fabrics are less efﬁcient and require far
more additionalareathandedicatedcircuits which exploitthe spatialmodel
of computation as well, but lack ﬂexibility. The silicon cost is often a strong
limitation for reconﬁgurable architectures being proposed to theconsumer
market. One acceptable trade-off is only to exploit the spatial computation
on critical kernels of applications, thus reducing area requirements. Non-
critical computations or control-dominated tasks (which often show a very
small degree of instruction level parallelism) are efﬁciently mapped on the
standard processor core, taking advantage of its software programmabil-
ity and shortening the overall development time. According to Amdahl’s
law and the 90-10 rule (90% of time is spent executing 10% of the lines
of a code [28]), performance can be roughly enhanced up to one order of
magnitude when implementation efforts focus on the identiﬁcation and
improvement of few critical kernels.
A typical ﬂow for the development of applications on a reconﬁgurable
processor can be based on a common processor-oriented tool-chain aug-
mented in order to handle the instruction set extension. Starting from a
high-level description language, the developer needs to partition the ap-
plication code between hardware and software portions, typically guided
by simulation and proﬁling back-annotations. The partitioning is an iter-
ative process, implying reﬁnements and modiﬁcations of a given imple-
mentation in order to exploit as much as possible the space-based com-
putation. When the partitioning is decided, the programmer (possibly
helped by tools and utilities) describes the hardware part in a proper lan-
guage, while the interaction between the processor and the reconﬁgurable
hardware can be accomplished by means of built-in functions and/or as-
sembly inlining.
When a portion of code is considered suitable for hardware implemen-
tation, it is translated into the description language used as the entry-point
for the reconﬁgurable device. A hardware-speciﬁc tool-chain then maps3.1 Motivations 33
the description in the device, providing the conﬁguration bit-stream. The
translation can be performed by re-writing the code from scratch (for ex-
ample, the algorithm is completely rewritten in VHDL/Verilog) or can be
assisted by tools. Automatic high-level language translation relieves the
user from the burden of learning a HDL language, introducing an abstrac-
tion layer that however hides many details of implementation, making it
hard to handle the performance accurately. On the other hand, when the
level of abstraction is tightly linked to the underlying hardware, the time
spent on optimizing an application grows, as do the skills required.
Processor-based computation allows the designer to exploit instruc-
tion level parallelism (ILP), while a hardware-oriented approach allows
one to match application requirements perfectly. If we consider the de-
velopment time spent in obtaining a speciﬁc implementation, processor-
based approaches typically require a matter of minutes to compile a given
source code, or a few days to optimize critical parts at the assembly level
and some weeks to maximize performance using manually-programmed
high-end VLIW DSPs. Of course, use of application-speciﬁc IPs or manual
optimization of an assembly code demands a deep knowledge of the un-
derlying architecture. In the hardware approach, the implementation of a
given application in an ASIC takes a long time, it being necessary to de-
scribe the algorithm in an optimized RTL HDL taking into account critical
paths, for example, after both physical synthesis and place-and-route.
These preliminary considerations are summarized in Figure 3.1, where
the performance improvement of a hardware implementation in terms of
execution time can be more than 2 orders of magnitude higher than a soft-
ware one. While aprocessor-based fully-software ﬂow allows the designer
to match the ILP of a given application, a hardware-oriented approach re-
quires one order of magnitude more time-to-develop to achieve the best
performance. In the middle of this design space reconﬁgurable platforms
should ﬁnd their place. The deﬁnition of design patterns and frameworks
for reconﬁgurable processors enabling unexperiencedusers to build appli-
cations deﬁnes an intermediate “optimization” curve in which the design
space moves from software to hardware, in terms of both performance34 Programming tools for reconﬁgurable processors
Ti me
Per f orma n c e
I mp r ovem ent
Sof t wa r e approach
Ha r dw are approach
I LP
Boundary
A ppl i cati on
Boundary Reconfi gurable Com put i ng
Learni ng Curve  ?
Figure 3.1: Performance vs. Time-to-develop design space
expected and skills required to obtain it.
A further interpretation of these curves can be given in terms of cost
modelling. Few works have attempted to examine the impact on costs of
hardware/software trade-offs in embedded-system co-design [84, 85, 86,
87]. While the software cost is estimated using standard models, such as
COCOMO [82], in these works the hardware cost is mainly based on COTS
(Commercial Off-The-Shelf) and libraries of functions. The customization
of a reconﬁgurable processor, despite involving hardware concepts, is not
well suited for modelling as a standard hardware development, because
it requires an existing component (the reconﬁgurable processor) to be pro-
grammed rather than a new hardware component to be designed. On the
other hand development on a reconﬁgurable processor needs to take into
account many more details than a standard processor. For example, in the
case of theDSP inFig. 1.3, better performancecan beachievedbyspending
additional time programming the processor at a lower level of abstraction.
Hence, the cost model is characterized by two calibration factors [83] that
describe the average number of code lines per function point, one for the part
of the application written in C and another one for the part written in as-
sembly. The total development time depends on the respective percentage
of function points implemented in C and assembly. The same approach3.2 Algorithm development on reconﬁgurable
processors (programming issues) 35
holds for reconﬁgurable processors, but in this case the speciﬁc tools and
languages for the given reconﬁgurable logic need to be considered in ad-
dition to C and assembly.
3.2 Algorithm development on reconﬁgurable
processors (programming issues)
Processor-based system-on-chips (SoC) are becoming the most popular
way to perform computation in the electronic marketplace. Today, at least
one processor is present in every SoC in order to handle in a simple fash-
ion the overall system synchronization, be it provided by the operating
system functionalities (i.e. multitasking management, real-time issues) or
be it required for I/O communications. Usually, the processor (i.e. ARM9,
PowerPC, MIPS, ...)isnotthemain responsible of the computation that is
demanded to high-performance co-processing engines. Depending on the
application constraints and the ﬂexibility required, computation intensive
parts are implemented on dedicated hardware accelerators (when non-
recurring costs allow that) or on application-speciﬁc digital signal pro-
cessors (DSPs). In this context, high-end DSPs are proposed as a way
to match ﬂexibility requirement (since they are software programmable)
with high performance. Architectures like the Texas Instruments OMAP
or the STMicroelectronics STW51000 (also known as GreenSIDE) are state
of the art examples of commercial ARM-based SoCs powered with one
or more DSPs plus one or more dedicated hardware accelerators. One of
the most interesting trend in the ﬁeld of high-performance SoCs is rep-
resented by the introduction of dynamically (or run-time) reconﬁgurable
hardware (i.e. embedded FPGAs, reconﬁgurable data-path and reconﬁg-
urable processors) in substitution of the constellation of DSPs and/or ded-
icated hardware accelerators today necessary to match constraints in term
of performance and energy consumption [1, 2, 7, 8, 11, 10]. In general
terms, the exploitation of such kind of architectures implies the capability
to tailor the SoC functionalities around the computational requirements of36 Programming tools for reconﬁgurable processors
a given application. This can be seen as an instruction set extension of the
main processor (e.g. the ARM in previously cited examples), being the re-
conﬁgurable hardware a run-time extension of the baseline computation
engine.
As for DSPs and dedicated accelerators, the exploitation of any degree
of parallelism at bit-, word-, instruction-, iteration- and task-level is the
control lever for the effective utilization of the reconﬁgurable hardware.
This implies for the programmer a deep knowledge of both application
and system architecture to well understand how to partition and how
to map algorithms over the different available computational resources.
On the other hand, this also implies for the programmer the capability to
investigate a hybrid design-space including both software and hardware
concepts, requirement not so usual for application developer long used to
C/assembly design environments. With respect to mask-time program-
mable hardware accelerators, reconﬁgurable computing offers the pro-
grammer the capability to design its proper extensions in order to satisfy
application-speciﬁc requirements. Therefore, the capability of providing
soft-hardware (or hardware programmable as software) is probably the
most important point to enable the large market of application developers
to use effectively a reconﬁgurable device [29].
In the past, the action of targeting a reconﬁgurable device borrowed
tools and methodologies from FPGA-based design (with hand-coded RTL
HDL), although it was clear from the beginning the severe lack of user-
level programmability coupled to this approach. The utilization of C lan-
guage hasbeen seenas the most promising way to approach the customers
at the reconﬁgurable proposal. On one hand, C dialects have been pre-
sented including entire new object classes dedicated to hardware design,
like in System-C or Handel-C. This kind of approach move the C toward
the hardware design making the hardware description friendlier through
a C-based language that basically becomes another HDL, therefore requir-
ing hardware skills to the developers. On the contrary, a more promis-
ing approach is to use standard ANSI C code and translate it into some
kind of RTL, thus requiring a sort of C-to-RTL hardware compiler. Com-3.2 Algorithm development on reconﬁgurable
processors (programming issues) 37
panies like Celoxica, Mentor Graphics, Impulse, Altium and CriticalBlue
offer stand-alone C-to-RTL and/or C-dialect-to-RTL synthesizers that can
be integrated in standard ﬂows for FPGA and that were used in many
works on reconﬁgurable system implemented using commercial FPGAs.
For embedded applications, the reconﬁgurable device is a part of an
usually complex system with a rigid budget in term of area and cost. As
underlined in the previous chapter, this precludes the utilization of stan-
dard FPGAs, since they are too area demanding for the embedding and
not so appealing for the ﬁnal implementation of the whole system (in term
of performance, power consumption and costs). Reconﬁgurability is then
provided through embedded-FPGAs (small FPGAs suitable for the em-
bedding in a SoC), reconﬁgurable data-paths and reconﬁgurable proces-
sors that offer ﬂexibility under typically hard constraints in term of area.
The limitation in terms of area is accomplished by reducing the number
of programmable elements and equivalent KGates available, but while a
stand-alone high-end FPGA requires some hundreds of mm
￿, reconﬁg-
urable devices show an area occupation ranging from few mm
￿ to some
tens. Even if much less than FPGAs, the area occupation of reconﬁgurable
devices is very often considered huge from SoC designers. This means
that the reconﬁgurable device needs to be as small as possible, while the
conﬁguration efﬁciency must grow up to the peak performance offered by
the device.
On the architecture side, area limitation can be accomplished by an ac-
curate trade-off between logic and interconnects. For example, in island-
style programmable architectures it is possible to achieve better area ﬁg-
ures increasing the grain of the basic logic element with respect to the
interconnect structure, or decreasing the interconnect capabilities for ex-
ample limiting the connection at level of rows and/or the neighbors logic
elements [53, 55]. This implies an undeniable reduction in term of ﬂexibil-
ity, paid to the need of guarantee small area budget. On the programming
side, the increase of design constraints and then the reduction of degrees
of freedom in the mapping of algorithms imply that any inefﬁciency of the
automated high-level synthesizer leads to a dramatic loss in terms of per-38 Programming tools for reconﬁgurable processors
formance. To avoid this, many reconﬁgurable devices provide structural
languages in which operators are directly mapped into the device without
synthesis and the application designer can tune, reﬁne or re-write from
scratch the implementation in order to maximize the performance beneﬁt
in the same way that a DSP programmer could use the assembly language.
All these preliminary considerations can be summarized in few points that
we can see as requirements for an application development environment
in the ﬁeld of reconﬁgurable computing:
￿ to be appealing for the wide world of software and DSP program-
mers, such environment needs to be as similar as possible to tradi-
tional software-only environments.
￿ to be effective and compliant to the huge investment in term of area
and costs required by reconﬁgurable hardware, such environment
needs to provide capability to exploit as much as possible architec-
tural features.
3.3 Instruction set extension implementation on
a standard compilation tool-chain
The extension of a standard software tool-chain in order to support in-
struction set metamorphosis implies to analyze the role played by each
tool and the efﬁciency required by each step, with the ﬁnal goal of propos-
ing to application developers a tool-chain in which hardware and software
can be handled together. The introduction of instruction set extensions
implies modiﬁcations in each step of the compilation process, and the ad-
dition of conﬁguration tools dedicated to the mapping of instruction set
extensions in the reconﬁgurable hardware. In this section, we focus on
the software supports necessary to handle instruction set reconﬁguration
from a C compiler, while aspects concerning the extension deﬁnition and
its mapping on the hardware support, will be dealt with in the next sec-
tions.3.3 Instruction set extension implementation on a standard compilation
tool-chain 39
In general terms, modiﬁcations in the assembler and in the linker are
kept as minimal as possible, since the assembler can be reduced to a sim-
ple mnemonic translator and the linker needs to include the eventual bit-
stream for the hardware customization. On the contrary, the high-level
compiler needs to be conscious of the reconﬁgurable parts in order to help
the user in the optimization process. We can require to programming tools
for reconﬁgurable processors many tasks:
￿ to provide to the user the capability to deﬁne and instance an ex-
tended instruction;
￿ to schedule the extended instruction accurately;
￿ to automatically recognize user-deﬁned extended instructions in a
general-purpose code;
￿ to detect critical kernels and automatically generate a set of extended
instructions.
Thedeﬁnition ofextendedinstructions isusuallyaccomplished byded-
icated tools, while the capability of instancing the extended instructions in
a software code can be obtained by using the same functionalities pro-
vided for assembly inlining. The last three points are very speciﬁc of re-
conﬁgurable computing. Accurate scheduling and identiﬁcation of cus-
tom instructions can be handled by modifying the machine description
andtheintermediaterepresentation (asort ofvirtual machine-independent
assembler) of the compiler. In the case of traditional C tool-chains (e.g.
GNU GCC [97]), this implies the complete recompilation of the compiler
front-end since the machine description is static. It is thus possible to deal
with extendedinstruction inthesameway thatacompiler handlesﬂoating
point extension, describing the required functional units in the intermedi-
ate representation [98]. Of course, this proves to be a hard obstacle for
most of the application developers, also in terms of time required during
the design-space exploration when the instruction set extension is under
deﬁnition.40 Programming tools for reconﬁgurable processors
Alternative approaches have been proposed in research projects on ad-
vanced high-level compilers like Impact [21] and SUIF [22]. In this cases
machine descriptions and intermediate representations can be dynami-
cally extended without rebuilding the tools, since the description of the
target architecture is read before each compilation. This implies that all
the internal automata required to implementing, for example, the pattern
matching and the scheduling steps are dynamically generated from the ar-
chitecture description. A state of the art compiler able to handle the opti-
mized scheduling of custom instructions (even if featuring long latencies)
can be found, for example, in the MOLEN project [12] or in the DRESC
framework [13], respectively based on SUIF and Impact. Moreover, the
Trimaran framework [14] proposes ascheduling mechanismbased on sim-
ulation and proﬁling back-annotations to reduce the stalls in an execution-
aware environment, although the impact on the compilation time.
This point introduces the last issue that reconﬁgurable computing im-
poses on programming tools that is the reconﬁguration of the simulator.
Similarly to the compiler, the simulator needs to be adapted to the changes
or the extensions of the instruction set. Language for Instruction Set Archi-
tecture (LISA), commercially available from CoWare and Axys, as well as
open-source architecture description languages, like Arch-C, are examples
of frameworks where cycle-accurate instruction set simulators can be built
with the support of a native structure implementing typical processor ob-
jects, like the pipeline or the register ﬁle. This approach requires to rebuild
the instruction set simulator every time the instruction set is changed. In
[15] an alternative approach is proposed. A dynamically linked library
is used to model the instruction set extension, while the main processor
is modelled by standard simulator support. The mechanism, described
in the following of this thesis, is applied on both functional and cycle-
accurate simulation, integrating the mechanism on an environment based
on LISA and SystemC, and on a pure-functional debugging environment
based on the GNU GDB simulator.
Figure 3.2shows a simpliﬁed and very general block diagram for apro-
gramming environment supporting reconﬁgurable computing. It includes3.3 Instruction set extension implementation on a standard compilation
tool-chain 41
Par t i t i oni ng Par t i t i oni ng
Compi l er Compi l er
M appi ng M appi ng
Confi gurati on
Bi t -Str eam s
Confi gurati on
Bi t -Str eam s
Ext ended
I SA
Ext ended
I SA
Ba s i c
I SA
Ba s i c
I SA
A ssem bler A ssem bler
Li nker Li nker
Hi gh-Level
Language
Hi gh-Level
Language
Executable
Fi l e
Executable
Fi l e
Si mu l ator
Pr ofi l er
Si mu l ator
Pr ofi l er
Figure 3.2: Basic software tool-chain extension to support reconﬁgurability is-
sues
the basic software support (compiler + assembler + linker + simulator)
previously described, and the partitioning and conﬁguration parts. The
partitioning is the process, automatic or not, of design-space exploration
in which critical tasks or kernels are moved from a software implementa-
tion to a hardware one (and viceversa) depending on the required perfor-
mance (speed, energy, ...).T oday, this process is usually under the whole
control of the programmer, although it can be helped by the usage of tools.
Many researches are going in the direction of full automation of the par-
titioning since it will represent the most appealing enabling step toward
the true soft-programmable hardware (e.g. [6]). Despite this, very few
works are leaving the academic/research project to challenge the market,
and these few works are focused in the ﬁeld of mask-time programmable
device (e.g. [56]). Even if constrained to provide good area ﬁgures to be
appealing for the integration on system-on-chips, reconﬁgurable devices42 Programming tools for reconﬁgurable processors
remain very precious resources (and area demanding) that shall be return
by high performance. The programming efﬁciency required for run-time
dynamically reconﬁgurable devices can be accomplished only by the full
exploitation of thecomputational capabilities of that, with avery restricted
margin for the overhead that an automatic design ﬂow can introduce. This
is probably the most important difference between conﬁgurable solutions
(like mask programmable, application speciﬁc standard processor, and so
on) and reconﬁgurable solutions, that heavily impacts in term of program-
ming models and languages. In fact, while for conﬁgurable solutions the
literature as well as commercial proposals are talking of high level descrip-
tion languages, like the C, the common proposal for dynamically reconﬁg-
urable devices is some kind of structural form, like the assembler, as will
be described in the following of this chapter.
The last block in Figure 3.2 is the conﬁguration engine, a tool that
starting from some kind of description language is capable of providing
the conﬁguration bitstream for the reconﬁgurable device. This tool is (of
course) tightly coupled with the underlying hardware, and for C-based
conﬁguration ﬂow it represents the bridge from the software to the hard-
ware worlds.
3.4 Bridging the gap from hardware to software
through C-described Data Flow Graphs
Programming of reconﬁgurable devices can be performed in many differ-
ent ways, borrowing methods and tools from standard hardware design
(VHDL or Verilog) or borrowing methods and tools from software compi-
lation. As stated in [45], there is no a real difference from high-level be-
havioral synthesis and non-optimizing compilation of programming lan-
guages, since they are basically translations of the initial language to an
intermediate representation. On the contrary, the optimization is a very
different step in hardware synthesis from the software synthesis, with dif-
ferent metrics and cost-functions. Another common point between compi-3.4 Bridging the gap from hardware to software through C-described
Data Flow Graphs 43
Data Flow
Graph
Insn2 Insn1
InsnM
Body1 Body2
End
Header
(test condition)
Control Flow
Graph
If (cond) {
Insn1_1;
….
Insn1_N;
} else {
Insn2_1;
…
Insn2_M;
}
Body1
Body2
Figure 3.3: Examples of control and data ﬂow graphs
lation and synthesis is that graphs are most often used for internal repre-
sentations. In software programs, we can distinguish between two kinds
of graphs: Control- and Data-Flow Graphs (respectively CFG and DFG).
The CFG is the representation of the paths that might be traversed through
a program during its execution. Each node of the CFG is known as basic
block and its graph representation is a DFG. The DFG describes the de-
pendencies among the set of operations required to the data processing.
As shown in the example in Figure 3.3, branches of a conditional state-
ment (if...then...else ...) arerepresented as nodes of the CFG, while the
operations performed in each branch are described by a DFG “attached”
to a CFG node.
In hardware description languages there is the co-existence of both se-
quential and concurrent deﬁnitions of operations. As an example, the
behavior of a process or the expression assigned to a signal follow a se-
quential paradigm, although this not means that the same semantics of
the software languages, like C, are used. They can be viewed as nodes in
the DFG, as well as sub-graphs of a DFG, depending on the granularity
we assign to the node. In any cases, hardware description languages use
a event-driven activation mechanism in which more than one DFG and
more than one DFG node can be active per time natively, and this point44 Programming tools for reconﬁgurable processors
represents the most signiﬁcant difference with respect to control parts of
software languages. Of course, during the compilation for processors fea-
turing some degree of parallelism (e.g. VLIWs, Superscalars, TTAs, ...)
this constraint is heavily relaxed, bringing software implementation near
to the hardware, although the different optimization metrics.
For the deﬁnition of a suitable bridge between hardware and software
in the ﬁeld of reconﬁgurable computing, the DFG probably represents the
most natural choice. In fact, for reconﬁgurable processors the control part
is typically managed by the processor core, while the hardware accelera-
tion is provided for the DFGs. Hence, the DFG suitable for the mapping
on the reconﬁgurable device can be described by a sequential language,
like C, but it can be viewed at the same time as an abstract representation
of a circuit.
The exploitation of the parallelism is the key point for the effectiveness
of the reconﬁgurable computing, be it at word-level or loop-level. Stan-
dard software compilation techniques like software pipelining [61], itera-
tive modulo scheduling [62] and vectorization [60] are examples of well-
known methods that increase the instruction-level parallelism by the ex-
ploitation of loop-level data parallelism. Loop transformations are widely
used in compilation for VLIW processors to maximize the performance, as
well asthey areappliedto utilizeSIMD (Single- Instruction Multiple-Data)
extensions (like the Intel MMX or AMD 3DNow!). These methodologies
can be efﬁciently applied in order to transfer loop-level parallelism to the
instructions in the loop body, thus increasing the instruction level paral-
lelism of the innermost DFG, while more hardware-oriented methods can
be applied for the efﬁcient mapping of DFGs on the reconﬁgurable de-
vices. Starting from a DFG software description where the instructions (or
DFG nodes) are executed in the same order in which they are written in
the code, we can relax the enabling rule of the DFG executing each node
when inputs are available and output can be overwritten, as described in
[63]. The run-time execution of a DFG can thus be modelled by Petri Nets
as in [64, 65]. Furthermore, by nodes scheduling and registers insertion
it is possible to build the DFG in a pipelined form, without affecting the3.4 Bridging the gap from hardware to software through C-described
Data Flow Graphs 45
functionality. In this case, it is possible to overlap the execution of succes-
sive DFG activations (if the data dependencies allow that) hence improv-
ing the performance by the exploitation of parallelism at level of iteration,
as in [59, 60].
To this point, we have discussed about the role played by DFGs as
bridge between software andhardware. One more point is of course repre-
sented by the way in which the DFG can be described in order to meet the
requirements of effectiveness andfriendlinessposedasbasis of aprogram-
ming tool-chain for reconﬁgurable processors. We said that the entry lan-
guage must be appealing to software programmers and must be effective
in term of hardware utilization. An interesting option is to use the C lan-
guage for that goal: it allows to describe DFGs since DFGs are representa-
tions of the basic blocks, but it also allows to handle the DFG topology un-
der simple restrictions. For example, the utilization of a single-assignment
form, in which each variable is assigned exactly once, can help the user
in the DFG modelling, providing a simple way of handling efﬁciently all
the data dependencies, as proposed in [40]. The single-assignment form is
today introduced in many compilation frameworks as an important inter-
mediate representation in order to both simplify and optimize the internal
compilation steps. Starting from the version 4, also the GNU GCC makes
extensive use of single-assignment representations although the conver-
sion to single assignment is performed (in my knowledge) only for scalar
register values (everything except memory) at level of basic block. There-
fore, conditional statements (if...then...else) are converted in a specula-
tive form which executes concurrently each branch of the statement and
introduces merge nodes that select the correct outputs among the branch-
replications, similarlyto functionality providedby multiplexers inthe hard-
ware design.
In general talking, the translation of standard C and C-dialects into
some kind of hardware description is a complex problem addressed by
many research programs [36, 37, 38, 5, 39], especially if we include the
memory access (i.e. pointers) [43]. Inthe case of reconﬁgurable processors,
the processor core can handle (and usually handles) the memory access,46 Programming tools for reconﬁgurable processors
eventually with the help of DMAs to speed-up the memory access, thus
simplifying the synthesis requirements. Summarizing, single-assignment
forms are a restriction of the C semantic, they are useful to accurately han-
dle the DFG performance (parallelism and pipeline structure) and they
can be extracted from high-level C compilers. The application developer
can thus start the implementation over a reconﬁgurable processor from
the application description written in C, and selecting the critical kernels
suitable for the reconﬁgurable hardware mapping. Depending on the efﬁ-
ciency required, the application developer can choose to use an automatic
translation mechanism or to hand-code the kernel with a low-level de-
scription language, thus introducing a third trade-off point represented
by the time spent to the development.
3.5 Overviewof programmingtoolsfor reconﬁg-
urable processors
Programming frameworks for reconﬁgurable architectures are highly de-
pendent on the structure, the hardware granularity and the language pro-
posed as entry-point. Although far from being an ideal hardware descrip-
tion language, C was selected as an appealing entry-point for the conﬁgu-
ration of reconﬁgurable processors since the ﬁrst architectures (e.g. PRISM
[16]). Milestones of the research on ﬁeld of reconﬁgurable processors, like
the Garp [19] processor, and commercial state-of-the-art reconﬁgurable
processors [23, 24, 25, 26] proposed C-based design environments envi-
sioning the possibility to offer the end-user the capability of automatic
partitioning, and then to co-compile the same source code over both the
processor core and the reconﬁgurable logic. The Nimble compiler [80],
targeting the Garp processor, is one of the ﬁrst tools that try to automat-
ically move critical kernels from the processor core to the reconﬁgurable
hardware accelerator, selecting them from the basic blocks found in the in-
nermost loops. PipeRench [46, 47], one of most popular coarse-grained
reconﬁgurable data-paths, is conﬁgured using a single-assignment lan-3.5 Overview of programming tools for reconﬁgurable processors 47
guage with C operators (called DIL, Dataﬂow Intermediate Language), as
well as RaPiD [48] that features a C-based proprietary language. RaPiD-C
programs consist of nested loops describing pipelines, and language ex-
tensions allow the programmer to explicitly handle synchronization me-
chanism, specify parallelism and data movement (that is stream-based).
Another example of popular coarse grain architecture is represented by
the RAW architecture [49] developed from the MIT: in this case a SUIF-
based compiler partitions the application over a mesh of RISC processors,
instead of performing a technology mapping. Another programming ap-
proach based on C language was provided for the NAPA architecture [50],
including a C-programmed reconﬁgurable device as I/O coprocessor.
Many reconﬁgurable devices are programmable at assembly-level or
by graphical tools (for manual mapping), in a way that seems to trade part
of the programmability offered by high-level languages with the program-
ming efﬁciency (MOPS/mm
￿), as reported in the Hartenstein’s retrospec-
tive [30]. In general, the underlying architecture has a strong impact on the
technology mapping, on the placement and to a lesser term on the routing
algorithm. Direct mapping is probably the most used method for coarse
grain architectures, where operators are mapped to the programmable ele-
ments that compound the device without a real logic synthesis step. PACT
XPP [53] and MorphoSys [55] are effective examples of this approach, al-
though they provide a tentative to virtualize the underlying layer using
C-based high-level compiler ﬂows [60, 13]. For the full exploitation of
the architecture capabilities, PACT XPP is programmed through the Na-
tive Machine Language (NML), a structural event-based netlist descrip-
tion language. The following code is an example of nML code.
MODULE LDPC_VNODE_WC_2_SINGLENODE(DIN Q0,A0,B0, DOUT OUT)
{
OBJ q0_plus_b : ADD @ FREG 0,0 {
A=Q 0
B=B 0
}
OBJ q0_plus_a : ADD @ 0,0 {
A=Q 0
B=A 0
}48 Programming tools for reconﬁgurable processors
OBJ clip_a : CLIP (8) @ 0,0 {
A = q0_plus_b.X
}
OBJ clip_b : CLIP (8) @ 1,0 {
A = q0_plus_a.X
}
OBJ pack : PACK @ FREG 1,0 {
A = clip_a.X
B = clip_b.X
}
OUT = pack.X
}
In the example, sums and clipping instructions are manually placed to
cells 0,0 and 1,0, whereas FREG register are used to transfer data between
two successive rows. In fact, PACT XPP not features a vertical routing
channel and vertical data transfers are performed only by registers in a
pipelined form. Speciﬁc tools are proposed for the place-&-route phase as
reported in [54], where the strongly pipelined structure requires to pipe-
line also the interconnections across rows by dedicated registers.
For the MorphoSys architecture, a SUIF-based compiler is provided for
the host processor, while the partitioning between hardware and software
is performed manually by the programmer. The MorphoASM, a structural
assembly-like language, is used to conﬁgure each programmable element
to the functionality required. Usually, the programmer needs to take into
account also the interconnect capabilities of each programmable element
in order to distribute the processing elements in the device pipeline in a
way compliant to the timing requirements. An example of MorphoASM
code is reported in the following (stars give the programmer the possibility
to specify manually row and column).
CELL{*,*} R13 = MULSIH{FB{InputTwiddleCos, 0, OMEGA_BR2, COL_BUS,
WORD}, R5, R14} << 1;
CELL{*,*} R12 = MULSIH{FB{InputTwiddleCos, 0, OMEGA_BR2, COL_BUS,
WORD}, R1, R14} << 1;
CELL{*,*} R11 = MULSIH{FB{InputTwiddleSin, 0, OMEGA_BR2, COL_BUS,
WORD}, R1, R14} << 1;
CELL{*,*} R10 = MULSIH{FB{InputTwiddleSin, 0, OMEGA_BR2, COL_BUS,
WORD}, R5, R14} << 1;
CELL{*,*} NOP{}; CELL{*,*} R13 = ADD{R13, R11} >> 1; // scale down;
CELL{*,*} R12 = SUB{R12, R10} >> 1; // scale down;3.5 Overview of programming tools for reconﬁgurable processors 49
CELL{*,*} R11 = MULL{R15,R4} >> 1;// scale down for adjustment;
CELL{*,*} R10 = MULL{R15,R0} >> 1;// scale down for adjustment;
CELL{*,*} NOP{}; CELL{*,*} R8 = ADD{R11, R13};
CELL{*,*} R9 = ADD{R10, R12};
CELL{*,*} R4 = MULSIL{R15,R4,R8}; // scale down
CELL{*,*} R0 = MULSIL{R15,R0,R9}; // scale down
Themappingonreconﬁgurable architectures, especiallyfor coarsegrain
architectures which are very different from the island style of FPGAs, re-
quires speciﬁc management constructs. As an example, in the Garp pro-
cessor, the GAMA tool [79] maps a DFG using a dedicated tree covering
algorithm that split the original graph into sub-trees with single fanout
nodes, introducing a signiﬁcant overhead in the resources utilization. Fur-
thermore, only acyclic graphs are supported. Modules detected by the
tree covering are placed in Garp array rows (only one module per row)
using bit-slice methods proposed for data-paths synthesis in regular archi-
tectures. The DRESC compiler [13] is an example of high-level compiler
targeting a MorphoSys-like coarse grain architecture. It focuses on the
exploitation of loop-level parallelism and performs the place-&-route for
the reconﬁgurable hardware using an extended iterative modulo schedul-
ing algorithm. A simulated annealing strategy is used to decide when
a legal conﬁguration can be accepted or not, helping to escape from lo-
cal minimum. In some cases, where the reconﬁgurable processor is inte-
grating an embedded FPGA or it is implemented on a stand-alone FPGA
(like in MOLEN [12]), VHDL and Verilog are used for the hardware cus-
tomization: the optimization process is that one typical of hardware de-
sign on FPGAs where the behavioral HDL descriptions are substituted
by FPGA-speciﬁc macros, when expected performance are not achieved
directly from synthesis. On the programmability side, since HDL is the
entry-point, all the C-based language generating VHDL can be applied
for a more software approach, but the optimization process is performed
under the hardware design paradigm, for example analyzing timing con-
straints and critical paths.50 Programming tools for reconﬁgurable processors
3.6 Griffy project overview
This section introduces the Griffy project, a programming environment for
reconﬁgurable processor focused on C language for the processor core and
a simpliﬁed C syntax for the reconﬁgurable device. Implementation de-
tails of the most signiﬁcant steps will be described in the following chap-
ters and results achieved developing applications will be provided. The
approach has been originally applied to the XiRisc reconﬁgurable pro-
cessor [66], developed at the Arces/STM joint lab of the University of
Bologna, and it is currently applied also to the DREAM adaptive DSP
[75] and to the XiSystem [67] integrating in the same chip XiRisc and a
standard eFPGA. All these systems are based on processor cores driving
one or more reconﬁgurable devices (connected as functional units or co-
processors), thus providing the system reconﬁgurability at level of assem-
bly instruction. As an example, in the case of XiRisc, the reconﬁgurable
device (the PiCoGA) is ﬁt in the processor pipeline as an additional func-
tional unit, triggered by a dedicated assembly instruction called pgaop,
whereas in the DREAM architecture the reconﬁgurable co-processing sub-
system based on PiCoGA-III is handled, for simplicity, by standard mem-
ory access (as an example, computation is triggered by store operation).
The functionality associated to an extended instruction (in the following
commonly called pgaop) is modelled as a DFG, implemented in a pipeli-
ned form in order to enhance computation performance.
Figure 3.4 shows the overall programming environment proposed to
the application developer. Through the proﬁling analysis the program-
mer can identify the critical kernels suitable for the mapping on recon-
ﬁgurable hardware, evaluating the performance of the new implementa-
tion in order to ﬁnd the best partitioning between reconﬁgurable hard-
ware and software. The compiler tool-chain is based on a retargeted ver-
sion of the GNU GCC and instruction set extensions are handled through
assembler inlining. No speciﬁc scheduling support is provided for the
extended instruction set, although recent versions of GCC (3.2 and later)
counts the number of semicolons (“;”) inside the assembler inlining tem-3.6 Griffy project overview 51
picogaop
extraction
Kernel 
PiCoGA Mapping
Software Simulation
Memory:
Registers:
C Code Profiling Optimized
C Code
Function
Emulation
Code
Executable
Configuration
Bits
Figure 3.4: Griffy Algorithm Development Environment
plate in order to roughly estimate the number of cycle required (the idea
is to consider 1 cycle per instruction, thus 1 cycle per semicolon is con-
sidered). Software simulation is provided for both pure functional debug-
ging on the GNU GDB and cycle-accurate instruction-set simulation with
the LISA and System-C supports. Differently to the compiler, the simu-
lation environment supports the instruction metamorphosis through the
utilization of a dynamically linked shared library (.so library under Linux
environment). The emulation library of the instruction set extension is au-
tomatically generated from DFG compilation, andis currently successfully
plugged in both GDB and LISA/System-C environments [15].
The functionality of each instruction set extension is described starting
from a single-assignment manually-dismantled C syntax called Griffy-C
[71]. Griffy-C is a structural description in which basic C operators, like
sum, subtraction, bitwise logical operation and comparison, are directly
mapped on hardware resources, without logic synthesis. Figure 3.5(a)
shows an example of the Griffy-C code used to implement a simple sum of
absolute differences (SAD) required on video encoding applications and
the corresponding (non-optimized) Data Flow Graph. Griffy-C does not
support control ﬂowstatements, with the only exception of the conditional52 Programming tools for reconﬁgurable processors
#pragma pga sad4 1 2 out p1 p2
￿
short int sub0a, sub1a, sub0b, sub1b;
unsigned short int sub0, sub1;
unsigned char cond0, cond1, p10, p11, p20, p21;
#pragma attrib sub0a, sub1a, sub0b, sub1b SIZE=10
#pragma attrib cond0, cond1 SIZE=1
p10=p1; p11=p1
￿
￿ 8; p20=p2; p21=p2
￿
￿ 8;
sub0a=p10-p20; sub0b=p20-p10;
cond0=sub0a
￿0;
sub0=cond0 ? sub0b : sub0a;
sub1a=p11-p21; sub1b=p21-p11;
cond1=sub1a
￿0;
sub1=cond1 ? sub1b : sub1a;
out=sub0+sub1;
￿
#pragma end
(a) C-level description
p1
p10
(=)
p11
(>>)
sub0a
(-)
sub0b
(-)
sub1a
(-)
sub1b
(-)
8
p21
(>>)
p2
p20
(=)
cond0
(<)
sub0
(? :)
cond1
(<)
sub1
(? :)
0
out
(+)
(b) DFG - Graphical view
Figure 3.5: DFG Description
assignment (“? :”) used to implement multiplexers, in hardware terms, or
the merge-node under the dataﬂow paradigm. Detailed description of the
Griffy-C syntax is provided in Appendix A.
The C-oriented description implies that some operations with constant
operands may beresolved by constant folding andcollapsing on following
nodes. This kind of operators do not need explicit instantiation of process-
ing elements, an this kind of optimization can be regarded as a very basic
synthesis step. An example of such approach is the utilization of the rout-
ing resources to implement constant amount shifts in a ﬁne grain routing
architecture. Figure 3.6(a) shows the collapsing of shifts used in the pre-
vious SAD example for unpacking the input variable, thus providing in
the Figure 3.6(b) the optimized pipelined DFG. In the ﬁgure, nodes are de-
picted aligned per pipeline stage and dotted nodes represent the collapsed
operators.
The single-assignment syntax used in Griffy-C allows the user to han-
dle accurately the pipeline structure and at the same time can be automat-
ically generated from a high-level compiler tool-chain as proposed in [88].3.6 Griffy project overview 53
(>>) (>>)
p1
p10
(=)
p11
sub0a
(−)
sub0b
(−)
sub1a
(−)
sub1b
(−)
8
p21
p2
p20
(=)
cond0
(<)
sub0
(? :)
cond1
(<)
sub1
(? :)
0
out
(+)
(a) Synthesis step
p1
p10
(=)
p11
(>>)
sub0a
(-)
sub0b
(-)
sub1a
(-)
sub1b
(-)
8
p21
(>>)
p2
p20
(=)
cond0
(<)
sub0
(? :)
cond1
(<)
sub1
(? :)
0
out
(+)
(b) Pipeline building
Figure 3.6: Example of optimization of routing-only operators
Speciﬁc extensions for the deﬁnition ofthe variablesizeatbit-level arepro-
vided through #pragma directives in order to reduce the area occupation.
The mapping process can be divided in ﬁve main steps:
￿ Instruction-Level Parallelism extraction. Starting from the data depen-
dencies of the DFG, an optimized scheduling algorithm builds the
pipeline structure. Griffy-C code is analyzed and scheduled in pipe-
line stages applying an earliest ﬁring rule, in which instructions are
executed as soon as possible. Detection of routing-only instructions
allows to build optimized pipeline stages, although the presence of
an eventual internal state (e.g. described by static variable in the C
syntax) requires special management.
￿ Physical Mapping. The arithmetic and logic operations that require
computational resources on the array are generated with a proper
conﬁguration. The result is a netlist, annotated with conﬁguration
bits, where elements are hierarchically organized for pipeline stage
and macro-elements (the set of basic computational blocks imple-
menting a Griffy-C operation).
￿ Placement, routing and pipelinesynchronization. In this phase the netlist
is arranged on hardware-speciﬁc resources, while synchronization
mechanisms required for the pipeline evolution are programmed.54 Programming tools for reconﬁgurable processors
￿ Bit-stream generation is the last step in the conﬁguration process, and
provides the set of bits necessary for hardware conﬁguration in a C
vector form that can be included in all the standard processor tool-
chains.
The validation process or debugging is an additional key point re-
quired to an algorithm development environment. In reconﬁgurable com-
puting handled by a processor core, the overall simulation can be man-
aged by a standard software debugger like that one provided in the GNU
environment by GDB (and its graphical interface DDD) or by the cycle ac-
curate tools provided for LISA/System-C environments. In both cases, in
the Griffy-C approach the validation of the reconﬁgurable part is handled
by a separate viewer that shows the same Griffy-C code written by the
user annotated with intermediate results. Breakpoints and control ﬂow
management are in general handled by the processor debugger, and the
application developer can only inspect the status of the reconﬁgurable
unit. As an example, Figure 3.7 provides a screen-shot of the GDB-based
debugging environment augmented with the Griffy code viewer.
Applications development under the Griffy-C approach is a process in
which the programmer can move the application from software to hard-
ware in a sort of continuous space. In fact, starting from the original C
code, the user manually rewrites the code in Griffy-C, usually working
with C-based operators and then starts the performance analysis. The par-
titioning between C-code on the processor and Griffy-code on the recon-
ﬁgurable device is an iterative process of reﬁnement where experience and
knowledge plays a very important role. But, differently to methodologies
borrowed from FPGA design, this kind of approach is mainly software-
oriented. In fact the user can change the partition and can optimize the
kernel mapped on the reconﬁgurable hardware in the same way that DSP
programmers use assembly for speeding-up their applications. Optimiza-
tion of Griffy-C code can be performed at two main levels: pipeline re-
styling and intrinsic optimization. In the ﬁrst case, the pipeline structure
is modiﬁed playing with data-dependencies in order to retime the graph
or to adjust the write-back points in pipeline, for example using software3.6 Griffy project overview 55
Figure 3.7: Griffy-C Debugging and Validation Environment
pipelining methods [59, 60]. In the second case (intrinsic optimization),
the programmer can substitute part of the code with optimized operations
like the direct instance of a look-up table. For software programmers this
seems the assembly-level optimization in which high-level code is substi-
tuted by built-in functions, linear assembler or assembler, since the syntax
remains strongly sequential and imperative, without any kind of direct
parallelism exposition. Only tools are responsible of that.56 Programming tools for reconﬁgurable processorsChapter 4
Mapping DFG on reconﬁgurable
devices
Thischapter describesthemappingof DataFlow Graphs(DFGs) described
by Griffy-C on a reconﬁgurable device. DFGs are implemented in a pipeli-
ned form, in order to improve the ﬁnal performance. Instructions schedul-
ing is required to transform software DFG into a pipelined DFG, although
some optimization steps can be applied only under some hypothesis on
the underlying architecture. After a general part, the chapter includes the
description of target-speciﬁc back-end ﬂows for the mapping on PiCoGA
(in particular, for the
￿
￿
￿ release) and on a commercially available eFPGA,
through the generation of VHDL description.
4.1 ILP exploitation through pipelined DFG and
Petri Nets
Griffy-C code, as described in Appendix A, features a single-assignment
manually-dismantled syntax in which each operation is described by:
 
 
 
 
￿
 
 
 
 
 
 
 
 
￿
 
 
 
￿
 
 
 
 
￿
 
 
 
 
 
 
 
 
 
￿
￿
Single-assignment form means that each variable can be assigned only
once, whilethemanually-dismantling underlinesthefact that each
 
 
 
 
 
 
 
 
5758 Mapping DFG on reconﬁgurable devices
is deﬁned by a single operator. Griffy-C syntax borrows most of the oper-
ators from the C syntax (in fact, Griffy-C is a simpliﬁed C syntax), but also
includes a set of built-in functions (or hard-macros) useful to instance op-
timized, and commonly target-speciﬁc, functionalities (similarly to built-
in functions or intrinsics of DSPs). As an example, the capability to di-
rectly specify LUTs on the PiCoGA is offered by means of a dedicated
hard-macro. The semantic of Griffy-C is strongly sequential, as in ANSI
C, and no parallel statements are deﬁned. This means that parallelism is
extracted from the code.
Given a set of instructions
 , a sequential semantic rule implies to ex-
ecute instructions in the same order they are deﬁned. This ﬁring rule de-
ﬁnes the data dependencies among the different instructions. Under the
Von Neumann paradigm, that is the underlying paradigm of software pro-
gramming, data are stored in the memory and the access to the memory
deﬁnes a sort of synchronization point. Since variables are implemented
as memory locations, the access to these memory locations can be used
to deﬁne the concept of data dependency. There are three types of data
dependencies, which also happen to be the three data hazards:
￿ Read after Write (RAW or ”True”):
 
￿ writes a value used later by
 
￿.
 
￿ must come ﬁrst, or
 
￿ will read the old value instead of the new.
￿ Write after Read (WAR or ”Anti”):
 
￿ reads a location that is later
overwritten by
 
￿.
 
￿ must come ﬁrst, or it will read the new value
instead of the old.
￿ Write after Write (WAW or ”Output”): Two instructions both write
the same location. They must occur in their original order.
Due to the single-assignment form of Griffy-C the third kind of de-
pendency cannot occur, while RAW and WAR can occur. If only RAW
dependencies occur and then the resulting graph will be a direct acyclic
graph. Variables read before written implement an internal state, and are
implemented in C using the static attribute. Exploitation of the available
instruction level parallelism needs to relax the sequential semantic rule4.1 ILP exploitation through pipelined DFG and Petri Nets 59
preserving the behaviour of the block, thus preserving the data depen-
dencies.
Let us consider the set of instructions
  deﬁned as:
 
￿
￿
 
￿
 
￿
￿
 
 
 
 
 
 
 
 
 
 
 
 
 
 
￿
 
￿
￿
 
 
 
￿
￿
￿
 
where the index
  deﬁnes the order in which instructions are declared,
thus executed. Moreover, let us suppose that at a given time
  the set of
variables
 
 
 
 
￿
 
￿ be available. Then the instructions that can be executed
at time
 
￿
￿are that ones for which is true:
￿
￿
 
 
 
￿
￿
￿
￿
 
 
 
 
￿
 
￿
 
 
 
 
￿
￿
 
 
 
￿
￿
￿
￿
 
￿
 
 
 
 
 
While the ﬁrst check preserves the RAW dependencies, the second one
is required to preserve WAR dependencies and it is always veriﬁed for
direct acyclic graphs. Substituting the fully sequential semantic rule with
this set of relaxed rules, it is possible to execute concurrently a set of in-
structions, based only on the veriﬁcation of the data dependencies. Data
Flow Graph (DFG) representation in Fig. 4.1 shows the result of this in-
struction reorganization (or scheduling), where instructions (represented
as nodes) are aligned for execution time. Edges represent the data depen-
dencies among nodes: forward edges represent RAWdependencies, while
backward edges represent WAR dependencies. This kind of scheduling is
also known as As-Soon-As-Possible (ASAP) scheduling policy, since in-
structions are executed in the ﬁrst safe temporal slot.
Under the Von Neumann paradigm, a central memory stores the vari-
ables and a processing unit perform the elaboration. Storage and process-
ing are distinct units, and the communication between them is often a sig-
niﬁcant bottleneck. On the contrary, hardware implementation joins stor-
age and processing in the same unit. Each processing element is directly
connected with the other processing elements providing the input data,
thus requiring storage element to provide temporal disambiguation and
to avoid Write-after-Read hazards.60 Mapping DFG on reconﬁgurable devices
I1
I2
I3
I4
I5
I6
I1 I2 I3
I4
I5
I6
SEQUENTIAL EXECUTION
PARADIGM
DATA−DEPENDENCY BASED
RELAXED PARADIGM
Figure 4.1: Computation paradigm relaxation preserving the data dependencies
In order to improve the computational efﬁciency, each DFG computa-
tion can start before the completion of the previous ones, thus overlapping
(pipelining) the execution of more DFGs. To achieve this, data dependen-
cies shall be checked in order to preserve the behaviour, verifying the pos-
sibility to update a given variable. In general terms, the computation of
each node can thus start when inputs are available and outputs can be up-
dated. This second condition is a little more complex with respect to the
pure WAR safety and includes the computation time required to read and
elaborate each data. For synchronous digital circuits, time is measured in
term of clock cycles, although the clock cycle period depend on the com-
binatorial logic used. In this context, it will be supposed that the clock
period is deﬁned by the most complex computational nodes, then each
computation node (or Griffy operation) requires at the most one clock cy-
cle. Under this assumption, at a given time
  the execution of each node
(or instruction
 
￿) can be triggered when:4.1 ILP exploitation through pipelined DFG and Petri Nets 61
￿ inputs are available
￿
￿
 
 
 
￿
￿
￿
￿
 
￿
￿
 
 
 
 
￿
 
￿
￿ outputs can be updated, since
– all the preceding nodes
 
￿
￿
￿ requiring the old value of
 
￿ are
already triggered, hence the old value of
 
￿ is not more required
(WAR check)
 
 
 
 
￿
￿
 
￿
￿
 
 
 
￿
￿
￿
￿
 
￿
￿
 
￿
 
 
 
 
 
– given at the time
  the on-going
 
￿
￿ DFG iteration, a node rela-
tive to the
 
￿
 
￿
￿ iteration can be triggered if
￿ RAW and WAR dependencies for the
 
￿
 
￿
￿ iteration are
veriﬁed (as in previous items)
￿
￿
 
 
 
￿
￿
￿
￿
 
 
 
￿
 
￿
￿
 
 
 
 
￿
 
 
 
￿
 
￿
￿
 
 
￿
 
 
 
 
￿
￿
 
 
 
￿
 
￿
￿
 
 
 
￿
￿
￿
￿
 
 
 
￿
 
￿
￿
 
￿
 
 
￿
￿
￿
 
 
 
 
 
￿ all the successive nodes
 
￿
￿
￿ relative to the previous DFG
iterations
 
￿
 
￿
￿ with
 
 
 have already read the output
 
￿
￿
 
 
 
￿ corresponding to the
 
￿
￿
￿iteration (temporal depen-
dent RAWs and WARs)
 
 
 
 
￿
￿
 
 
 
￿
 
￿
 
￿
 
 
 
￿
￿
￿
￿
 
 
 
￿
 
￿
￿
 
 
 
￿
￿
 
 
 
 
 
This process can be better and more intuitively modelled by Petri Nets
[65]. A Petri Net is a three-tupla (P,T,A) where:
￿ P is a non-empty set of place denoted by
￿
 
￿
 
 
￿
 
 
 
 
 
 
￿
￿;
￿ T is a non-empty set of transitions denoted by
￿
 
￿
 
 
￿
 
 
 
 
 
 
￿
￿;
￿ A is a non-empty set of directed arcs;
such that
 
￿
￿
 ,
 
￿
￿
  and
 
￿
 
￿
 ,
 
￿
 
￿
 
￿
 
￿
 . Pictorially, P, T
and A are respectively represented by circles, bars and directed arcs. Each62 Mapping DFG on reconﬁgurable devices
I1 I2 I3
I4
I5
I6
T1 T2 T3
T4
T5
T6
Figure 4.2: DFG and the corresponding Petri Net representation
transition is enabled when all the places connected to the transition have at
least one token. In our case, we consider a subset of Petri-Net in which at
most one token can reside in each place and the status is updated at dis-
crete steps, hence each transition is ﬁred when all the tokens are available
(earliest ﬁring rule) at discrete step of time (also known as timed Petri Net
[64]).
Computational nodes are associated to transitions, and ﬁring a transi-
tion means executing an operation. Starting from a Data Flow Graph, each
edge is substituted by two arcs, respectively forward and backward (with
respect to the direction of the original DFG edge). Forward arcs determine
the availability of a new input data, while the backward ones determine
the request of new data, under a producer-consumer mechanism. Fig. 4.2
shows an exampleof DFG andthe corresponding Petri Net representation.
Tokens, depicted as black circles, identify the initial state in which all the
operations require data to compute, and (let us suppose) the primary in-
puts are available. Transitions featuring a token for each input arcs are
 
￿,
 
￿ and
 
￿, then can be ﬁred at time
 . At time
 
￿
￿, only
 
￿ can be triggered,
while at
 
￿
￿either
 
￿,
 
￿ and
 
￿. At time
 
￿
￿,
 
￿ and
 
￿ can be triggered,
and so on, as described in Fig. 4.3.4.1 ILP exploitation through pipelined DFG and Petri Nets 63
T1 T2 T3
T4
T5
T6
T1 T2 T3
T4
T5
T6
T=t+3
T1 T2 T3
T4
T5
T6
T=t+4
T1 T2 T3
T4
T5
T6
T=t+1 T=t+2
Figure 4.3: Petri Net transition ﬁring
Pipelined execution of DFGs, under a Petri Net paradigm, implies that
intermediate results are stored in registers, since temporally different in-
stances of the same DFG are overlapped. But, also in this case, two con-
siderations can drive further optimizations:
￿ in digital synchronous design, registers samplethe input at the rising
(or falling) edge of the clock, and their outputs don’t change during
the clock period. This means that the backward arcs can provide
their tokens at the beginning of the clock cycle in which transitions
are ﬁred, allowing to implement more compact pipelines.64 Mapping DFG on reconﬁgurable devices
￿ if the target architecture features programmable routing, some oper-
ations can be implemented only by routing resources, like in the case
of a shifts with constant amount. Furthermore, under some hypoth-
esis (discussed in the following), some operation can be collapsed
in the successive operations. This is the case of bit-wise logic oper-
ation involving a constant that can be reduced to selective connec-
tions (some bit connected, some bit constant to 1 or 0) or bit-wise
not. If the architecture provides LUTs or input inversion logic, also
this kind of operation can be considered routing-only since its imple-
mentation does not require speciﬁc computational resources. This
optimization can be used to optimize the instruction scheduling in
order to implement more compact pipelines.
Lasttwo itemsimprove thetimedPetri Netevolution inorder toachieve
better pipelines and then better throughput. This is the computational
schema proposed in the Griffy project, that will be described in the fol-
lowing, and that can be summarized in:
￿ a Pipelined Data Flow Graph is built exploiting the instruction-level
parallelism from the sequential Griffy-C code.
￿ for a given DFG, RAWand WARdependenciesarepreserved at com-
pilation time by the Griffy scheduler, while hazards across different
iterations of the same DFG are handled by dedicated hardware at
execution time. For that:
– without loss of generality, each pipeline stage (composed by a
set of concurrent computational nodes) is considered as a Petri
Net transition, under an ASAP scheduling policy.
– pipeline management is handled by a programmable control
unit, generated by Griffy tools. Each element of the control unit
represents a programmable Petri Net transition which enables
the execution of the respective (set of) computational node(s)
depending on sources and resources availability.4.2 Instruction scheduling: optimized DFG for pipelined computation 65
It should be noted that spatial computation is preserved by the com-
piler, during the pipeline organization. On the contrary, temporal depen-
dent hazards are checked at execution time, since the pipelined and over-
lapped execution of successive DFG iterations is dependent on conditions
non-predictable at compile time, as the inputs availability, external condi-
tion of stalls and, of course, the frequency of DFG triggering.
4.2 Instruction scheduling: optimized DFG for
pipelined computation
Instruction scheduling is the phase in which sequential Griffy-C code is
selected to be executed in a speciﬁc pipeline stage, thus translating a se-
quential description in a concurrent pipelined form. This process borrows
the instruction ﬁring mechanism of the Petri Nets, since instructions can
be executed in a speciﬁc pipeline stage under the same hypothesis that
enable the transition ﬁring. This section describes the selection algorithm
used in the Griffy ﬂow, starting from a simpliﬁed version for direct acyclic
graphs (DAGs) and thus improving the algorithm to support functionali-
ties holding an internal state. In the ﬁrst case only RAW dependencies are
considered, while in the second one also WAR dependencies will be taken
into account.
After the scheduling phase, the ﬂow becomes mainly target speciﬁc
and includes for example the place & route phases. The next section will
provide an overview of two target speciﬁc back-end, the ﬁrst one for the
PiCoGA-III (featuring a dedicated control unit) and the second one for a
commercially available eFPGA programmed generating standard VHDL
description.
4.2.1 Scheduling of direct acyclic graphs
A direct acyclic graph is a graph with one-way edges containing no cycles.
This means that if there is a route from node A to node B then there is no66 Mapping DFG on reconﬁgurable devices
way back. In “software” words, this means that each variables is written
before read, then only RAW data dependencies can happen, without static
variables.
Under the earliest ﬁring rule, each instructions is executed as soon as
possible, thus in the ﬁrst pipeline stage in which input data are avail-
able. Routing only operations can be considered as operations featuring
zero-time execution, then they provides data available in the same stage
in which the instruction is executed. On the contrary, the other instruc-
tions will provide variables available only in the next pipeline stage. In
this case, the corresponding ASAP scheduler is very simple, and can be
represented by the pseudo-code in Fig. 4.4.
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
 
￿
￿
!
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
"
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
 
￿
￿
!
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
"
￿
￿
￿
￿
￿
￿
￿
￿
 
￿
￿
!
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
 
￿
￿
!
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
#
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
$
￿
￿
￿
￿
￿
￿
%
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
&
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
 
￿
￿
￿
￿
￿
￿
￿
 
￿
￿
￿
￿
￿
￿
￿
￿
￿
 
￿
￿
!
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
 
￿
￿
!
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
&
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
$
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
Figure 4.4: DAG scheduling pseudo-code (with routing only optimization)
It is important to observe that while the scheduling algorithm pre-
serves the read-after-write data dependencies inside a speciﬁc Griffy op-
eration, in case of pipelined(overlapped) execution of successive instances
of thesameGriffy operation alsowrite-after-read hazardsshallbechecked.
In fact, each pipeline stage can accept a new computation when the out-4.2 Instruction scheduling: optimized DFG for pipelined computation 67
puts can be updated (there is a write-after-read check), under the compu-
tational paradigm explained in the previous section. This second check is
provided by a Petri-Net based hardware pipeline manager that, depend-
ing on the data dependencies, allows or not to trigger pipeline stages.
4.2.2 Scheduling of data ﬂow graphs
To consider complete DFGs means to take into account DFGs featuring an
internal state. Inparticular two additional effects are to be considered. The
ﬁrst one is given by the presence of static variables (used to implement a
state) for which the old value is available at the beginning of the DFG
computation, be it an initialization value or a real value referred to a past
computation, and that canbe updateonly whenall theWARdependencies
are preserved. The second effect is given by the presence of routing-only
operations for which the optimization process would remove the corre-
sponding memory location. Let us consider the following example:
static int status;
tmp1 = status << 1;
status = in + 5;
out = tmp1 + 1;
In this case, standing the routing-only optimization of
 
 
 
￿, the
 
 
 
variable must be considered dependent from the old value of
 
 
 
 
 
 , and
the scheduling algorithm shall preserve this behaviour adding a ﬁring
rules. In this case WAR hazards are veriﬁed by:
 
 
 
 
￿
￿
 
￿
￿
 
 
 
￿
￿
￿
￿
 
￿
￿
 
￿
 
 
 
 
 
 
 
 
 
 
￿
 
 
 
 
￿
￿
 
￿
￿
￿
 
 
 
￿
￿
￿
￿
 
￿
￿
 
￿
 
 
 
where the function
 
 
 
 
 1 represents any possible direct or indirect de-
pendency to the static variable
 
 
 
 
￿. A direct dependency represents the
case in which a routing-only instruction involves a static variable. Let us
deﬁne this case as direct static alias, and let us deﬁne as indirect static alias
1the term alias is used since the operation can be considered as an alias representation
of the memory location of the original variable68 Mapping DFG on reconﬁgurable devices
(and consequently a indirect dependency to a static variable) a routing-
only instruction which involves a direct static alias or another indirect
static alias. Under this assumption, a static alias (both direct and indi-
rect) depends to one or more instructions holding a state. Then, the static
variables shall be updated only when both direct and indirect (by means
of routing-only propagation) WAR dependencies are preserved.
Let us consider now the following code:
static int status;
tmp1 = status << 1;
status = in + 5;
out = tmp1 + status;
In this case, we have that
 
 
  reads both the old and the new value
of the static variable
 
 
 
 
 
 . This implies that
 
 
 
￿ cannot be optimized,
since only different storage elements can preserve the original behaviour.
Hence routing-only optimization depends from the overall DFG and the
adopted scheduling policy, and not only from the speciﬁc instruction.
The scheduling algorithm proposed in the Griffy project supports both
static variables management and routing-only optimization. Routing-only
operations are detected depending on the features of the single instruc-
tion, checking the operation type and the sources involved. The PIPEREG
attribute (see Appendix A) can be used by the programmer in order to
force a routing-only instruction to be non optimized (for example, in or-
der to build a delay line or to retime a graph). The scheduling algorithm
tries to execute all the operations that satisfy the earliest ﬁring rule, con-
sidering as zero-time executed all the routing-only instructions similarly
to the scheduling algorithm proposed for DAG. In addition to that basic
algorithm (Fig. 4.4), a check condition enables the ﬁring of instructions
involving static variables and alias of static variables.
In the ﬁrst case, the computation of an instruction
 
￿ having as desti-
nation a static variable (thus an instruction holding an internal state) is
enabled only if all the instructions which need to read the old value of
 
￿
are already executed or they are executed in the current pipeline stage. In
the second case, a static alias could be triggered in the ﬁrst pipeline stage4.2 Instruction scheduling: optimized DFG for pipelined computation 69
Pure ASAP scheduling ALAP correction for static variables
Figure 4.5: Example of ALAP correction for static variables
in which non-static variables are available (at the most, in the ﬁrst pipeline
stage), but this could create longer paths in the case of shift registers. As
shown in Fig. 4.5, the ﬁring of static alias following a pure ASAP policy
could imply that a chain of registers is split over more than one pipeline
stage, hence increasing the critical path (and the issue delay) since past
values are available from the beginning. For that, the Griffy scheduling
algorithm tries to delay the execution of static alias2 As Late As Possible
(ALAP), implementing in many cases a more efﬁcient pipeline structure,
as in the right part of Fig. 4.5.
On the other hand, check conditions enabling the execution of an in-
struction are both coherence and optimization checks. If the enabling con-
dition is not veriﬁed, then the instruction
 
￿ is not ﬁred and is removed
from the set of instructions suitable for the execution in the current pipe-
line stage. As a consequence of this cleaning, other instructions could be
removed since
 
￿ is not more ﬁred and some dependencies could be not
more veriﬁed.
The resulting scheduling algorithm is reported in a simpliﬁed form in
Fig. 4.6, and is composed of a main loop executed until instructions are
available. The loop body can be partitioned in three main sections:
￿ candidate selection, in which instructions having available sources are
selected to be executed in the current pipeline stage;
2since the past value of a static variable can be seen as a static alias70 Mapping DFG on reconﬁgurable devices
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
 
￿
￿
!
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
"
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
 
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
 
￿
￿
!
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
 
￿
￿
!
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
 
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
 
￿
￿
!
￿
￿
￿
￿
￿
￿
￿
$
￿
’
￿
￿
￿
￿
(
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
"
￿
￿
￿
￿
￿
￿
￿
 
￿
￿
!
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
 
￿
￿
!
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
 
￿
￿
￿
￿
￿
￿
￿
￿
 
￿
￿
!
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
 
￿
￿
!
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
"
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
)
￿
￿
￿
￿
 
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
#
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
$
￿
￿
￿
￿
￿
￿
%
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
&
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
 
￿
￿
￿
￿
￿
￿
￿
 
￿
￿
￿
￿
￿
￿
￿
￿
￿
 
￿
￿
!
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
 
￿
￿
!
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
&
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
$
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
Figure 4.6: Simpliﬁed DFG scheduling algorithm
￿ candidate analysis, in which both RAW and WAR dependencies are
veriﬁed, as well as the ALAP correction for static management is
applied (see pseudo-code in Fig. 4.7);
￿ commit stage, in which instructions selected for the execution on the
current pipeline stage (after passing the previous check) are commit-
ted and removed from list of pending instructions. Furthermore the
corresponding destination variable becomes available for the next
pipeline stage. In the case of no instructions selected for execution,
thescheduling algorithm activates thedisambiguation mode, inwhich
conﬂicting routing-only instructions are de-optimized adding a PI-
PEREG attribute.4.2 Instruction scheduling: optimized DFG for pipelined computation 71
It should be noted, from the code in Fig. 4.7 that the ALAP correction
is applied only if static alias are involved in the computation. Then, using
temporary and routing-only variables it is possible to choose the way in
which a set of correlated static variables is scheduled. As a choice, if the
old value is passed through temporary instructions (indirect dependency)
a tentative of alignment (critical path optimization) is done, while if the
old value is passed referring directly the static variable, ALAP correction
is not applied. It is possible to avoid this side-effect applying the ALAP
correction also to static variables, but in this casethe disambiguation mode
shall work in two phases. In the ﬁrst one, the ALAP correction will be
relaxed (removed) without generating new registers, while in the second
phase (the second round without ﬁred instruction) the routing-only de-
optimization shall be applied.
4.2.3 Execution-time pipeline management
Given a Data Flow Graph (DFG) organized in pipeline stages, pipeline ex-
ecution of successive DFG instances shall be enabled checking at run-time
the data-dependencies. For that, each pipeline stage is represented by a
node of the Pipelined Data Flow Graph (PDFG) representing the data de-
pendency across pipeline stages. Data dependencies are analyzed and for
each pipelined stage is generated a pipeline controller, implementing the
handshake mechanism described by the corresponding Petri-Net transi-
tion. The basic pipeline stage controller, depicted in Fig. 4.8 features:
￿ a preceding input port, providing the pipeline stage controller infor-
mations about inputs availability;
￿ a successive input port, providing the pipeline stage controller infor-
mations about the possibility to update the outputs;
￿ an executionenable that triggers the computation of the pipelinestage.
P-blocks and S-blocks are the basic sub-blocks that verify, respectively,
the preceding and successive input ports. The internal structure of these72 Mapping DFG on reconﬁgurable devices
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
"
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
$
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
$
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
$
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
$
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
(
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
$
￿
￿
￿
￿
￿
$
￿
￿
￿
￿
￿
*
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
(
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
 
￿
"
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
$
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
$
￿
￿
￿
￿
￿
$
￿
￿
￿
￿
￿
*
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
!
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
+
￿
￿
+
￿
+
￿
!
￿
￿
￿
￿
￿
￿
￿
￿
+
￿
￿
￿
+
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
!
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
+
￿
￿
￿
$
￿
￿
￿
￿
￿
$
￿
￿
￿
￿
￿
*
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
+
￿
￿
+
￿
￿
￿
￿
￿
￿
Figure 4.7: Candidates analysis algorithm
sub-blocks is reported in Fig. 4.9. Execution enables provided by the other
pipeline stages are used as preceding and successive signals (directly con-
nected or routed by a programmable interconnect), and are stored inter-
nally to the speciﬁc sub-block until the pipeline stage is ﬁred since they
can disappear after a single execution. Thanks to a feedback path, both
P-block and S-block maintain the local execution enable until the global
execution enable is ﬁred, while it is reseted when after a triggering. Differ-4.2 Instruction scheduling: optimized DFG for pipelined computation 73
EENs1
EENp2
RCU
EENp0
EENp1
EENs0
RESET
EEN
P−block   0
P−block   1
S−block   0
P−block   2
S−block   1
C
O
N
F
I
G
U
R
A
B
L
E
 
C
O
N
N
E
C
T
I
O
N
 
B
L
O
C
K
Figure 4.8: Pipeline stage controller simpliﬁed architecture
ently from the P-block, the S-block features a combinatorial path (dashed
in Fig. 4.9) that allows to early evaluate the data request from succes-
sive pipeline stages, thus improving the pipeline evolution and the overall
throughput (relaxing the timed Petri-Net model).
P−block
S−block
EEN
RESET
EENs
EEN
RESET
EENp
ENABLEs
ENABLEp
Figure 4.9: P-block and S-block simpliﬁed architecture
For each pipeline stage, the Griffy-C compiler generate a speciﬁc pi-
peline stage controller and the corresponding data dependencies with the
other pipeline stages. In the following code, 4 pipeline stages (here termed
virtual rows, or V-Rows) are described, as well as dependencies from pri-74 Mapping DFG on reconﬁgurable devices
mary inputs and outputs.
.output out 4
V-Row: 1
Preceding: #gins
Successive: 2
V-RowEnd
V-Row: 2
Preceding: 1
Successive: 3
V-RowEnd
V-Row: 3
Preceding: 2
Successive: 4
V-RowEnd
V-Row: 4
Preceding: 3
Successive: #gouts0
V-RowEnd
4.2.4 Griffy Front-End architecture
The overall architecture of the Griffy Front-End compiler is shown in Fig.
4.10. Lexer and parser are implemented using standard tools (GNU Flex
for the lexer and GNU Bison for the parser), which allow to verify the
grammar of the input Griffy-C code and translate them into an abstract
syntax tree (AST). The next is the (semantic) veriﬁcation of the DFG de-
scription. It is performed at AST-level in order to verify the correctness of
the code, including additional checks on the single-assignment form, on
the variable declaration, and on the read-after-write dependency of non-
static variables. When a condition is not veriﬁed, the compiler break the
execution and provide an error message to the user.
Analyzing the instructions and the kind of variables involved in the
computation, the compilation ﬂow detects the instructions suitable for
routing-only optimization. Then, it starts the phase of pipeline organiza-
tion in which instruction level parallelism is exploited using the schedul-
ing algorithm explained before. As a result, a netlist is generated with4.2 Instruction scheduling: optimized DFG for pipelined computation 75
Lexer/Parser
Griffy−C
Grammar checks
DFG checks
Declaration check
Single−assignment check
RAW check for non−static variables
Alias
Analysis
ILP
Extraction
Emulation
Netlist
Generation
Model
Graphical
DFG dump
Griffy Netlist
ANSI−C
.dot file
Emulation
(for GraphViz) Pipeline
Management
Figure 4.10: Simpliﬁed Griffy Front-End architecture
the corresponding pipeline structure and activation sequence. During the
netlist generation, aliased signals (derived from the utilization of alias in-
structions) are substituted by the corresponding physical implementation.
For example, a shift with constant amount is obtained by appropriately
rearranging the input variable, and a bitwise-and with a constant is im-
plemented by connecting the signals corresponding to 1s and by open-
connection for the 0s. Furthermore, taking into account the pipeline struc-
ture, a simulation model is generated, emulating the Griffy-C code in stan-
dard ANSI-C. As a user facility, a graphical view of the pipelined DFG is
dumped using the .dot format of the GraphViz tools (free download from
Graph Visualization Software, www.graphviz.org).76 Mapping DFG on reconﬁgurable devices
Lexer/Parser
Griffy−C
Grammar checks
DFG checks
Declaration check
Single−assignment check
RAW check for non−static variables
Alias
Analysis
ILP
Extraction
Emulation
Netlist
Generation
Model
Graphical
DFG dump Pipeline
Management
Griffy Netlist
ANSI−C
Emulation
.dot file
(for GraphViz)
Library
Macro
Netlist
Target−dep Routing
Control Unit
Configuration
Bit−stream
Generation
Placement
information
area
Figure 4.11: Simpliﬁed Griffy ﬂow for PiCoGA-III
4.3 Target-speciﬁc customizations and back-end
ﬂows
Previous sections have described the Griffy ﬂow for a generic target. The
description has been focused on the exploitation of instruction level par-
allelism as a mean to organize DFG in pipeline stages. This is a common
point for most of the reconﬁgurable architectures, although speciﬁc cus-
tomization including checks or optimizations shall be implemented. In
this section, it is provided a brief overview of two speciﬁc customizations.
In theﬁrst case, Griffy-C is usedto conﬁgure thePiCoGA(in particular, the
description focus on the
￿
￿
￿ release, PiCoGA-III, included in the DREAM
adaptive DSP), while the second target is a commercially available embed-
ded FPGA for which a VHDL code is provided from a Griffy description
in order to enter in the eFPGA proprietary tool-ﬂow. Furthermore, for the
PiCoGA is outlined the back-end tool-ﬂow including the place & route
and the bitstream kit.4.3 Target-speciﬁc customizations and back-end ﬂows 77
4.3.1 DFG mapping for PiCoGA
PiCoGA-III is a reconﬁgurable device implementing pipelined data ﬂow
graph on a hybrid architecture in which computational parts are mapped
on an island-style matrix of 16
￿24 4-bitwise tiles, while the pipeline man-
agement is handled by a dedicated pipeline control unit. For that, while
the front-end, does not need a speciﬁc customization3, but require a com-
plete target-speciﬁc back-end ﬂow. In fact, under some hypothesis on the
connectivity, the netlist provided by the front-end is not target speciﬁc,
and the speciﬁc PiCoGA customization only requires to add a physical
mapping phase in which:
￿ each computational nodes is implemented using the resources avail-
able on one or more reconﬁgurable logic cells (RLCs), thus providing
a target-speciﬁc mapping;
￿ each pipeline stage controller is implemented using the dedicated
row control units.
Physical mapping is implemented using a library-based approach in
which each Griffy computational node is split into one or more RLCs. No
physical synthesis is performed, if we exclude the routing-only optimiza-
tion implemented in the Griffy Front-End. In fact, Griffy code is intended
as a structural way to effectively handle a reconﬁgurable device under
a pipelined DFG paradigm, providing the programmer a low-level opti-
mization step similar to the assembly-level optimization for DSPs. The
target speciﬁc netlist is then placed on PiCoGA. Since each row can be
ﬁred synchronously, each row can be used at most by one pipeline stage,
while more than one row can be triggered together in order to build larger
pipeline stages. Placement is than the phase in which pipeline stages are
split into one or more rows, and reconﬁgurable logic cells are ﬁtted into.
In order to contain the number of used rows, a pseudo-malloc algorithm is
used: for each pipeline stage, computational nodes are sorted depending
3if we exclude a writeback alignment, that is implemented as an additional check
condition during the candidate analysis phase.78 Mapping DFG on reconﬁgurable devices
on the relative weight (the number of RLC required for the implementa-
tion) and they are ﬁt in the ﬁrst empty space found in a free row or in a
row assigned to the same pipeline stage. Reduction of connection costs
is implemented applying simulated annealing internally to the pipeline
stage, using a Manhattan metric as cost function.
EEN1
EEN0
EEN2
EEN3
EEN4
Computational
logic
RCU 1
Row 0
Row 1
Row 4
Row 3
CONTROL UNIT
RCU−Array interface Interconnection Matrix
RCU 2 Row 2
PiCoGA
RCU 0
RCU 3
RCU 4
Figure 4.12: PiCoGA-III control unit programmable interconnect
After the placement phase, the pipeline control unit is conﬁgured. In
particular, the conﬁguration of every pipeline stage controller is applied
to each row used to implement a pipeline stage. As shown in Fig. 4.12, a
dedicate programmable bus is provided the necessary handshakes, prop-
agating the execution enables between predecessor and successor nodes.
When more than one row is used to implement a single pipeline stage,
Griffy tools perform the connection with the nearest row for both preced-
ing and successive pipeline stage, in order to reduce the routing utiliza-
tion.
PiCoGA routing is programmed using a customized version of VPR
[81], awell-known open-source tool developedattheUniversity of Toronto.
It is based on a state of the art timing-driven negotiation-based pathﬁnder
algorithm in which resources over-utilization is allowed in the ﬁrst iter-
ations of the routing algorithm. The ﬁnal solution is achieved minimize4.3 Target-speciﬁc customizations and back-end ﬂows 79
the overall cost that is basically driven by the Elmore delay associated to
the nets increased. To avoid over-utilization, an additional cost parameter
is introduced in order to increase the cost of overused resources, that be-
come less appealing and that shall be negotiated among the “users”. On
this context, PiCoGA speciﬁc customizations are focused on the architec-
ture modelling, while the routing algorithm is not changed with respect
to the basic one proposed in VPR. In particular, it was modelled the 2-bit
granularity of the interconnections and the particular switch-block [66].
After place & route the area required for the speciﬁc Griffy operation
is available and is reported to the simulation engine, in order to verify the
resources utilization on PiCoGA. The routing part is necessary in order
to take into account the routing exceeding the bounding box deﬁned by
the placement, as could happen in the case of design with high routing
congestion.
The last step of the PiCoGA speciﬁc tool-chain is the generation of the
bit-stream. This topic is achieved in two steps. In the ﬁrst phase both RLC
conﬁguration (from the physical mapping) and routing conﬁguration are
translated to a textual bit-stream that speciﬁes the logical value of each
programmed bit. Only in a second phase bits are placed corresponding
to the physical implementation of the device and the conﬁguration bit-
stream is generated in the form of a C vector.
4.3.2 DFG mapping for eFPGA
This section describes the mapping of Griffy-C code in device that have
a proprietary back-end ﬂow receiving, as entry-point, standard hardware
description languages. In particular, this section describes a Griffy target
generating VHDL code implemented, as a prototype, in order to provide
the XiSystem architecture a homogeneous algorithm development envi-
ronment.
The XiSystem architecture [67] is the ﬁrst time architecture integrating
two different ﬁeld-programmable devices to provide application-speciﬁc
computing blocks and IOs. A XiRisc reconﬁgurable processor is exploited80 Mapping DFG on reconﬁgurable devices
Timer
PiCoGA
Conf. Cache
AHB Slave
Interface
AHB Master
Interface
Instruction
Cache
Data
Cache
on−chip SRAM
256 KB TIC
eFPGA
DMA
Interfaces
Basic IO
Bridge
AHB
APB
256
XiRisc
Core
XiRisc
I/O pads I/O pads
PiCoGA Register
File
Figure 4.13: XiSystem SoC architecture
to achieve a more than one order of magnitude speed-up and energy con-
sumption reduction vis-` a-vis a DSP-like processor, while an embedded
FPGA (eFPGA) is integrated in the system in order to make it ﬂexible
enough to support various IO ports and protocols. The reconﬁgurable IO
device is also utilized for pre/post data processing and implementation of
some standard computational blocks. Fig. 4.13 shows the overall system
architecture.
While instruction set extensions for the XiRisc processor is generated
conﬁguring the PiCoGAstarting from Griffy-C bythepreviously described
ﬂow, the management of the eFPGA could depend on the speciﬁc utiliza-
tion. On one hand, hardware description languages provide the designer
a straightforward way to describe I/O protocols directly exposing timing
issues. On the other hand, if the eFPGA is used for pre/post data pro-4.3 Target-speciﬁc customizations and back-end ﬂows 81
GCC
Compiler
PiCoGA Mapping
eFPGA Mapping
S
y
n
t
h
e
s
i
z
e
r
H
D
L
pga−op
C Code
Library
Optimized
C Code
Profiling
Registers:
Memory:
Software Simulation
Extraction
Kernel
eFPGA PiCoGA
Configuration
bits
Executable
Code
HDL
Customized
PiCoGA
eFPGA
Figure 4.14: Overall software tool-chain
cessing or as a streaming computational block the utilization of hardware
description languages could be substituted with the same high level de-
scription language (Griffy-C) used for PiCoGA conﬁguration. Exploita-
tion of both instruction and data parallelism are key elements to achieve
impressive performance improvement also using standard reconﬁgurable
devices, whereas the utilization of software-like languages could be in-
tended as a way to allows software programmer to beneﬁt from this tech-
nology. Also in this case, the detection of critical kernels is made through
an iterative proﬁling step, where the programmer can evaluate various
possible implementations of the algorithm. Moreover, the programmer
can choose to implement such a kernel on PiCoGA or on eFPGA (or in
both), as well as to choose what kind of approach (Griffy-C or VHDL) for
the programming of the eFPGA. Fig. 4.14 shows the overall software tool-
chain implemented for the XiSystem architecture.
In particular, the translation of the Griffy-C code in VHDL is based on82 Mapping DFG on reconﬁgurable devices
PiCoGA eFPGA
Area Contexts Eff. Area Freq. Eff.
Kernel
 
 
￿ used
,
-
￿
￿
￿
￿
 
 
￿
 
 
 
,
-
￿
￿
￿
￿
Motion 11 4 15.09 10.76 52 4.81
Pred. 5.5 2 30.18 2.54 86 33.66
FDCT 7.3 2 22.64 3.18 68 21.26
Quant. 10.5 3 15.75 3.81 47 12.4
IDCT 7.3 4 22.64 9.14 46 5
Table 4.1: PiCoGA vs. eFPGA computational efﬁciency comparison
the same principles adopted for PiCoGA programming. A library based
generation of appropriately sized computational blocks (i.e. adder, sub-
tractor, bit-wise logic, etc.) and a library based generation of pipeline stage
controllers. A library of VHDL components implementing both computa-
tional and control parts has been developed and the back-end ﬂow in-
stance the appropriate component thus realizing the target speciﬁc netlist.
Withrespect to themappingon PiCoGA,only onepipelinestage controller
is generated sinceeFPGAs arenot organized by rows. Moreover, the hand-
shaking among pipeline stages is handled by point-to-point direct connec-
tion. The ﬁnal design mix both computational and control parts in the
same support, differently from PiCoGA approach in which computation
and control are separated.
The realization of the prototype tool-ﬂow generating VHDL code start-
ing from Griffy-C description allowed a comparison between the PiCoGA
and the eFPGA. The comparison has been conducted on the MPEG2 en-
coder application evaluating the maximum performance achievable on
critical kernels. Table 4.1 shows the area occupation, working frequency
and computational efﬁciency of the same 5 circuits mapped on both de-
vices starting from the same Griffy-C code. In the case of the PiCoGA
(ver 1.0, integrated in XiRisc) a ﬁxed frequency of 166 MHz was consid-
ered. Computational efﬁciency has been calculated as the ratio between
the working frequency and the area occupied on the device to implement4.3 Target-speciﬁc customizations and back-end ﬂows 83
Circuit eFPGA Frequency
area occupation
IEEE1284 1.5% 83MHz (SCK/2)
RS232 39% 83MHz (SCK/2)
I
￿C 8% 55MHz (SCK/3)
LCD + YUV-RGB conv. 28% 42MHz (SCK/4)
VideoCam 1.5% 166MHz (SCK)
CRC 32% 55MHz (SCK/3)
Reed-Solomon 20% 55MHz (SCK/3)
IDCT 60% 42MHz (SCK/4)
SCK: System CK frequency
Table 4.2: Area occupation and working frequency of circuits mapped on the eF-
PGA
the circuit. The PiCoGA advantage is maximum when all the 4 contexts
are used, as in the case of motion estimation and IDCT which are the most
critical kernels of MPEG2 encoder. This demonstrates that PiCoGA can
be 2 to 3 times more efﬁcient than the eFPGA when implementing DSP
algorithms, since the introduction of the eFPGA is not aimed at increasing
computational density but at adding system interfacing ﬂexibility and at
providing additional parallel pre/post processing facilities.
Common communication protocols, such as I
￿C, RS232 and IEEE1284,
have been mapped on the reconﬁgurable IO module, satisfying the spe-
ciﬁc requirements of each protocol (see Table 4.2). Moreover, it could be
convenient to map some additional post/pre-processing operations, for
example to perform data format manipulation or error correcting codes,
which are well suited to FPGA implementation because of their bit-level
granularity. This allows one to remove a portion of computational load
from the central processor. Particularly interesting are error detection and
correction algorithms such as CRC and Reed-Solomon, which is almost
ubiquitous, used for example in storage drives, wireless communications,
digital television, and broadband modems. Both have been mapped in the84 Mapping DFG on reconﬁgurable devices
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
XiRisc
Y
U
V
−
R
G
B
I
D
C
T
G
e
t
_
B
i
t
s
I
D
C
T
I
D
C
T
Y
U
V
−
R
G
B
G
e
t
_
B
i
t
s
I
D
C
T
Y
U
V
−
R
G
B
architecture
DSP−like
XiSystem
eFPGA Processor PiCoGA
Normalized processing time
G
e
t
_
B
i
t
s
Figure 4.15: XiSystem MPEG2 decoder performance
eFPGA, which is capable of supporting up to a 100MB/sec data rate for
the Reed-Solomon encoder.
As a proof of the beneﬁt provided by mixed reconﬁgurable devices,
it has been developed an MPEG-2 application (encoder and decoder) on
XiSystem, applied to a standard QCIF stream with a frame resolution of
176x144 pixels and half-pel precision. Application-speciﬁc instructions in-
troduced with the PiCoGA achieve a 5x speed-up and 66% energy reduc-
tion on the encoder and 1.5x speed-up on the decoder. Reconﬁgurable IOs
are used to implement drivers for external peripherals such as an LCD dis-
play or a videocam, not binding the chip to any speciﬁc device. Since the
LCD display chosen requires RGB pixel format while an MPEG decoder
computes frames in YUV format, the necessary conversion has been im-
plemented in the eFPGA, removing this procedure from the central core.
This simple data post-processing achieved a 10% speed-up and 6% energy
saving on the whole decoder application. Moreover the coprocessor con-
ﬁguration of the eFPGA was used to implement the row processing part
of IDCT, achieving a further 6% speed-up, as shown in Figure 4.15.Chapter 5
Simulation of dynamically
reconﬁgurable processors
“Modern processors are incredibly complex marvels of engineering that are be-
coming increasingly hard to evaluate” [89].
In the case of reconﬁgurable architectures, simulation and performance
evaluation shows additional complexity due to the coupling of dynami-
cally programmable logic with standard processor cores. On one hand,
fast simulation is a strict requirement for a design environment in which
the iterative reﬁnement of partitioning between hardware and software is
a critical point in order to achievethe expectedperformance improvement.
On the other hand, fast simulation requires the instruction set customiza-
tion to be handled using high-level models (it is not required to control
each transition associated to each signal, but only the overall behavior)
and new strategies are thus necessary in order to save both cycle-level ac-
curacy and fast reconﬁguration time.
Design of reconﬁgurable architecture simulators can be oriented to fast
source-level retargeting or to dynamic simulator extension. In the ﬁrst
case, the Instruction Set Simulator (ISS) description allows the user to
rapidly add and remove instructions from the basic instruction set, but it
requires the simulator to be recompiled for each instruction set extension.
As an example, the Language for Instruction Set Architecture description
(LISA [90]) was introduced in order to reduce retargeting time in design
8586 Simulation of dynamically reconﬁgurable processors
and exploration of processor architectures, even if instruction set exten-
sion involves detailed description of new pipeline irregularity, such as the
case of different pipeline depths and internal stalls. Also the SimpleScalar
Toolset [89] allows the user to modify both the instruction set and the pi-
peline architecture. This infrastructure has been used in [91] where recon-
ﬁguration is performed via application speciﬁc re-targeting and simulator
recompiling (Fig. 4 in [91]).
A different approach is followed in [92] where speed-up and perfor-
mance estimation involve the GNU gprof proﬁler and a prototype FPGA
used to synthesize application kernels. Due to space limitations, no ﬁle,
I/O, or operating system calls have been implemented on the prototype
FPGA. Kernels are implemented on the prototype FPGA allowing to com-
pare execution time for the reconﬁgurable implementation to software im-
plementation. The application speed-up is estimated employing the Am-
dahl’s law.
In the Griffy project, two layer of simulation environment has been
provided:
￿ functional simulation, in which the programmer can validate a Griffy-
C code on the host-machine (e.g. x86 Linux) using a compiled sim-
ulator, which joins the Griffy emulation and processor code in the
same executable.
￿ cycle-accurate simulation, in which the programmer can simulate the
Griffy code on the target architecture, thus allowing both target spe-
ciﬁc debugging and performance evaluation.
In both the cases, the emulation of Griffy code, be it used for PiCoGA
conﬁguration or for different devices like an eFPGA, is automatically gen-
erated by the Griffy toolchain. The following of the chapter explains the
Griffy emulation principles, with an example of integration in an open-
source environment. As a test-case will be considered the GNU GDB-
based simulation environment of the XiRisc processor. GNU GDB is not a
cycle accurate simulation environment, like that ones provided for exam-
ple by LISA, but it is a commonly available and well-know tool. For this5.1 Functional simulation 87
reason, it will be used as an example of integration, although the Griffy
emulation has been successfully plugged also in LISA-based environment
for both XiRisc and DREAM processors.
5.1 Functional simulation
The goal of the functional simulation is to provide the end-user a tool for
the veriﬁcation and the debugging of the Griffy-C speciﬁcation. For this
reason one of the most important points is the speed of simulation, since
the veriﬁcation could require intensive tests with a high volume of data. In
this case and for this purpose, it is not so important to have a correct man-
agement of all the timing aspects (for example, the pipeline evolution),
which on the contrary will be required in target speciﬁc performance eval-
uations. Two aspects will be considered in following:
￿ the functional emulation of Griffy-C code, which requires the trans-
lation to standard ANSI C of the Griffy-C operations, including both
the operators and the #pragma attributes.
￿ the deﬁnition of a virtual platform (a simple reconﬁgurable architec-
ture) which allowsthe utilization of functionalities deﬁnedby Griffy-
C code in a standard C code.
5.1.1 Functional emulation
Functional emulation is the standard C translation of the Griffy-C opera-
tions, including both operators and #pragma attributes. Griffy tools pro-
vide an emulation function for each Griffy operation deﬁned in the source
code. Emulation is based on a re-ordered Griffy-C code scheduled per
pipeline stages in order to validate also the consistency of the data depen-
dencies.
The translation of Griffy C code is handled in two steps:
￿ mapping of Griffy-C operators on standard C operators. While for many
operators this phase is a pure cut-&-paste, since most Griffy-C op-88 Simulation of dynamically reconﬁgurable processors
erators are a subset of ANSI C operators, additional functionalities
requires the generation of speciﬁc code. As an example, LUT emula-
tion is obtained declaring local arrays to implement the functionality,
while the operator is substituted with the read of an array elements.
#pragma attrib out SIZE=1;
#pragma attrib in SIZE=4;
out = in @ 0x05;
comes
{
unsigned char emulation_vector[16] = { 1,0,1,0,
0,0,0,0,
0,0,0,0,
0,0,0,0};
out = emulation_vector[in];
}
￿ adding attribute side effects. Attributes allow to both deﬁne bit-level
variable size and extract information about the carryout and over-
ﬂow condition on arithmetic operations. They are handle adding a
“normalization” process after the basic operation in order to patch
the result.
 -bit resizing is implemented by means of masking and
depending on the variable type is obtained by:
– bit-wise and with
￿
.
￿
￿ in the case of unsigned variables.
– left and right shift by
 
￿
  in the case of signed variables,
where
  is the original size (32-bit for integers, 16-bit for short
and 8-bit for char types).
Carryout and overﬂow informations are extracted adding speciﬁc
emulation code after the operation under check. As an example, for
an addition:
– carryout is obtained by:
tmp1 = i1 + i2;
__griffy_gen_carryout_tmp1 = (( ( (i1 >> 1) & 0x7fffffffULL ) +
( (i2 >> 1) & 0x7fffffffULL ) +5.1 Functional simulation 89
Figure 5.1: Griffy code viewer
( (((i1 & 0x01) + (i2 & 0x01)) >> 1) & 0x1 )
) >> (32 - 1)) & 0x01;
– overﬂow is obtained by:
tmp1 = i1 + i2 ;
__griffy_gen_overflow_tmp1 = (
( ((i1 < 0) ? 1 : 0) == ((i2 < 0) ?1:0 ))& &
( ((i1 < 0) ? 1 : 0) != ((tmp1 < 0) ? 1 : 0) ) ) & 0x01;
After each Griffy-C emulation, a ﬁle dump is performed in order to
allows the inspection on the internal value. Since Griffy code describes a
hardware part (although reconﬁgurable), it could be misleading to allow
break-pointing on Griffy code just like on standard C code. For this rea-
son, intermediate results are only reported in a dedicated viewer (Fig. 5.1)
available for the entire simulation environment in which Griffy emulation
is plugged.90 Simulation of dynamically reconﬁgurable processors
5.1.2 Reconﬁgurable devices management via virtual tar-
get
The veriﬁcation of Griffy-C code in a given application requires to perform
several operations on the reconﬁgurable device programmed by Griffy-
C. Most of them are strongly connected to the speciﬁc system including
the device, the way in which the device is connected to the processor and
the speciﬁc management protocol of the device. Usually, from a hard-
ware point of view, a reconﬁgurable device is explicitly triggered to load
conﬁgurations and to execute operations, writing the speciﬁc commands
in a speciﬁc hardware port. While the physical implementation of these
operations is roughly system independent, the mechanism that generate
the command is strongly dependent from the system around the reconﬁg-
urable engine. In this context, we can suppose, without loss of generality,
that the reconﬁgurable device is managed through a standard micropro-
cessor. For these reasons, we provide a virtual Application Program Inter-
face (API) that allows the user to load, trigger, and deallocate operations
on the reconﬁgurable device in a sort of virtual platform. In the follow-
ing, this section describes the functional simulation model used to handle
Griffy operations through a standard processor, as in the following exam-
ple.
Initialization { ... }
PD = pga_allocate (my_first_pgaop);
{ ... }
Computation { ... }
For( ; ; ) {
...
pgaop_direct1(PD, &outputs,... ,inputs, ...);
...
}
Conclusion { ... }
pga_dealloacate (PD);5.1 Functional simulation 91
Low level built-in functions are provided to manage the conﬁguration.
pga load and pga free builtin functions allow users to load a conﬁgu-
ration and to release the space used by a conﬁguration. They are low level
primitives to handle a reconﬁgurable device like the PiCoGA that pro-
vides the user the possibility to load more than one conﬁguration in the
same device. Although it is provided an emulation of these builtins, their
functionality needs to be considered, for most devices, as atomic. Using
pga load and pga free, the user is responsible to allocate Griffy operations
into the reconﬁgurable and to check the space availability, since this pa-
rameter is speciﬁc to the target. But, on the other hand, if the allocation is
handled by a processor, it is also possible to run a ﬂexible allocation mech-
anism, requiring the user to specify only the name of the Griffy operation
(or pga-op, programmable gate array operation) to be load or free. This
function can automatically ﬁnd an appropriate empty space (layer and
starting row) inside the reconﬁgurable device, thus providing the user an
abstraction layer with respect to the direct hardware level management.
At the highest abstraction layer, the allocation mechanism is very sim-
ilar of that one used for the dynamic memory allocation. The pga allocate
provides the user the capability of load a new conﬁguration on the recon-
ﬁgurable device. The pga allocate function searches for an empty avail-
able space in the array, starting from the ﬁrst row in the ﬁrst conﬁguration
context (or layer), but it does not perform any analysis about fragmenta-
tion of the reconﬁgurable device. If the pga allocate ﬁnds a proper space,
it forces the pga load command, which is the physical operation that re-
ally interact with the reconﬁgurable device. Of course, the user can di-
rectly force an allocation manually, using the pga load, but in this case the
allocation structure used by pga allocate is not updated. The user is re-
sponsible to assure the data structure consistency when a mix of pga load
and pga allocate is used. pga allocate returns a pga-op descriptor (PD in
the example before), that includes information about the location inside
the reconﬁgurable device. The pgaop is triggered specifying the PD in the
pgaop direct1.92 Simulation of dynamically reconﬁgurable processors
Of course, it is necessary to require the conﬁguration of a given opera-
tion before its issue. Usually though, the conﬁguration of an operation can
be performed in the initialization phase of the application, and this rarely
causes any overhead on the performance. The pgaop direct1 is the link
that allows to compute the previously described emulation function of a
given PGAOP. It works like an ANSI C procedure, thus results are pro-
vided through memory referencing (pointer). The pgaop direct1 requires:
￿ PD : PGAOP DESCRIPTOR
￿ 4 outputs (even if not all used)
￿ 12 inputs (even if not all used)
Communication with the reconﬁgurable device is handled by a vir-
tual mechanism implemented in software by a function call. Real data
transfer shall be deﬁned and reﬁned on the target-speciﬁc simulation en-
vironment. The pga deallocate function allows to remove the speciﬁed
operation (speciﬁed through the PD) when it has no longer to be used, or
when the user needs space for a new set of operations. The pga deallocate
updates the same data structure used by pga allocate (it’s its inverse func-
tion), and perform a call to the pga free. Before an allocation, the identi-
ﬁcation of a given operation is obtained referring to the name used in the
Griffy-C declaration. When one or more operations are compiled by Griffy
Tools, anenumeratelist isprovided, assuring the consistency ofdata struc-
tures (e.g. internally to pga allocate). In the case of operation holding an
internal state (by the utilization of static variables), the (re)initialization
can be forced by the pga init function. At the next execution, initialization
values are reloaded in each static variable.
5.2 Instructionsetextensionthroughdynamicli-
braries
The basic idea of the Griffy simulation environment is to work as a plug-in
for standard simulation environments, be its functional or cycle-accurate.5.2 Instruction set extension through dynamic libraries 93
To maintain a high simulation speed, operations implemented on a recon-
ﬁgurable device are emulated (as explained before), do not performing
simulation at bit-level. Starting from emulation functions, the reconﬁg-
urable device resources are modelled in such a way that all the physical
constraints are respected all along the program execution. In the case of
cycle-accurate simulation, timing issues, as for example the pipeline evo-
lution, are handled by a third additional wrapper.
Dynamic reconﬁguration is thus handled by changing the emulation
function associated to a speciﬁc pgaop descriptor (this link is provided by
the execution of pga load/pga free primitives). Since for reconﬁgurable
processor most of the architecture is deﬁned and only a part changes de-
pendingon theapplicationspeciﬁc customization, reconﬁgurability ispro-
vided by means of dynamically linked libraries. This avoid the need of in-
struction set simulator (ISS) recompilation, that could be excessively time
expensive for the end-user.
Therefore, the emulation of a set of functionalities mapped on the re-
conﬁgurable unit implies a description of:
￿ the functionalities to be implemented on the array;
￿ the description of the resources available on the reconﬁgurable de-
vice;
￿ the way the operations are structured within pipeline stages.
Such items are partly described in the ISS deﬁnition, and partly come
with the dynamic library produced by the application compilation. Re-
sources description is speciﬁc for the reconﬁgurable device, as well as the
way in which pipeline evolution is handled, although the pipeline control
structure can be considered common to all the operations. Therefore, it is
possible to integrate these structures in the ISS core, by programming the
pipeline manager for each new functionality loaded in the reconﬁgurable
device. The dynamic library only needs to describe the pgaops functional-
ity and the proper pipeline activation events. According to this approach,94 Simulation of dynamically reconﬁgurable processors
GRIFFY−C
AND
C Compiler
DYNAMICAL
PICOGA
EMULATION
C Source
Code
EXECUTABLE
XIRISC
SIMULATOR
CORE
ISS
Figure 5.2: Simpliﬁed XiRisc simulation structure
as summarized in Figure 5.2 for the XiRisc processor, when a new applica-
tion is compiled, the Griffy toolchain provides:
￿ an executable program, in ELF1 format, composed by the processor
code/data and the conﬁguration bitstream for the reconﬁgurable de-
vice;
￿ an application-speciﬁc emulation library for the instruction set exten-
sion that is dynamically linked to the ISS-core.
The simulation library can be compiled in verbose mode. In this case,
it is possible to monitor the internal status of the reconﬁgurable device
during the ISS elaboration and to visualize it with an external viewer in
order to implement source-line debugging.
As an example, we can consider the dynamic instruction set extension
of a GNU GDB-based simulator. Similarly to other instruction set sim-
ulators, the core of GDB is the instruction set description in which a set
of functions associates a speciﬁc functionality to each assembly operation.
In the case of GDB, the .igen format is used in order to specify the in-
struction template (opcode
￿ input and output registers), the mnemonic
1Executable and Linking Format5.2 Instruction set extension through dynamic libraries 95
(dumped if simulation tracking is enabled) and the functionality. The fol-
lowing code shows the description of a simple ADD operation.
000000,5.RS,5.RT,5.RD,00000,000000:SPECIAL:32::ADD
"add r<RD>, r<RS>, r<RT>"
{
GPR[RD] = GPR[RS] + GPR[RT];
}
RDrepresents the destination register, whereas RS and RT are the input
registers. GPR is the general purpose register ﬁle. The ﬁrst two lines rep-
resent respectively the instruction template and the mnemonic. The func-
tionality, written in ANSI C, can include additional operations, such as
for example the updates of the program counter in the case of conditional
branches, delay slot management and so on. In the case of instruction set
extension in which the functionality is deﬁned by the end-user and often
depend on the application, the retargeting performed re-compiling all the
simulation engine is not particularly appealing. In the Griffy project, and
in particular in the case of the XiRisc reconﬁgurable processor, dynamic
instruction set extension is handled using dynamically linked libraries. In
this case, the .igen description shall be modiﬁed to call an emulation
function. In this case, during the compilation of GDB, a curses-library is
used, providing error message for the invocation of undeﬁned functional-
ities. When the application and the relative set of Griffy operation are de-
ﬁned, the emulation library providing correct emulation is available. GDB
is then modiﬁed in order to support a-speciﬁc assembly instruction, in-
struction skeleton in which the functionality is speciﬁed at execution time,
like in the following PGA32 code:
111110,5.RD1,5.RD2,5.RS,5.RT,4.PD,00:SPECIAL:32::PGA32
"pga32 <PD>, r<RD1>, r<RD2>, r<RS>, r<RT>"
{
// Verification of PD availability
if ( PGA_ID_table[PD] != 1 ) {
fprintf (stderr, "Error!!! PD 0x%x not loaded\n", PD);
exit (1);
}
// Dynamically linked with libemu.so
__pga_emul[PD] (&latency_dest, &issue_delay,96 Simulation of dynamically reconﬁgurable processors
PD,
&(GPR[RD1]), &(GPR[RD2]), // Output list
GPR[RS], GPR[RT] // Input list
);
}
The pga emul is a vector of pointer to function that is initialized by
the pga load. During the pga load, the emulation function is associated to
the PD, and at the trigger is veriﬁed that this link exists. Input and output
values are passed depending on the model of computation used: in this
case, a functional unit model is chosen and data are passed through the
register ﬁle (2 inputs and 2 outputs). Latency and issue delay represent
the static parameter of the function and they are used in the timing model
as discussed in the following.
5.2.1 Cycle-accurate simulation model
The management of a set of custom operations mapped on user-deﬁned
pipelinesoverareconﬁgurable deviceisdescribedinthefollowing propos-
ing at the beginning a model for a single conﬁgurable pipeline and than
the same model will be extended to handle stalls involved, for example, in
context switches, writeback conﬂicts and so on. As introduced in previous
chapter, the computation of custom pipelines generated by Griffy tools
is controlled through timed Petri Nets, with operations ﬁring associated to
each taken transition. In simulation, the check of the token availability
for every node would require a large amount of time. To overcome the
problem, it is proposed a different cycle-accurate model based on resource
allocation vectors.
We distinguish between two levels of description of the custom pipe-
line:
￿ the “functional model”, in which we describe the functionality of
the DFG, its area occupation and load penalty on the reconﬁgurable
device.
￿ the “timing model”, which takes into account stalls occurrences both5.2 Instruction set extension through dynamic libraries 97
Figure 5.3: An example of pipeline evolution
inside and outside the reconﬁgurable device and due to data depen-
dencies in the program ﬂow and/or between successive pgaops.
Although for debugging purposes it is possible to attach the functional
model to a functional-only simulation engine(e.g. the Gnu DebuggerGDB),
a cycle accurate performance evaluation requires to interfacing the proces-
sor core ISS with the timing model in order to represent correctly all stalls.
As an example, stalls in the XiRisc computations can be due to two dif-
ferent factors: inter-operation data hazards, due to dependencies in the
processor program ﬂow, and intra-operation pipeline hazards inside the
PiCoGA computation.
Program Flow Hazards
A cycle accurate model can be seen as an abstract object with an internal
hierarchy:
￿ the functional model is an internal, compilation dependent and dy-
namically linked core that describes the pgaop functionality;
￿ the timing model is a ﬁxed wrapper handling communication with
the ISS core.
The wrapper emulates pipeline activity, and handles all hazards occur-
rences thus allowing cycle-accurate simulation. It is also the wrapper that98 Simulation of dynamically reconﬁgurable processors
appropriately calls the functional model of the required pgaop, writing
back on the register ﬁle computation results at the appropriate time and
issuing stalls according to the data hazard handling rules.
PiCoGA Internal Hazards
A relevant feature of the PiCoGA unit is the capability to compute concur-
rently on a pipelined pattern multiple issues of the same pgaop, resolving
dynamically at computation time all potential hazards. Thus, a fundamen-
tal issue of the timing model embedded in the wrapper is the description
of the pipeline management inside the PiCoGA in case of multiple issue of
a given pgaop. Other architectures does not provide this feature, as well
as they could not have the capability to load more than one operation at
time. For all these architecture, the simulation model shall be simpliﬁed
in term of concurrency management, whereas for example checks on re-
sources availability shall be improved.
For each pipeline stage, preceding and successive data dependencies
deﬁne an issue delay which describes the minimum number of cycles
among successive ﬁring. These“Pipelineeffects” are described by an Issue
Delay Vector (IDV), produced by the DFG compilation. The IDV describes
for each pipeline stage (i.e. set of concurrent DFG nodes) the number of
cycles, in the overall pipeline ﬂow, during which the stage must be idle,
because:
￿ it is waiting for an input not yet produced by a previous stage;
￿ one output must be processed by a following stage whose computa-
tion that has not yet been triggered.
The maximum value in the IDV will describe the issue delay of the
overall pipeline that is the rate at which new issues of the same operation
can be fed to the pipeline. An optimal DFG has an IDV composed of ’1’ for
each stage. This means it features an issue delay of 1, and a new operation
can be fed to the pipeline at each cycle (provided it does not cause pro-
gram ﬂow inconsistencies at processor level as described in section 5.2.1).5.2 Instruction set extension through dynamic libraries 99
Of course, the IDV is speciﬁc for a given pgaop and represents the key
information for the timing model.
During simulation, to each loaded conﬁguration corresponds a Status-
IDV(SIDV).TheSIDVis usedto verify resource availability: whenapgaop
computes the i
￿
￿ pipeline stage, it sets the i
￿
￿ entry of the SIDV to the cor-
responding Issue Delay. At every cycle each element in the SIDV value is
decremented: each speciﬁc pgaop issue will be stalled in the stage un-
til the corresponding SIDV entry returns to zero. Figure 5.3 shows an
example of pipeline evolution under this model. Let’s suppose that the
processor core is capable of trigger a new pgaop at every clock cycle. At
 
￿
(
￿
 
￿
￿we try to trigger a new pgaop, but a stall occurs because the
pipeline stage 2 features an issue delay of 2.
The distance between successive operations in the pipeline is not ﬁxed:
a stall condition occurs when a pgaop tries to compute a pipeline stage
(i.e. DFG node) corresponding to a non-zero value in the SIDV. As a con-
sequence, successive issues will (
 ) stall if the pipeline is at the maximum
issue rate (backward avalanche effect) or (
 
 ) proceed until they “reach”
the stall location (elastic effect).
All effects discussed so far are due to multiple issues of the identical
pgaop: this guarantees that at most only one issue per clock is comput-
ing the last stage of the pipeline and is thus performing writeback on the
register ﬁle. On the contrary, concurrent computation of different pgaops,
that is pgaops implementing different pipelines on the PiCoGA resources,
are completely independent and feature different latencies. Consequently,
they may cause conﬂicts on the writeback channels. This will cause a stall
of one of the two pipelines.
Another cause of stall handled by the wrapper is due to context switch.
For architecture like PiCoGA featuring multi-context capabilities, the re-
conﬁgurable device is able to change the active conﬁguration contexts in
few cycles. In the case of PiCoGA, the active context (among the 4 avail-
able) can be switched in a single clock cycle simply addressing a pgaop
that is residing in a different context with respect to the active one. Sev-
eral stalls may then be necessary in order to complete all current computa-100 Simulation of dynamically reconﬁgurable processors
tions in the current context and ﬂush all active pipelines before the context
switch.
Of course, all internal stalls described so far may affect the processor
core pipeline, as the reconﬁgurable device may refuse the issue of an op-
eration whose initial stage is already occupied by a stall. Furthermore,
an incorrect utilization of the reconﬁgurable device may cause exceptions.
The wrapper is capable to back-annotate stall (and exception) information
to the processor model.
In Figure 5.3, a couple of stalls occur during the execution when the
reconﬁgurable device tries to writeback values to the register ﬁle (
 
￿
(
￿
 
￿
￿ ). In the following clock cycle (
 
￿
(
￿
 
￿
￿ )a nelastic effect occurs,
allowing computing other pending operations until they reach the stalled
pipeline stage: successive issues will “crowd” in the stages following the
stall. A backward avalanche stall will be caused. When the ﬁrst stalled
stage will resume computation, all following pipelines will in turn resume
their normal computation and spread over the pipeline. In general, both
backward avalanche and elastic effect can occur when more than one op-
eration is under execution in reconﬁgurable devices like the PiCoGA, fea-
turing a ﬂexible pipeline management based on the Petri-Net paradigm.
The resulting overall effect resembles the alternance of compression and
dilation phases in longitudinal wave propagation.
5.2.2 Simulation speed analysis
Evaluation of a simulation engine needs to take into account several pa-
rameters such as ﬂexibility and accuracy. As well, for a reconﬁgurable ar-
chitecture we need to know how the ISS can be retargeted and how much
time must be spent in retargeting. As additional constraint, the design
exploration needs to have a fast simulation engine because of the huge
amount of time spent from the end-user during the application develop-
ment. In the case of the simulation engine described before, reconﬁgu-
ration costs are tightly coupled with the time spent compiling the appli-
cation. In fact, when the hardware-part of an application is compiled, the5.2 Instruction set extension through dynamic libraries 101
Algorithm LISA-System LISA-core GDB
FDCT #CK Cycles 11572872 8225712 7028208
Sim. Time 160 sec. 7 sec. 7sec.
IDCT #CK Cycles 18492536 15279930 14237048
Sim. Time 300 sec. 14 sec. 13 sec.
Quantization #CK Cycles 33727336 25929123 21899910
Sim. Time 528 sec. 24 sec. 21 sec.
VLC #CK Cycles 34663765 34132828 28611213
Sim. Time 532 sec. 33 sec. 27 sec.
Motion Estimation #CK Cycles 815167602 805077673 695172321
Sim. Time 210 min. 754 sec. 650 sec.
MPEG-2 Encoder #CK Cycles 978423450 920866413 795349022
Sim. Time 260 min. 845 sec. 747 sec.
IDEA #CK Cycles 84682673 84682662 76556209
Sim. Time 22.4 min. 79 sec. 69 sec.
CRC #CK Cycles 11779 11295 10245
Sim. Time 1.15 sec. 0.5 sec. 0.5 sec.
Table 5.1: Simulation results (without PiCoGA)
compilation ﬂow automatically provides the simulation library which rep-
resents the functional model of the current application and the amount of
time required can be estimated in few minutes.
In order to trace results, it has been evaluated the performance of three
simulation engines, the ﬁrst one based on GNU GDB and two based on
LISAISS,allfeaturing dynamicinstruction setextension through theGriffy-
generated emulation library. The GDB model takes into account both re-
conﬁgurable unit latency and maximum issue delay of each pgaop, but
does not consider processor internal stalls. The LISA-core adds an accu-
rate evaluation of processor stalls due to internal pipeline and the recon-
ﬁgurable unit timing model described in the previous section. The LISA-
System simulator is the more accurate engine that integrates LISA-core
modelling both contentions on bus architecture and latency of memory
hierarchy described in System-C.102 Simulation of dynamically reconﬁgurable processors
Algorithm LISA-System LISA-core GDB
FDCT #CK Cycles 6354872 4446330 3936240
Sim. Time 94 sec. 5 sec. 4 sec.
IDCT #CK Cycles 13097695 13097695 10207483
Sim. Time 186 sec. 13 sec. 10 sec.
Quantization #CK Cycles 21437286 12662704 10929703
Sim. Time 324 sec. 13 sec. 10 sec.
VLC #CK Cycles 24303819 24194040 21686592
Sim. Time 340 sec. 25 sec. 22 sec.
Motion Estimation #CK Cycles 127151165 120938635 100900354
Sim. Time 46 min. 123 sec. 104 sec.
MPEG-2 Encoder #CK Cycles 239064880 192245109 163166038
Sim. Time 90 min. 270 sec. 168 sec.
IDEA #CK Cycles 31947675 31947688 28998568
Sim. Time 490 sec. 34 sec. 26 sec.
CRC #CK Cycles 8204 8198 5637
Sim. Time 1 sec. 0.4 sec. 0.3 sec.
Table 5.2: Simulation results (with PiCoGA)
Referring to the XiRisc model, Tables 5.2 and 5.1 show performance re-
sults achieved from the ISS running on a Sun Sparc UltraIII workstation,
900MHz with and without the reconﬁgurable unit (PiCoGA in the case
of XiRisc) emulation engine compared to the functional-only GDB-based
simulation engine running on a Linux workstation with Athlon XP2000+.
The LISA ISS environment shows a computational capability of about 1
MIPS, while the integrated system model computes about 50,000 clock cy-
cles per second, reducing the overall simulation engine performance up
to 20 times. The simulation environment has been mainly benchmarked
using a 12 QCIF Frames (176x144) MPEG-2 encoding requiring about 1 bil-
lion clock cycles to be executed. As a reference, a HDL simulation engine
runs about 1000 clock/sec without a system-level integration and about
400 clock/sec with memory hierarchy. The LISA-System cycle-accurate
instruction set simulator runs about three orders of magnitude faster than5.2 Instruction set extension through dynamic libraries 103
HDL simulation, while the LISA-core gains up to ﬁve orders of magnitude
with respect to HDL simulation time.104 Simulation of dynamically reconﬁgurable processorsChapter 6
Application development on
reconﬁgurable processors
I’m not a bit-level programmer,
but a right-level programmer!
(Mario Toma, STMicroelectronics)
Reconﬁgurable hardware accelerators are the strong point of reconﬁg-
urable processors if compared to general purpose processors (GPPs). Ef-
ﬁcient programming of reconﬁgurable architectures often implies ﬁnding
the best HW/SW partitioning of the C program execution between the
standard hardwired functional units (SW) and the hardware accelerators
(HW). Unlike C-to-FPGA synthesis which translates a whole C program
to hardware and therefore needs to fully support all C constructs (arith-
metic and logical operations, memory access, etc. [43, 38]), the compila-
tion ﬂow for a reconﬁgurable processor translates to programmable logic
only the program sections that may beneﬁt most and can be implemented
as hardware, while the rest of the algorithm is executed on the hardwired
functional units. Hence restrictions on C constructs that can be mapped
onto HW do not compromise the overall system capabilities. Finding the
optimal HW/SW partitioning for a given reconﬁgurable architecture is a
very complex task for a software tool, and hence the Griffy development
environment currently provides support only for manual partitioning. As
105106 Application development on reconﬁgurable processors
Word32 L_mac (Word32 L_var3, 
Word16 var1, Word16 var2)
{
Word32 L_var_out, L_produit;
L_produit = L_mult(var1, var2);
L_var_out = L_add(L_var3, L_produit);
return(L_var_out);
}
Word32 L_mult(Word16 var1,Word16 var2)
{
Word32 L_var_out;
L_var_out = (Word32)var1 * (Word32)var2;
if (L_var_out != (Word32)0x40000000L) {
L_var_out *= 2L;
} else {
Overflow = 1;
L_var_out = MAX_32;
}
return(L_var_out);
}
Word32 L_add(Word32 L_var1, Word32 L_var2)
{
Word32 L_var_out;
L_var_out = L_var1 + L_var2;
if (((L_var1 ^ L_var2) & MIN_32) == 0L) {
if ((L_var_out ^ L_var1) & MIN_32) {
L_var_out = (L_var1 < 0L) ? 
MIN_32 : MAX_32;
Overflow = 1;
}
}
return(L_var_out);
}
L_mul t
L_add
var1v a r 2
L_var3
*
Sat
+
Sat
var1v a r 2
L_var3
Ov e r f l ow
Sel ected
Cl ust er
Figure 6.1: Case study: saturating MAC for low bit-rate audio compression
a general rule, control or data management is better suited to the sequen-
tial elaboration of the CPU core, while computational kernels with higher
instruction level parallelism (ILP) or prevalent bit-level operations beneﬁt
most from a hardware implementation.
Let us consider, as an example, the case of saturating multiply-and-
accumulate (MAC) used in many low bit-rate audio compression applica-
tions implemented on the XiRisc reconﬁgurable processor. A typical code
is shown in Fig. 6.1. Since the PiCoGA is not well suited to implementing
large multipliers, the initial multiplication is performed on the processor,
while the other operations (the saturation of the multiplication and the
saturating sum) are “clustered” to a single operation performed on the
PiCoGA. The selected code is backgrounded in grey in Fig. 6.1 and a sim-
pliﬁed data-ﬂow graph is shown.
Fig. 6.2 shows the Griffy-C code relating to implementation of the se-
lected kernel and the resulting mixed HW/SW saturating multiplication.
TheGriffy-C codeisincludedbetween“pragma”directives andadditional
pragma directives are used to manage the size of each variable at the bit107
#pragma picoga kernel_L_add_mux 2 3 
L_var_out overflow // Output list
L_var1 L_var2 overflow1    // Input  list
{
int L_sum;
#pragma attrib L_sum SAT
int Lvar1, Lvar1_tmp, Lvar2,Lvar2_tmp;
int Lvar2_out_tmp, L_var_out_tmp1;
unsigned char cond, cond1, overflow_a;
#pragma attrib cond, cond1, overflow_a SIZE=1
Lvar2 = L_var2;
Lvar2_tmp = Lvar2 << 1;
cond = Lvar2 != 0x40000000;
overflow_a = cond ? overflow1 : 1;
Lvar2_out_tmp = cond ? Lvar2_tmp:0x7fffffff;
Lvar1 = L_var1;
L_sum = Lvar1 + Lvar2_out_tmp;
cond1 = Lvar1 < 0;
L_var_out_tmp1 = cond1 ? 0x80000000:0x7fffffff;
L_var_out = L_sum(overflow) ? L_var_out_tmp1:L_sum;
overflow = L_sum(overflow) ? 1:overflow_a;
}
#pragma end
Word32
L_mac(Word32 L_var3, Word16 var1, Word16 var2)
{
Word32 L_var_out;
L_var_out = (Word32)var1 * (Word32)var2;
kernel_L_add_mux ( L_var_out, Overflow,
L_var3, L_var_out, Overflow );
return(L_var_out);
}
L_var2
Lvar2
(=)
Lvar2_tmp
(<<)
cond
(!=)
Lvar2_out_tmp
(? :)
1
overflow_a
(? :)
overflow
(? :)
L_var1
Lvar1
(=)
cond1
(<)
L_somme
(+)
L_var_out_tmp1
(? :)
0
0x40000000
L_var_out
(? :)
0x80000000 0x7fffffff
overflow1
L_somme(overflow)
Figure 6.2: Case study: Griffy-C code for saturating arithmetic
level. In addition, the SAT ﬂag enables a bit of overﬂow information to
be extracted directly from an arithmetic operation. Control statements are
dismantled in conditional assignments mapped one-to-one to multiplex-
ers with 2 input words. The corresponding pipelined DFG is also repre-
sented: nodes are aligned per pipeline stage, thus showing the 4 result-
ing stages. On the processor side, the PiCoGA operation is triggered as a
function-like call.
Fig. 6.3 summarizes the optimization process. While (a) is the start-
ing point, and (b) represents the partitioning step, (c) corresponds to the
ﬁnal optimization. Analysis of the basic implementation in (b) shows that
memory accesses to provide data, multiplication and the PiCoGA opera-
tion are cascaded in such a way as not to allow exploiting either the pi-
pelining inside the array or the concurrent elaboration between PiCoGA
and processor. However by applying a loop transformation based on stan-
dard software pipelining methodology [61, 59], it is possible to compose a
loop body (shown in (c)) where the PiCoGA operation and the processor
code run concurrently, since they are working on data referring to succes-
sive loop iterations. This is a ﬁrst optimization step, which exploits the108 Application development on reconﬁgurable processors
for (i = 0; i < lg; i++)
{
s = L_mult (x[i], a[0]);
for (j = 1; j <= m; j++)
{
s = L_mac (s, a[j], x[i - j]);
}
s = L_shl (s, 3);
y[i] = round (s);
}
for (i = 0; i < lg; i++)
{
s = L_mult (x[i], a[0]);
for (j = 1; j <= m; j++)
{
tmp = a[j] * x[i-j];
kernel_L_add_mux ( s, Overflow, 
s, tmp, Overflow );
}
s = L_shl (s, 3);
y[i] = round (s);
}
for (i = 0; i < lg; i++)
{
s = L_mult (x[i], a[0]);
tmp = a[j] * x[0-j];
for (j = 1; j <= m-1; j++)
{
kernel_L_add_mux ( s, Overflow, 
s, tmp, Overflow ); 
tmp = a[j] * x[i+1-j];
}
kernel_L_add_mux ( s, Overflow, 
s, tmp, Overflow );
s = L_shl (s, 3);
y[i] = round (s);
}
a)  St art i ng  Code
b)  Code  Par t i t i oni ng  and 
kernel   subst i t ut i on
c)  Code  opt i mi zati on  vi a
soft wa r e pi pel i ni ng
Figure 6.3: Case study: software pipelining across processor and PiCoGA
concurrency between the processor core and the PiCoGA, but further im-
provements can be achieved by applying loop unrolling or similar meth-
ods capable of enhancing the PiCoGA hardware pipelining as well [60].
For a wide spectrum of current embedded applications, the typical ker-
nels are located in the core of innermost loops [57], which can be usually
described with traditional data-ﬂow graphs. Very signiﬁcant speed-ups
can be achieved by pipelining successive loop iterations in the case of
acyclic graphs [59]. This software compilation technique [61] has been
shown to be an effective means to increase the instruction level paral-
lelism without increasing the code size for highly parallel processor ar-
chitectures (e.g., super-scalar processors with 8 or more data channels).
Reconﬁgurable processor architecture can also directly exploit more tradi-
tional hardware pipelining [60]. In fact, the innermost loop can be mapped
(or clustered), for example, in a single Griffy operation and then executed
overlapping successive iteration depending on the issue delay. Neverthe-
less, several application do not allow the clustering of the wholeinnermost
loop because either the whole loop does not ﬁt in the reconﬁgurable de-
vice or the whole loop features instructions ill-suited to the reconﬁgurable
device (as the multiplication in the previous example). Furthermore, for109
many architecture, the access to the memory is performed only by the
processor core (reducing the cache coherency problems), that is thus re-
sponsible to feed data to the reconﬁgurable device and store data after the
elaboration. Let’s consider the following example:
// C source code
for (i=0; i<64; i++) {
tmp1 = f1(v[i]);
tmp2 = tmp1 * w[i];
out[i] = f2 (tmp2,z[i]);
}
# Pseudo-ASM code
loop:
reg3 = load (v[i]);
reg4 = load (w[i]);
reg5 = load (z[i]);
tmp1 = f1 (reg3);
tmp2 = mult (tmp1, reg4);
reg6 = f2 (tmp2, reg5);
out[i] = store (reg6);
loop i;
where f1 and f2 are computations suitable for the reconﬁgurable ac-
celerator, while the multiplication is an example of an operation that is
usually better implemented in a hardwired functional unit. Load/store
operations are commonly implemented in the processor core, as in the
case of XiRisc. This is an example of an acyclic graph in which a mixture
of software pipelining and clustering can improve the performance of the
application with respect to software pipelining or clustering taken alone.
Superscalar processors apply software pipelining to increase instruction
level parallelism up to the limit determined by the available processor re-
sources. Traditional reconﬁgurable processors apply the clustering of as-
sembly instructions moving the computation to the HW and waiting for
the results, since both the communication overheads often forbid to spread
the computation over a mix of processor and conﬁgware resources, and
conﬁguration tools often discard kernels with strange operators. In the
“reconﬁgurable” computation pattern instead, we can map f1 and f2 as
concurrent sub-graphs of the same Griffy operation and then apply a soft-
ware pipelining schedule to the resulting code:
# Pseudo-ASM code
# SW-pipelining prologue
loop:
reg3 = load (v[i]);
reg4 = load (w[i]);
reg5 = load (z[i-1]);110 Application development on reconﬁgurable processors
tmp2 = mult (tmp1, reg4); # tmp1 is the old value
out[i-2] = store (reg6);
# PiCoGA-operation implements both f1 and f2
tmp1,reg6 = PiCoGA-operation (reg3,tmp2,reg5);
loop i;
# SW-pipelining epilogue
The execution of the Griffy-operation can be concurrent to the proces-
sor computation and it can overlap both loop instructions and memory
access. Furthermore, sub-graphs f1 and f2 work with data referred to
different loop iterations in the original code, thus the concurrent execu-
tion of f1, f2 and the multiplication can be achieved. As shown in this
example, applicable for example to the XiRisc architecture, it is possible
to design “around” the processor functional units an additional functional
unit implemented as a Griffy operation, which is highly speciﬁc for the
kernel.
For run-time reconﬁgurable devices, like the PiCoGA, it is possible to
provide dedicated functional units for as many kernels as they can be
found in an application, thus truly realizing the dynamic reconﬁguration
concept. The high degree of reusability together with the close interaction
with the standard processor datapath allows the reconﬁgurable device to
be of a smaller size than the FPGAs used in other approaches, where sev-
eral entire kernels need to be mapped at the same time in the device.
On the other hand performance optimization with a reconﬁgurable
processor requires a co-design tightly coupled between the processor core
and the reconﬁgurable device. In order to avoid stalls and to improve the
instruction level parallelism, the algorithm developer needs to manually
tune the accelerated kernel ’C’ code up to the boundaries deﬁned by the
application data dependencies. In this context, the user often needs to be
conscious of the DFG abstraction that is implemented at the intermediate
level by the Griffy-C description, and thus it is required to accurately han-
dle the data ﬂow across the reconﬁgurable device to obtain considerable
performance improvements. It is important to notice that the optimiza-
tion step is managed only at the DFG-level without any need for deep
knowledge of the underlying architecture or circuit implementation. In6.1 Reconﬁgurable software development time: hardware and software
approaches 111
this context Griffy-C language also provides the user with a means to de-
scribe pipelines at a deep level of detail, though without needing skills in
hardware design. For these reasons we state inthe introduction that recon-
ﬁgurable processors represent the natural extension of DSPs among other
things for what concerns the algorithm development methodologies.
6.1 Reconﬁgurable software development time:
hardware and software approaches
Depending on the constraints and on the type of algorithm, the imple-
mentation of an application on a reconﬁgurable processor can follow var-
ious different strategies, which lead to different trade-offs between per-
formance and the development effort required. Hardware/software co-
design for these architectures provides an additional dimension to the tra-
ditional design space for DSPs. Quality of service as well as real-time spec-
iﬁcations can be used to select the implementation strategy and evaluate
the development effort required.
Two main orthogonal factors can inﬂuence the time required to de-
velop an application on reconﬁgurable processors:
￿ Algorithm modiﬁcations are applied. This is the case when the algo-
rithm utilizes a description that is not well suited to the implemen-
tation on the target device or there is a different description which
provides signiﬁcant performance improvements. In the second case
we include, for example, the description of non-standard operators
(for example Galois Fields arithmetic) which achieve an important
improvement when synthesized at the LUT-level, while a pure C
code proves particularly inefﬁcient. This approach is often referred
as hardware approach, since it is orient to the optimization in a way
similar to the optimized design of application-speciﬁc circuits.
￿ Accurate scheduling is applied. In this case the algorithm imple-
mentation is efﬁciently described by C language by means of high112 Application development on reconﬁgurable processors
level arithmetic/logic operations. This approach is also referred as
software approach, since it is similar to the approaches used to opti-
mizesoftware programs, andthe designspaceexploration performed
at this point is:
– clustering basic operations to compose a Griffy operation, in-
cluding the possibility of grouping clusters of concurrent or in-
dependent sub-graphs;
– scheduling the execution (i.e. through software pipelining) be-
tween the processor core and the reconﬁgurable device in an ef-
ﬁcient way, thus avoiding stalls due to long latency instructions
or memory accesses. However this is a common task that DSP
developers are used to undertaking, in order to exploit dedi-
cated functional units provided in the processor data-path.
Proﬁciency in co-design is the ﬁrst key-point in order to obtain a re-
duction in the time-to-develop, given the expected performance. The time
spent on application partitioning is often the dominant task, because this
is an iterative task which includes analysis, description, validation and
performance estimation. Acquired experience on application analysis for
DSPs could help one early on to discard several solutions, thus focusing
the design space exploration on a few signiﬁcant options. Depending on
both the ability of the user and the degree of modiﬁcation that we want
to introduce, the implementation process could take a long time, which is
only justiﬁed if an appropriate performance improvement is achieved.
At a ﬁrst glance, the programming of a reconﬁgurable processor could
appear time-consuming as the ASIC design, requiring expertise on both
the application and the architecture as well as non-common skills (par-
titioning, pipelining, and so on). However, compared to a traditional
hardware design ﬂow, the adopted development methodology does not
require one to handletiming, critical paths, clock synchronization, or other
hardware-speciﬁc features, which areextremelytime-consuming inanHDL-
to-silicon ﬂow. In the case of Griffy, the framework in which the recon-
ﬁgurable operations are described is strongly sequential, the working fre-6.1 Reconﬁgurable software development time: hardware and software
approaches 113
Figure 6.4: Variation of %speed-up wrt
￿
￿ and
￿
￿
quencyof aGriffy operation isassumedconstant, since areasonable worst-
case condition is assumed (as in the case of PiCoGA), and performance im-
provements are achieved by managing the structure of the DFG pipeline
through data dependenciesdirectly described in the Griffy-C intermediate
format. We can thus say that performance tailoring is achieved at the DFG
level, simplifying the user-approach to the design-space.
In general talking, the approach of development shall be optimized
in term of expected performance, kernel criticality and cost in term of de-
velopment time. For example, long time development required for a hard-
ware approach on a marginally import tasks, requiring
 
￿
￿
￿ of execution
time, is probably not justiﬁed by a reasonable performance improvement.
The impact of a i-th kernel optimization depend on the percentage of time
 
￿ of the kernel with respect to the whole application and the local speed-
up
 
￿ by the formula:
 
 
 
 
 
 
 
￿
￿
￿
￿
/
￿
￿
￿
￿
￿
￿
￿
￿
When the local speed-up increase, the overall performance improve-
ment depends only from the time
 
￿, thus giving a upper bound to the114 Application development on reconﬁgurable processors
performance. Fig. 6.4 show the percentage of overall speed-up gained
with respect to the local speed-up, considering a set of different compu-
tational weight
 
￿. It is possible to see that the impact on overall perfor-
mance saturate rapidly especially for kernel with
 
￿
 
￿
￿
￿.
For real application, the computational load is commonly distributed
over N kernels, than the speed-up is determined by the formula:
 
 
 
 
 
 
 
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
/
￿
￿
￿
￿
￿
￿
￿
￿
When a designer starts the implementation, he/she decides how many
performance improvement isrequiredto matchthe constraints andhe/she
needs to estimate how much development effort is necessary. It is neces-
sary to optimize as much as possible all the critical kernels? Or, mixing
hardware and software approaches it is possible to focus the development
time only on few very critical kernels in order to achieve near-optimal per-
formance? These are two important questions, for the engineering of re-
conﬁgurable system. To better explain the point, it could be considered the
following example. Let us consider to have the 90% of the computational
overload on 10 comparable kernels (9% per kernel).
Figure 6.5: Variation of speed-up wrt #optimized kernel and local speed-up
￿
￿6.1 Reconﬁgurable software development time: hardware and software
approaches 115
Fig. 6.5 shows a bi-dimensional space exploration obtained consider-
ing a variable number of optimized kernels and different (homogeneous)
local speed-up. Itispossible toseethat betterperformance canbeachieved
with reduced local speed-up applied to all the kernels, with respect to the
strong optimization of few parts. Moreover, usually, small local speed-
ups could be obtained with a software approach, thus reducing the design
time. This consideration could be extended to real cases, although the
application-speciﬁc parameter could bias the ﬁnal result and the choice of
the best development strategy. For a quantitative analysis, please refer to
the next chapter, where experimental results will be provided.116 Application development on reconﬁgurable processors
Frames sequence
Motion vector
Search
Window
Reference block
MAD
Block
Figure 6.6: Motion estimation
6.2 Example of application mapping
6.2.1 MPEG-2 motion compensation on the XiRisc proces-
sor
Compression techniques are successfully employed in order to reduce the
volume of transmitted data in audio/video communication devices. In
particular, in the case of real video sequences (e.g. video conference),
successive frames often have similar or identical content, mainly due to
subjects or objects motion, thus introducing a high degree of correlation
and a temporal redundancy that can be exploited using differential coding
techniques[99]. MPEG video coding standards improve the compression
scheme with a motion compensated prediction. Each frame is subdivided
in blocks (or macro-blocks), of typically 16x16 pixels, and motion vectors
are estimated searching among consecutive frames the block with a min-
imum distance. MPEG Software Simulation Group[100] proposes a pub-
lic release of their MPEG-2 encoder compliant with ISO/IEC 13828-2[101]
where motion estimation is computed using an exhaustive search pattern.
The minimum absolute difference (MAD or L1 matching criteria) is deter-
mined among all blocks in a search window around the reference block
(see Fig. 6.6).6.2 Example of application mapping 117
Table 6.1: MPEG-2 computation-aware analysis
Algorithm Clock Cycles %
Motion Estimation 696144419 87%
Fast DCT 10820304 1.4%
Inverse DCT 14260740 1.8%
Prediction 6675078 0.8%
Quantization 12554590 1.5%
Inverse Quantization 9331982 1.3%
Variable-Length Coding 5260986 0.6%
Bitstream packing 30397513 3.8%
Communication among tasks 14445511 1.8%
Total 799891123 100%
This search approach, known as full-search motion estimation, allows
the computation of an absolute minimum into the search window, but re-
quires a very relevant computational cost. A computation-aware analysis
of the MPEG-2 encoding engine, reported in Tab. 6.1 referring to a 12-
frames sequence(or 1 group of pictures, GOP) with a resolution of 176x144
pixel (QCIF standard) and implemented on a VLIW RISC processor fetch-
ing 2-instruction per clock, shows how a largely dominant amount of com-
putational time is spent over the motion estimation engine (about 90%),
and speciﬁcally a signiﬁcant contribution is due to the measurement of
the distance between pairs of macro-blocks. This L1 matching criteria is
a Sum of Absolute pixel-to-pixel Differences (or SAD) as described by the
following formula:
 
 
 
 
￿
 
 
 
￿
￿
￿
￿
.
￿
￿
￿
￿
￿
￿
￿
￿
.
￿
￿
￿
￿
￿
￿
￿
 
￿
￿
￿
￿
 
￿
￿
￿
￿
￿
￿
￿ (6.1)
Usual macro-blocks are squares of 16x16 pixel thus requiring 256 sums118 Application development on reconﬁgurable processors
of absolute differences repeated W
￿ times, where W is the search window
width in pixels. The dimension of the search window is thus deﬁned by a
trade-off between computational complexity and the compression factor.
Standard general purpose embedded processors can hardly meet quality
standards because of the described computational requirements. In such,
several categories of multimedia processors have been proposed [102] in
order to ﬁll this efﬁciency gap. In fact, general purpose processors are
inefﬁcient when the Instruction Set Architecture (ISA) is not well suited to
the task or when the amount of algorithmic parallelism in the application
is greater than their capability to exploit it. On the contrary, Application
Speciﬁc Integrated Circuits (ASICs) are optimized for the required task
and offer excellent results in terms of speed and energy consumption, but
their utilization implies non-recurring costs (NREs) that are often hardly
justiﬁed by the application environment.
Reconﬁgurable computing appears a cost-effective and performing so-
lution for data-intensive applications featuring deep pipelining and high
concurrency. The key issue for the algorithm developer is the exploration
of the design space in order to match application requirements with com-
putational capabilities, thus determining the optimal partitioning of the
task between hardware (the reconﬁgurable logic) and software (the pro-
cessor core) resources. For example, multiplications or control-ﬂow state-
ments usually do not show particular advantages in space-based com-
putations while bit-wise logic and concurrent multiple data arithmetic
may offer signiﬁcant improvement when implemented on an appropri-
ately programmed FPGA-based device.
Full-search motion estimation performs an exhaustive comparison be-
tween all macro-blocks in the search windows and a reference block. A
possible option to reduce complexity in the pixel-to-pixel distance mea-
surement is to stop distance computation when the actual value exceeds
the minimum pre-determined distance value. It is also possible to obtain a
computational advantage choosing a suitable search path. In low motion
sequences, such as in video conference environment, a locality criteria can
be introduced in the search path performing a spiral path, as shown in Fig.6.2 Example of application mapping 119
1 Pixel
1 Pixel
Search window bound
Reference macroblock
Figure 6.7: Search path
6.7.
This notwithstanding, the most signiﬁcant algorithmic contribution in
terms of complexity and time-cost considerations remains the computa-
tion of the distance between two given macro-blocks. Considering 16x16
pixel macro-blocks, the account of this distance requires 256 sums, 256 dif-
ferences and 256 absolute value operations (ABSs). Using standard pro-
cessors without an application-speciﬁc ABS hardware unit, this step can
be performed using an “if-then-else” conditional structure and compar-
isons, severely increasing computational requirements.
The amount of assembly instructions spent for each distance measure-
ment is around 1000, because the absolute difference computation per-
formed through conditional statements requires roughly four instruction
cycles, as is shown in the following assembly code (where r2, r3 are the
current loaded pixels), repeated for all the 256 pixels.
subu r4,r3,r2
bgez r4,$L1
subu r4,r0,r4
$L1: addu r10,r10,r4
This computational kernel has been demonstrated as critical through a
proﬁling-based analysis on the MPEG-2 encoding of a 12-frame sequence.
In this test-case, the motion estimation phase accounts for about the 90%
of the overall computation time, and more than 70% of that is spent on120 Application development on reconﬁgurable processors
AB
01
Sign
A − B B − A
Figure 6.8: Absolute Difference (AD) DFG
the distance measurement function. In the case of the processor architec-
ture, a ﬁrst optimization step may be to use reconﬁgurable logic in order to
implement the absolute difference (AD) operation. The space based com-
putational pattern typical of hardware-oriented applications allows one
to enhance the degree of parallelism in the computation using the graph
shown in Fig. 6.8.
Since pixels are described using 8-bit unsigned variables, the area re-
quired for the implementation of this graph in a FPGA-like architecture
or in the PiCoGA unit is extremely small. In fact, the Absolute Difference
DFG mapped on the PiCoGA requires about 4% of the cells of the gate
array. Still, the computational density achieved using this 8-bit pattern is
small. Each processor datapath is used at 25% and the array under 4%
of its potential computing power. The under-utilization of the PiCoGA
AD AD AD AD
++
+
Figure 6.9: Concurrent 4-pixel Sum of Absolute Differences6.2 Example of application mapping 121
Aligned read
Reference macroblock
First macroblock compared
Second macroblock compared
Third macroblock compared
Fourth macroblock compared
Figure 6.10: Memory layout
suggests an investigation of single-instruction-multiple-data (SIMD) com-
putation patterns, implementing four concurrent absolute difference com-
putations at a time. In this case, data transferred through the register ﬁle
are integers and the array utilization goes up to 16%. It is also possible
to embed the sum of these four absolute differences in the PiCoGA using
a balanced logarithmic tree scheme, as illustrated in Fig. 6.9, thus maxi-
mizing instruction level parallelism and reducing both latency and issue
delay of the graph. If the graph does not show dependencies across pi-
peline stages, the issue delay is minimal and then it is possible to overlap
successive computations of long latency PiCoGA instructions signiﬁcantly
increasing the throughput.
A relevant problem introduced by the exploitation of concurrent com-
putation over the gate-array, as is the case with the SAD proposed in Fig.
6.9, is the bottleneck caused from the number of accesses to system mem-
ory.
The XiRisc instruction set does not support misaligned memory access,
so misaligned read/write operation from/to data memory is handled by
byte-level load/store and packing/unpacking operations. As shown in
Fig. 6.10, performing a spiral search path, only one frame over four is
word-aligned, thus requiring four load-byte operations and byte packing
(based on constant-step shift and bitwise-or) in order to build the correct
inputs for the SAD operation. Operands’packing introduces a very sig-
niﬁcant overhead that signiﬁcantly affects the possible speed-up ﬁgure.
In conclusion, memory access bandwidth is the bottleneck which stops122 Application development on reconﬁgurable processors
an high throughput execution and thus the processor core introduces an
upper-bound to achievable pipelining.
Increasing the PiCoGA input bandwidth performing on the gate array
a sum of absolute differences which involves more than four pairs of pixels
is a way that maximizes the PiCoGA area utilization, but it also increases
the overhead introduced for packing operations due to misaligned mem-
ory access. In the next section we propose a modiﬁed implementation of
the full-search motion estimation that avoids packing overhead and ob-
tains signiﬁcant performance ﬁgures following an alternative search path.
Improved Full-Search Motion Estimation
Inthe previous section we haveshown howto improve the computation of
the distance between two macro-blocks using PiCoGA. Unfortunately, the
proposed implementation is affected by the problem of misaligned mem-
ory accesses that affect the available performance gain. In this section,
an alternative approach will be proposed in order to improve full-search
estimation avoiding the impact of operands packing, using well-known
unfolding techniques in digital signal processing.
A macro-block in the search window is aligned if each 4-byte word in
the 16x16 macro-block are aligned. In this case, the SAD shown in Fig.
6.9 can be computed without any overhead concurrently loading input
data using only two memory accesses, as word-wise access is suitable
for both reference macro-block and the current macro-block under com-
parison. Pixel-wise scan paths are unfortunately characterized by
￿
￿ mis-
aligned blocks. In order to overcome this problem, we have chosen to
utilize a search path based on a 4-pixel-step spiral. We divided the search
window in a Group of 4x4 Macro-blocks (GM) and we performed a spiral
path among all the GMs in the search window. Each GM is thus internally
parsed by rows, as depicted in Fig. 6.11.
Using this search pattern, each Group of Macro-blocks is aligned. It is
then possible to perform concurrent computations of a row of GMs, thus
nicely improving PiCoGA utilization, computational density and conse-6.2 Example of application mapping 123
4 pixel
4 pixel
Search Window
Reference macroblock
Scan Path
Local
Group of 4x4 Macroblocks
Figure 6.11: Enhanced search path
quently gainingperformance. Four SADoperations onfour bytes, asshown
in Fig. 6.9, are used in order to implement a concurrent 4-blocks SAD op-
eration that requires only word-aligned access to memory. In Fig. 6.12 the
result of this unfolding approach applied both on the search path and on
the local scan path is shown. The area required to implement this com-
putational kernel is about 100% of the PiCoGA resources and the latency
required in order to execute the SAD is 7 clock cycles which is the same of
the concurrent 4-bytes SAD previously shown. We deﬁne this implemen-
tation as sad4blk.
The PiCoGA architecture is oriented at the pipelined elaboration of
data-ﬂow graphs in order to improve the throughput of a data-intensive
computations. By overlapping successive executions of long latency Pi-
CoGA instructions, it is often possible to improve the throughput up to
bounds that are set by the issue delay on a side, and the processor-to-
PiCoGA data bandwidth on the other. The main limitation comes again
from the memory access bandwidth (three load versus one PiCoGA op-
eration). But some further modiﬁcations of the inner loop allow one to
achieve a higher degree of data reuse. For example, the reference block
word can be reused for all 16 macro-blocks in the GM. The increment
of the unfolding factor may also increase the number of registers stati-
cally allocated in order to store both temporary results and reusable data,124 Application development on reconﬁgurable processors
SAD4 SAD4 SAD4 SAD4
64−bit Output
Macroblock
Reference
Group of Macroblocks
Figure 6.12: Concurrent 4-blocks SAD
introducing a critical trade-off between unfolding factor and register ﬁle
occupancy. In order to avoid data dependencies that could lead to pipe-
line stalls, correlation among successive PiCoGA operations must be very
small. This goal may be achieved by only taking the register ﬁle dimen-
sions into careful account. The optimal unfolding factor is determined by
the minimum number of stalls and the maximum usage of the register
ﬁle without having to resort to the main memory for temporary variables
storage.
We estimated as optimal trade-off the computation of 2-rows for each
macro-block and the concurrent utilization of GM composed of 8 mac-
roblocks each, leading to a loop utilizing 16 sad4blk operations. The
pseudo-code in Fig. 6.13 shows this unfolded metric function. Intra-block
unfolding decreases the number of used registers (for storing temporary
results), but increases data dependencies. Thus, pipeline stalls needed to
wait for PiCoGA writeback. Inversely, inter-block unfolding increases the6.2 Example of application mapping 125
number of allocated registers, but decreases the number of processor stalls
because of a low degree of correlation among macro-block distances. Us-
ing this unfolding factor, the instructions executed in the processor core, in
order to provide data to the PiCoGA and to accumulate partial results, bal-
ances the PiCoGA operation latency avoiding stalls in the processor core
thus allowing a good degree of pipeline utilization.
Furthermore, the register ﬁle usage, albeit scheduled with manual reg-
isters allocation, shows ahighdegreeof coveragewithout increasing mem-
ory activity. The break mechanism previously introduced in order to stop
thecomputation ofdistances greater thantheactual minimumvalueshows
a smaller impact, because of the unfolding factor applied in the inner loop.
This reduction ismore thancompensated bythe degreeofutilization of the
customized PiCoGA pipeline.
The unfolding technique applied to the spiral full-search path requires
the concurrent availability of4x4 macroblocks for eachspiral step. Thisap-
proach presents a drawback when the search path overcomes the bound-
aries of the search window or the frame size. In these cases, the cho-
sen solution has been to read macroblocks exceeding the search window
space discarding their distance values. This is possible with negligible
computational cost. Each group of macroblocks is processed by reading 3-
additional blocks with respect to the spiral path coordinates. By appropri-
ately choosing the search window side, it is then possible to signiﬁcantly
reduce or to avoid altogether these boundary effects. In fact, boundary
effects involve the control statements of the spiral-form path introducing
additional check points which can be used in order to choose an appropri-
ate measurement functions that involve or do not involve PiCoGA.
Fig. 6.14 shows the correspondent data ﬂow graph, automatically gen-
erated through the C-based Place and Route ﬂow described above. Each
computational node represents an assembly-level operation, such as an
addition or a subtraction, and is mapped over a set of logic cells. Nodes
represented over the same line all belong to the same row and thus to the
same pipeline stage. Consequently, the alignment shows the conforma-
tion of the pipeline stages (for example, in Fig. 6.14 is possible to iden-126 Application development on reconﬁgurable processors
for (j = macroblock height; j
￿ 0 ;j- =2 )
￿
//lx: Frame width
// First row of the corrent group of macroblocks - First internal row
// RowPixels 0 - 3
sad4blk(o1, o2, ((uint *) current GM row)[0], ((uint *) current GM row)[1], ((uint *) ref macroblock)[0]);
// RowPixels 4 - 7
sad4blk(o3, o4, ((uint *) current GM row)[1], ((uint *) current GM row)[2], ((uint *) ref macroblock)[1]);
// RowPixels 8 - 11
sad4blk(o5, o6, ((uint *) current GM row)[2], ((uint *) current GM row)[3], ((uint *) ref macroblock)[2]);
// RowPixels 12 - 15
sad4blk(o7, o8, ((uint *) current GM row)[3], ((uint *) current GM row)[4], ((uint *) ref macroblock)[3]);
sad1 2 += o1 + o3 + o5 + o7; //Concurrent two 16-bit add
sad3 4 += o2 + o4 + o6 + o8; //Concurrent two 16-bit add
// First row of the current group of macroblocks - Second internal row
current GM row += lx;
sad4blk(o1, o2, ((uint *) current GM row)[0], ((uint *) current GM row)[1], ((uint *) ref macroblock)[0]);
sad4blk(o3, o4, ((uint *) current GM row)[1], ((uint *) current GM row)[2], ((uint *) ref macroblock)[1]);
sad4blk(o5, o6, ((uint *) current GM row)[2], ((uint *) current GM row)[3], ((uint *) ref macroblock)[2]);
sad4blk(o7, o8, ((uint *) current GM row)[3], ((uint *) current GM row)[4], ((uint *) ref macroblock)[3]);
sad5 6 += o1 + o3 + o5 + o7; //Concurrent two 16-bit add
sad7 8 += o2 + o4 + o6 + o8; //Concurrent two 16-bit add
// Second row of the current group of macroblocks - First internal row
current GM row -= lx; ref macroblock += lx;
sad4blk(o1, o2, ((uint *) current GM row)[0], ((uint *) current GM row)[1], ((uint *) ref macroblock)[0]);
sad4blk(o3, o4, ((uint *) current GM row)[1], ((uint *) current GM row)[2], ((uint *) ref macroblock)[1]);
sad4blk(o5, o6, ((uint *) current GM row)[2], ((uint *) current GM row)[3], ((uint *) ref macroblock)[2]);
sad4blk(o7, o8, ((uint *) current GM row)[3], ((uint *) current GM row)[4], ((uint *) ref macroblock)[3]);
sad1 2 += o1 + o3 + o5 + o7; //Concurrent two 16-bit add
sad3 4 += o2 + o4 + o6 + o8; //Concurrent two 16-bit add
// Second row of the current group of macroblocks - Second internal row
current GM row += lx;
sad4blk(o1, o2, ((uint *) current GM row)[0], ((uint *) current GM row)[1], ((uint *) ref macroblock)[0]);
sad4blk(o3, o4, ((uint *) current GM row)[1], ((uint *) current GM row)[2], ((uint *) ref macroblock)[1]);
sad4blk(o5, o6, ((uint *) current GM row)[2], ((uint *) current GM row)[3], ((uint *) ref macroblock)[2]);
sad4blk(o7, o8, ((uint *) current GM row)[3], ((uint *) current GM row)[4], ((uint *) ref macroblock)[3]);
sad5 6 += o1 + o3 + o5 + o7; //Concurrent two 16-bit add
sad7 8 += o2 + o4 + o6 + o8; //Concurrent two 16-bit add
￿
Figure 6.13: Unfolded SAD function based on sad4blk
tify 5 pipeline stages). Some operations, such as constant-step shifts em-
ployed for word-wise to byte-wise pixel unpacking, can be made using
only the programmable interconnections of the gate array reducing both
area and latency of the graph. These instructions do not occupy any stage
in the hardware pipeline, thus are deﬁned routing-only operations and are6.2 Example of application mapping 127
p1
p13 p12 p11 p10
sub3a sub3b sub6a sub6b sub9a sub9b sub12a sub12b sub2a sub2b sub5a sub5b sub8a sub8b
8
p22 p16
sub1a sub1b sub4a sub4b
16
p21 p15
sub0a sub0b
24
p20 p14
p3
p23
sub7a sub7b sub11a sub11b sub15a sub15b sub10a sub10b sub14a sub14b sub13a sub13b
p2
cond0
sub0
cond1
sub1
cond2
sub2
cond3
sub3
cond4
sub4
cond5
sub5
cond6
sub6
cond7
sub7
cond8
sub8
cond9
sub9
cond10
sub10
cond11
sub11
cond12
sub12
cond13
sub13
cond14
sub14
cond15
sub15
0
acc1 acc2 acc3 acc4 acc5 acc6 acc7 acc8
out1 out2 out3 out4
conca1 conca2
o1 o2
Figure 6.14: sad4blk DFG
drawn, in the DFG, using dotted nodes. Fig. 6.15 depicts the sad4blk
instruction mapped over the PiCoGA.
It should be observed that sad4blk cannot be effectively used in the
case of the half-pel reﬁnement of the motion vector. Half-pel precision
is utilized in MPEG compression in order to reduce the residual error in
the differential coding, but the interpolation among adjacent macroblocks
and the compile-time non-predictable alignment of the minimum distance
macroblock requires memory accesses to be performed at byte level. The
need of a packing step in order to feed PiCoGA with an appropriate work-
load would require such relevant processor activity to vanify any advan-
tage introduced by conﬁgurable computation. For this reason, in our im-
plementation, thehalf-pelreﬁnement phaseisperformed through processor-
only computation.128 Application development on reconﬁgurable processors
Figure 6.15: sad4blk Place & Route
Performance Evaluation
An evaluation of the effectiveness of the implemented solution can be ob-
tained comparing results achieved using the XiRisc reconﬁgurable pro-
cessor with the performances obtained by a general purpose embedded
VLIW RISC processor not augmented by PiCoGA. Operating frequency
and function units availability are the same in the two cases in order to
have a fair proof of the performance enhancements bound to the instruc-
tion set architecture metamorphosis.
The synthesizable HDL model of the processor can be used in order to
verify the correctness of the implementation, but the huge simulation time
required by MPEG compression algorithm would not allow an exhaustive6.2 Example of application mapping 129
analysis over a signiﬁcant benchmark. Faster simulations can be obtained
using an instruction set simulator (ISS) which describes the functionality
of the processor. Depending on the desired level of accuracy it is possible
to use bit-accurate or a cycle-accurate simulation.
Instruction-accurate simulators, such as the internal ISS of the GNU
GDB debugger, performs an evaluation of the computation without con-
sidering pipelinestallsor memory waits. Onthe otherhand, cycle-accurate
simulators such as ISS generated from LISA (Language for Instruction Set
Architecture) evaluate accurately pipeline stalls, and can be embedded
into a System-C environment in order to evaluate memory hierarchy im-
pact. For a qualitative performance evaluation we have used a proﬁling
tools based on the GNU-GDB ISS, that is be used to provide a program
trace (a trace ﬁle that annotates all computed instructions) estimating only
the processor stalls introduced by the PiCoGA register lock mechanism.
The results of the proﬁling are then back-annotated on C and assembler
source code.
As described in [66], the energy consumption of the XiRisc proces-
sor architecture can be roughly considered proportional to the number of
memory accesses, which in turn is mainly due to instruction fetches. By
collapsing a set of assembly instructions in a single instruction that trig-
gers PiCoGAelaboration, it is possible to reduce thenumber of fetches and
thus to decrease energy consumption. Of course, this decrease is traded
with the overhead in power consumption due to the reconﬁgurable hard-
ware unit (leakage power) and the elaboration of the unit itself (dynamic
power):
￿ the ﬁrst component is proportional to gate array area;
￿ the second one is a dynamic component of energy consumption due
to PiCoGA elaboration and depends from both the input data and
the DFG implemented.
This simpliﬁed model was used in order to estimate energy consump-
tion on the traces provided by software simulation. The model was em-
pirically veriﬁed from measurement performed on silicon prototypes.130 Application development on reconﬁgurable processors
Table 6.2: Test-sequence features
Sequence title : Coast-guard
Number of frames: 12 (1 GOP)
Frame standard : QCIF (176x144)
YUV standard : 4:2:0
Interleaving : No
Macroblocks : 16x16
Search windows : 16x16 (-8,+7)
The effectiveness of the introduced motion estimation implementation
has been evaluated on a test sequence composed by a group of 12 frames
in QCIF standard. Encoding this sequence (Table 6.2), we have analyzed
an entire Group of Picture (or GOP) featuring both backward and forward
predictions. By extracting from the MPEG proﬁling analysis the results
referred to motion estimation it is possible to observe a performance im-
provement up to one order of magnitude (in the case of full-pel precision)
comparing XiRisc with a general purpose VLIW RISC processor.
Results are shown in Table 6.3, where the “distance” speed-up (about
18x) is reported referred to the kernel directly involved in the PiCoGA-
driven computation and the motion estimation performance is referred
both at the case of full-pel and half-pel precision. These performances are
relevant also in the case of the motion estimation algorithm with half-pel
analysis which introduces a signiﬁcant overhead due to control ﬂow state-
ments and spiral path handling that justify the speed-up ﬁgure decrement.
The performance gain achieved using sad4blk depends on the re-
quired search area. As explained in [104], it is necessary to ﬁnd a trade-off
between search area, compression factor and computation-time. Depend-
ing on the amount of available computation, several algorithms feature
a reduced computational requirement inspecting a subset of checkpoints
through hierarchical paths or through search windows with variable side.
In Fig. 6.16, considering the full-search engine (with full-pel precision),6.2 Example of application mapping 131
Table 6.3: Performances
Speed-up Energy Consumption
Reduction
Distance 18x -
M.E. Full-Pel 10x 80%
M.E. Half-Pel 7x 75%
Figure 6.16: Full-Search workload vs. search window side
we show a linear increase of computational workload proportionally to
the search window sides comparing a VLIW processor with a XiRisc im-
plementing the sad4blk.
Even if the computational workload is small, this XiRisc conﬁgura-
tion can be effectively used obtaining a signiﬁcant gain. In the border-
case represented by fast motion estimation algorithms (e.g. the algorithms
overviewed in [105]), where the matching criterion is applied to a very
small number of distributed macroblocks, sad4blk can be used in order
to avoid computational overheads due to data misalignments achieving a
speedup that can be estimated about 5x in full-pel distance measurement.
The impact on the overall MPEG-2 encoding is summarized in Table 6.4.
Using the power estimation model described in the previous section,
an energy consumption reduction of about 75-80% has been achieved,
mainly due to PiCoGA intensive usage in the most signiﬁcant computa-132 Application development on reconﬁgurable processors
Table 6.4: MPEG-2: ﬁnal results
Algorithm Clock Cycles %
Motion Estimation 115671891 53%
Fast DCT 10820304 4.9%
Inverse DCT 14260740 6.4%
Prediction 6675078 3%
Quantization 12554590 5.7%
Inverse Quantization 9331982 4.3%
Variable-Length Coding 5260986 2.3%
Bitstream packing 30397513 13.9%
Communication among tasks 14445511 6.5%
Total 219418595 100%
tional kernel thus demonstrating the effectiveness of reconﬁgurable com-
puting approach for energy-critical applications.6.2 Example of application mapping 133
6.2.2 AES/Rijndael implementation on the DREAM adap-
tive DSP
Security of data is becoming an important challenge for a wide spectrum
of applications, including communication systems (with high privacy re-
quirements), secure storage supports, digital video recorders, smart cards,
cellular phones. Resistance against known attacks is one of the main prop-
erties that an encryption algorithm needs to provide. When a new attack
is demonstrated as effective (also in term of computation time), the up-
date of the encryption system is a real necessity to guarantee the security
of data.
InNovember2001, theNationalInstitute ofStandard Technology (NIST)
announced the Advanced Encryption Standard (AES) [106], as a replace-
ment of the DataEncryption Standard (DES). TheRijndaelalgorithm [107],
selected among 15 candidates, is a symmetric key algorithm based on
a substitution-permutation network, where most of the calculations are
done using Galois Field (GF) arithmetic deﬁned over the ﬁeld GF(2
￿) with
the irreducible polynomial x
￿+x
￿+x
￿+x+1.
Applications requiring high performance and/or low power consump-
tion are today implemented using dedicated hardware accelerators with
the downside of higher development costs and lack of ﬂexibility (i.e. algo-
rithm update or parameter changes) with respect to software implementa-
tions. In this context, reconﬁgurable hardware such as Field Programma-
ble Gate Arrays (FPGAs) seems to bridge the gap between performance
and ﬂexibility required to guarantee the necessary updates. For complex
System-on-Chip, where the area budget dedicated to a single computa-
tional island is a constraint, reconﬁgurable architectures (RAs) for embed-
ded applications were proposed as hardware accelerators, including em-
bedded FPGAs, reconﬁgurable processors and reconﬁgurable data-paths.
In this section, an implementation of the AES/Rijndael algorithm on
the DREAM architecture will be presented. The DREAM architecture is
composed of a reconﬁgurable data-path (the 3
￿
￿ generation PipelinedCon-
ﬁgurable Gate Array, or PiCoGA-III) controlled by a 32-bit RISC processor.134 Application development on reconﬁgurable processors
PiCoGA-IIIis directly interfaced to a high-bandwidth memory sub-system
through programmable address generators, featuring for example vector-
ized and modulo addressing. An important key point is that the PiCoGA-
III features a native support for operations in GF(2
￿), thus allowing easy
and effective implementations of composite ﬁelds that provide the mathe-
matical back-ground formany applications, including Reed-SolomonCodes.
Overview: the AES/Rijndael algorithm
The Rijndael algorithm [107] is a symmetric key cipher implementing a
substitution-permutation network. The size of both ciphered block and
key depends on the security level required, as well as the number of it-
erations (rounds) required to encrypt the plain-text. As an example, the
U.S. Government requires 128-bit keys for SECRET data, while the TOP-
SECRET level requires 196 and 256-bit keys. While Rijndael supports a
large range of block and key sizes, the NIST standardized a subset of them,
using only 128-bit blocks and128, 196and256-bit keys [106]. For ciphering
a stream, AES/Rijndael can be applied in many schemes, including ECB
(Electronic Codebook) and CBC (Cipher Block Chaining) [108]. While the
EBC mode ciphers each block independently to the other ones, the CBC
XORs the plain-text with the previously ciphered block, preventing the
coding of equal plain-blocks with equal ciphered-blocks. On one hand,
the CBC mode introduces an additional level of security wrt EBC, but on
the other hand we have an additional feedback that limit the peak perfor-
mance, especially for hardware implementation.
The encryption process starts arranging the block in a matrix form
termed State. Let us consider as reference the 128-bit (block and key) Rijn-
dael. In this case, the State (S)i sa4
￿4 array of bytes in which the 128-bit
block is arranged by rows. The State is thus encrypted by the iterative
application of 4 operations, as described in the following pseudo-code.6.2 Example of application mapping 135
S=in; N
￿ = 128;
S=AddRoundKey(S, key[0,N
￿-1]);
for (i=1; i
￿Nround; i++)
￿
S = SubBytes(S);
S = ShiftRows(S);
S = MixColumns(S);
S = AddRoundKey(S,key[i*N
￿,(i+1)*N
￿]);
￿
S = SubBytes(S);
S = ShiftRows(S);
S = AddRoundKey(S,key[i*N
￿,(i+1)*N
￿]);
out = S;
The number of iteration (Round) depend on the key size, and ranges
from 10 to 14. Four basic operations are applied to the State:
SubBytes: is a non-linear substitution step applied to each byte of the
State array, that is substituted with its inverse multiplicative over
GF(2
￿). Then, an afﬁne transformation (
 
￿
￿
 
￿
 
￿
 ) is applied, as
described by the following equivalent equation:
 
￿
￿
￿
 
￿
￿
 
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
 
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ (6.2)
 
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
 
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
 
￿
where
 
￿ and
  are bytes of the State array,
  is the vector
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿.
The non-linear substitution applied to each byte is also known as
S-Box.
ShiftRows: operates on the rows of the State, rotating them to the left by
a shift step equal to the row index.
 
￿
￿
￿
￿
￿
 
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿ (6.3)
where
  and
  are respectively the column and row indexes.
MixColumns: operates on the four bytes of each column of the State ar-
ray, that are treated as the coefﬁcient of a 4-th order polynomial over
GF(2
￿). The MixColumns step performs a multiplication (modulo
 
￿
￿
￿ ) with the ﬁxed polynomial
￿
 
￿
￿
 
￿
￿
 
￿
￿ .136 Application development on reconﬁgurable processors
AddRoundKey: represents thelast operation ofeachRoundandperforms
an addition over GF(2
￿) between the State and the Round Key,a4
￿4
array generated from the original key by an expansion step in order
to provide different key-words to different rounds.
The key expansion step, also known as Key Schedule, is performed be-
fore theencryption, andis describedwithmathematicaloperations, mainly
based on the application of S-Box and word rotation [106, 107].
Alltheoperations previously describedareinvertible inavery straight-
forward manner, resulting a decoding schema very similar to the encoding
one. In particular, the computational complexity is more or less the same,
since the kind of applied operations is the same.
The Advanced Encryption Standard implemented by the Rijndael al-
gorithm can be efﬁciently implemented in both software and hardware.
8-bit processors can directly implement most of the operations required
by AES since they are natively working on 8-bit variables (e.g. ShiftRows,
AddRoundKey and MixColumns), while the S-Box is more efﬁciently im-
plemented using a 256-entry 8-bit hash table. 32-bit processors implement
fast Rijndael combining the different step of a round transformation in a
single setof hash-tables. Asaresult, 4tableswith 25632-bit values(termed
T-Box) substitute most of the round operations, leaving to the dynamic
computation XORs and rotations [107]. Comparing this optimized version
with the basic one, about one order of magnitude in performance is gained
on a RISC processor. Implementations on TI DSPs are discussed in [112]: a
112.3 Mbit/sec throughput (@ 200MHz) is achieved on the C62x architec-
ture for the encoder, 1.6
￿ faster than a Pentium-Pro working at the same
frequency. Moreover, instruction set extensions dedicated to Rijndael are
present in the literature, such as [109, 110].
Hardware implementations of AES are optimized by the exploitation
of the availableparallelism. Hence, thedesign ofhardware accelerators for
AES begins from the 1-to-1 unfolding of the Round deﬁnition, as shown
in Figure 6.17. For the ECB mode, the Rijndael algorithm can be com-
pletely unrolled and pipelined, thus improving the available throughput
up to the technological limit. The undeniable drawback is the consider-6.2 Example of application mapping 137
S- Box
S- Box
S- Box
S- Box
S- Box
S- Box
S- Box
S- Box
…
128-bi t
bl ock
Mi x
Col
Mi x
Col
Mi x
Col
Mi x
Col
^
Round  Ke y
SubByt es Mi x
Col um ns
Shi f t
Rows
Ad d
Round
Ke y
Figure 6.17: Common AES-Round block diagram
able augment in area occupation. Examples of AES implementations for
stand-alone FPGAs are [115, 116, 113, 117, 114], providing 2-30 GBit/sec
throughout. Hybrid solutions, coupling a processor with FPGA technol-
ogy, are implemented in the Xilinx Virtex II Pro platform [114, 118], achiev-
ing performance up to 1.2 GBit/sec. For embedded applications, where
the area budget is a constraint, devices with restricted size are proposed.
EmbeddedFPGAs(e.g. [111]) arethe most direct “translation” of thetradi-
tional ﬁeld-programmable technology to the market of IPs suitable for SoC
integration. Alternatively, and depending on the application ﬁeld, recon-
ﬁgurable data-paths (e.g. [53, 55]) are used as hardware-programmable
accelerators. As an example, in [120] a reconﬁgurable datapath challenges
a set of cryptographic applications.
Implementation of basic GF(2
￿) operations
An important property of Galois Fields is that they are univocally deﬁned
by the number of elements. What can be changed, depending on the irre-
ducible polynomial, is the representation. Therefore, the GFs are isomorphic
with respect to an irreducible polynomial change and a transformation ma-138 Application development on reconﬁgurable processors
trix can be deﬁned in order to change the representation. As described in
Paar’ PhD Thesis [121], this implies that GF(2
￿) can be seen as a compos-
ite ﬁeld GF((2
￿)
￿) whose elements are represented by 1-order polynomials
 
 
￿
  with
 
 
 
￿ GF(2
￿). PiCoGA-III features a native support of GF(2
￿)
with the irreducible polynomial
 
￿
￿
 
￿
￿ . This means that each RLC can
be programmed to perform both the sum (
￿) operation, implemented by
LUT as a 4-bit XOR, and the multiplication (
￿) operation, implemented by
the dedicated GF multiplier.
The AES/Rijndael algorithm requires to implement three operations
on GF(2
￿): the sum, the multiplication by constant amount, and the in-
verse multiplicative. While the sum and the multiplication with constant
amount can be described (in Griffy-C) and implemented (on the PiCoGA)
with standard C (XORs, ANDs and shifts), the implementation of the in-
verse multiplicative overGF(2
￿)beneﬁts from theGFcapabilitiesof PiCoGA-
III. By deﬁnition [122], the inverse multiplicative on the composite ﬁeld
GF((2
￿)
￿) (using the irreducible
 
￿
￿
 
￿
 
￿
￿) is:
￿
 
 
￿
 
￿
￿
￿
￿
 
￿
 
￿
￿
 
￿
￿
 
￿
 
￿
 
￿
￿ (6.4)
 
￿
 
￿
￿
 
￿
￿
￿
 
￿
 
￿
 
￿
Figure 6.18(a)shows thestraightforward implementationof theinverse
multiplicative obtained from equation (6.4). Basic blocks are aligned per
pipeline stage, and each basic block can be mapped on one RLC (the in-
verse on GF(2
￿) is a 4-in 4-out function implemented by LUT). The full
retiming, needed to maximize the throughput, requires 7 additional reg-
isters (dashed-line blocks), for a total of 17 RLCs distributed over 5 rows.
Figure 6.18(b) shows an optimized inverse multiplicative generated by re-
writing the equation (6.4) in the following form:
￿
 
 
￿
 
￿
￿
￿
￿
￿
 
￿
￿
￿
Æ
￿
￿
￿
 
￿
￿
￿
 
￿
 
￿
￿
￿
￿
Æ
￿
￿
￿
Æ
￿
 
￿
￿
 
￿
￿
￿
 
￿
￿
 
￿
 
￿ (6.5)6.2 Example of application mapping 139
ab
1 − x
2 x
2 x
w 14
a’ b’
1 − x
2 x
1 − x
1 − x
1 − x
ab
a’ b’
w 14
( a) ( b)
Figure 6.18: Inverse multiplicative on composite ﬁelds schemes
In this second case we have an issue-delay of 2 cycles, requiring only 4
additional registers (for a total of 15 RLCs) for the full retiming. The max-
width of this implementation schema is 4 RLCs, allowing a better packing
of multiple instances of the inverse multiplier in the PiCoGA rows (each
of them composed by 16 RLCs). To complete an S-Box, we need to add the
isomorphism matrix and the successive afﬁne transformation. Two rows
with respectively 4 and 2 RLCs are required for the input isomorphism,
while the output isomorphism and the afﬁne transformation can be col-
lapsed together, with the same resources occupation (4+2 RLCs).
Implementation of AES/Rijndael
A goal of our AES/Rijndael implementation is to be ﬂexible for both block
and key size. Hence, we have analyzed, in relation with DREAM capa-
bilities, the following properties of Rijndael algorithm. First of all, since
the SubBytes operation does not depend on the position of each byte, the
ShiftRows can be performed before the SubBytes. In addition, ShiftRows140 Application development on reconﬁgurable processors
Figure 6.19: AES/Rijndael selected kernel and implementation
performs a rotation which can be implemented using modulo addressing.
Hence, using different memory banks for storing the different rows of the
State matrix, PiCoGA is able to load a new State column for each cycle.
The rotation applied by ShiftRows is handled by changing the starting ad-
dress of each bank, while the different number of columns (for the generic
Rijndael) is handled by setting the address generator end-of-count. The
organization by column allows the packing of the MixColumns function
in the same PiCoGA operations.
Figure 6.19 shows the corresponding implementation scheme. This
PGAOP performs AddRoundKey, SubBytes and MixColumns for the 4
bytes in a column concurrently, leaving the addressing engine to handle
the ShiftRows for both block and key access. A different set of buffers is
used to store PGAOP results, since it is not possible to read-and-write a
memory bank in the same cycle. This implementation requires 4 PGAOP
call in order to accomplish one AES/Rijndael Round, after that we need6.2 Example of application mapping 141
Clock cycles per 1 block
block/key Scalable Optimized Key
size Version Version Expansion
128/128 408 285 192
128/192 466 329 216
128/256 524 373 240
256/128 455 - 319
256/192 521 - 367
256/256 587 - 415
Table 6.5: AES/Rijndael encoder performance
to re-conﬁgure the interconnect cross-bar in order to swap the used I/O
buffers. Although this operation could be performed in parallel to the
PGAOP computation (destination port are stored internally to the PiCoGA
during the PGAOP triggering), this reconﬁguration break the best pipeline
evolution. For the EBC mode, there is not dependency among the encryp-
tion of successive blocks, thus it is possible to interleave the encryption
of more than one block in order to mitigate the impact of the interconnect
reconﬁguration. The stride factor allows the address generator to jump to
the next block when the Round is ﬁnished. The last Round requires the im-
plementation of a dedicated PGAOP, without MixColumns and within an
additional AddRoundKey before the SubBytes needed by the loop trans-
formation introduced before. Only 11 pipeline stages are required for this
goal, but the area occupation is increased to 17 rows because of an unfa-
vorable requirement of additional retiming registers necessary to maintain
the issue-delay equal to 1.
For 128-bit block only, the PiCoGA-III is able to output a whole 4x4
block, then it is possible to implement an optimized PGAOP using only
the simple registers. When blocks interleaving is not applicable (e.g. in
CBC mode), we can achieve a further 1.4
￿ speed-up reducing the con-
ﬁguration overhead, through the utilization of simple registers instead of
address generators to exchange data with the PiCoGA. Two additional142 Application development on reconﬁgurable processors
1
10
100
1000
0 20 40 60 80 100 120 140
Num ber   of   i nt er l eaved  bl ocks
S
p
e
e
d
-
u
p
AES- 128  SW
AES- 192  SW
AES- 256  SW
AES- 128  Fast SW
AES- 192  Fast SW
AES- 256  Fast SW
Figure 6.20: Speed-ups wrt RISC processor
shift registers (and the corresponding control logic) shall be mapped on
the PGAOP because the ShiftRows requires to be implemented internally.
Data are loaded at the ﬁrst PGAOP trigger, while other 3 three additional
triggers are required to provide the correct result.
Experimental results and comparisons
We have implemented the AES/Rijndael algorithm on the DREAM cycle-
accurate Instruction Set Simulator (ISS), based on CoWare technology. The
RISC processor is modelled using LISA language, while the memory sub-
system and the PiCoGA are modelled using a mix of SystemC and C/C++.
Frequency and power consumption ﬁgures are estimated starting from
measurement on the silicon prototype in [75], featuring a comparable de-
sign complexity. Both scalable and optimized implementations presented
in the previous section were considered in our analysis and the cycle count
obtained is reported in Table 6.5. Results are provided for the encryp-
tion of a single block, considering various block and key sizes. At the fre-
quency of 200MHz, it is possible to achieve a throughput up to 90Mbit/sec
using a scheme applicable in both EBC and CBC modes.6.2 Example of application mapping 143
In EBC mode, the scalable solution can interleave the encryption of
more than one blocks, exploiting as much as possible the computational
efﬁciency of DREAM. Pipelining the computation on the PiCoGA-III, the
obtained speed-up ﬁgures raises from 100
￿ to 930
￿ wrt the ANSI-C Ref-
erence Code (v. 2.2) running on a RISC processor at the same frequency,
while it raises from 3
￿ to 24
￿ wrt a fast software implementation (by
C. Devine, on-line available at the Rijndael Home Page [107]) working on
the same RISC processor. Figure 6.20 shows the achieved speed-ups versus
the level of interleaving applied, hence in relation to the number of block
concurrently elaborated.
Figure 6.21 shows an analysis of the throughput with respect to the
interleaving factor applied. As a consequence, ciphering 64 or 128 blocks,
the beneﬁt of pipelining the computation inside the PiCoGA-III mitigates
the overhead due to interconnect conﬁguration changes, allowing one to
obtain up to 546 Mbit/sec of throughput. Considering the case of AES-
128, the throughput increases from 63 to 546 Mbit/sec in a way that is
proportional to the average number of active rows inside the PiCoGA. In
fact, the average numberof active rows growth from 1.5rows/cycle to 12.8
rows/cycle, respectively corresponding to 10% and 85% of the PGAOP.
With 256-bit block size, the memory utilization growths faster, then the
128-block interleaving cannot be applied.
Comparisons with other AES-128 implementations are reported in Ta-
ble6.6, includingboth fastsoftware (with anassembly hand-codedPentium-
III) and hardware approaches. Furthermore, a processor with custom-
designed ISA [109] is considered too. For the hardware approaches, we
have taken into account folded schemes implemented on both FPGA and
ASIC (0.18
 m) prototype. The energy efﬁciency (Mbit/sec/mW) shows
the density advantage of DREAM with respect to the other “programma-
ble” solutions. For this purpose, the power consumption of DREAM is
estimated in a range from 80 mW (CBC) to 180 mW (EBC), depending on
the different PiCoGA-III utilization and correlated memory activity.144 Application development on reconﬁgurable processors
0
100
200
300
400
500
600
0 20 40 60 80 100 120 140
Number   of   i nt erl eaved  bl ocks
T
h
r
o
u
g
h
p
u
t
 
(
M
b
i
t
/
s
e
c
)
128  bi t   bl ock,   128  bi t   key
128  bi t   bl ock,   192  bi t   key
128  bi t   bl ock,   256  bi t   key
256  bi t   bl ock,   128  bi t   key
256  bi t   bl ock,   192  bi t   key
256  bi t   bl ock,   256  bi t   key
Figure 6.21: Throughput vs. interleaving factor6.2 Example of application mapping 145
Frequency Throughput Energy eff.
MHz Mbit/sec Mbit/sec/mW
DREAM
(EBC)
200 546 3.03
DREAM
(CBC)
200 90 1.12
ARM9
￿
￿
￿
[123]
500 46.6 0.32
￿
￿
￿
ARM9
￿
￿
￿
[123]
250 23.3 0.67
￿
￿
￿
TI C62x
[112]
200 112 n/a
Pentium-III
[119]
1130 645 0.015
Ravi [109] 188 17.2 n/a
Lu [118] 196 1197 n/a
Chaves [114] 100 1258 n/a
Sch. [119]
FPGA
77 640 0.39
Sch. [119]
ASIC
154 1280 22.8
(1) ARM926EJ-S Speed-Opt. 90nm 0.29 mW/MHz (www.arm.com)
(2) ARM926EJ-S Area-Opt. 90nm 0.14 mW/MHz (www.arm.com)
Table 6.6: AES-128 encryption comparisons146 Application development on reconﬁgurable processors
6.2.3 Low-complexity transform for H.264 video encoding
The H.264 video encoding architecture [124] has many innovations if com-
pared to previous standards and provides a compression gain of 1.5-2.0
￿
over in relation to them. Among the other things, this standard introduces:
￿ a new low-complexity transform and quantization approaches [125]
employing only integer arithmetic without multiplications. Its co-
efﬁcients and scaling factors can be elaborated using a 16 bit arith-
metic, leading to a signiﬁcant complexity reduction, and allowing a
more efﬁcient hardware implementation, in particular for reconﬁg-
urable devices.
￿ new cost functions for the deﬁnition of the macro-block distance
metric in motion compensation heuristics. In particular two metrics
are deﬁned, the sum of absolute difference (SAD) and sum of ab-
solute transform difference (SATD) based on Hadamard transform.
Motion compensation is also used for the efﬁcient intra-frame encod-
ing, by the introduction of nine motion mode. Intraframe prediction
can be used to encode very efﬁciently also static images, with the
same signal to noise ratio of JPEG2000, but with a better compres-
sion factor [126].
This section and the next one will introduce the implementation of
these critical kernelsonthePiCoGA-IIIreconﬁgurable deviceintheDREAM
adaptive DSP. It will be also illustrated the techniques of optimization
adopted to obtain an optimal implementation of these computational ker-
nels.
H.264 Transform
Thestructure ofH.264imposes severalrequirementson thedesignofresid-
ual coding. In traditional video encoding standard, residual decoding
contains the possibility of drift (mismatch between the decoded data in
the encoder and decoder). The drift arises from the fact that the inverse6.2 Example of application mapping 147
transform is not fully speciﬁed in integer arithmetic, but using ﬂoating-
point operations (sinus and cosine samples, for the implementation of Dis-
crete Cosine Transform - DCT) that introduce approximation errors due to
the speciﬁc implementation. On one hand, the programmer can adapt
and optimize the implementation on a particular architecture, but in the
other hand the cost of this ﬂexibility is the introduction of a prediction
drift. To avoid this, H.264 standard introduces an integer transform in
which all the operations are natively deﬁned by ﬁxed point arithmetic,
thus without loss of information. Moreover, H.264 transform is applied to
4
￿4-pixel blocks, whereas the previous video coding standards used 8x8
blocks. This smaller block size leads to a signiﬁcant reduction in ringing
artefacts (image border noise) and computational requirements. In addi-
tion, compression gain is improved by using inter-block pixel prediction
for intra-coded frames. The transform is applied to prediction residuals,
reducing the spatial correlation and the size of transformed block without
affecting the compression gain.
The length-4 transform proposed in H.264 is an integer orthogonal ap-
proximation of the Discrete Cosine Transform (DCT), which allows bit-
exact implementation for both encoder and decoder. From a computa-
tional point of view, the new transform has the additional beneﬁt of sub-
stituting multiplications with shifts, more suitable for the implementation
on reconﬁgurable hardware. For improved compression efﬁciency, H.264
employs a hierarchical transform structure, in which the DC coefﬁcients
of neighboring 4x4 transform are grouped in 4x4 blocks and transformed
again by a second level transform.
Integer Transform design
DCT is commonly used as block transform coding of images and video be-
cause its close approximation to the statistically optimal Karhunen-Loeve
transform, for a wide class of signals. DCT maps a N-length vector x into
a new vector X, by a linear transformation
 
￿
 
 148 Application development on reconﬁgurable processors
where the elements of the matrix
  are deﬁned by
 
(
￿
￿
 
￿
 
 
 
￿
￿
 
(
￿
￿
 
￿
￿
￿
￿
￿
 
￿
￿
￿
￿
 
 
 
￿
The DCT matrix is orthogonal, thus
 
￿
 
￿
￿
 
￿
 
/
 
A disadvantage of DCT is that coefﬁcients
 
￿
 
 
 
￿ are irrational numbers,
that in a digital computer are approximated, thus introducing some de-
gree of error. In H.264, the transform is based on the DCT and operates on
￿
 
￿ blocks of residuals data, but differs from a DCT for the fact that is na-
tively deﬁned using integer arithmetic (without loss of accuracy, for both
direct and inverse transform), avoiding mismatch between encoders and
decoders. Furthermore, the core part is multiply-free, and a scaling multi-
plication is integrated in the quantizer thus reducing the total number of
multiplications.
A
￿
 
￿ DCT of an input array
  is given by:
 
￿
 
 
 
/
￿
￿
￿
￿
￿
￿
￿
￿
￿
 
 
 
 
 
 
￿
 
￿
 
 
￿
 
￿
 
 
 
￿
 
 
￿
 
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
 
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
 
 
 
 
 
 
￿
 
￿
 
 
￿
 
￿
 
 
 
￿
 
 
￿
 
￿
￿
￿
￿
￿
￿
￿
￿
where:
 
￿
￿
￿
￿
 
￿
￿
￿
￿
 
 
 
￿
 
￿
￿
￿
 
￿
￿
￿
￿
 
 
 
￿
￿
 
￿
￿
The matrix multiplication can be factorized in the following form:
 
￿
￿
 
 
 
/
￿
￿
 
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
 
￿
 
￿
￿
￿
￿
￿
￿
￿
￿
 
￿
￿
￿
￿
 
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
 
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
 
￿
 
￿
￿
￿
￿
￿
￿
 
￿
￿
￿
￿
￿
￿
￿
￿
 
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
 
￿
 
 
 
￿
 
 
 
 
 
￿
 
 
 
￿
 
￿
 
 
 
￿
 
 
 
 
 
￿
 
 
 
￿
￿
￿
￿
￿
￿
￿
￿
￿
 
 
 
/ is the “core” 2D transform, while
  matrix represents the required
scaling factors (
￿ means scalar multiplication) for the corresponding ele-
ments of
 
 
 
/.
  and
  are the same deﬁned before, while
 
￿
￿
  is ap-
proximatively
￿
 
￿
￿
￿. To simplify the implementation of the transform
  is6.2 Example of application mapping 149
approximated to
￿
 
￿, and
  is consequently modiﬁed in order to maintain
the matrix orthogonal, so that:
 
￿
￿
￿
￿
 
￿
￿
￿
￿
￿
 
￿
￿
￿
Then, the
￿
￿
￿ and
￿
￿
￿ rows of
  and the
￿
￿
￿ and
￿
￿
￿ columns of
 
/ are
up-scaled by a factor of 2, post-scaling consequently the matrix
 . The
ﬁnal forward transform becomes:
 
￿
￿
 
 
 
/
￿
￿
 
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
 
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
 
￿
￿
 
￿
 
￿
￿
 
￿
￿
 
￿
 
￿
￿
￿
 
￿
 
￿
￿
 
￿
￿
 
￿
 
￿
￿
 
￿
￿
 
￿
 
￿
￿
￿
 
￿
 
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
Matrix
  is collapsed in the quantizer, that is deﬁned by:
 
￿
￿
￿
￿
 
 
 
 
 
￿
 
￿
￿
￿
 
 
 
 
 
￿
￿
 
 
 
 
 
￿
 
￿
￿
￿
 
 
 
 
 
 
 
￿
where PF ispost-scaling factor that dependson the position
￿
 
 
 
￿ suchthat:
￿
￿
 
￿
￿
 
￿
￿
 
￿
￿
 
￿
￿
 
￿
￿
 
￿
￿
 
￿
￿
 
￿
￿
￿
 
￿
￿
 
￿
￿
 
￿
￿
 
￿
￿
 
￿
￿
 
￿
￿
 
￿
￿
 
￿
 
￿
 
 
 
 
 
 
 
 
￿
On the decoder side, we can use
 
/ scaling the reconstructed trans-
form coefﬁcients in order to compensate the different row norms. On the
other hand, we need to reduce the dynamic range gain in order to min-
imize the combined rounding errors from the inverse transform and re-
construction. H.264 standard scales the odd-symmetric basis functions by
1/2, replacing the rows
￿
￿
￿
￿
￿
￿
￿
￿ and
￿
￿
￿
￿
￿
￿
￿
￿ with
￿
￿
￿
 
￿
￿
￿
 
￿
￿
￿
￿
and
￿
￿
 
￿
￿
￿
￿
￿
￿
 
￿
￿, respectively. That way, the sum of absolute values of
the odd functions is 3, which reduces the dynamic range gain for the 2-D
inverse transform from
￿
￿ to
￿
￿. This allows reducing the dynamic range
increase from 6 bits to 4 bits. Therefore, the inverse transform matrix is
then deﬁned as150 Application development on reconﬁgurable processors
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
 
￿
￿
￿
 
￿
￿
￿
￿
￿
￿
￿
￿
 
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
 
￿
￿
￿
￿
￿
￿
￿
￿
￿
A key point is that the small errors caused by the right shifts are com-
pensated by the 2-bit gain in the dynamic range of the input to the inverse
transform. Inverse transform can thus be performed by the following ex-
pression:
 
￿
 
￿
￿
￿
 
 
/
￿
￿
￿
Direct transform mapping and optimization
Various methods has been proposed in literature in order to decrease the
computational complexity of the (I)DCT, most of them based on decima-
tion algorithms and butterﬂy structures. For the case of H.264 direct and
inverse transform, the butterﬂy structure is represented by the schemes in
Figure 6.22.
Direct transform is performed on the residual frame obtained from
the pixel-to-pixel difference between the current frame and the previous
frame, reconstructed by decoding the previously encoded frame. There-
fore, before the butterﬂy structure is required an additional stage perform-
ing the pixel to pixel difference between corresponding macro-blocks. The
following code represents the software implementation of the H.264 2-D
4x4-DCT.6.2 Example of application mapping 151
x  (0) r
x  (1) r
x  (2) r
x  (3) r
X  (0) r
X  (3) r
X  (2) r
X  (1) r
x(0)
x(1)
x(2)
x(3) X(3)
X(0)
−2
2
−
−
−
Fast implemetation of inverse transform (b)
Fast implemetation of direct transform (a)
1/2
1/2
−
−
−
−
X(2)
X(1)
Figure 6.22: Fast implementation of the 1-D H.264 transform
for( y = 0; y
￿ 4; y++ )
f o r (x=0 ;x¡4 ;x + +)
d[y][x] = pix1[y][x] - pix2[y][x];
f o r (i=0 ;i
￿ 4; i++ )
￿
int s03 = d[i][0] + d[i][3]; int s12 = d[i][1] + d[i][2];
int d03 = d[i][0] - d[i][3]; int d12 = d[i][1] - d[i][2];
tmp[0][i] = s03 + s12; tmp[1][i] = 2*d03 + d12;
tmp[2][i] = s03 - s12; tmp[3][i] = d03 - 2*d12;
￿
f o r (i=0 ;i
￿ 4; i++ )
￿
int s03 = tmp[i][0] + tmp[i][3]; int s12 = tmp[i][1] + tmp[i][2];
int d03 = tmp[i][0] - tmp[i][3]; int d12 = tmp[i][1] - tmp[i][2];
dct[0][i] = s03 + s12; dct[1][i] = 2*d03 + d12;
dct[2][i] = s03 - s12; dct[3][i] = d03 - 2*d12;
￿152 Application development on reconﬁgurable processors
¿From a computational point of view, the direct transform requires
pixel-to-pixel subtractions (highly parallel), sums and shifts. Multiplica-
tions are not required since the matrix coefﬁcients can be strength-reduced
in shifts and sums. Two-dimensional DCT can be performed using the
common row-column algorithm, thus the whole computation schema is
that one represented in Figure 6.23, where the
 
￿ block are the previously
described 4-point 1-D DCT.
1−D Transform
Figure 6.23: Fully-unfolded bi-dimensional transform diagram
Considering the data range, the elements of the transformed matrix
can be represented by 15 bit, since 9 bits are required after the pixel-to-
pixel difference (pixel are represented by 8-bit variable) and 6 additional
bits are required after the transform. Therefore the representation of the
whole matrix requires a bandwidth greater than the bandwidth available
in PiCoGA-III, that provides up to 4 32-bit outputs. This requires to split
the execution of the transform intwo successive calls, eachof themprovid-
ing one half matrix. To save memory space and to improve the commu-
nication bandwidth, two pixels can be packed in the same output word
without loss in precision. Therefore, the implementation on PiCoGA-III
requires the insertion of a multiplexing layer for the selection of the out-
puts. The most efﬁcient solution is to insert the multiplexing layer at the
same level of the row-column transposition, as is shown in Figure 6.24:
￿ the row computation is performed by a full-unfolded schema, since6.2 Example of application mapping 153
for each column transform a sample of every row is required.
￿ the column computation is performed by a 2-way unfold, since this
level of unfolding is the maximum allowed by the output band-
width. The multiplexing layer allows to choose the couple of rows
under elaboration.
1−D Transform
Multiplexing stage
Figure 6.24: Partially unfolded 4x4 DCT schema
This function features 9 inputs (4+4 for the two input blocks with 8-bit
pixels packed in 32-bit words, 1 for the multiplexing) and 4 outputs. To
complete the elaboration of one 4x4 block is required to call two times this
function. The static features are summarized in the table 6.7, while Figure
6.25 shows the mapping on the array. A detailed analysis of this imple-
mentation emphasizes resources under-utilization in the reconﬁgurable
device. As can be seen in Figure 6.25, some rows is only partially used,
as for example the case of the 3,7 and 11.
Although not critical in term of performance, the under-utilization of
the device can be seen as an overhead in term of area and can cause an ad-
ditional energy consumption, which is roughly proportional to the num-
ber of active rows. It is possible to use a software pipelining methodology
in order to fold cascaded pipeline stages in the same pipeline stage, by
means of the introduction of status register and feedbacks which allow to
work with data referred to a different iteration time. Software pipelining154 Application development on reconﬁgurable processors
Pgaop name sub4x4dct
Rows 20
Pipeline Stage 7
Latency 8
Issue Delay 1
Table 6.7: sub4x4dct
Figure 6.25: sub4x4dct rows occupation
the ﬁrst two stages, and the third and fourth ones, we can save two rows,
as it is shown in Figure 6.26. As a consequence, the PiCoGA operation
shall be call more times in order to both ﬁll the internal status register
with valid data and to output the results. Since this process is pipelined, it
introduces only a small overhead due to this prologue/epilogue require-
ments. After the ﬁrst three additional calls, this operation provides as out-
put a half matrix every cycle. The static features of this implementation6.2 Example of application mapping 155
are summarized in the table 6.8.
Figure 6.26: Modiﬁed sub4x4dct for area optimization
Pgaop name sub4x4dct
Rows 18
Pipeline Stage 5
Latency 6
Issue Delay 1
Table 6.8: sub4x4dct
Inverse transform mapping and optimization
As for the direct transform, the inverse transform is split in two computa-
tional kernels: in the ﬁrst part, given the transformed matrix, is extracted156 Application development on reconﬁgurable processors
the residual matrix, while in the second part the reconstructed block is
added. Since the input data have a range that is greater than the range
of the direct transform, area requirements are more demanding and the
optimized implementation requires two PiCoGA operations. The basic
butterﬂy schema is that one shown in Figure 6.22(b). Furthermore, a fur-
ther stadium shall be added just before the output, in order to perform a
shifting and rounding operations, as represented in Figure 6.27.
1−D Transform
Shift and Round
Figure 6.27: Fully-unfolded inverse 4x4-IDCT basic diagram
Also in this function the output data range not allows to map the whole
function on PiCoGA-III, since the matrix elements require 13 bit to be rep-
resented (9 for the difference representation and 4 due to transform and
inverse transform). The methodology adopted to solve this problem is the
same applied to the previous PiCoGA operation, with the introduction of
a multiplexing layer to select the required outputs after each PiCoGA op-
eration elaboration, as in Figure 6.28.
The last stage of this function performs a shift-and-round operation. In
the software implementationthis operation isobtainedadding32(0b10000
in binary) and right-shifting by 6 bits. In the hardware implementation, in
order to reducethe area occupation it is possible to carry out part of the op-
eration of right-shifting before the sum, thus reducing the number of bits
required. After the shift-and-round, a clipping operation is performed,
setting to 0 every negative value, and to 255 every value greater than 255.
Let us suppose to have in input 16 bit data, the clipping operation can be
performed by the logical structure represented in Figure 6.29.6.2 Example of application mapping 157
1−D Transform
Multiplexing stage
Shift and Round
Figure 6.28: Partially-unfolded inverse 4x4-IDCT basic diagram
Output data
Mux != 0
MSB input data
>> 16
16 bits input data
sign bit
Neg
LSB input data
1 0
Figure 6.29: Modiﬁed clipping function structure
In this structure, the multiplexer selector is determined analyzing the 8
most signiﬁcant bits of the input data. If these bits are set to 0 the clipping
shall not be performed (the number is positive and less than 255), then the
output is obtained passing the input data through the clipping structure.
Otherwise the clipping shall be performed, and the output data are ob-
tained by 16-bit shifting and not operations. The ﬁrst operation generate158 Application development on reconﬁgurable processors
a binary number composed from all 0 if the input data is positive, or all
1 otherwise. Therefore, the not operation allows to obtained in output all
1 (corresponding to 255 if we consider such number unsigned char) if the
input data is positive and all 0 (corresponding to 0) otherwise. This type
of implementation allows to use only 4 rows of PiCoGA-III for the com-
putation of 16 value clipping. The high issue delay requires the insertion
of some retiming registers that caused a increase of the used rows (from 4
to 8).
Summarizing, two functions are used to implement the inverse 4x4
DCT. The ﬁrst one is characterized by 8 inputs, containing the input matrix
composed by 16 elements of 16 bits, and 4 outputs returning the two se-
lected columns. The second one is characterized from 8 inputs, 4 for each
input matrix (residual and reference), and 4 outputs containing the output
block composed from 16 pixels of 8 bits. To elaborate a 4x4 block inverse
DCT is necessary to call two times the ﬁrst function and only one the sec-
ond one. The static features of the two PiCoGA operation are summarized
in the table 6.9.
Pgaop name F4x4idct
Rows 22
Pipeline Stage 7
Latency 8
Issue Delay 1
Pgaop name add4x4
Rows 14
Pipeline Stage 4
Latency 5
Issue Delay 1
Table 6.9: F4x4idct and add4x46.2 Example of application mapping 159
Results
Both direct and inverse transform are implemented in such a way that is
suitable for the computation in a pipelined form. In the case of the direct
transform, only one PiCoGA operation is required. Therefore, it is possible
to feed the reconﬁgurable device with new data every cycle, by mean of
the high-bandwidth direct memory access performed via programmable
address generators available in the DREAM adaptive DSP. A set of blocks
is stored in the local memory, then the computation start. Depending on
the number of locally stored and elaborated macro-blocks, also deﬁned as
interleaving factor, the computation achieves better performance ﬁgures
since the pipelining is better exploited thus allowing a reduction of stalls.
On the contrary, in the case of the inverse transform, two PiCoGAoper-
ation are required. To pipeline as much as possible the computation an in-
termediate data repository is necessary. The local data buffers of DREAM
can be used for this purpose, thus running the two PiCoGA operations
alternatively and storing/reading intermediate data from the exchange
buffer.
Fig. 6.30 shows the performance improvement (speed-up ﬁgure) with
respect to a RISC processor working at the same frequency. As expected,
the speed-up increases with the interleaving factor, saturating when the
PiCoGA pipeline is completely active and prologue/epilogue overhead
are negligible compared to the overall computation time.
Fig. 6.31 shows the throughput achieved with respect to the interleav-
ing factor. Since the direct transform provide one half output matrix for
every PiCoGA trigger, the maximum bandwidth achievable is:
 
￿
￿
%
￿
￿
 
 
 
 
 
￿
￿
 
 
 
 
 
 
 
 
 
 
 
￿
￿
￿
 
 
 
￿
￿
￿
￿
 
 
 
￿
 
 
 
 
 
￿
￿
￿
 
￿
 
 
 
 
 
 
 
 
In the case of the inverse transform, two PiCoGA operations are re-
quired thus the maximum achievable bandwidth is 12.8 Gbit/sec. As it is
shown in Fig. 6.31, sizing properly the interleaving factor, near-optimal
performance is achieved. The exploitation of the pipelining degree cause
an increase of the dynamic power that is roughly proportional to the num-
ber of active rows per cycle and the memory bandwidth utilized. On the160 Application development on reconﬁgurable processors
1
10
100
1000
1 2 4 8 16 32 64 128 256 512 1024 2048
I nt erl eavi ng  f act or
Speed- up  vs  RI SC
4x4  add  i dct
4x4  sub  dct
Figure 6.30: Speed-up ﬁgure with respect to a RISC processor working at the
same frequency
contrary, the energy efﬁciency, mixing the energy consumption with the
achieved performance, increase with the interleaving factor since the per-
formance gain is greater than the energy increase. The related ﬁgure of
merit is reported in Fig. 6.32.
0, 1
1
10
100
1 2 4 8 16 32 64 128 256 512 1024 2048
I nt erl eavi ng  f act or
Thr oughput
( Gbi t / sec)
4x4  Add  I DCT
4x4  Sub  DCT
Figure 6.31: Throughput achieved with respect to interleaving factor6.2 Example of application mapping 161
1
10
100
1000
1 2 4 8 16 32 64 128 256 512 1024 2048
I nt erl eavi ng  f act or
Ener gy  ef f i ci ency 
( Mb i t / sec/ mW)
4x4  Add  I DCT
4x4  Sub  DCT
Figure 6.32: Energy efﬁciency with respect to interleaving factor162 Application development on reconﬁgurable processors
6.2.4 H.264 intra prediction with Hadamard transform for
4x4 blocks
Intra-frame prediction is introduced in H.264 advanced video coding stan-
dard inorder estimate4x4 pixelblock starting from the neighboring pixels,
as shown in Fig. 6.33. If compared to previous standard, as the JPEG2000,
this enhancementallowsto achievedbetter compression gainandthe same
signal to noise ratio, as proof in [126]. Although different block sizes are
supported by the standard, for the base proﬁle the commonly used macro-
block is 4x4 pixels, and the intra prediction is applied to the luminance
component (or luma), that represents the grey-scale.
J
K
J
K
J
K
J
K
J
K
J
K
J
K
J
K
J
K
MODE 0 MODE 1 MODE 2
MODE 3 MODE 4 MODE 5
MODE 6 MODE 7 MODE 8
MEAN
A−D I−L
ABCDEFGH
I
M
L
ABCDEFGH
I
M
L
ABCDEFGH
I
M
L
ABCDEFGH
I
M
L
ABCDEFGH
I
M
L
ABCDEFGH
I
M
L
ABCDEFGH
I
M
L
ABCDEFGH
I
M
L
ABCDEFGH
I
M
L
Figure 6.33: Intra prediction modes for 4x4 luma block
Although H.264 not speciﬁes any mode decision, some algorithms are
proposed with different trade-off betweenqualityandcomputational com-
plexity, considering both distortion and rate. While high complexity mode
a sum of square differences is used, for low complexity mode decision, dis-
tortion is evaluated by sum of absolute differences (SAD) or sum of abso-
lute transformed differences (SATD) between the predictors and original
pixels (where the applied transform is the Hadamard transform). Usually,
the coding performance by selecting SATD is 0.2 0.5 dB better. The rate is6.2 Example of application mapping 163
estimated by the number of bits required to code the mode information,
and depends - among the others - to the level of quantization. Most of
the computational complexity is associated to SAD and SATD elaboration,
and these two kernels are considered for the implemented on DREAM.
4x4 SAD mapping and optimization
Similarly to MPEG-2 standard, the SAD function is deﬁned by:
 
 
 
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
 
￿
￿
￿
￿
 
￿
￿
￿
￿
where
 
￿
￿
￿ and
 
￿
￿
￿ are the
￿
 
 
 
￿
￿
￿ elements of two macro-blocks A and B.
SAD features a very high level of intrinsic parallelism, both at instruction
level and at loop level. Instruction level parallelism resides in the arith-
metic properties of the mathematical deﬁnition, whereas the loop level
parallelism is due to the fact that SAD computation could be applied in
parallel to all the macro-block in a frame. Each pixel is represented by
8-bit variable, thus it is possible to pack in 4 32-bit words a whole macro-
block. Since PiCoGA-III features up to 12 32-bit input words, it is pos-
sible to transfer every cycle all the data required for the 4x4 SAD com-
putation. The output is the SAD value, that can range between 0 and
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿. Figure 6.34 shows the simpliﬁed (fully-unfolded)
block diagram of the 4x4 SAD subdivided in the three phases required:
differences, absolute value computation and adder tree.
Thisoperation ﬁts thePiCoGA-IIIcomputational capabilities, andachieved
static features are summarized in table 6.10.
Pgaop name sad4x4
Rows 10
Pipeline Stage 6
Latency 7
Issue Delay 1
Table 6.10: 4x4 Sum of Absolute Differences (SAD)164 Application development on reconﬁgurable processors
Diff. Diff. Diff.
Abs Abs Abs
Phase 1
Phase 2
Phase 3
Matrix 1 Matrix 2
Figure 6.34: PiCoGA SAD structure
SATD mapping and optimization
The Sum of Absolute Transformed Differences (SATD) is deﬁned as:
 
 
 
 
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
 
￿
￿
￿
￿
where
 
￿
￿
￿ denotes the
￿
 
 
 
￿
￿
￿ elements of the C matrix, which is the
Hadamard-transformed differencebetweenthemacro-block AandB.Con-
sidering the case of 4x4 pixel blocks, the Hadamard transform of the D
matrix is calculate by:
 
￿
 
￿
 
 
/
￿
where
 
￿ (Hadamard transform) is deﬁned by the orthogonal matrix:6.2 Example of application mapping 165
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
Since
 
. has N orthogonal rows
 
.
 
/
.
￿
 
. (where
 
. is the NxN
identity matrix) and
 
￿
￿
.
￿
 
.
 
 . Therefore, the inverse transform is
deﬁned by
 
￿
 
￿
 
 
/
￿
Investigating the number of sign transition among the values of each
column of
 
￿, it is possible to see that the ﬁrst one has 0 sign transitions,
the second one 3 sign transitions, the third one 1 sign transition, and the
fourth one 2 sign transition. The number of sign transition is often termed
“sequence”, and it is a common concept already present in Fourier trans-
form (see Fig. 6.35). Zero sign transition corresponds to a DC component,
whereas a big number of sign changes corresponds to high frequency com-
ponents.
(a) DCT base (b) Hadamard
base
Figure 6.35: DCT and Hadamard transform
Ifthe columns of
 
￿ arearranged per increasing sequence, the obtained
matrix is called Walsh transform matrix and is deﬁned as:166 Application development on reconﬁgurable processors
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
￿
Just like in 4x4 DCT, the resulting basic computational engine has the
usual butterﬂy structure, and the bi-dimensional transform is obtained by
a row/column algorithm. The butterﬂy structure for the 1-D Hadamard
transform is depicted in Fig. 6.36, where both direct and inverse data ﬂow
graph are represented.
x  (0) r
x  (1) r
x  (2) r
x  (3) r
X  (0) r
X  (3) r
X  (2) r
X  (1) r
x(0)
x(1)
x(2)
x(3) X(3)
X(0)
−
−
−
Fast implemetation of inverse transform (b)
Fast implemetation of direct transform (a)
−
−
−
−
X(2)
X(1)
Figure 6.36: 1-D Hadamard transform butterﬂy schema
The 4x4 SATD operation is very similar to the previously described
4x4 SAD function, with the difference that the SATD performs a transform6.2 Example of application mapping 167
before the computation of the absolute values. Fig. 6.37 shows the corre-
sponding data ﬂow graph.
MacroBlock
Difference 1−D Transform
SATD
Absolute
Adder tree
Value
Figure 6.37: Fully unfolded 4x4 SATD data ﬂow graph
Difference
Rows
1−D Transform
1−D Transform
SATD
Absolute
Adder tree
Value
SHIFT REGISTER
Figure 6.38: Partially folded 4x4 SATD block diagram
Although no limitation is given by I/O resources, the mapping of a
whole 4x4 SATD does not ﬁt the resource available on PiCoGA-III, then
is required some kind of partitioning or folding. In particular, we chosen
to fold the butterﬂy structure allowing to elaborate only one row per call,
and to maintain unfolded the absolute value computation and the adder
tree. In the middle, a shift register store the intermediate results, allowing
to write-back a SATD computation every 4 cycle. Fig. 6.38 shows the
block diagram of this implementation, while the required shift register is
depicted in Fig. 6.39.168 Application development on reconﬁgurable processors
Third
matrix
row
Fourth
matrix
row
Second
matrix
row
First
matrix
row
Butterfly
Figure 6.39: Shifter register structure used for the matrix transposition
Standing this computational structure, four calls are necessary to ob-
tain a valid SATD computation since only after this time the internal shift
register, implementing the transposition matrix, is ﬁlled with valid data.
This notwithstanding, a streaming work-plan is allowed, although requir-
ing output sub-sampling. Furthermore, software pipelining and retim-
ing registers are used to improve the utilization of PiCoGA-III. Fig. 6.40
shows the mapped operation, while Table 6.10 summarizes the static per-
formance.
Pgaop name satd4x4
Rows 24
Pipeline Stage 6
Latency 7
Issue Delay 1
Table 6.11: 4x4 SATD static performance
Results
Results achieved implement 4x4 SAD and SATD function on the DREAM
architecture are reported in Fig. 6.41, Fig. 6.42 and Fig. 6.43. In particular,6.2 Example of application mapping 169
Figure 6.40: Optimized SATD mapping
Fig. 6.41shows the speed-upﬁgure with respect to aRISCprocessor work-
ing at the same frequency of DREAM, while Fig. 6.42 reports the absolute
throughput achieved. Fig. 6.43 shows the energy efﬁciency, measured in
term of Mbit/sec/mW. The last ﬁgure of merit is the inverse of nJ/bit,
value that gives an idea of the amount of energy spent for the elaboration
of each bit in output.170 Application development on reconﬁgurable processors
Figure 6.41: 4x4 SAD and SATD speed-upﬁgures with respect to the interleaving
factor
Figure 6.42: 4x4 SAD and SATD throughput with respect to the interleaving fac-
tor6.2 Example of application mapping 171
Figure 6.43: 4x4 SAD and SATD energy efﬁciency with respect to the interleaving
factor172 Application development on reconﬁgurable processorsChapter 7
Performance and development
time trade-offs
Application development on reconﬁgurable processors is performed par-
titioning thecomputational workload betweensoftware and(reconﬁgurable)
hardware. As usual, partitioning is an iterative process of design reﬁne-
ments or (in the worst-case) re-designs in order to achieved the optimal
result. But, what is the optimal result? Of course, we can said that con-
straints, such as real-time requirements, shall be veriﬁed, but it could be
possible to further improve the performance in order to achieve greater
energy reduction. As an example, additional performance improvements
can be used to over-boost the application by reducing the pure cycle count,
then reducing the working frequency to achieve better energy consump-
tion ﬁgures. In the case in which complex applications feature a set of
computational kernels (and not only one critical hot spot), it is required
to speed-up as much as possible all the kernels or it is better to focus the
optimization in a subset of them?
Of course, performance, be it computation speed or energy consump-
tion, is not the only cost function to be evaluated, if performance improve-
ments are generated by additional development, thus additional devel-
opment costs. As for the well-known assembly-level optimization, the
programmer shall choose for each critical kernel the most appropriate de-
velopment methodology. As shown in the previous chapter, the effective
173174 Performance and development time trade-offs
utilization of reconﬁgurable devices could be driven by a software or a
hardware approach that depends on the application ﬁeld can return dif-
ferent performance.
The quantitative analysis of the different performance and develop-
ment time trade-off has been evaluated on the XiRisc reconﬁgurable pro-
cessor. The adopted functional unit computational model reduces any
communication overhead between processor core and reconﬁgurable de-
vice. This allows to obtain interesting performance improvement after few
hours or few days of work, by mean of local optimizations obtained by
re-mapping part of the software code on the reconﬁgurable device. On
the contrary, hardware approaches, deeply investigating the mathemati-
cal aspects of the computation to match the device capabilities, promise
impressive performance improvements at the cost of long development
time. Commonly, a hardware approach could imply additional skills with
respect to backgrounds on software development and this can be seen as
an additional cost.
Several applications were developed in order to evaluate the differ-
ent trade-off points proposed by the Griffy methodology in terms of per-
formance, required skills and development time. While previous works,
such as [73, 74], detailed the implementation of common applications on
the XiRisc reconﬁgurable processor, and [68, 66, 67] summarize perfor-
mance and energy reduction on typical benchmarks compared to tradi-
tional general-purpose embedded processors and DSPs. In this section,
we discuss the relation between the performance achieved and the devel-
opment time spent working on the XiRisc.
The applications were chosen in order to be representative of the whole
“embeddedscenario”, usingopen-source codesthat arepartof well-known
benchmark suites such as MediaBench [57] or from well-known applica-
tions in the ﬁeld of image processing, telecommunications, and cryptogra-
phy. Each algorithm has been optimized for XiRisc architecture, exploiting
reconﬁgurability as much as possible. In this case, Griffy-C code is used
as entry-point to conﬁgure the PiCoGA, reﬁning at low-level the initial
conﬁguration using if necessary LUTs or built-in functions.175
Algorithm Speed-up after development time Line of Code Methods of
Developments
￿
￿ day 1 day 10
days
￿1
months
￿3
months
SW
Only
Griffy-C
IDEA 1,9 2,1 2,6 2,7 - 600 740 Clustering,
Mult
CRC 2,3 2,6 2,6 4,0 - 154 430 Clustering, LUT
RSA 1 1,1 1,3 1,6 - 2500 130 Clustering, Pi-
pelining
AES (128-bit) 1 1 1,5 2,5 - 860 490 Clustering, LUT
Kasumi 1,1 1,5 1,6 2,4 - 507 391 Clustering, LUT
Reed-Solomon
Encoder
(255,239)
3 4 7 10 80 156 1500 Clustering,
LUT, HW-
approach
Viterbi Decoder 1,4 1,6 1,7 3 - 300 250 Clustering, LUT
FDCT 1,5 1,6 1,8 2 2,5 250 1000 Clustering, LUT
IDCT 1,2 1,5 2 2,5 - 260 250 Clustering
Quantization 2 2,5 - - - 400 1000 Clustering, Pi-
pelining, Mult
VLC 1,5 2,1 - - - 400 24 Clustering, Pi-
pelining
Motion
Estimation
1,5 3 7 14 16 350 300 Clustering,
Loop unrolling,
Pipelining,
Loop transfor-
mation
MPEG-2
Encoder
1,2 1,5 2 2,5 5 10000 2800 -
MPEG-2
Decoder
1,2 1,3 1,4 1,5 - 4000 120 -
Find Best 1,5 2,5 5,4 - - 200 50 Clustering, Pi-
pelining
Find Acbk 1,1 2,2 4,5 - - 200 50 Clustering, Pi-
pelining
Estim Pitch 1,3 2,3 4,8 - - 200 50 Clustering, Pi-
pelining
Vocoder G.723.1 1,1 1,2 1,9 2 - 16500 210 -
Residu 1,5 3,5 5,0 - - 83 144 Clustering,
Loop unrolling
Vocoder ETSI
GSM 06.60
1,1 1,5 1,8 2,6 - 15000 169 -
Template
Matching
Ncc
1,1 1,5 4 4,4 - 1700 291 Clustering, Pi-
pelining
Template
Matching
Bpc
1,1 1,2 1,5 1,7 - 1900 291 Clustering, Pi-
pelining
Average 1,5 2 3 3,8 7,5 2690 500 -
Table 7.1: Experimental results on application development176 Performance and development time trade-offs
Reported results include both whole applications and many signiﬁcant
critical kernels. In all cases, most of the work was performed by under-
graduate students with about 1-2 weeks of training in reconﬁgurable com-
puting, and the development was stopped when an implementation pro-
viding near-optimal results was obtained. This was either because all the
computational resources were fully exploited, and hence no further per-
formance improvement could be achieved with the same architecture, or
because further improvements would have been too expensive in terms of
development time.
Table 7.1 shows the performance improvement in terms of speed-ups
compared to a VLIW RISC processor, featuring the same instruction set as
XiRisc but not augmented by PiCoGA. In this case, the speed-up takes into
account only the pure cycle count, considering core-only and PiCoGA-
augmented processors as working at the same clock frequency. The table
also reports the number of additional Griffy-C code lines required for the
ﬁnal implementation compared to the amount of initial code. This should
give an idea of the amount of work required to achieve the ﬁnal solution.
Table 7.1 also indicates the main methodology and implementation tech-
niques providing the most signiﬁcant performance gain for the speciﬁc
application.
The same results are summarized in Figure 7.1. Two zones in the graph
can be clearly distinguished, one corresponding to pure software opti-
mizations, such as loop unrolling and memory reorganization, and one
corresponding to more hardware-oriented optimizations. In less than 2
days of work, one can easily obtain 2-3
￿ speed-ups using the automated
C-to-DFG translation and standard techniques (software pipelining and
loop unrolling). Longer developmenttimes, however, leadsto muchlarger
performance improvements, up to one order of magnitude. The highest
speed-ups are achieved through manual optimization, rescheduling as-
sembly instruction, careful algorithm analysis, and, in extreme cases, even
hardware synthesis.
For some applications, the maximum performance gain is obtained by
re-writing the algorithm from scratch, following the analysis of its mathe-177
Per f or m ance  vs  Devel opem ent   Ti me
1, 00
10, 00
100, 00
0day 1/ 2day 1day 10day >1m ont h >3m ont h
Devel opm ent   Ti me
Speed- Up
Aver age
I DEA
CRC
RSA
AES
Kasumi
Reed- Sol om on  Encoder
Vi t er bi   Decoder
FDCT
I DCT
Q uant i zat i on
VLC
Mo t i on  Est i ma t i on
MPEG2  Encoder
MPEG2  Decoder
Fi nd_Best
Fi nd_Acbk
Est i m_ Pi t ch
Vocoder   G. 723. 1
Res i du
Vocoder   ETSI   GSM  06. 60
Tem pl at e  Ma t chi ng  Nc c
Tem pl at e  Ma t chi ng  Bpc
Figure 7.1: Application development trade-off
Ut i l i zati on
0%
20%
40%
60%
80%
100%
120%
0day 1/ 2day 1day 10day >1m ont h >3m ont h
Devel opm ent   Ti me
S
p
e
e
d
-
U
p
 
P
e
r
c
e
n
t
a
g
e
Aver age
I DEA
CRC
RSA
AES
Kasumi
Reed- Sol om on  Encoder
Vi t er bi   Decoder
FDCT
I DCT
Q uant i zat i on
VLC
Mo t i on  Est i ma t i on
MPEG2  Encoder
MPEG2  Decoder
Fi nd_Best
Fi nd_Acbk
Est i m_ Pi t ch
Vocoder   G. 723. 1
Res i du
Vocoder   ETSI   GSM  06. 60
Tem pl at e  Ma t chi ng  Nc c
Tem pl at e  Ma t chi ng  Bpc
Figure 7.2: Development Time vs Speed-Up percentage178 Performance and development time trade-offs
1/ 2day
0
5
10
15
20
1
2
3
4 5
6
7
8
9
10
Ot her
1day
0
2
4
6
8
10
12
1
2
3
4
5
6
7
8
9
10
Other
1 m ont hs
0
1
2
3
4
5
6
7
8
9
10
1
2
3 4
5
6
7 8
9
10
Other
3m ont hs
0
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
Other
mo r e…
0
1
2
3
4
5
6
7
8
9
10
123456789 1 0 O t her
10 days
0
2
4
6
8
10
12
1
2
3
4
5
6
7
8
9
10
O t her
Figure 7.3: Distribution of speed-up with respect to development time
matical structure. This is, for example, the case of the Reed-Solomon En-
coder, where software implementations are performed using 256-element
hash-tables and arithmetical operations, while a hardware approach re-
quires one to directly implement operations in Galois Fields arithmetic
using LUTs. The development time reported includes both the time re-
quired to develop the new algorithm and the time required to implement
and validate it.
Figure 7.2 shows the percent average speed-up obtained by spending
additional time, in term of the ﬁnal implementation. About 70% perfor-
mance gain is obtained in less than 10 days. Figures 7.2and 7.1 can guide a
designer to estimate a cost-effective development strategy and an appro-
priate trade-off between expected performance and required development
time and expertise, balancing for example the time spent for each kernel
in a whole application.
Figure 7.3 shows the distribution of performance speed-ups with re-
spect to the development time spent. While the average speed-up can
be affected by the performance of biased-applications, the distribution of
speed-ups indicates the number of algorithms that achieve a performance
improvement in a speciﬁc range. In the short-term, the distribution is con-
centrated around 2-3
￿ speed-up ﬁgures, while in long-term development179
Algorithm Speed-Up Energy Saving Development
Time on DSP
MPEG-2 Encoder 1.9
￿ 63.3% 11h
MPEG-2 Decoder 1.2
￿ 58.4% 3h
IDCT 1.9
￿ 73.8% 1h
IDCT
￿
￿
￿ 0.8
￿ -160% -
Motion Estimation 2.3
￿ 64.8% 7h
Reed-Solomon 80
￿ 80.1% 20h
Encoder (255,239)
Residu
￿
￿
￿ 0.2
￿ -340% 15h
AES 1.02
￿ 20% 20h
IDEA 0.85
￿ 0.11% 1h
RSA
￿
￿
￿ 1.2
￿ 69.8% 3h
￿
￿
￿ Using the 16-bit TI C5510 and the assembly code provided by
the optimized DSP library
￿
￿
￿ Using the dedicated application-speciﬁc instruction set
￿
￿
￿ 1024 bit key and 2KByte message
Table 7.2: XiRisc vs. TI TMS320C6713 Performance Comparison
three main regions can be identiﬁed, depending on the algorithm features.
The ﬁrst peak in the long-term histogram is typical of algorithms having
little parallelism or not well-suited to PiCoGA, while applications with a
high degree of instruction- and data-level parallelism beneﬁt by more than
1 month of development time (achieving an average performance peak of
5
￿). Algorithms that require one to re-design the application adopting a
hardware-oriented approach can achieve speed-ups greater than 10
￿ (3
￿
￿
peak in the histogram) but the development time usually takes more than
3 months.
Table 7.2 shows a performance comparison between the XiRisc recon-
ﬁgurable processor and a TI TMS320C6713 32-bit general-purpose DSP
capable of executing up to 8 32-bit instructions per cycle [96]. It can be
observed that in many cases XiRisc achieves a better performance in terms
both of speed and energy. On the other hand, as expected, the DSP per-
forms better in heavily MAC-intensive applications. For example, con-
sider the Residu function which represents a typical ﬁltering kernel of
low bit-rate audio coding using saturating arithmetic. Implementation on
XiRisc uses the PiCoGA to implement the saturating part, while the pro-180 Performance and development time trade-offs
Algorithm TI C6713 XiRisc
Standard C C with intrinsics Optimized C Assembly-level
Optimization
(clock cycles) (clock cycles) (clock cycles) (clock cycles)
Residu 27910 994 (
￿1h.) 386 (5h.) 360 (15h.) 1914
AES 1200 - 316 (5h.) 293(20h.) 288
IDEA 360 - 373
￿
￿
￿ (1h.) 220
￿
￿
￿ with an interleaving factor of 2, thus 2 blocks are elaborated concurrently
Table 7.3: XiRisc vs. TI TMS320C6713 Performance Comparison
cessor core provides the multiplier and access to the memory sub-system.
This implementation requires
￿1900 clock cycles. For this kind of ker-
nel, C6713 represents an almost application-speciﬁc architecture provid-
ing a dedicated instruction set for saturating arithmetic and 2 concurrently
available multipliers. Table 7.3 reports the number of cycles required to
execute Residu with respect to each improvement step. As expected, the
most signiﬁcant improvement is associated with introduction of the intrin-
sics (the dedicated instructions), which substitute both multiplication and
the saturation. Nevertheless, this ﬁrst step shows a result that is not so
distant (only a factor 2) from the reconﬁgurable solution proposed by the
XiRisc processor. Further optimizations have been obtained via accurate
scheduling at the assembly level.
In the case of AES and IDEA, neither XiRisc nor the DSP can be clas-
siﬁed as application-speciﬁc supports. However the multiplicative kernel
of IDEA can beneﬁt from the dual-multiplier architecture of the DSP and
in the XiRisc implementation we mapped an additional multiplier on the
PiCoGA.
Figure 7.4 shows a straightforward comparison of the development de-
sign spaces of XiRisc and the DSP. Performance improvements are con-
sidered for each architecture with respect to its baseline. In the case of
XiRisc the baseline is the execution time of the processor core without
the PiCoGA, while for the DSP it is the execution time obtained without
optimizations. Speed-ups are thus normalized with respect to the basic
features of a given architecture which are directly exploitable by a com-
piler, that is by simply writing a purely ANSI-C code. In this way we181
1, 00
10, 00
100, 00
0day 1/ 2day 1day 10day >1m ont h >3m ont h
Devel opm ent   Ti me
S
p
e
e
d
-
U
p
XiRisc - Average
DSP - Average
XiRisc - IDEA
XiRisc - RSA
XiRisc - AES
XiRisc - Reed-
Solomon Encoder
XiRisc - IDCT
XiRisc - Motion
Estimation
XiRisc - Residu
DSP - IDEA
DSP - RSA
DSP - AES
DSP - Reed-Solomon
Encoder
DSP - Motion
Estimation
DSP - IDCT(2)
DSP - Residu(1)
Figure 7.4: XiRisc vs DSP Development Time/Speed-Up analysis
only consider the effort required by programming the PiCoGA to accel-
erate a kernel regardless of the basic processor architecture where it is in-
tegrated. Only critical kernels were taken into account, because from an
optimization point of view a complete application can be considered as
a collection of kernels. In the comparison we considered XiRisc-efﬁcient
kernels (Reed-Solomon Encoder, Motion Estimation), DSP-efﬁcient ker-
nels (IDCT, Residu), and “neutral” applications (IDEA, RSA, AES). In par-
ticular, for IDCT implementation we considered implementation on a TI
C5510, which uses the application-speciﬁc DCT library and hence allows
one to obtain better performance, despite the additional programming
complexity due to 16-bit processor architecture. Wealso considered imple-
mentation with intrinsics as the baseline for Residu implementation on the
TI C6713, since this optimization step is performed in a very straightfor-
ward manner merely using #defines to rename the application-speciﬁc
instructions. The average results approximatively conﬁrm the “learning”182 Performance and development time trade-offs
1
10
100
1000
1 10 100 1000 10000
I nt er l eavi ng  f act or
S
p
e
e
d
-
u
p
 
v
s
 
R
I
S
C
Add4x4i dct Sub4x4dct
Sad4x4 Sat d4x4
Of dm _M apper - 1 Of dm _M apper - 4
Of dm _M apper - 8 AES- 128
Figure 7.5: DREAM speed-up
curve of XiRisc in Figure 3.1. While the DSP curve saturates the speed-
ups after a few days, XiRisc architecture allows the user to improve for a
longer time, and hence achieve much higher performances.
A similar analysis can be driven in the case of the DREAM architecture.
Compared to the XiRisc processor, the DREAM architecture performs in-
struction set extension using a co-processor model of computation. On the
application side, it is possible to underline three main differences between
the two processors, that could impact on both performance and develop-
ment time:
￿ high-bandwidth direct accessto thememory subsystem, by program-
mable address generator;
￿ loosely coupled register ﬁleanddedicatedmemory subsystem, which
increases the communication overhead between processor core and
reconﬁgurable accelerator;
￿ PiCoGA-III instead of PiCoGA 1.0, with roughly double computa-
tional capabilities.183
0, 01
0, 1
1
10
100
1 10 100 1000 10000
I nt er l eavi ng  f act or
T
h
r
o
u
g
h
p
u
t
 
(
G
B
i
t
/
s
e
c
)
Add4x4i dct Sub4x4dct
Sad4x4 Sat d4x4
Of dm _M apper - 1 Of dm _M apper - 4
Of dm _M apper - 8 AES- 128
Figure 7.6: DREAM throughput
On one hand, we could apparently think that DREAM architecture is
less programmable than the XiRisc processor, since the increase of com-
munication overhead and the necessity to manually handle both the allo-
cation of registers inside the dedicated register ﬁle and the access to the
memory. But, on the other hand, these features allow DREAM to overtake
most of the bottlenecks of the XiRisc processor (for example, the access to
the memory), thus improving the achieved performance. It should be no-
ticed that thememory bottleneck causesintheXiRisc processor asmalluti-
lization of the PiCoGA:1o2r o wa r eactives at times (with a peak of 12 in
the motion estimation algorithm), whereas in DREAM this average value
growth at 22. Performance improvements in terms of speed-up, through-
put and energy efﬁciency for the DREAM architecture are reported respec-
tively in Fig. 7.5, Fig. 7.6 and Fig. 7.7.
With respect to XiRisc, DREAM encourages the utilization of hardware
approaches providing additional degrees of freedom, although fast soft-
ware approach could beneﬁt from the same features to overtake the in-
crease of communication overhead. Usually, most of the time is spent184 Performance and development time trade-offs
0, 1
1
10
100
1000
1 10 100 1000 10000
I nt er l eavi ng  f act or
E
n
e
r
g
y
 
e
f
f
i
c
i
e
n
c
y
 
(
M
b
i
t
/
s
e
c
/
m
W
)
Add4x4i dct Sub4x4dct
Sad4x4 Sat d4x4
Of dm _M apper - 1 Of dm _M apper - 4
Of dm _M apper - 8 AES- 128
Figure 7.7: DREAM energy efﬁciency
on the design of the instruction set extension, deeply analyzing the algo-
rithm and its mathematical background. In fact, following hardware ap-
proaches, DREAM achieves impressive performance in 2-4 weeks of work,
whereas the XiRisc processor requires 1-2 months. Moreover, further per-
formance improvements are easilyachieved by exploiting data parallelism
using address generator to perform some kind of streaming or fully pipe-
lined computation.
Concerning the skills required to develop applications on the DREAM
architecture, it is possible to note that most of the required knowledge
is the same of XiRisc. In fact, the Griffy approach allows to abstract the
reconﬁgurable device at level of software DFG, also providing the possi-
bility to efﬁciently handling the pipeline structure by data dependencies.
Data analysis required to exploit as much as possible the pipelining is a
well known concept of DSP programming, since most of the DSPs pro-
vide scratch-pads with programmable access pattern. On this side, since
it has been afﬁrmed that embedded reconﬁgurable computing represents
the most natural evolution of the DSP, it could said that the direct mem-185
ory access provided by DREAM is not an obstacle to programmability but
the solution of a bottleneck of XiRisc. For both the architectures, as well
as for the XiSystem, the Griffy approach is resulted as an effective way
of programming, thus proving the generality of the approach. Although
XiSystem, including an additional eFPGA, is not involved in this analysis
on the development time, we can reasonably expect similar results also in
this case. Less speed-ups should be expected, due to communication over-
heads and to the utilization of a general-purpose device providing further
reconﬁgurability at level of I/O protocols.186 Performance and development time trade-offsChapter 8
Conclusions
During this thesis an application development environment for embed-
ded reconﬁgurable processors has been developed, focusing on enabling
technologies delivering reconﬁgurable computing to software program-
mers. In fact, one of the greater obstacles for the deployment of reconﬁg-
urable processors is the required co-design of both software and hardware
parts. Co-design involves knowledge and skills non common in the ﬁeld
of embedded applications, long time dominated by solutions based on
microcontrollers or DSPs augmented with application-speciﬁc hardware
accelerators. Although several approaches and languages have been pro-
posed to handle reconﬁgurable processors, a homogeneous and effective
solution has not been found yet, mainly because of the difﬁculty to match
performance gain with user-friendly design control. While hardware de-
signers may obtain signiﬁcant beneﬁts from traditional HDL ﬂows, pro-
grammers cannot optimize the software implementation without speciﬁc
training.
To be appealing for the wide scenario of DSP programmers, a simpli-
ﬁed C syntax, termed Griffy-C, has been proposed as main entry-point for
the mapping on reconﬁgurable devices. Griffy-C provides a DFG-based
abstraction of the underlying hardware, and implements the DFG in a pi-
pelined form to increase the throughput. On the Griffy paradigm, most
of the optimization steps can be driven as graph transformations (e.g. un-
folding, software pipelining), and Griffy-C can be seen as the assembly-
187188 Conclusions
level optimization performed by DSP developers, do not requiring any
expertise on hardware design.
In the ﬁrst phase of this thesis, a complete tool-chain for the XiRisc re-
conﬁgurable processor has been implemented. Griffy code is used to con-
ﬁgure the reconﬁgurable functional unit of XiRisc (the PiCoGA), while the
processor core is programmed by ANSI C. The implemented tool-chain
provides simulation engines for both debugging and cycle-accurate per-
formance evaluation. Graphical interfaces are provided for source-level
debugging of both processor core and reconﬁgurable device. The user-
friendly framework allowed also unexperienced users to develop their
applications on the XiRisc processor obtaining valuable results after few
days of work.
In a second phase, the Griffy approach has been extended to support a
commercially availableeFPGA, addedto the XiRisc processor in the XiSys-
tem architecture to provide pre/post processing capabilities and conﬁg-
urability at level of I/O protocols. A speciﬁc back-end has been imple-
mented in order to generate VHDL from the Griffy-C. Furthermore, the
Griffy approach has been applied to the DREAM adaptive DSP, including
the
￿
￿
￿ release of PiCoGA directly connected at the memory sub-system
by means of programmable address generators. In this last release, Griffy
has been augmented with the capability to handle built-in functions in or-
der to directly instance advanced operators, similarly to the case of DSP
intrinsics.
A key issue of this work is a transparent instruction set extension me-
chanism, applied to both functional unit and co-processor models, that
allows the unexperienced user full control over the embedded reconﬁg-
urable hardware, thus enabling signiﬁcant performance improvements in
terms of both speed and energy consumption. Applications developed,
also by students, provide a quantitative assessment of this aspect, by de-
scribing precisely how much performance gain can be be obtained with a
longer design cycle. In particular, two main zones can be identiﬁed de-
pending on the programming approach and consequently the develop-
ment time spent. In the ﬁrst one, corresponding to pure software opti-189
mization, in less than 2 days of work one can easily obtain 2-3
￿ speed-ups
simply rewriting the original code in Griffy-C, performing some memory
reorganization and using standard development techniques like software
pipelining and loop unrolling.
Additional development time, however, can be spent to achieve more
dramatic performance improvements, up to one order of magnitude in
the case of XiRisc and two orders in case of DREAM. The latter approach
often requires manual optimizations, involving assembly and instruction
rescheduling, deepanalysis ofthealgorithms andtheir mathematicalback-
grounds, as well as in extreme cases hardware synthesis or hardware-
aware mapping. In combination with Amdahl’s law, these experimental
curves, showing the different trade-offs between performance and devel-
opment time, can guide application developers to identify a cost-effective
mapping approaches, optimizing the overall development cost.190 ConclusionsAppendix A
Griffy-C syntax
66. Griffy the Cooper
THE COOPER should know about tubs. / But I learned about life as well,
And you who loiter around these graves / Think you know life.
You think your eye sweeps about a wide horizon, perhaps,
In truth you are only looking around the interior of your tub.
You cannot lift yourself to its rim / And see the outer world of things,
And at the same time see yourself. / You are submerged in the tub of yourself?
Taboos and rules and appearances, / Are the staves of your tub.
Break them and dispel the witchcraft / Of thinking your tub is life!
And that you know life!
Spoon River Anthology - E. L. Masters
191192 Griffy-C syntax
A.1 Overview
Griffy-C is a restricted subset of ANSI C syntax enhanced with some ex-
tensions, to handle for example variable resizing, that allow to describe
software Data Flow Graph (DFG) suitable for the implementation on re-
conﬁgurable devices. Differences with other approaches reside primarily
in the fact that Griffy is aimed at the extraction of a pipelined DFG from
standard C and its mapping over a gate-array that is also pipelined by ex-
plicit stage enable signals. The fundamental feature of Griffy-based algo-
rithm implementation is that Data Flow Control is not synthesized on the
array cells but is handled separately by a hardwired control unit, thus al-
lowing a much smaller resource utilization and easing the mapping phase.
This also greatly enhances the regularity of the placing.
Griffy-C is used as a friendly format for the programming of recon-
ﬁgurable devices using hand-written behavioral descriptions of DFGs, but
can also be used as an intermediate representation (IR) automatically gen-
erated from high-level compilers. As in Fig. A.1, it is thus possible to
provide different entry points for the compiling ﬂow: high-level C de-
scriptions, preprocessed by compiler front-end into Griffy-C, behavioral
descriptions (using hand-written Griffy-C) and gate level descriptions, ob-
tained by logical synthesis and again described at LUT level. The ﬁgure
also shows the capability of programming different devices changing the
back-end ﬂow, indicating as an example the possibility to generate conﬁg-
urations for PiCoGA or eFPGAs (by mean of VHDL).
Restrictions essentially refer to supported operators (only operators
that are signiﬁcant and can beneﬁt from hardware implementation are
supported) and semantic rules introduced to simplify the mapping into
the gate-array. Three basic hypothesis are assumed:
￿ DFG-based description: no control ﬂow statements (if, loops or func-
tion calls) are supported, as data ﬂow control is managed by the em-
bedded control unit. Conditional assignments (? :) are implemented
on standard multiplexers.A.1 Overview 193
Figure A.1: Multiple entry-point Griffy ﬂow
￿ single assignment: each variable is assigned only one time, avoiding
hardware connection ambiguity.
￿ manualdismantling: only singleoperator expressions areallowed(sim-
ilarly to intermediate representation or assembly code).
Griffy-C operators are summarized in Tab. A.1 and will be described
in the following sections.
Nativesupported variable types aresigned/unsigned int (32-bit), short
int (16-bit) and char (8-bit). Width of variables can be deﬁned at bit level
using #pragma directives. Operators width is automatically derived from
the operands size. Variables deﬁned as static are used to allocate static
registers inside the reconﬁgurable device, that is registers whose value is
maintained across successive calls (i.e. to implement accumulations). All
others variables are considered “local” to the operation and are not visible
to successive issues.194 Griffy-C syntax
Arithmetical operators
dest = src1 [
￿
 
￿] src2;
Bitwise logical operators
dest = src1 [&,
￿,ˆ] src2; dest = ˜ src1;
Shift operators
dest = src1 [
 
 ,
 
 ] constant;
Comparison operators
dest = src1 [
 
 
 
￿
 
￿
￿
 
￿
￿
 
 
￿
 
 ] src2;
Conditional Assignment (Multiplexer operator)
dest = src1 ? src2 : src3;
Extra-C operators
LUT operator: dest = src1
￿ 0x[LUT layout];
Concatenation operator: dest = src1 # src2;
Table A.1: Griffy operatorsA.1 Overview 195
Griffy-C code can be considered as a special function mapped in a re-
conﬁgurable devices, which substitutes the object code with a bit-stream.
It needs a special declaration obtained with #pragma directive and “picoga”
keyword. In the following, PiCoGA is considered as the target example,
although the same syntax could be re-used for different platform.
#pragma picoga name n_outs n_ins <outs> <ins>
{
[declaration of variables]
[declaration of attributes]
[PiCoGA-function body]
}
#pragma end
#pragma picoga syntax description:
￿ name is the name associated at the pga-op in the code and it can be
used to call them in the ANSI C source code.
￿ n outs is the number of outputs (no more then two) and n ins is the
number of inputs (no more then four).
￿
 outs
  and
 ins
  are, respectively, the names of the output and in-
put variables (separated by blanks). I/O transferring from/to the
reconﬁgurable device is done using 32-bit unsigned int variables.
￿ Pga-ops declaration is closed by #pragma followed by “end” keyword.
￿ Pga-ops can be called in the ANSI C source code similarly to proce-
dure calls as in the following example:
name (<outs>,<ins>);
As in ANSI C syntax, variables are separated using commas (“,”).
￿ variable declaration and function body must be inserted between
curly braces (“
￿” and “
￿”, as in previous example): function body
cannot havemorethenonebasic block (DFG-baseddescription), thus
no other curly braces can be used.196 Griffy-C syntax
Variables Declaration
Variable types supported by Griffy-C syntax are the standard:
￿ 8-bit signed/unsigned char
￿ 16-bit signed/unsigned short int
￿ 32-bit signed/unsigned int
No array or pointer are supported. It is also deﬁned the static type in order
to assign a speciﬁc variable to PiCoGA registers. Similar to ANSI C, static
variables are allocated at compiling time and exist throughout program
execution (its are permanently), but their scope is the block in which they
are deﬁned (restricted visibility).
Attributes Declaration
Using #pragmadirective with attribkeyword it is possibleto associate some
additional attributes at standard variables. For example, it is possible to
deﬁne width of all variables at bit level. Syntax used to deﬁne additional
attributes of the variable is the following one:
#pragma attrib Var1,...,VarN attribs
Attributes deﬁned in Griffy-C environment are:
￿ SIZE=nbits: it sets at nbits the width, at bit level, of the variable; re-
sized width must be less (or equal) then the original size: on the
other hand, resizing not augments the original width of the variable.
￿ SAT: it allows to extract overﬂow/carryout informations from addi-
tions or subtractions that have a destination variable with SAT at-
tribute set. Carryout and overﬂow are deﬁned as additional ﬂags
of the destination variable, if and only if the operation is
￿
 
￿.I ti s
possible to use these ﬂags to realize “saturation” arithmetic. Only
variables that are destination of adds or subs can be used with these
ﬂags. To use these ﬂags in Griffy function body is possible using spe-
cial variables (implicitly deﬁned by SAT attribute):A.1 Overview 197
var name(carryout)
var name(overﬂow)
where var name is the name of the variable declared SAT.
￿ PIPEREG: this ﬂags can be used to deﬁne an explicit pipeline stage
without that the variable should be seen as a static elements from the
software environment.
Griffy-function Body
Function body description is obtained using a restricted subset of ANSI C
syntax with some additional operators used to implement in the descrip-
tion environment grouping (or concatenation) of signals using routing-only
resources and to implement some truth tables using the internal RLC re-
sources (mainly, LUTs and multiplexers).
DFG-based description
Only DFG (Data Flow Graph) description is supported: no control state-
ments (e.g if or loops) are deﬁned in the Griffy syntax. The only exception
is the conditional assignment operator that is used to implement an hard-
wired multiplexer. Thus, each node of a DFG can be described using a sin-
gle assignment operation at Griffy-C level. When the DFG is not a DAG
(Data Acyclic Graph) and thus it has one or more feedbacks, static variables
must be used because, in the “software” C-compliant description of the
feedback, each variable is read before written. When a feedback occurs,
each pga-ops trigger set the value for the following pga-ops trigger.
Single Assignment Form
Each variable must be assigned one and only one time in the function
body. This hypothesis is taken in order to avoid ambiguity and to sim-
plify hardware translation of the DFG. In fact, under a single assignment
assumption, each DFG node is unambiguously deﬁned by the destination198 Griffy-C syntax
variable. Dependences among nodes and instruction level parallelism is
explicitly deﬁned by the data-dependencies graph. Single assignment hy-
pothesis is equal to assume that each variable should be used as instruc-
tion destination only one time in the DFG description. Variables can be
seen as the labels of each DFG edge.
Manual Dismantling
Manual Dismantling is the third assumption of Griffy-C. DFG description
can be seen as an assembly language used to conﬁgure reconﬁgurable de-
vices. Dismantling of complex instructions is a non-trivial task that in-
volves instruction level parallelism (ILP) exploited in the Griffy-C descrip-
tion andeffectively usable in the conﬁguration ﬂow (performed by Griffy).
For example, in order to dismantle the following complex expression:
y=x 1+x 2+x 3+x 4 ;
it is possible to use two main policies. The ﬁrst one is a traditional
sequential dismantling:
y 1=x 1+x 2 ;
y 2=y 1+x 3 ;
y=y 2+x 4 ;
x1 x2
x3
x4
y1
y2
y
+
+
+
that obtains a 3 clock cycles latency and not exploit any degree of instruc-
tion level parallelism. A more efﬁcient approach is the balanced-tree dis-
mantling that exploits the most degree of instruction level parallelism. As
shown in the following example, using a balanced-tree dismantling strat-
egy, only 2 cycles of latency are required:
y 1=x 1+x 2 ;
y 2=x 3+x 4 ;
y=y 1+y 2 ;
+
x1 x2 x3
y1 y2
y
x4
+
+A.1 Overview 199
Dismantling has an important impact on both latency and issue delay of
pga-ops.
Operator Width
Operator width is obtained involving the width of both input and output
operands and type of operator. For example, adder width is set by the
destination width, whereas the size of comparison is given by the greater
input operand. Inputs truncation or extension are done to correctly resize
the variables, padding with zeros or propagating the sign bit.
A.1.1 Standard Operators
In this section standard operators deﬁned in Griffy-C syntax are summa-
rized focusing on the differences (if there are) with ANSI C. Informations
about the routing-only optimization are also provided. Descriptions refer
to local destination variable: if destinations are statics or if they are at-
tributed with PIPEREG ﬂag, then routing-only optimization is not taken.
Exceptions at this rule are explicitly reported in the descriptions.
A.1.2 Arithmetical Operators
Addition
Syntax:
￿ dest = src1 + src2;
dest, src1, src2 are variables; width of these ones is deﬁned by integer
type or by #pragma attrib.
￿ dest = src1 + const;
const is an unsigned integer; both decimal (e.g 120) and hexadecimal
(e.g. 0xfA00) formats are accepted.
￿ dest = const + src2;
const is an unsigned integer; both decimal (e.g 120) and hexadecimal
(e.g. 0xfA00) formats are accepted.200 Griffy-C syntax
￿ dest = const1 + const2;
const1 and const2 are unsigned integers; both decimal (e.g 120) and
hexadecimal (e.g. 0xfA00) formats are accepted. This instruction
type is realized by constant folding and propagation: implementa-
tion do not needs RLCs.
Subtraction
Syntax:
￿ dest = src1 - src2;
dest, src1, src2 are variables; width of these ones is deﬁned by integer
type or by #pragma attrib.
￿ dest = src1 - const;
const is an unsigned integer; both decimal (e.g 120) and hexadecimal
(e.g. 0xfA00) formats are accepted.
￿ dest = const - src2;
const is an unsigned integer; both decimal (e.g 120) and hexadecimal
(e.g. 0xfA00) formats are accepted.
￿ dest = const1 - const2;
const1 and const2 are unsigned integers; both decimal (e.g 120) and
hexadecimal (e.g. 0xfA00) formats are accepted. This instruction
type is realized by constant folding and propagation: implementa-
tion do not needs RLCs.
A.1.3 Bitwise Logical Operators
Bitwise And
Syntax:
￿ dest = src1 & src2;
dest, src1, src2 are variables; width of these ones is deﬁned by integer
type or by #pragma attrib.A.1 Overview 201
￿ dest = src1 & const;
const is an unsigned integer; both decimal (e.g 120) and hexadecimal
(e.g. 0xfA00) formats are accepted. This instruction masks some bit
of the variable src1 and is implemented using routing-only resources:
some bits are propagated ant the others are set to 0.
￿ dest = const & src2;
const is an unsigned integer; both decimal (e.g 120) and hexadecimal
(e.g. 0xfA00) formats are accepted. This instruction masks some bit
of the variable src2 and is implemented using routing-only resources:
some bits are propagated ant the others are set to 0.
￿ dest = const1 & const2;
const1 and const2 are unsigned integers; both decimal (e.g 120) and
hexadecimal (e.g. 0xfA00) formats are accepted. This instruction
type is realized by constant folding and propagation: implementa-
tion do not needs RLCs.
Bitwise Or
Syntax:
￿ dest = src1
￿ src2;
dest, src1, src2 are variables; width of these ones is deﬁned by integer
type or by #pragma attrib.
￿ dest = src1
￿ const;
const is an unsigned integer; both decimal (e.g 120) and hexadeci-
mal (e.g. 0xfA00) formats are accepted. This instruction is imple-
mented using routing-only resources because it masks src1: some bits
are propagated to destination and the others are set to 1 by constant
propagation.
￿ dest = const
￿ src2;
const is an unsigned integer; both decimal (e.g 120) and hexadeci-
mal (e.g. 0xfA00) formats are accepted. This instruction is imple-
mented using routing-only resources because it masks src2: some bits202 Griffy-C syntax
are propagated to destination and the others are set to 1 by constant
propagation.
￿ dest = const1
￿ const2;
const1 and const2 are unsigned integers; both decimal (e.g 120) and
hexadecimal (e.g. 0xfA00) formats are accepted. This instruction
type is realized by constant folding and propagation: implementa-
tion do not needs RLCs.
Bitwise Not
Syntax:
￿ dest = ˜ src2;
src2 is a variable; this instruction can be implemented manipulating
destination RLCs without area occupancy.
￿ dest = ˜ const;
const is an unsigned integer; both decimal (e.g 120) and hexadecimal
(e.g. 0xfA00) formats are accepted. This instruction type is realized
by constant folding and propagation: implementation do not needs
RLCs.
Bitwise Xor
Syntax:
￿ dest = src1 ˆ src2;
dest, src1, src2 are variables; width of these ones is deﬁned by integer
type or by #pragma attrib.
￿ dest = src1 ˆ const;
const is an unsigned integer; both decimal (e.g 120) and hexadeci-
mal (e.g. 0xfA00) formats are accepted. This instruction is imple-
mented using routing-only resources because it masks src1: some bits
are propagated to destination and the others are propagated in active
low format.A.1 Overview 203
￿ dest = const ˆ src2;
const is an unsigned integer; both decimal (e.g 120) and hexadeci-
mal (e.g. 0xfA00) formats are accepted. This instruction is imple-
mented using routing-only resources because it masks src2: some bits
are propagated to destination andthe others are propagated in active
low format.
￿ dest = const1 ˆ const2;
const1 and const2 are unsigned integers; both decimal (e.g 120) and
hexadecimal (e.g. 0xfA00) formats are accepted. This instruction
type is realized by constant folding and propagation: implementa-
tion do not needs RLCs.
A.1.4 Direct Assignment
Syntax:
￿ dest = src1;
This instruction can be used as an explicit cast in order to resize src1
width to dest: src1 can beextended using zeros (if unsigned) or using
the most signiﬁcant bit (if signed) or reduced taking a slice of src1
(number of bits are deﬁned by dest size).
￿ dest = const;
This instruction is implemented using constant folding and propa-
gation.
A.1.5 Shift Operators
Syntax:
￿ dest = src1
 
  const;
src1 is a variable; this instruction can be implemented using routing-
only resources and do not needs RLCs. Zeros are inserted as least
signiﬁcant bits.204 Griffy-C syntax
￿ dest = const
 
  const;
This instruction can be implemented by constant folding and propa-
gation.
￿ dest = src1
 
  const;
src1 is a variable; this instruction can be implemented using routing-
only resources. When src1 is unsigned Zeros are inserted as most
signiﬁcant bits; when src1 is signed the most signiﬁcant bit is ex-
tended.
￿ dest = const
 
  const;
This instruction can be implemented by constant folding and propa-
gation.
Note:
Destination variable can be used to select some bits of the source in order
to realize an unpacking. If destination width is less then source width, then
a slice of source is taken: the least signiﬁcant bit is set by shift step and the
most signiﬁcant bit is deﬁned by “dest size + shift step”. For example:
unsigned int original_word;
unsigned char byte1, byte2, byte3, byte4;
.......
byte1 = original_word;
byte2 = original_word >> 8;
byte3 = original_word >> 16;
byte4 = original_word >> 24;
.......
can be used to unpack a word in four byte variables. This procedure must
be used for unpack variables transferred from register ﬁle to PiCoGA.
A.1.6 Comparison Operators
Comparison operators must have destination variable with 1-bit SIZE at-
tribute deﬁned: Output value is 1 when comparison returns TRUE and 0
when comparison returns FALSE.A.1 Overview 205
Equal
Syntax:
￿ dest = src1 == src2;
dest, src1, src2 are variables; width of these ones is deﬁned by integer
type or by #pragma attrib. SIZE of dest must be 1.
￿ dest = src1 == const;
const is an unsigned integer; both decimal (e.g 120) and hexadecimal
(e.g. 0xfA00) formats are accepted.
￿ dest = const == src2;
const is an unsigned integer; both decimal (e.g 120) and hexadecimal
(e.g. 0xfA00) formats are accepted.
￿ dest = const1 == const2;
const1 and const2 are unsigned integers; both decimal (e.g 120) and
hexadecimal (e.g. 0xfA00) formats are accepted. This instruction
type is realized by constant folding and propagation: implementa-
tion do not needs RLCs.
Not Equal
Syntax:
￿ dest = src1 != src2;
dest, src1, src2 are variables; width of these ones is deﬁned by integer
type or by #pragma attrib. SIZE of dest must be 1.
￿ dest = src1 != const;
const is an unsigned integer; both decimal (e.g 120) and hexadecimal
(e.g. 0xfA00) formats are accepted.
￿ dest = const != src2;
const is an unsigned integer; both decimal (e.g 120) and hexadecimal
(e.g. 0xfA00) formats are accepted.206 Griffy-C syntax
￿ dest = const1 != const2;
const1 and const2 are unsigned integers; both decimal (e.g 120) and
hexadecimal (e.g. 0xfA00) formats are accepted. This instruction
type is realized by constant folding and propagation: implementa-
tion do not needs RLCs.
Less Than
Syntax:
￿ dest = src1
  src2;
dest, src1, src2 are variables; width of these ones is deﬁned by integer
type or by #pragma attrib. SIZE of dest must be 1.
￿ dest = src1
  const;
const is an unsigned integer; both decimal (e.g 120) and hexadec-
imal (e.g. 0xfA00) formats are accepted. When src1 is signed and
const == 0, instruction is implementedextracting sign bit of src1with-
out using RLCs.
￿ dest = const
  src2;
const is an unsigned integer; both decimal (e.g 120) and hexadecimal
(e.g. 0xfA00) formats are accepted.
￿ dest = const1
  const2;
const1 and const2 are unsigned integers; both decimal (e.g 120) and
hexadecimal (e.g. 0xfA00) formats are accepted. This instruction
type is realized by constant folding and propagation: implementa-
tion do not needs RLCs.
Less or Equal Than
Syntax:
￿ dest = src1
 
￿ src2;
dest, src1, src2 are variables; width of these ones is deﬁned by integer
type or by #pragma attrib. SIZE of dest must be 1.A.1 Overview 207
￿ dest = src1
 
￿ const;
const is an unsigned integer; both decimal (e.g 120) and hexadecimal
(e.g. 0xfA00) formats are accepted.
￿ dest = const
 
￿ src2;
const is an unsigned integer; both decimal (e.g 120) and hexadec-
imal (e.g. 0xfA00) formats are accepted. When src2 is signed and
const == 0, instruction is implemented extracting sign bit of src2 and
to propagate them in active low form without using dedicated RLCs.
￿ dest = const1
 
￿ const2;
const1 and const2 are unsigned integers; both decimal (e.g 120) and
hexadecimal (e.g. 0xfA00) formats are accepted. This instruction
type is realized by constant folding and propagation: implementa-
tion do not needs RLCs.
Greater Than
Syntax:
￿ dest = src1
  src2;
dest, src1, src2 are variables; width of these ones is deﬁned by integer
type or by #pragma attrib. SIZE of dest must be 1.
￿ dest = src1
  const;
const is an unsigned integer; both decimal (e.g 120) and hexadecimal
(e.g. 0xfA00) formats are accepted.
￿ dest = const
  src2;
const is an unsigned integer; both decimal (e.g 120) and hexadec-
imal (e.g. 0xfA00) formats are accepted. When src2 is signed and
const == 0, instruction is implemented extracting sign bit of src2 and
to propagate them without using dedicated RLCs.
￿ dest = const1
  const2;
const1 and const2 are unsigned integers; both decimal (e.g 120) and208 Griffy-C syntax
hexadecimal (e.g. 0xfA00) formats are accepted. This instruction
type is realized by constant folding and propagation: implementa-
tion do not needs RLCs.
Greater or Equal Than
Syntax:
￿ dest = src1
 
￿ src2;
dest, src1, src2 are variables; width of these ones is deﬁned by integer
type or by #pragma attrib. SIZE of dest must be 1.
￿ dest = src1
 
￿ const;
const is an unsigned integer; both decimal (e.g 120) and hexadec-
imal (e.g. 0xfA00) formats are accepted. When src1 is signed and
const == 0, instruction is implemented extracting sign bit of src2 and
to propagate them in active low form without using dedicated RLCs.
￿ dest = const
 
￿ src2;
const is an unsigned integer; both decimal (e.g 120) and hexadecimal
(e.g. 0xfA00) formats are accepted.
￿ dest = const1
 
￿ const2;
const1 and const2 are unsigned integers; both decimal (e.g 120) and
hexadecimal (e.g. 0xfA00) formats are accepted. This instruction
type is realized by constant folding and propagation: implementa-
tion do not needs RLCs.
A.1.7 Conditional Assignment
Conditional assignment is implemented as a multiplexing DFG node and
cannot be used to explicitly perform control ﬂow. This is the only three
input edges deﬁned in Griffy-C syntax. Variable used as conditional ﬂag
must be have SIZE equal to 1.
Syntax:A.1 Overview 209
￿ dest = src1 ? src2 : src3;
dest, src1, src2, src3 are variables; width of these ones is deﬁned by
integer type or by #pragma attrib. SIZE of src1 must be 1.
￿ dest = src1 ? src2 : const;
const is an unsigned integer; both decimal (e.g 120) and hexadecimal
(e.g. 0xfA00) formats are accepted.
￿ dest = src1 ? const : src3;
const is an unsigned integer; both decimal (e.g 120) and hexadecimal
(e.g. 0xfA00) formats are accepted.
￿ dest = src1 ? const2 : const3;
const2 and const2 are unsigned integers; both decimal (e.g 120) and
hexadecimal (e.g. 0xfA00) formats are accepted.
￿ dest = const1 ? src2 : src3;
const1 is an unsigned integer with SIZE = 1: implementation re-
quires routing-only resources because conditional assignment can be
resolved at compiling time; selected variable is propagated to desti-
nation without dedicated RLCs.
￿ dest = const1 ? const2 : src3;
const1 is an unsigned integer with SIZE = 1: implementation re-
quires routing-only resources because conditional assignment can be
resolved at compiling time; src3 is propagated to destination using
routing only resources if const1 is equal to 0. Instead, if const1 is true
then const2 is folded and propagate wherever dest is used.
￿ dest = const1 ? src2 : const3;
const1 is an unsigned integer with SIZE = 1: implementation re-
quires routing-only resources because conditional assignment can be
resolved at compiling time; src3 is propagated to destination using
routing only resources if const1 is equal to 1. Instead, if const1 is false
then const3 is folded and propagate wherever dest is used.210 Griffy-C syntax
￿ dest = const1 ? const2 : const3;
In this case dest is a constant calculated at compiling time: so, it is
folded and propagate, similarly to other constants, wherever dest is
used.
A.1.8 Advanced Operators
In this section two additional operators are shown. Its extend ANSI C
syntax in order to exploit packing (concatenation) of multiple variables or
to deﬁne some truth-tables implemented using RLC-only resources.
A.1.9 Concatenate operator (#)
Concatenate operator can be used in order to pack two variables into only
one. If destination width is less to the sum of the widths of the sources
then a packing truncated is performed.
Syntax:
￿ dest = src1 # src2;
dest, src1, src2 are variables. Bits correspondence is shown in ﬁg. A.2.
No constants can be directly used with concatenate operator. In order to
concatenate a variable with a constant can be used shift and bitwise-or, or,
better, using a direct assignment in order to ﬁx the size of the constant and
thus concatenate them.
SRC1
SRC2
N 0
SRC1 SRC2
DEST
0 2N
￿￿￿￿￿
￿￿￿￿￿
￿￿￿￿￿
￿￿￿￿￿
￿￿￿￿￿
￿￿￿￿￿
￿￿￿￿￿
￿￿￿￿￿
￿￿￿￿￿
￿￿￿￿￿
SRC1
SRC2
N0
SRC1 SRC2
0 >2N
DEST (with truncation)
Figure A.2: Concatenate operatorA.1 Overview 211
A.1.10 LUT operator (@)
LUT operator (@) is deﬁned in order to implement some typology of truth
tables using resources inside single RLC.Typology of truth table is deﬁned
by destination and source width.
Syntax:
￿ dest = src1 @ 0x[LUT LAYOUT];
From destination and source is deﬁned how to read the LUT layout
(hexadecimal) string.
Typologies of supported truth tables are summarized in tab. A.2: LUTs
2x4:1 or 2x4:2 are pairs of independent LUTs, respectively 4:1 or 4:2, inside
the same RLC.
SIZE attrib LUT Typology
Src Dest
4 1 4:1
5 1 5:1
6 1 6:1
4 2 4:2
5 2 5:2
4 4 4:4
8 2 2 x 4:1
8 4 2 x 4:2
Table A.2: Typologies of LUTs supported
LUT Layout Rules
LUT Layout can be speciﬁed coding in an hexadecimal string the output(s)
of the truth table using layout rules deﬁned for each LUT typology. Hex-
adecimal string is left extended with 0 to achieve required width.
￿ 4:1 - 5:1 - 6:1212 Griffy-C syntax
SRC DEST
0...0
￿
. . . ... . . .
. . .
1...1
￿
dest = src @ 0x[
￿...
￿];
Example: and6i = src @ 0x1;
￿ 4:2 - 5:2
SRC DEST1 DEST0
0...0
￿1
￿0
. . . ... . . .
. . .
. . .
1...1
￿1
￿0
dest = src @ 0x[
￿1...
￿1][
￿0...
￿0];
Example: and4 2 = src @ 0x00010001;
or: and4 2 = src @ 0x10001;
￿ 4:4
SRC DEST3 DEST2 DEST1 DEST0
0...0
￿3
￿2
￿1
￿0
. . . ... . . .
. . .
. . .
. . .
. . .
1...1
￿3
￿2
￿1
￿0
dest = src @ 0x[
￿3...
￿3][
￿2...
￿2][
￿1...
￿1][
￿0...
￿0];
￿ 2x4:1
SRC[7:4] DEST1 SRC[3:0] DEST0
0...0
￿1 0...0
￿0
. . . ... . . .
. . .
. . . ... . . .
. . .
1...1
￿1 1...1
￿0
dest = src @ 0x[
￿1...
￿1][
￿0...
￿0];
￿ 2x4:2A.1 Overview 213
SRC[7:4] DEST3 DEST2 SRC[3:0] DEST1 DEST0
0...0
￿3
￿2 0...0
￿1
￿0
. . . ... . . .
. . .
. . .
. . . ... . . .
. . .
. . .
1...1
￿3
￿2 1...1
￿1
￿0
dest = src @ 0x[
￿3...
￿3][
￿2...
￿2][
￿1...
￿1][
￿0...
￿0];
A.1.11 Built-in function as hard-macros
Reconﬁgurable devices featuring basic logic block with advanced oper-
ators can beneﬁt of the direct instance of these functionalities. The case
is analogue to the built-in function provided in DSP processor in order
to explicitly instance some assembly instruction, like a sum of absolute
differences or a saturating sum. The syntax adopted is the same of stan-
dard function calls, although the number of inputs depends on the speciﬁc
functionality.
Syntax:
￿ dest = my hard macro (src1, ...,srcN);
Example: PiCoGA-III speciﬁc built-in functions
PiCoGA-III features a hybrid reconﬁgurable logic cells allowing straight-
forward implementations of non-standard operations by mean of dedi-
cated logic. As an example, the simple adder is implemented using the
ALU block, but more complex operations can be provided coupling ALU
features with the surrounding control logic. To give an idea of that capa-
bilities, the following items shows a basic set of built-in functions already
implemented on the PiCoGA-III speciﬁc Griffy ﬂow.
GFMult : implements the multiplication on the Galois Field GF(2
￿) with
the irreducible polynomial
 
￿
￿
 
￿
￿ ;
 
 
 
 
￿
 
 
 
 
 
 
￿
 
 
 
￿
 
 
 
 
￿
￿
￿214 Griffy-C syntax
a0b3
a0b4
a0b5
a1b4
a2b3
a3b2
a1b3
a2b2
a3b1 a2b1
a1b2
a3b0
multblock
Figure A.3: Multiplier chunk
multblock : implements a multiplier chunk as shown in Fig. A.3
 
 
 
 
￿
 
 
 
 
 
 
 
 
 
￿
 
 
 
￿
 
 
 
 
￿
￿
￿
CondSum : implements a conditional sum. Depending on a condition
ﬂag, it performs a sum or a subtraction between two operands
 
 
 
 
￿
 
 
 
 
 
 
 
￿
 
 
 
 
 
 
 
 
￿
 
 
 
 
￿
￿
￿
￿
 
 
 
 
￿
 
 
 
 
￿
 
 
 
￿
￿
 
 
 
￿
￿
 
 
 
￿
￿
 
 
 
￿
￿
Accumulator : implements a routing-efﬁcient accumulator in which the
feedback path is internal to the reconﬁgurable logic cell
 
 
 
 
￿
 
 
 
 
 
 
 
 
 
 
 
￿
 
 
 
￿
￿
￿
 
 
 
 
￿
 
 
 
 
￿
 
 
 
￿
CondAccumulator : implements a routing efﬁcient conditional accumu-
lator which accumulates or reset to a constant depending on the con-
dition
 
 
 
 
￿
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
￿
 
 
 
 
 
 
 
 
 
 
 
 
 
 
￿
￿
￿
 
 
 
 
￿
 
 
 
￿
￿
 
 
 
 
￿
 
 
 
 
 
￿
 
 
 
 
￿
￿A.1 Overview 215
Xor10bit : implements a single-cell xor among 10 bits
SuperMux : implementsa4-input1-output multiplexer, basedon 2-bitwise
chunk implemented on a single cell.216 Griffy-C syntaxBibliography
[1] J. Rabaey Reconﬁgurable Computing: The solution to Low Power Program-
mable DSP, Proceedings of the 1997 IEEE International Conference on
Acoustics, Speech and Signal Processing, April 1997.
[2] W.H. Mangione-Smith, B. Hutchings, D. Andrews, A. DeHon, C. Ebel-
ing, R. Hartenstein, O. Mencer, J. Morris, K. Palem, V.K. Prasanna,
H.A.E. Spaanenburg Seeking solutions in conﬁgurable computing, IEEE
Computer, Dec 1997.
[3] J. Goodacre, A.N. Sloss Parallelism and the ARM Instruction Set Architec-
ture, IEEE Computer, May 2005.
[4] S. Leibson, J. Kim Conﬁgurable Processors: a new era in chip design, IEEE
Computer, May 2005.
[5] V.Kathail, S.Aditya, R.Schreiber, B.R.Rau, D.C.Cronquist,
M.Sivaraman PICO: Automatically Designing Custom Computers,
IEEE Computer, Feb 2002.
[6] N.T. Clark, H. Zhong, S.A. Mahlke Automated Custom Instruction Gen-
eration for Domain-Speciﬁc Processor Acceleration, IEEE Transactions on
Computers, Vol. 54, No. 10, October 2005.
[7] A. DeHon The density advantage of conﬁgurable computing, IEEE Com-
puter, April 2000.
[8] K. Bondalapati, V.K. Prasanna Reconﬁgurable Computing Systems,P r o -
ceedings of the IEEE, Vol. 90, No. 7, April 2002.
217218 BIBLIOGRAPHY
[9] L. B. Baumstark, L. M. Wills Retargeting Sequential Image-ProcessingPro-
grams for Data Parallel Execution, IEEE Transactions on Software Engi-
neering, vol. 31, no. 2, February 2005.
[10] F. Barat, R. Lauwereins, G. Deconinck Reconﬁgurable instruction set
processors from a hardware/software perspective, IEEE Transactions on
Software Engineering, Volume 28, Issue 9, Sept. 2002.
[11] A. DeHon, J. Wawrzynek Reconﬁgurable Computing: What, Why and
Implications for Design Automation, Proceeding on DAC, 1999.
[12] S. Vassiliadis, S. Wong, S. Gaydadjiev, K. Bertels, G. Kuzmanov,
E.M. Panainte The MOLEN PolymorphicProcessor, IEEE Transactions on
Computers, Nov. 2004.
[13] B. Mei, S. Vernalde, D. Verkest, H. De Man, R. Lauwereins DRESC: A
Retargetable Compiler for Coarse-Grained Reconﬁgurable Architecture, Int l
Conference on Field Programmable Technology, Dec. 2002.
[14] Trimaran consortium. The Trimaran Compiler Infrastructure,
http://www.trimaran.org.
[15] C. Mucci, F. Campi, A. Deledda, A. Fazzi, M. Ferri, M. Bocchi A cycle-
accurate ISS for a dynamically reconﬁgurable processor architecture,
IEEE Reconﬁgurable Architecture Workshop (RAW), Apr. 2005
[16] P.M. Athanas, H.F. Silverman Processor Reconﬁguration Through
Instruction-Set Metamorphosis, IEEE Computer, March 1993.
[17] R. Razdan; M.D. Smith A High-Performance Microarchitecture with
Hardware-Programmable Functional Units, Proceedings of IEEE MICRO,
Nov. 1994.
[18] R.D. Wittig; P. Chow OneChip: An FPGA Processor With Reconﬁgurable
Logic, IEEE Symposium on FPGA for Custom Computing Machine,
1996.BIBLIOGRAPHY 219
[19] T.J. Callahan, J.R. Hauser, J. Wawrzynek The Garp architecture and C
compiler, IEEE Computer, April 2000.
[20] J-Y. Mignolet, V. Nollet, P. Coene, D.Verkest, S. Vernalde, R. Lauw-
ereins Infrastructure for Design and Management of Relocatable Tasks in a
Heterogeneous Reconﬁgurable System-on-Chip, Proceedings of the DATE
2003.
[21] P.P. Chang, S.A. Mahlke, W.Y. Chen, N.J. Water, W.W. Hwu IMPACT:
An Architectural Framework for Multiple-Instruction-Issue Processors,P r o -
ceedings of the 18th Annual Int’l Symposium on Computer Architec-
ture, Toronto, Canada, May 28, 1991, pp. 266-275
[22] SUIF Compiler System [online] http://suif.standford.edu
[23] J.M. Arnold S5: the architecture and development ﬂow of a software con-
ﬁgurable processor, Proceedings on IEEE International Conference on
Field-Programmable Technology, 11-14 Dec. 2005, pp. 121-128
[24] Sato, T.; Watanabe, H.; Shiba, K.; Implementation of dynamically recon-
ﬁgurable processor DAPDNA-2, IEEE VLSI-TSA International Sympo-
sium VLSI Design, Automation and Test, 27-29 April 2005, pp. 323-324
[25] R. Baines and D. Pulley A Total Cost Approach to Evaluating Different
Reconﬁgurable Architectures for Baseband Processing in Wireless Receivers,
IEEE Communication Magazine, Jan. 2003.
[26] Elixent http://www.elixent.com
[27] A. Marshall, T. Stansﬁeld, I. Kostarnov, J. Vuillemin, B. Hutchings A
reconﬁgurable arithmetic array for multimedia applications Proceedings of
the ACM/SIGDA International Symposium on Field programmable
gate arrays (FPGA), 1999.
[28] Hennessy, Patterson Computing architecture: a quantitative approach,
Morgan Kaufmann220 BIBLIOGRAPHY
[29] A. DeHon, J. Adams, M. DeLorimier, N. Kapre, Y. Matsuda,
H. Naeimi, M. Vanier, M. Wrighton Design Patterns for Reconﬁgurable
Computing, IEEE Symposium on FCCM, April 2004.
[30] R. Hartenstein A Decade of Reconﬁgurable Computing: a Visionary Ret-
rospective, DATE 2001.
[31] F. Balarin, E. Sentovich, M. Chiodo, P. Giusto, H. Hsieh, B. Tabbara,
A. Jurecska, L. Lavagno, C. Passerone, K. Suzuki, and A. Sangiovanni-
Vincentelli. Hardware-Software Co-Design of Embedded Systems: The Polis
Approach, Kluwer Academic Publishers, 1997.
[32] L.M. Reynari, F. Cucinotta, A. Serra, L. Lavagno A hardware/software
co-design ﬂow and IP library based of Simulink
/
,, Proceedings on DAC,
Jun. 2001.
[33] A. Baghdadi, N.-E. Zergainoh, W. O. Cesario, A.A. Jerraya, Combining
a Performance Estimation Methodology with a Hardware/Software Codesign
Flow Supporting Multiprocessor Systems, IEEE Transactions on Software
Engineering, vol. 28, no. 9, September 2002.
[34] http://www.systemc.org/
[35] A. Gerstlauer, R. Dmer, P.Junyu, D.D. Gajski, System Design: A Practi-
cal Guide with SpecC, Kluwer Academic Publishers, 2001.
[36] Sullivan, C.; Wilson, A.; Chappell, S.; Using C based logic synthesis to
bridge the productivity gap, Proceedings of the Asia and South Paciﬁc
Design Automation Conference (ASP-DAC), 27-30 Jan. 2004, pp. 349-
354
[37] J. Frigo, M. Gokhale, D. Lavenier Evaluation of the Streams-C C-to-
FPGA Compiler: An Applications Perspective, Proceeding on FPGA 2001.
[38] S. Gupta, N. Dutt, R. Gupta, A. Nicolau SPARK : A High-Level Synthe-
sis Framework For Applying Parallelizing Compiler Transformations, Inter-
national Conference on VLSI Design, January 2003.BIBLIOGRAPHY 221
[39] G. De Micheli Hardware synthesis from C/C++ models, Proceedings of
Design, Automation, and Test in Europe (DATE), Munich, Germany,
March 1999.
[40] W.A. Najjar, W. Bohm, B.A. Draper, J. Hammes, R. Rinker, J.R. Bev-
eridge, M. Chawathe, C. Ross High-Level Language Abstraction for Re-
conﬁgurable Computing, IEEE Computer, August 2003.
[41] S. Edwards The Challenges of Hardware Synthesis from C-like Languages,
Proceedings on IWLS 2004.
[42] K. Keutzer, S. Malik, A.R. Newton, J.M. Rabaey, A. Sangiovanni-
Vincentelli System-Level Design: Orthogonalization of Concerns and
Platform-Based Design, IEEE Transactions on CAD, Dec. 2000
[43] L. Semeria, K. Sato, G. De Micheli Synthesis of hardware models in C
withpointers and complexdata structures, IEEETransactions on VLSISys-
tems, Dec. 2001
[44] R. Camposano, W. Rosenstiel Synthesizing Circuits From Behavioral De-
scriptions, IEEE Transactions on Computer-Aided Design, February
1989.
[45] R. Camposano From Behavior to Structure: High-Level Synthesis, IEEE
Design & Test of Computers, October 1990.
[46] S.C. Goldstein, H. Schmit, M. Budiu, S. Cadambi, M. Moe, R. Reed
Taylor PipeRench: A Reconﬁgurable Architecture and Compiler, IEEE
Computer, April 2000.
[47] M. Budiu, S.C. Goldstein Fast Compilation for Pipelined Reconﬁgurable
Fabrics, FPGA 1999.
[48] D.C. Cronquist, P.Franklin, S.G. Berg, C. Ebeling Specifying and Com-
piling Applications for RaPiD, Proceedings of IEEE Workshop on FPGAs
for Custom Computing Machines (FCCM), April 1998.222 BIBLIOGRAPHY
[49] Michael Bedford Taylor, Jason Kim, Jason Miller, David Wentzlaff,
Fae Ghodrat, Ben Greenwald, Henry Hoffmann, Paul Johnson, Jae-
Wook Lee, Walter Lee, Albert Ma, Arvind Saraf, Mark Seneski, Nathan
Shnidman, Volker Strumpen, Matt Frank, Saman Amarasinghe and
Anant Agarwal The Raw Microprocessor: A Computational Fabric for Soft-
ware Circuits and General Purpose Programs, IEEE Micro, Mar/Apr 2002.
[50] M.B. Gokhale, J.M. Stone NAPA C: Compiling for a Hybrid RISC/FPGA
Architecture, Proceedings of IEEE Workshop on FPGAs for Custom
Computing Machines (FCCM), April 1998.
[51] S. Talla Adaptative Explicit Parallel Instruction Computing, PhD Thesis,
Department of Computer Science, New York University, May 2001.
[52] V.S. Gheorghita, W.-F. Wong, T. Mitra, S. Talla A Co-simulation Study of
Adaptative EPIC Computing, Proceedings on IEEE Field Programmable
Technologies (FPT), 2002.
[53] M. Vorbach, J. Becker, Reconﬁgurable processor architectures for mobile
phones Proceedings on the Int’l Parallel and Distributed Processing
Symposium, 22-26 April 2003.
[54] Florian Stock, Andreas Koch Architecture Exploration and tools for pipe-
lined coarse grained reconﬁgurable arrays, Proceedings on the IEEE Int l
Conference on Field Programmable Logic and Application (FPL), Aug.
2006.
[55] H. Singh, M.-H. Lee, G. Lu, F.J. Kurdahi, N. Bagherzadeh, E.M.
Chaves Filho MorphoSys: An Integrated Reconﬁgurable System for Data-
Parallel and Computation-Intensive Applications, IEEE Transactions on
Computers, May 2000.
[56] C. Rowen, S. Leibson Engineering the Complex SOC: Fast, Flexible De-
sign with Conﬁgurable Processors, Prentice-Hall, 2004.
[57] C. Lee, M. Potkonjak, W.H. Mangione-Smith MediaBench: A Tool for
Evaluating and Synthesizing Multimedia and Communications Systems,BIBLIOGRAPHY 223
30th Annual International Symposium on Microarchitecture (Micro
’97), December 1997.
[58] R.Lysecky, F.VahidAConﬁgurableLogicArchitectureforDynamicHard-
ware/Software Partitioning, Proceedings on the Design Automation and
Test in Europe Conference (DATE), February 2004.
[59] G. Snider Performance-Constrained Pipelining of Software Loops onto Re-
conﬁgurable Hardware, Proceeding on FPGA 2002.
[60] M. Weinhardt and W. Luk Pipeline Vectorization, IEEE Transactions
on Computer-Aided Design of Integrated Circuits and Systems, Feb.
2001, pp. 234-248.
[61] V. Allan, R. Jones, R. Lee, S. Allan Software Pipelining, ACM Comput-
ing Surveys, Vol. 27, No. 3 September 1995.
[62] B. R. Rau. Iterative Modulo Scheduling. Technical Report HPL-94-115,
Hewlett Packard Company, November 1995.
[63] A.H. Veen Dataﬂow Machine Architecture, ACM Computing Surveys,
Vol. 18, No. 4, December 1986.
[64] G. Gao, Y. Wong, Q. Ning A Timed Petri-Net Model for Fine-Grain Loop
Scheduling, Proceedings of the ACM SIGPLAN ’91 Conference on Pro-
gramming Language Design and Implementation, June 1991.
[65] M.Mukund Petri Nets and Step Transition Systems. International Jour-
nal of Foundations of Computer Science 3, 443-478, World Scientiﬁc,
1992.
[66] A. Lodi, M. Toma, F. Campi, A. Cappelli, R. Canegallo, R. Guerrieri
A VLIW processor with reconﬁgurable instruction set for embedded applica-
tions, IEEE Journal of Solid-State Circuit, Nov. 2003.
[67] A. Lodi, A. Cappelli, M. Bocchi, C. Mucci, M. Innocenti, C. De Bar-
tolomeis, L. Ciccarelli, R. Giansante, A. Deledda, F. Campi, M. Toma224 BIBLIOGRAPHY
and R. Guerrieri, XiSystem: a XiRisc-based SoC with a Reconﬁgurable IO
module, IEEE Journal of Solid-State Circuit (JSSC), Jan. 2006
[68] M. Bocchi, C. De Bartolomeis, C. Mucci, F. Campi, A. Lodi, M. Toma,
R. Canegallo, R. Guerrieri A XiRisc-based SoC for Embedded DSP Appli-
cations, IEEE Custom Integrated Circuits Conferences (CICC’04), Oct.
2004
[69] A. Lodi, M. Toma, F. Campi A Pipelined Conﬁgurable Gate Array for
Embedded Processors, Proceeding on FPGA 2003.
[70] A. Cappelli, A. Lodi, C. Mucci, M. Toma, F. Campi A Dataﬂow Control
Unit for C-to-Conﬁgurable Pipelines Compilation Flow, IEEE Symposium
on FCCM, Apr. 2004.
[71] C. Mucci, C. Chiesa, A. Lodi, M. Toma, F. Campi A C-based Algorithm
Development Flow for a Reconﬁgurable Processor Architecture, IEEE Inter-
national Symposium on System on Chip, November 2003.
[72] C. Mucci, F. Campi, A. Deledda, A. Fazzi, M. Ferri, M. Bocchi A cycle-
accurate ISS for a dynamically reconﬁgurable processor architecture, IEEE
Reconﬁgurable Architecture Workshop (RAW), Apr. 2005.
[73] A. La Rosa, L. Lavagno, C. Passerone, Software Development for High-
Performance, Reconﬁgurable, Embedded Multimedia Systems, IEEE Design
and Test of Computers, vol. 22, no. 1, pp. 28-38, January/February,
2005.
[74] C. Mucci, M. Bocchi, P. Gagliardi, L. Ciccarelli, A. Lodi, M. Toma,
F. Campi A Case-Study on Multimedia Applications for the XiRisc Recon-
ﬁgurable Processor, Proceedings on IEEE Int’l Symposium on Circuits
and Systems (ISCAS), May 2006.
[75] F. Campi, A. Deledda, M. Pizzotti, L. Ciccarelli, C. Mucci, A. Lodi,
A.Vitkovski, L.Vanzolini, P.RolandiAdynamicallyadaptiveDSPforhet-
erogeneous reconﬁgurable platforms, Proceedings on IEEE/ACM DATE
2007.BIBLIOGRAPHY 225
[76] E. Sentovich et al. SIS: A System for Sequential Circuit Synthesis,
UCB/ERL M92/41, May 1992.
[77] D. Xu, X. He, Y.Deng, Compositional Schedulability Analysis of Real-Time
SystemsUsing TimePetriNets, IEEETransactions on Software Engineer-
ing, vol. 28, no. 10, October 2002.
[78] A.Koch ModuleCompaction in FPGA-basedRegular Datapaths, Proceed-
ing on DAC 1996.
[79] T.J. Callahan , P. Chong, A. DeHon, and J. Wawrzynek Fast Module
Mapping and Placement for Datapaths in FPGAs, Proceeding on FPGA
1998.
[80] Y. Li, T. Callahan, E. Darnell, R. Harr, U. Kurkure, J. Stockwood
Hardware-Software Co-Design of Embedded Reconﬁgurable Architectures,
Proceedings on DAC, 2000.
[81] V.Betz, J.Rose, A.Marquardt ArchitectureandCADforDeep-Submicron
FPGAs, Kluwer Academic Publishers, 1999.
[82] COCOMO 2.0 Model Deﬁnition manual, ver 1.2, 1997.
[83] Capers Jones, Chairman, Software Productivity Research,
Inc. Programming Languages Table, Release 8.2, March 1996.
http://www.theadvisors.com/langcomparison.htm
[84] J. A. Debardelaben, V. K. Madisetti, A. J. Gadient Incorporating Cost
Modeling in Embedded-System Design, IEEE Design & Test of Computer,
Vol. 14, Issue 3, pag. 24-35, July-Sept. 1997
[85] D. Ragan, P. Sandborn, P. Stoaks A Detailed Cost Model for Concurrent
Use With Hardware/Software Co-Design, Proceedings on the Design Au-
tomation Conference (DAC), June 10-14, 2002, New Orleans, LA.
[86] W. Fornaciari, F. Salice, U. Bondi, E. Magini, Development Cost and Size
Estimation Starting from High-Level Speciﬁcations, Proceedings on the In-
ternational Symposium on Hardware/Software Codesign (CODES),
April 25-27, 2001, Copenhagen (Denmark).226 BIBLIOGRAPHY
[87] V. K. Madisetti, J. A. Debardelaben A RASSP Approach to HW/SW
Codesign, RASSP Digest, Vol. 2 4th Qtr. 1995.
[88] A. La Rosa, L. Lavagno, M. Lazarescu, C. Passerone, An optimizing C
front-end for hardware synthesis, Proceedings on the workshop on Wire-
less Reconﬁgurable Terminals and Platforms (WiRTeP), April, 10-12,
2006, Rome (Italy).
[89] D. Burger, T. Austin The SimpleScalar Tool Set, Version 2.0,
www.simplescalar.com
[90] S. Pees, A. Hoffmann, V. Zivojnovic, H. Meyr LISA - Machine Descrip-
tion Language for Cycle-Accurate Models of Programmable DSP Architec-
tures, Proceedings on DAC, Jun. 1999.
[91] J. Cardillo, P. Chow The Effect of Reconﬁgurable Units in Superscalar Pro-
cessors, Proceedings on FPGA, February 2001.
[92] S. Vassiliadis, S. Wong, G. Gaydadjiev, K. Bertels, G. Kuzmanov,
E.M. Panainte The MOLEN PolymorphicProcessor, IEEE Transactions on
Computers, November 2004
[93] Suddep Pasricha Transaction level modeling of Soc with SystemC 2.0,
Synopsys User Group Conference (SNUG), 2002.
[94] M. Caldari, M. Conti, M. Coppola, S. Curaba, L. Pieralisi, C. Turchetti
Transaction-Level Model for AMBA Bus Architecture Using SystemC 2.0,
Proceedings of the Design,Automation and Test in Europe Conference
and Exhibition (DATE 2003)
[95] L. Di Stefano, S. Mattoccia, F. Tombari Speeding-up NCC-based template
matching using parallel multimedia instructions, Proceedings on the 7th
Int’l Workshop on Computer Architecture for Machine Perception, 4-6
July 2005.
[96] Texas Instruments Incorporated. TMS320C6713, TMS320C6713B
Floating-point Digital Signal Processors. [Online]. Available:
http://focus.ti.com/lit/ds/symlink/tms320c6713.pdfBIBLIOGRAPHY 227
[97] GNU Compiler Collection (GCC) [online available] http://gcc.gnu.org
[98] C. Brunelli, F. Garzia, F. Campi, C. Mucci, J. Nurmi A FPGA Imple-
mentation of an Open-Source Floating-Point Computation System, IEEE Int
l Symposium of System-on- Chip, Tampere (Finland), Nov. 2005.
[99] T. Sikora, MPEG Digital Video-Coding Standards, IEEE Signal Process-
ing Magazine, September 1997.
[100] MPEG Software Simulation Group http://www.mpeg.org
[101] ISO/IEC 13818 Draft International Standard: Generic Coding of
Moving Pictures and Associated Audio, Part-2: video.
[102] A. Dasu and S. Panchanathan A survey of media processing approaches,
IEEE Transactions on Circuits and Systems for Video Technology, Au-
gust 2002.
[103] F. Campi et al. A VLIW Processor with Reconﬁgurable Instruction Set for
Embedded Applications, ISSCC 2003.
[104] P.L. Tai, S.Y. Huang, C.T. Liu and J.S. Wang Computation-Aware
Scheme for Software-Based Block Motion Estimation, IEEE Transactions on
Circuits and Systems for Video Technology, September 2003
[105] C. De Vleeschouwer, T. Nilsson, K. Denolf, J. Bormans Algorithmic
and Architectural Co-Design of a Motion-Estimation Engine for Low-Power
Video Devices, IEEE Transactions on Circuits and Systems for Video
Technology, December 2002
[106] NIST Speciﬁcation for the ADV ANCED ENCRYPTION STANDARD
(AES), FIPS PUBS 197, November 26, 2001.
[107] J. Daemen and V. Rijmen AES Proposal: Rijndael, NIST AES Proposal,
www.esat.kuleuven.ac.be/˜rijmen/rijndael/.
[108] B. Schneier, AppliedCryptography, 2nd ed. John Wiley and Sons. New
York, NY, 1996.228 BIBLIOGRAPHY
[109] S. Ravi, A. Raghunathan, N. Potlapally, M. Sankaradass System De-
sign Methodologies for a Wireless Security Processing Platform, Proceed-
ings on the DAC 2002.
[110] S. Tillich et al. An Instruction Set Extension for Fast and Memory-
Efﬁcient AES Implementation, Communications and Multimedia Secu-
rity, Springer Verlag, 2005.
[111] M2000 Inc. www.m2000.fr
[112] T. Wollinger, M. Wang, J. Guajardo, C. Paar How well Are High-
End DSPs Suited for the AES Algorithms? (AES Algorithms on the
TMS320C6x DSP) Presentation at the NIST AES-3 Conference, 2000.
http://csrc.nist.gov/CryptoToolkit/aes/round2/conf3/presentations/wollinger.pdf
[113] J. Zambreno et al. Exploring Area/Delay Tradeoffs in an AES FPGA Im-
plementation, FPL 2004.
[114] R. Chaves et al. Reconﬁgurable Memory Based AES Co-Processor,P r o -
ceedings of the RAW, April 2006
[115] HELION. High Performance AES (Rijndael) cores for Xilinx FPGA,
http://www.heliontech.com
[116] A. Wiesmaier, The State of the Art in Algorithmic Encryption (2006),
citeseer.ist.psu.edu/wiesmaier06state.html
[117] A. Hodjat and I. Verbauwhede A 21.54 Gbit/s fully pipelined AES pro-
cessor on FPGA, FCCM 2004.
[118] J. Lu, J. Lockwood IPSec Implementation on Xilinx Virtex-II Pro FPGA
and Its Application, RAW 2005
[119] P.R. Schaumont, H. Kuo, I. Verbauwhede Unlocking the DesignSecrets
of a 2.29 Gb/s Rijndael Processor, DAC 2002.
[120] R.R. Taylor, S.C. Goldstein A High-Performance Flexible Architecture
for Cryptography, CHES 1999.BIBLIOGRAPHY 229
[121] C. Paar Efﬁcient VLSI Architectures for Bit-Parallel Computation in Ga-
lois Fields, Ph.D. Thesis, 1994.
[122] V. Rijmen Efﬁcient Implementation of the Rijndael S-box,
[123] G. Bertoni et al. Efﬁcient Software Implementation of AES on 32-Bit Plat-
forms, CHES 2002.
[124] T. Wiegand, G.J. Sullivan, G. Bjntegaard, A. Luthra Overview of the
H.264/AVC video coding standard, IEEE Transaction on Circuits and Sys-
tems for Video Technology, July 2003.
[125] H. S. Malvar, A. Hallapuro, M. Karczewicz, L. Kerofsky Low-
Complexity Transform and Quantization in H.264/AVC, IEEE Transaction
on Circuits and Systems for Video Technology, July 2003.
[126] Yu-Wen Huang, Bing-Yu Hsieh, Tung-Chien Chen, Liang-Gee Chen
Analysis, fast algorithm, and VLSI architecture design for H.264/AVC intra
frame coder IEEE Transactions on Circuits and Systems for Video Tech-
nology, March 2005.