SYSTEMATIC EXPLORATION OF TRADE-OFFS BETWEEN APPLICATION THROUGHPUT AND HARDWARE RESOURCE REQUIREMENTS IN DSP SYSTEMS by Kee, Hojin
ABSTRACT
Title of dissertation: SYSTEMATIC EXPLORATION OF TRADE-OFFS
BETWEEN APPLICATION THROUGHPUT AND
HARDWARE RESOURCE REQUIREMENTS IN
DSP SYSTEMS
Hojin Kee, Doctor of Philosophy, 2010
Dissertation directed by: Shuvra S. Bhattacharyya, Professor
Department of Electrical and Computer Engineering,
and Institute for Advanced Computer Studies
Dataflow has been used extensively as an efficient model-of-computation to ana-
lyze performance and resource requirements in implementing DSP algorithms on various
target architectures. Although various software synthesis techniques have been widely
studied in recent years, there is a distinct lack of efficient synthesis techniques in the
literature for systematically mapping dataflow models into efficient hardware implemen-
tations. In this thesis, we explore three different aspects that contribute to the development
of a powerful dataflow-based hardware synthesis framework:
1. Systematic generation of 1D/2D FFT implementation on field programmable gate
arrays (FPGAs). The fast Fourier transform (FFT) is one of the most widely-used
and important signal processing functions. However, FFT computation generally
becomes a major bottleneck for overall system performance due to its high compu-
tational requirements. We propose a systematic approach for synthesizing FPGA
implementations of one- and two-dimensional (1D and 2D) FFT computations, and
rigorously exploring trade-offs between cost (in terms of FPGA resource require-
ments) and performance (in terms of throughput). Our approach provides an ef-
ficient hardware synthesis framework that can be customized to specific design
constraints. In our FFT synthesis approach, we apply two orthogonal techniques
in FPGA implementation to realize data-parallelism and parallel processing in FFT
computation, respectively. These techniques can be applied to various 1D FFT al-
gorithms, including Radix-2 and Radix-4 algorithms, and extended naturally and
efficiently to 2D FFT implementation.
2. Buffer optimization under self-timed execution. Self-timed execution is known to
provide the maximum achievable throughput when mapping DSP dataflow graphs
into hardware under certain technical constraints. Throughput-constrained buffer
minimization under self-timed execution is a key question in efficient hardware
synthesis for practical design scenarios. Previous approaches to this problem have
suffered from high worst case complexity or loose buffer bounds, which lead to in-
efficient resource utilization. In this thesis, we integrate a novel constraint into
traditional self-timed execution to obtain a modified form of self-timed execu-
tion, which we call MSTE (Modified Self-Timed Execution). We show that MSTE
greatly improves the efficiency with which we can accurately analyze and optimize
hardware configurations of dataflow graphs, and furthermore, the additional execu-
tion constraints imposed in MSTE result in relatively minor performance overhead.
Based on MSTE, we explore novel methods for self-timed analysis and associated
techniques for buffer optimization subject to given throughput constraints.
3. Hardware synthesis technique for parameterized dataflow model. Parameterized
dataflow modeling approaches allow for dynamic capabilities without excessively
compromising the key properties of the existing static dataflow model — compile-
time predictability and potential for rigorous optimizations. We develop a novel
PSDF-based FPGA architecture framework using National Instrument’s LabVIEW
FPGA, a recently-introduced commercial platform for reconfigurable hardware im-
plementation. This framework develops novel connections among model-based
DSP system design, FPGA implementation, and next generation wireless commu-
nication systems.
SYSTEMATIC EXPLORATION OF TRADE-OFFS BETWEEN APPLICATION




Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment














my wife & our son
Acknowledgments
First of all, I would like to thank Dr. Shuvra S. Bhattacharyya for giving me an
invaluable opportunity to work on challenging and interesting projects. While working
with him, I have learned how to define an engineering problem, approach in solving
the given problem, and describe my approach technically in presentations and papers.
Furthermore, his positive advice and encouragement on my works motivated me in doing
my best, and became a great lesson when I co-worked with my colleagues.
I also would like to thank the people of National Instruments, including Dr. Jacob
Kornerup, Newton Petersen, Minhaz Khan, Ben Weidman, Dr. Ian Wong, Dr. Kaushik
Ravindran, and Yon Rao. It has been my pleasure to work and interact with good and
smart engineers from National Instruments. Also, it was always exciting to apply my
academic solutions to cutting-edge applications used in the real-engineering field.
It is my pleasure to thank the members of the committee, including Dr. Gang Qu,
Dr. Raj Shekhar, Dr. Ramani Duraiswami, and Dr. Harris for guiding me to complete this
thesis with constructive feedbacks and helpful discussions.
My gratitude goes to the DSPCAD research group, including Will, Chung-ching,
Ruirui, Nimish, Hsing-Huang, George, Inkeun, Scott, and Soujanya. I also feel grateful
to all my friends here for their generous friendship, which became a great support in
completing my Ph.D study.
I would like to give my deepest gratitude to my parents. “You can do it because you
are our proud son.” Their endless support and love motivated me in moving forward to
become their proud son when I was having hard times. Looking back, it was one of the
ii
best motivation to achieve this milestone in my academic career.
Finally, I thank my beloved wife, who accompanied me from the beginning to the




List of Figures viii
List of Tables x
1 Introduction 1
1.1 Contributions of this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Systematic generation of FPGA-based 1D-FFT implementation . . . . . 3
1.1.2 Resource-efficient acceleration of 2D-FFT on FPGAs . . . . . . . . . . . 4
1.1.3 Efficient static buffering for throughput-optimal FPGA Implementation of
synchronous dataflow graphs . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.4 Hardware synthesis techniques for parameterized dataflow . . . . . . . . 6
1.2 Outline of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Systematic generation of FPGA-based 1D-FFT implementation 8
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 BACKGROUND AND RELATED WORK . . . . . . . . . . . . . . . . . . . . 10
2.3 UNROLLING TECHNIQUES . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.1 Outer Loop Unrolling . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2 Inner Loop Unrolling in radix–2 FFT . . . . . . . . . . . . . . . . . . . 14
2.3.2.1 Address for the read . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.2.2 Address for the write . . . . . . . . . . . . . . . . . . . . . . 17
2.3.2.3 Conflict–free property in read/write . . . . . . . . . . . . . . . 19
2.3.3 Inner Loop Unrolling in radix–4 FFT . . . . . . . . . . . . . . . . . . . 19
2.3.3.1 Address for the read . . . . . . . . . . . . . . . . . . . . . . . 20
iv
2.3.3.2 Address for the write . . . . . . . . . . . . . . . . . . . . . . 20
2.3.3.3 Conflict–free property in read/write . . . . . . . . . . . . . . . 22
2.4 COST/PERFORMANCE ANALYSIS . . . . . . . . . . . . . . . . . . . . . . . 23
2.5 EXPERIMENTAL RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3 Resource-efficient Acceleration of 2D-FFT on FPGAs 30
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3 2D-FFT Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.1 Inner Loop Unrolling Technique (ILUT) . . . . . . . . . . . . . . . . . . 37
3.3.2 2D-FFT Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4 Analysis and Comparison ILUT-based and OLUT-based Implementation . . . . . 40
3.4.1 Operation of ILUT-based 2D-FFT Implementation . . . . . . . . . . . . 41
3.4.2 Operation of OLUT-based 2D-FFT Implementation . . . . . . . . . . . . 43
3.5 Experimental Results and Discussions . . . . . . . . . . . . . . . . . . . . . . . 45
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4 Efficient Static Buffering to Guarantee Throughput-Optimal FPGA Implementation
of Synchronous Dataflow Graphs 53
4.1 Introduction and related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2.1 Application representation . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3 Target platform model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.4 Design flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.5 Two-actor SDF graph model (TASM) . . . . . . . . . . . . . . . . . . . . . . . 64
v
4.5.1 Two-actor SDF graph model (TASM) . . . . . . . . . . . . . . . . . . . 65
4.5.2 Modified self-timed execution (MSTE) in TASM . . . . . . . . . . . . . 66
4.5.3 Subperiods in TASM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.6 Properties of subperiods in TASM . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.7 Throughput analysis in TASM . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.7.1 Firing pattern analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.7.2 Saturated TASM systems . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.8 Analysis of saturated systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.9 Application to general tree-structured SDF graphs . . . . . . . . . . . . . . . . . 90
4.10 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5 Hardware synthesis technique for parameterized dataflow model 94
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.2.1 LTE downlink physical layer . . . . . . . . . . . . . . . . . . . . . . . . 97
5.2.2 Parameterized Synchronous Dataflow . . . . . . . . . . . . . . . . . . . 97
5.3 Parameterized SDF Model of LTE . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.3.1 LTE specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.3.2 PSDF Modeling Details . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.3.3 PSDF Execution Model . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.4 LTE Prototype Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6 Conclusion and Future Work 105
6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105





1.1 Relationships among different levels of dataflow-based design methods for DSP
systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.1 Signal flow graph of 8–point FFT with notational conventions illustrated. For each
stage p < n, the data written through an output index for stage p corresponds to
the data read through an input index for stage (p+1). . . . . . . . . . . . . . . . 12
2.2 The pipelined radix–2 FFT implementation. . . . . . . . . . . . . . . . . . . . . 15
2.3 DM bank selection logic and parallel-in/serial-out shift register. . . . . . . . . . . 22
2.4 Resource utilization in Radix–2 FFT implementation with 4096 samples. . . . . . 24
2.5 Resource utilization in Radix–4 FFT implementation with 4096 samples. . . . . . 24
2.6 Resource utilization in the streaming Radix–2 FFT with 4096 samples . . . . . . 27
2.7 Resource utilization in the streaming Radix–4 FFT with 4096 samples . . . . . . 27
3.1 Functional block diagram of 2D-FFT computation. . . . . . . . . . . . . . . . . 35
3.2 Functional block diagram of ILUT-based, 1D-FFT implementation. . . . . . . . 37
3.3 Functional block diagram of 1D-FFT with OLUT . . . . . . . . . . . . . . . . . 40
3.4 Functional block diagram of 2D-FFT with ILUT. . . . . . . . . . . . . . . . . . 41
3.5 A timing diagram of ILUT-based FFT computation. . . . . . . . . . . . . . . . . 45
3.6 Computation time and FPGA resource utilization for 2D-FFT with an image size
of 256x256. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.7 Computation time and FPGA resource utilization for 2D-FFT with an image size
of 2048x2048. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.1 Overall design flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
viii
4.2 An example of an SDF edge and its TASM model. . . . . . . . . . . . . . . . . . 65
4.3 Example of TASM-based modeling approach, and execution patterns under con-
ventional self-timed execution and MSTE. . . . . . . . . . . . . . . . . . . . . . 70
4.4 DIF-based Application specifications . . . . . . . . . . . . . . . . . . . . . . . . 92
5.1 Example LTE subframe showing multiplexing of various channels on a 2D time-
frequency grid (not to scale). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.2 PSDF Model for LTE BS Modulator. . . . . . . . . . . . . . . . . . . . . . . . . 99
5.3 PSDF specification of RE Mapper. . . . . . . . . . . . . . . . . . . . . . . . . . 100
ix
List of Tables
2.1 Time When address is accessed for read/write . . . . . . . . . . . . . . . . . . . 23
2.2 Comparing synthesis report between radix–2 and radix–4 under the same perfor-
mance level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1 Relative resource requirements for an image size of 256x256. . . . . . . . . . . . 51
3.2 Relative resource requirements for an image size of 2048x2048. . . . . . . . . . 51
4.1 The number of firings of vTsrc and v
T
snk in subperiod α and β of TASM . . . . . . . 73
4.2 Sum of result buffer distribution under the maximum throughput(samples/cycle)
and its synthesis result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93




Dataflow-based digital signal processing (DSP) system design methods include
three levels of design. Each level of design is closely related to the other levels, as il-
lustrated in Fig. 1.1. In this thesis, we explore techniques at each of these design levels,
and corresponding advances that are applicable to DSP system design flows at each of
these levels. Furthermore, novel trade-offs of performance enhancement techniques that
are enabled by our techniques are considered jointly to realize optimized DSP imple-
mentations subject to given constraints on performance and resource requirements. We
specifically consider techniques for efficient implementation of DSP-based application
representations on field programmable gate array (FPGA) devices, which are attractive
targets for rapid prototyping and high performance signal processing in many application
contexts.
For actor-level design, we explore trade-offs between throughput and resource re-
quirements (hardware cost) in implementing computational modules for the fast Fourier
transform (FFT), which is a fundamental function in many signal processing applications.
Due to its computational complexity — O(NlogN), where N the number of inputs — and
the large amount of data that must be processed, FFT computation often becomes a ma-
jor bottleneck for overall system performance. In this thesis, we develop a systematic





























Figure 1.1: Relationships among different levels of dataflow-based design methods for
DSP systems.
designer–specified throughput requirement.
In dataflow-based system design, functional blocks and communication channels
for transferring data between adjacent blocks are modeled as graph vertices (actors) and
edges, respectively. When mapping dataflow graph edges into storage locations, care
must be taken to make effective use of limited storage locations (e.g., on-chip memory
in programmable digital signal processors, and block RAM and distributed memory in
FPGAs). However, reducing the storage space for transferring data between actors may
result in decreased throughput due to idle time that is required to prevent buffer overflow
— as buffers become smaller, the frequency and duration for such overflow-avoiding idle
time generally increases, which leads to decreased throughput. The limited amounts of
storage available in DSP implementation targets, and the importance of meeting real-time
performance constraints motivate the goal of throughput-constrained buffer minimization
for SDF graphs. In this thesis, we study this problem in the context of FPGA-based
2
implementation.
When mapping DSP dataflow graphs into FPGA implementations, it is important to
consider real-time constraints as well as optimization of hardware resources. Synchronous
dataflow (SDF) [1] has been used widely as an efficient model of computation for ana-
lyzing performance and resource requirements of DSP applications that are implemented
on various target architectures (e.g., see [2, 3, 4, 5, 6]). In recent years, the parameter-
ized dataflow meta modeling approach has evolved as a useful framework for modeling
graphs in which arbitrary actor, edge, and graph parameters can be changed dynamically.
However, the potential to enable efficient hardware synthesis has been treated relatively
sparsely in the literature for traditional dataflow modeling techniques, and even more so
for the newer, more general parameterized dataflow model. In this thesis, we develop
efficient techniques to synthesize SDF-based and parameterized-dataflow-based dataflow
models onto FPGA platforms.
1.1 Contributions of this thesis
1.1.1 Systematic generation of FPGA-based 1D-FFT implementation
We propose a systematic approach for synthesizing FPGA implementations of FFT
computations. Our approach considers both cost (in terms of FPGA resource require-
ments), and performance (in terms of throughput), and optimizes for both of these di-
mensions based on user-specified requirements. Our approach involves two orthogonal
techniques — FFT inner loop unrolling and outer loop unrolling — to perform design
space exploration in terms of cost and performance. By appropriately combining these
3
two forms of unrolling, we can achieve cost-optimized FFT implementations in terms of
FPGA slices or block RAMs in an FPGA subject to constraints on the required through-
put.
We compared the results of our synthesis approach with a recently-introduced com-
mercial FPGA intellectual property (IP) core — the FFT IP module in the Xilinx Logi-
Core Library, which provides different FFT implementations that are optimized for a lim-
ited set of performance levels. Our results demonstrate efficiency levels that are in some
cases better than these commercial IP blocks. At the same time, our approach provides the
advantages of being able to optimize implementations based on arbitrary, user-specified
performance levels, and of being based on general formulations of FFT loop unrolling
trade-offs, which can be retargeted to different kinds of FPGA devices.
1.1.2 Resource-efficient acceleration of 2D-FFT on FPGAs
The 2-dimensional (2D) FFT is a fundamental, computationally intensive function
that is of broad relevance to multidimensional signal processing computations, such as
those found in smart camera systems, medical imaging tools, and other important appli-
cations. In this thesis, we develop a systematic method for improving the throughput of
2D-FFT implementations on FPGAs. Our method is based on a novel loop unrolling tech-
nique for FFT implementation, which is extended from our work on FPGA architectures
for 1D-FFT implementation described in Section 1.1.1.
Our unrolling technique deploys multiple processing units within a single 1D-FFT
core to achieve efficient configurations of data parallelism while minimizing memory
4
space requirements, and FPGA slice consumption. Furthermore, using our techniques
for parallel processing within individual 1D-FFT cores, the number of input/output (I/O)
ports within a given 1D-FFT core is limited to one input port and one output port. In
contrast, previous 2D-FFT design approaches require multiple I/O pairs with multiple
FFT cores. This streamlining of 1D-FFT interfaces makes it possible to avoid complex
interconnection networks and associated scheduling logic for connecting multiple I/O
ports from 1D-FFT cores to the I/O channels of external memory devices. Hence, our
proposed unrolling technique maximizes the ratio of the achieved throughput to the con-
sumed FPGA resources under pre-defined constraints on I/O channel bandwidth.
To provide generality, our framework for 2D-FFT implementation can be efficiently
parameterized in terms of key design parameters such as the transform size and I/O data
word length.
1.1.3 Efficient static buffering for throughput-optimal FPGA Implemen-
tation of synchronous dataflow graphs
When designing DSP applications for implementation on FPGAs, it is often impor-
tant to minimize consumption of limited FPGA resources while satisfying real-time per-
formance constraints. We develop efficient techniques to determine dataflow graph buffer
sizes that guarantee throughput-optimal execution when mapping synchronous dataflow
(SDF) representations of DSP applications onto FPGAs. Our techniques are based on a
novel modeling technique, which we call the two-actor SDF graph model (TASM). The
TASM technique efficiently captures important characteristics relating to the behavior
5
and costs associated with SDF graph edges. With our proposed techniques, designers
can automatically generate upper bounds on SDF graph buffer distributions that realize
maximum achievable throughput performance for the corresponding applications. Fur-
thermore, our proposed technique is characterized by low polynomial time complexity,
which is useful for rapid prototyping in DSP system design.
1.1.4 Hardware synthesis techniques for parameterized dataflow
Parameterized SDF (PSDF) has evolved as a useful framework for modeling SDF
graphs in which arbitrary parameters can be changed dynamically. However, the po-
tential to enable efficient hardware synthesis has been treated relatively sparsely in the
literature for SDF and even more so for the newer, more general PSDF model. This chap-
ter investigates efficient FPGA-based design and implementation of the physical layer
for 3GPP-Long Term Evolution (LTE), a next generation cellular standard. To capture
the SDF behavior of the functional core of LTE along with higher level dynamics in the
standard, we use a novel PSDF-based FPGA architecture framework. We implement
our PSDF-based, LTE design framework using National Instrument’s LabVIEW FPGA,
a recently-introduced commercial platform for reconfigurable hardware implementation.
We show that our framework can effectively model the dynamics of the LTE protocol,
while also providing a synthesis framework for efficient FPGA implementation.
6
1.2 Outline of thesis
The rest of this thesis is organized as follows. Chapter 2, Chapter 3, Chapter 4,
and Chapter 5 develop the contributions discussed in Section 1.1.1, Section 1.1.2, Sec-
tion 1.1.3, and Section 1.1.4, respectively. In chapter 6, we present conclusions of the




Systematic generation of FPGA-based 1D-FFT implementation
2.1 Introduction
The fast Fourier transform (FFT) is one of the most widely-used and important sig-
nal processing functions, for example, in applications related to digital communications
and image processing. Since the computational complexity of the FFT is O(NlogN),
where N the number of inputs, the FFT potentially requires multi-cycle processing, and
can become a major bottleneck for overall system performance. Thus, care must be taken
in FFT module development at the actor level of the design methodology illustrated in
Fig. 1.1.
To relieve this bottleneck, many commercial IP blocks provide a streaming form
of the FFT with single–cycle–per–sample throughput. This high–throughput form of
FFT comes at the expense of increased hardware cost, which in turn can lead to costly,
over–designed hardware in situations where single–cycle–per–sample throughput is not
required — that is, in situations where the FFT bottleneck is significant, but not so severe
as to require such a high degree of throughput optimization. This chapter develops a sys-
tematic approach for generating a cost–efficient, FPGA–based FFT implementation based
on a designer–specified throughput requirement. Our approach carefully integrates two
orthogonal methods for trading–off hardware cost and performance. The first method,
which can be viewed as outer loop unrolling of the targeted FFT, realizes parallelism by
8
instantiating multiple processing cores (dedicated hardware subsystems) across FFT but-
terfly stages. The second method, which can be viewed as unrolling of the FFT inner
loop, allocates multiple cores within each stage. Each of these methods has advantages
and drawback compared to the other, and in general, an integrated application of both
methods can lead to a more cost–effective solution for a given throughput constraint —
e.g., a more cost–effective solution compared to a solution that applies only one of these
methods, or that is based on the high performance/high cost streaming FFT implementa-
tion. Depending on the given throughput constraint, one of these unrolling methods may
be of more critical utility than the other. Furthermore, the proposed intergrated unrolling
technique can be applied to the radix–4 FFT algorithms as well as the radix–2. It en-
ables designers to choose the different processing unit showing different features in the
performance/cost trade–off, based on their performance requirement.
Motivated by these observations, we develop a comprehensive approach to mixing
and matching outer and inner loop unrolling for cost–efficient, throughput–constrained
synthesis of FPGA hardware. In FPGA synthesis, slices (basic logic cells) and block
RAMs (BRAMs) are limited, and usage in terms of these two resources is important
in evaluating hardware cost [7]. Our synthesis approach is prototyped in National In-
struments LabVIEW FPGA 8.5. LabVIEW is a graphical, dataflow–based programming
environment for embedded systems design. LabVIEW features for HDL (hardware de-
scription language) synthesis and fixed point data types, along with LabVIEWs dataflow
orientation make the tool well–suited to FPGA-based design of signal processing applica-
tions. The output of our techniques for synthesis and optimization of FFT configurations
is a LabVIEW dataflow diagram that specifies the structure and functionality of an opti-
9
mized FFT configuration. This diagram is then synthesized to an FPGA device by first
invoking LabVIEW’s HDL synthesis tool, and then mapping the resulting HDL code us-
ing the platform-specific tools of the targeted FPGA. In our experiments, we have targeted
the Xilinx Virtex II Pro FPGA.
In our experiments, we have compared the targeted cost metric — the usage of
FPGA slices and BRAMs — between implementations generated by our novel synthe-
sis flow, and those obtained from the Xilinx LogiCore library [8] for identical levels of
throughput. The results demonstrate that our synthesis approach provides results that
are of similar cost to those from the commercial library. This is encouraging since our
approach provides the unique advantage of being synthesis-driven (as opposed to library-
based) so that it can be driven by arbitrary performance levels rather than being restricted
to a pre-determined subset of FFT configurations. Also, because it is based on an abstract
synthesis formulation, it can be retargeted to different FPGA devices . e.g., by weighting
or otherwise revising the cost function in terms of the resources that are most critical for
a particular target.
2.2 BACKGROUND AND RELATED WORK





xi ·W ikN (2.1)
where
W ikN = exp(−2πik/N), k = 0,1, · · · ,N−1 (2.2)
10
The computational complexity of the DFT is O(N2). The radix–2 fast Fourier trans-
form (FFT) algorithm proposed by Cooley and Tukey [9] in Figure 2.1, is widely used to
compute the DFT with a complexity of O(NlogN). After that, numerous FFT algorithm
has been proposed to reduce the order of the algorithm complexity such as radix–2m al-
gorithms, Winograd algorithm (WFTA) [10], prime factor algorithms (FPA) [11], and fast
Hartley transform (FHT) [12]. Because of the simple structure with a constant butterfly
geometry, radix–2m FFT algorithm is one of the most popular algorithm implemented in
the hardware for the practical application.
FFT implementation in FPGA takes advantages of being reconfigured based on the
user-specified design parameters — a variable FFT size, a variable data word length, a
performance, and a hardware resource, compared to general purpose processors [13] [14]
and dedicated FFT processor ICs [15] [16]. Various research efforts are involved in de-
veloping FPGA–based FFT library. Uzun [17] developed a framework covering different
types of 1–D FFT algorithm. [18, 19, 20, 21] show the efficient FFT implementation on
FPGA target meeting one throughput requirement. Xilinx LogiCore [22] provides radix–
2/4 FFT library with a variable FFT size with two performance levels — burst mode and
streaming mode. These works are under limited throughput level restricting the design
space.
In achieving the various speed–up in the performance in radix-2/4 algorithm which
requires to run a butterfly operation (dragonfly in radix–4) iteratively in Figure 2.1, it
needs to execute multiple butterfly operations in parallel or a pipelined manner. Since a
pair of inputs for the butterfly operation are changed in each FFT stage in Figure 2.1, a



























Figure 2.1: Signal flow graph of 8–point FFT with notational conventions illustrated. For
each stage p < n, the data written through an output index for stage p corresponds to the
data read through an input index for stage (p+1).
an efficient method for the conflict–free memory management in FFT implementation.
In Ma’s approach an conflict–free strategy is employed to store butterfly outputs in the
same memory locations that are used by the inputs to the butterfly. Such an conflict–free
strategy is useful in reducing memory requirements, and enabling pipelining in terms of
memory reads, butterfly operations, and memory writes. However, Ma’s scheme is also
developed for an FFT core that involves a single butterfly unit, so the overall approach is
limited in terms of throughput improvement. Our proposed address scheme realizes the
multiple butterfly operations in parallel and a pipelined manner by expanding Ma’s work.
Nordin [24] presented a parameterized soft core generator for the FFT based on the
Peace FFT algorithm with the stride permutation approach proposed by Takala et al. [25].
By running multiple butterflies simultaneously with a scalable stride permutation, the
generated FFT achieves an effective balance between hardware costs and performance
12
features, and is also customizable based on given design constraints. A distinguishing
aspect of the approach that we develop in this chapter is the realization of data parallelism
with a carefully–configured address generator, and the integration of this address genera-
tion approach with an inner loop unrolling technique. This is in contrast, for example, to
introducing special permutation structures for butterfly operations. Our approach, which
is especially targeted to FPGA implementation, results in efficient utilization of FPGA
slices.
A preliminary version of this work has been presented in [26], and this chapter goes
beyond the developments of [26] by applying the proposed address scheme to radix–4
FFT algorithm as well as radix–2 FFT. For saving the storage space for a twiddle factor
table, Hassan’s method[27] has been applied in the proposed implementation.
2.3 UNROLLING TECHNIQUES
The radix–2/4 FFT algorithm involves running the butterfly/dragonfly operation it-
eratively. Using a conflict–free memory management scheme, we roll the butterfly oper-
ations within a given stage using a for–loop, which we refer to as the inner loop. Across
different stages, we then employ another for–loop, which we call the outer loop. A ba-
sic FFT core(BFC) provides dedicated hardware for butterfly/dragonfly operation inside
for–loop, and we can execute a BFC iteratively with the aforementioned inner and outer
for–loops to achieve a complete FFT computation. However, rather than instantiating just
one BFC for computing all FFT stages, we can achieve k times throughput improvement
by running k BFCs simultaneously across stages, or by incorporating parallelism inside
13
the BFC so that multiple butterfly/dragonfly operations can be executed in parallel within
a given stage. We propose two orthogonal unrolling techniques to allocate and utilize
BFCs in an efficient and scalable manner on FPGAs. The techniques have different cost
functions in terms of usage of FPGA slices or BRAMs, and we show that in general, the
two approaches should be considered jointly for cost–efficient FPGA–based, FFT imple-
mentation.
2.3.1 Outer Loop Unrolling
The iteration count for the outer f or–loop in the FFT is equal to the total number of
stages, logN. Unrolling the outer loop by an unrolling factor (k≥ 1) instantiates k BFCs.
(k−1) of these BFCs have dlogN/ke outer loop iterations each, while the remaining one
has (logN−dlogN/ke) outer iterations. The designs of BFCs are identical to each other
as described in 2.3.2 and 2.3.3 , except for some initialization details, and the iteration
count. In this approach, k BFCs are running simultaneously, and up to a factor of k
improvement in throughput can be achieved. This approach introduces k identical copies
of BFCs, so that it is expected that a factor of k increase in hardware cost results — in
terms of BRAMs and FPGA slices. The trade-offs associated with outer loop unrolling are
complemented by inner loop unrolling, which we elaborate on in the following section.
2.3.2 Inner Loop Unrolling in radix–2 FFT
While unrolling the outer loop is realized by adding more copies of the BFC, the in-
ner loop unrolling could be realized by executing multiple butterfly units in parallel inside
14
Ap = RL (Counter,p)
     = an-r-2 an-r-3 … a0
ap
Bp =  br br-1 … b1 0
Bp =  br br-1 … b1 1
Butterfly
Unit
Bp+1 = (ap=0) br br-1 … b1






Figure 2.2: The pipelined radix–2 FFT implementation.
a BFC. That is, we parameterize the BFC with the number of hardware butterfly units, and
we increase the value of the associated parameter to trade-off increased area for improved
throughput. Ma [23] indicated that indices of two inputs, u and l, for the butterfly unit in
the pth stage are identical, except for the pth bit in their binary patterns. Based on his ob-
servation, we propose a conflict–free memory addressing assignment for inputs/outputs
of multiple butterfly operations in parallel. With k of an inner loop unrolling factor, k
butterfly units within a BFC require 2k independent (parallel) data memory banks (DM
banks); however, the amount of storage required in each DM bank is reduced by a factor
of k so that the total amount of DM bank storage required after inner loop unrolling is
unchanged, compared with BFC that has a single butterfly unit. Note that in an FPGA
device, each DM bank will normally be implemented by one or more BRAMs [7].
2.3.2.1 Address for the read
For efficiency in a hardware utilization, we restrict the inner loop unrolling factor to
be a power of 2 ; that is, k = 2r for some non-negative integer r. Each DM bank contains
2(n−r−1) data locations that are accessed during FFT computation, where n = log2N, and
15
N is the number of sample points involved in the overall FFT computation. Given an inner
loop unrolling factor k = 2r, there are k hardware butterfly units in the parameterized
BFC, and 2k DM bank (two for each butterfly unit). If x denotes a binary bit pattern,
and y denotes a non-negative integer, let RL(x,y) denote the bit pattern that results from
left-rotation of x by y bit positions, and similarly, let RR(x,y) denote the bit pattern that
results from right-rotation of x by y bit positions. Also, for bit patterns x1 and x2, let
CONCAT (x1,x2) denote the concatenation of x1 and x2. For example, if x1 = 110, and
x2 = 01100, then RL(x1,2) = 011, RR(x2,3) = 10001, and CONCAT (x1,x2) = 11001100.
Let DM banks have indices of 0,1, . . . ,(2k− 1). Suppose that p is the index of a given
FFT stage (i.e., 0 ≤ p ≤ n− 1); let Bp = brbr− 1 . . .b0 be the binary pattern of the DM
bank index in the pth stage; and let Ap = an−r−2an−r−3 . . .a0 be the bit pattern for data
address in the DM bank in this stage. For clarity, our conventions for input indices, and
FFT stage indices, as well as N and n are illustrated in Figure 2.1. In the proposed memory
addressing scheme, the input index in the pth stage that corresponds to address Ap in DM
bank index Bp can be derived as
u = RL(CONCAT (RR(Ap, p),Bp), p) (2.3)
= an−r−2an−r−3 . . .apbrbr−1 . . .b0ap−1ap−2 . . .a0
With this notation, the least significant bit (LSB) in a given DM bank index b0, rep-
resents the pth bit of the corresponding input index in the pth FFT stage. Since two input
indices for a given butterfly operation in the pth stage are the same except for the pth bit,
the input index u and the index l for the other input in the same butterfly operation stand
16
for their data stored in DM bank brbr−1 . . .b10 and brbr−1 . . .b11, respectively. These two
DM banks, whose indices are identical except for their LSBs, become a pair of DM bank
that is assigned to the same hardware butterfly unit in Figure 2.2. Thus, we entirely avoid
any selection logic between among all DM bank in order to match one accessed input data
to the other input data for the correct butterfly computation in each FFT stage. Moreover,
all pairs of inputs to multiple butterfly operations can be accessed by the same address
because the indices of corresponding input pairs are identical, except for the pth bit, and
the pth bit is the one that the selects the DM bank for a given butterfly unit. One accessing
address for all input pairs in every iteration helps to implement the input address generator
in an efficient manner.
2.3.2.2 Address for the write
In the proposed FFT implementation, it does not introduce additional DM bank for
storing butterfly outputs to realize the pipelined operation in read, butterfly computation,
and write. After a butterfly operation in the pth stage, outputs should be written back to
DM bank where butterfly input pairs come from. Butterfly output pairs stored in DM bank
should be ready for the read in the (p + 1)th stage with the same manner in 2.3.2.1. In
other words, the destined DM bank index and addresses for writing back an output pair
indexed by u and l in the pth stage are equivalent, respectively, to the DM bank indices
and the address for reading the input indexed by u and l in the next stage, stage (p + 1).
Thus, the destined DM bank index and its associated addresses for writing butterfly output
pair can be generated by an inverse mapping from (2.3) with output indices of u and l,
17
and stage index (p+1). This inverse mapping is given by
Bp+1 = apbrbr−1 . . .b1 (2.4)
Ap+1 = an−r−2an−r−3 . . .ap+1b0ap−1 . . .a0 (2.5)
As reminding that pairs of inputs are accessed with the same address in pairs of
DM bank whose indices are same but for b0, the destined DM bank indices for each
butterfly output pair should be the same as in (2.4). In other words, each butterfly unit
produces an output pair which should be stored in one DM bank every cycle. Since 2 read
operations and 2 write operations in each butterfly unit should be done simultaneously
for the maximal pipelined operation and each BRAM in FPGA has only two ports, two
ports of the bank must be used as the read and write port respectively. In the situation
of a single port for the write, one of outputs is delayed by a inserted register to write a
output pair with one allowed port as in Figure 2.2. Also, pairs of destined DM bank is
connected to each butterfly units by a simple 1–by–2 demux selected by ap other than the
worst case of 1–by–2k demux to store butterfly output pairs in the proper DM bank by a
help of (2.4). For alternating the destined DM bank every cycle, the address, Ap,can be
generated efficiently by
Ap = RL(Counter, p) (2.6)
Here, the value of Counter is increased by one every clock cycle, so that bit ap
in Ap is flipped on each clock cycle. This provides a resource-efficient mechanism for
generating ap, and (via (2.4)), generating the required sequence of Bp+1 selections.
18
2.3.2.3 Conflict–free property in read/write
To figure out whether the proposed addressing scheme works well, it requires to
make sure that butterfly outputs should be written back to the address where data has
been read to avoid the read–after–write hazard. From all DM bank pairs, pairs of data in
Ap0(ap = 0) are read out in even Counter as in (2.6). In the next cycle (odd Counter), data
of Ap0(ap = 0) becomes inputs of butterfly units, while pairs of data in A
p
0(ap = 1) are read
out in the pipelined FFT implementation in Figure 2.2. It means that there are a pair of free
slots, Ap0(ap = 0) and A
p
0(ap = 1), in all DM bank after the butterfly computation, and this
is the same to a pair of output addresses in (2.5) whatever b0 is. Therefore, our proposed
method makes it possible to read and write a data simultaneously without additional DM
bank for output pairs in a conflict–free manner. In this way, unrolling inner loop makes a
BFC achieve k times throughput due to running k butterfly units simultaneously.
2.3.3 Inner Loop Unrolling in radix–4 FFT
The dragonfly unit in radix–4 FFT algorithm is equivalent to 4 butterfly units in
radix–2 FFT algorithm, while it requires only 75% less multiplications and the same
additions to 4 butterflies [28]. Hence, it takes advantages over radix–2 FFT in terms of
less hardware resources. However, the size of FFT should be a power of 4 in radix-4
FFT so that the resolution of coverage in radix–4 FFT is worse, compared to a power of
2 in radix–2 case. For inputs for the dragonfly unit in the pth stage of radix–4 FFT, four
input indices are identical in their binary pattern, except for (2p)th bit and (2p+1)th bit
. Similar to the radix–2 FFT implementation, a group of four DM bank belongs to each
19
dragonfly unit, and the last two bits, b1 and b0 in a binary pattern of DM bank index, Bp,
specify four different inputs for a dragonfly computation. As the base of the logarithm
in radix–4 FFT is changed to 4, the total FFT stage is log4N and input indices in the pth
stage can be derived as
xi = RL(CONCAT (RR(Ap,2p),Bp),2p) (2.7)
= an−r−3an−r−4 . . .a2pbr+1br . . .b0a2p−1 . . .a0
,where Ap, Bp, and r is the address, the DM bank index, and log2k respectively when k is
the inner loop unrolling factor.
2.3.3.1 Address for the read
The group of four DM banks, whose indices are the same except for the last two
bits, belongs to the dragonfly unit in the radix–4 FFT implementation. From these DM
banks, a group of four inputs for each dragonfly unit are read out from the address of Ap.
In the pth FFT stage, (2.7) verifies that indices of the input group are identical except for
the (2p + 1)th and (2p)th bit which come from the last two bits of the DM bank group,
and the correct dragonfly input group.
2.3.3.2 Address for the write
A destined DM bank index and its associated four addresses corresponding to four
outputs from a dragonfly operation can be derived by an inverse mapping from (2.7) with
a stage index of (p+1).
20
Bp+1 = a2p+1a2pbr+1br . . .b2 (2.8)
Ap+10 = an−r−3 . . .a2(p+1)(b1 = 0)(b0 = 0)a2p−1 . . .a0
Ap+11 = an−r−3 . . .a2(p+1)(b1 = 0)(b0 = 1)a2p−1 . . .a0
Ap+12 = an−r−3 . . .a2(p+1)(b1 = 1)(b0 = 0)a2p−1 . . .a0
Ap+13 = an−r−3 . . .a2(p+1)(b1 = 1)(b0 = 1)a2p−1 . . .a0
Four dragonfly outputs are written back to one DM bank, as indices of a DM bank
group are same except for the last two bits. Because only one port is allowed for the write
due to BRAM port limit, it requires a parallel-in/serial-out shift register to write an output
group through the allowed port in Figure 2.3. From (2.8), a selector of 1–by–4 demux,
providing the dragonfly output group to the destined DM bank, is determined as a2p+1
and a2p, two bits out of the address Ap. As it takes four cycles to pass loaded outputs to
the DM bank port, a change in the selector of demux should be repeated in a period of
four cycles. To meet this requirement, the address Ap is generated as
Ap = RL(Counter,2p) (2.9)
, where Counter is increased by one every clock cycle. Similar to radix–2 FFT
implementation, a group of DM bank belonging to the dragonfly unit is pre-determined

























Figure 2.3: DM bank selection logic and parallel-in/serial-out shift register.
2.3.3.3 Conflict–free property in read/write
After the dragonfly operation, the output group is written back to memory banks
where it has been read for the dragonfly operation. With the proposed address scheme,
the read-after-write (RAW) hazard can be avoided in the pipelined radix–4 FFT imple-
mentation like the radix–2 case. In a period of four cycles, the demux provides four
dragonfly outputs to each destined DM bank out of the group at each cycle. The out-
put addresses (2.8) for all of four DM banks in the period are Ap0(a2p+1 = 0,a2p = 0),
Ap0(a2p+1 = 0,a2p = 1), A
p
0(a2p+1 = 1,a2p = 0), and A
p
0(a2p+1 = 1,a2p = 1). As the de-
mux passes out outputs first to the shift register of DM bank0 in Figure 2.3 every period
and writes outputs to the same four address in the identical order in all of four DM banks,
the conflict–free property between the read and the write in DM bank0 guarantees that
there is no RAW hazard in the remaining DM banks. Input addresses have been generated
by (2.9), while output address are generated based on input address and the last two bits of
DM banks indices (2.8). Because of the shift register after the dragonfly unit in Figure 2.3,
22
Table 2.1: Time When address is accessed for read/write
Address Time for READ Time for WRITE
in DM bank0
Ap0(a2p+1 = 0,a2p = 0) t0 t0 +2
Ap0(a2p+1 = 0,a2p = 1) t0 +1 t0 +3
Ap0(a2p+1 = 1,a2p = 0) t0 +2 t0 +4
Ap0(a2p+1 = 1,a2p = 1) t0 +3 t0 +5
reading the input from the address Ap in any DM bank always occurs before writing the
output to Ap in DM bank0. Table. 2.1 shows the time for the read and write operation in
each address. With this proposed address scheme, multiple dragonfly operations inside
BFC could be executed simultaneously without introducing additional DM banks.
2.4 COST/PERFORMANCE ANALYSIS
The two orthogonal unrolling techniques developed in the previous section exhibit
different profiles of FPGA resource consumption. While outer loop unrolling pipelines
multiple BFCs, inner loop unrolling executes multiple butterfly/dragonfly units in parallel
inside BFC. Since the inner loop unrolling technique involves more localized control (i.e.,
control over a single BFC) it generally consumes less FPGA logic resources compared
with the more extensive control structures needed for outer loop unrolling. However, in-
ner loop unrolling is less flexible in terms of the set of possible unrolling factors — to
preserve the applicability of our streamlined approach for inner loop memory manage-
ment, while the inner loop unrolling factor must be a power of two. This requirement
makes the range of achievable speedups for the inner loop unrolling technique to be lim-




























































Figure 2.5: Resource utilization in Radix–4 FFT implementation with 4096 samples.
integer factors. Thus, for example, if the degree of speedup required to achieve the given
throughput constraint is not a power of two, then a combination of inner–loop and outer–
loop unrolling may lead to the most cost–effective solution.
Figure 2.4 shows FPGA slice and BRAM utilization as functions of the unrolling
factor for both inner and outer loop unrolling. These results are obtained after synthesis,
and include the streamlining effects of our proposed schemes for address generation and
memory management. For both kinds of unrolling, BRAM and FPGA slice utilization
increase linearly with the degree of speedup achieved (unrolling factor). Also from Fig-
ure 2.4 and 2.5, we see that inner loop unrolling is more area-efficient compared to outer
loop unrolling for the same throughput increase. However, recall that inner loop unrolling
24
is restricted to factors that are powers of 2. In increasing FFT length, we take advantage
of more fully using BRAMs in a wider range of inner loop unrolling factors. For use
in analytical design space exploration, the following cost functions can be derived from
these synthesis results:
uinner = sinner ·uinitial(kinner−1)+uinitial (2.10)
uouter = souter ·uinitial(kouter−1)+uinitial (2.11)
Here, uinner and uouter are the amounts of utilization (FPGA slice or BRAM utiliza-
tion) after inner and outer loop unrolling, respectively; uinitial represents the amount of
resource utilization without any unrolling; kinner and kouter are inner and outer loop un-
rolling factors, respectively; and sinner(souter) is a constant factor that represents the slope
of the linear plots for inner (outer) loop configurations in Figure 2.4 and Figure 2.5.
The cost functions (2.10) and (2.11) are for inner and outer loop unrolling in iso-
lation. If both forms of unrolling are applied in combination, then the total hardware
resource requirements can be expressed as
ucombined = souter ·uinner(kouter−1)+uinner (2.12)
kcombined = kinner · kouter (2.13)
Given a throughput constraint, (2.12) and (2.13) can be used to efficiently search
the space of feasible designs (i.e., designs with satisfactory throughput) for a cost-optimal
solution. In particular, candidate pairs (kouter,kinner) that satisfy the throughput constraint
25
Table 2.2: Comparing synthesis report between radix–2 and radix–4 under the same per-
formance level
Algorithm FPGA slices BRAM Multipliers
Radix–2 FFT with (1,4) 2770 16 16
Radix–4 FFT without (1,1) 2471 14 12
(based on (2.13)) can be evaluated to select the one that minimizes cost (based on (2.12)).
This evaluation can be pruned by noting that whenever a particular pair (k′outer,k
′
inner)
is found to satisfy the throughput constraint, we need not consider any additional pairs
(k′′outer,k
′′
inner) such that k
′′
inner ≥ k′inner and k′′outer ≥ k′outer are both satisfied. This ap-
proach allows for very rapid, pre–synthesis determination of cost–effective architectures
for given throughput constraints.
2.5 EXPERIMENTAL RESULTS
We have targeted the Xilinx Virtex II Pro P30 embedded in the National Instru-
ments PCI-5640R to synthesize implementations derived by our architecture generation
techniques for the FFT. For a fair comparison with Xilinx library, HDL code for FFT
generated from LogiCore is encapsulated by a wrapper from LabVIEW FPGA before the
synthesis. The specific form of FFT implemented in these results is a radix–2/4 FFT with
4096 samples, with each sample represented as a fixed–point data type with 16-bit word
length.
Table 2.2 shows to compare the synthesis report of the radix–4 FFT to the radix–2
FFT under the same performance level. For all kinds of FPGA resource, radix–4 FFT


























Unrolling Factor Fair(kouter, kinner)
OLUT/ILUT combination
Xilinx IP streaming mode























Unrolling Factor Fair(kouter, kinner)
OLUT/ILUT combination
Xilinx IP streaming mode
Figure 2.7: Resource utilization in the streaming Radix–4 FFT with 4096 samples
preferable IP in a power of 4 FFT size.
Figure 2.6 and Figure 2.7 indicates the synthesis report comparison between the
proposed method and Xilinx IP core under the steaming performance level. In radix–
2 FFT case, we take a target speedup of 6 here because the throughput of a sequential
implementation (no unrolling) on this device is 6 cycles per sample, and 6 is the low-
est integer speedup needed to achieve the common ’‘streaming FFT” target of 1 cycle
per sample. Using the high level exploration approach developed in Section 2.5, and the
device–specific slopes and initial utilizations from the curves in Figure 2.2, we can cal-
culate analytically that when (kouter,kinner) is equal to (3,2) and (2,4), respectively, then
27
the generated FFT core is optimized in terms of FPGA slice usage and BRAM utilization.
These results agree with the optimal values observed from the two curves from actual
synthesis results in Figure 2.6, thereby demonstrating the accuracy of our high level ex-
ploration method. To compare our approach with relevant commercially–available FFT
core, we evaluated the FFT core that is available from the Xilinx LogiCore library under
the two different throughput levels that are available for it — streaming throughput and
sequential (resource–optimized) throughput. For streaming FFT performance (one cycle
per sample throughput), our approach required 25% less FPGA slices compared to the
Xilinx core, but 190% more BRAMs. In radix–4 case, the target speedup for achieving
the streaming performance is 2, because the throughput of the sequential FFT is 1.5 cycles
per sample. The inner loop unrolling is always more efficient than outer loop unrolling
and the best unrolling factor pair is (1,2). This implementation required 10% less FPGA
slices but 150% more BRAMs, compared to the Xilinx IP.
For the sequential performance level(burst mode), our approach required 14% fewer
slices, and 10% more BRAMs in radix–2 FFT implementation and 10% fewer slices 22%
more BRAMs in radix–4 FFT, compared to Xilinx IP core as shown in Figure 2.4 and 2.5.
Note that this comparison (“sequential performance”) does not include any unrolling and
is therefore essentially a comparison with Ma’s FFT configuration, which is the special
case of our approach that results when no unrolling is carried out. As well as the burst FFT
implementation, our method provides various FFT implementation having a wider range
of performance by help of two orthogonal unrolling technique. As the consumed resource
in FFT implementation increases slower than the achieved speed-up, it can be the ready-
to-use and cost-efficient FFT core meeting the user-specific target throughput. While the
28
heavy streaming FFT implementation is only one option in case of the target throughput
larger than the sequential performance for a commercial IP, the proposed method provides
well-tailored FFT library fit to the performance requirement.
29
Chapter 3
Resource-efficient Acceleration of 2D-FFT on FPGAs
3.1 Introduction
Fourier image analysis plays a key role in many image processing applications by
making it possible to replace convolution operations in the spatial domain to simpler mul-
tiplication operations in the frequency domain, and enabling FFT convolution and various
deconvolution techniques [29]. In spite of its wide use, FFT computation often becomes
a major application bottleneck due to its high computational complexity. Thus, improv-
ing the throughput of 2D-FFT computation is useful to enhance overall system perfor-
mance of the target application. Field-programmable gate arrays (FPGAs) are attractive
for acceleration of FFT computations since FPGAs allow for configuration of customized
digital logic structures that exploit the parallelism and regularity of FFT computations.
However, achieving the full potential of 2D-FFT throughput acceleration under FPGA
resource constraints is challenging since parallelism, interconnection complexity, FPGA
logic gate utilization, and memory utilization must be carefully traded off at the actor
level of the design methodology illustrated in Fig. 1.1.
The 2D-FFT is typically implemented as repeated invocations of 1D-FFT compu-
tations. Therefore, techniques for efficient FPGA-based 2D-FFT computations can be
derived by considering two key design issues — improving the throughput of 1D-FFT
computation with efficient FPGA resource consumption, and carefully utilizing the lim-
30
ited bandwidth of data transfer between the targeted FPGA device and external memory.
Since 2D-FFT computation consists of 2N 1D-FFT computations, the throughput of 1D-
FFT computation directly influences that of the enclosing 2D-FFT. In [26], we introduce
an inner loop unrolling technique(ILUT) with an associated memory addressing scheme
to achieve resource-efficient throughput improvement of the 1D-FFT. This technique can
be parameterized by the required throughput to generate an FFT IP (intellectual prop-
erty) subsystem such that the resource consumption is streamlined based on the targeted
performance, which avoids over-designed hardware.
A 2D-FFT for an N-by-N image can be computed by performing N row-wise 1D-
FFTs followed by N column-wise 1D-FFTs. Such an approach requires us to store N2
intermediate data values between the row-wise and column-wise phases of computation.
Due to the limited storage space within FPGA devices, external memory is often used to
store such high-volume sets of intermediate data. When external memory is employed in
this way, it is essential to carefully utilize the available bandwidth between the FPGA and
associated external memory.
This chapter presents the efficient application to 2D-FFT implementation of our
previously-developed ILUT technique [26], which is a systematic approach for generating
1D-FFT IP cores that are customized based on user-specified cost/performance trade-
offs, as described above. We show that by carefully building on our ILUT-based 1D-
FFT architecture to implement FPGA-based 2D-FFTs, we achieve significantly better
cost/performance efficiency compared to previous techniques for implementation of 2D-
FFTs on FPGAs. Here, by cost/performance efficiency we mean specifically the ratio of
consumed FPGA resources to the achieved throughput.
31
In our ILUT-based approach to 2D-FFT implementation, only a single pair of I/O
ports is needed, regardless of the inner loop unrolling factor, in the underlying 1D-FFT
IP core to transfer data with external memory. This provides significant improvements
in interconnect complexity and I/O scheduling overhead compared to related work on
2D-FFT implementation.
We prototyped our 2D-FFT implementation techniques in National Instruments
LabVIEW (LV) FPGA 8.6 — a graphical, dataflow-based programming environment
for embedded system design. LabVIEW includes a feature called Component-Level IP
(CLIP), which allows designers to create wrappers around existing FPGA IP cores so
that they can be used as components within LV FPGA. Designers can also write code
for custom-designed subsystems in a hardware description language (HDL) and integrate
this HDL code into LV FPGA using CLIP. In the experiments that we report on in this
chapter, we have used CLIP to interface platform-specific IP for sending and receiving
data between the targeted FPGA device and external memory.
For our experiments, we specified our optimized FFT architecture in the LV FPGA
design environment, and implemented the architecture on the targeted FPGA by first in-
voking the LV FPGA HDL synthesis tool, and then mapping the resulting HDL code
using the platform-specific tools of the targeted FPGA. The target FPGA that we used
was the Xilinx Virtex-5. More specifically, our experimental platform was the National
Instruments FlexRio board, which includes a Xilinx Virtex-5 device that is integrated with
128 MB of external memory (DRAM). Only the details in our implementation that per-
tain to synthesis and memory interfacing are related to the FlexRio board; the core FFT
architecture that we present can be retargeted to other kinds FPGA platforms.
32
The organization of the chapter is as follows: In Section 3.2, we review background
on the 1D and 2D-FFT algorithms, and describe challenges in implementing these com-
putations efficiently. Subsequently, we present details of our ILUT-based, 2D-FFT archi-
tecture in Section 3.3. In Section 3.4, we show how our proposed 2D-FFT architecture
provides significantly improved trade-offs between throughput and FPGA resource con-
sumption. Section 3.5 demonstrates experimental results from our proposed architecture
and comparisons with previous approaches. Section 3.6 provides a summary of the chap-
ter and concluding remarks.
3.2 Background





xi ·W ikN , (3.1)
where
W ikN = exp(−2πik/N) ∀k = 0,1, · · · ,N−1.
As shown in Equation 3.1, a direct computation of the DFT suffers from O(N2)
complexity. After Cooley and Turkey [9] proposed the FFT algorithm to decrease the
computational complexity of the DFT to O(N · logN), a large body of research has been
focused on realizing the proposed 1D-FFT algorithm on various kinds of hardware plat-
forms, including general purpose processors, programmable digital signal processors, and
FPGAs. Ma [23] proposed an effective memory addressing scheme for a single FFT core
to promote reuse of memory locations, and thereby reduce overall memory requirements.
33
Takala et al. [25] proposed a stride permutation for FFT computation, and Nordin et
al. [24] developed a parameterized FFT soft core generator with a scalable stride permu-
tation.
While many approaches have been developed to implement the 1D-FFT, research on
design and implementation for 2D-FFT computations has centered around the approach of
deploying multiple 1D-FFT cores, where each 1D-FFT core embeds a single processing
unit — the butterfly unit for the radix-2 FFT or the dragonfly unit for the radix-4 FFT.
Jung et al. [30] developed a design methodology for exploring area/performance
trade-offs in hardware implementation, and demonstrated this methodology using a 2D
discrete cosine transform (DCT) benchmark. In this approach to DCT implementation,
larger numbers of 1D-DCT blocks are deployed to achieve increasing levels of speed-up
with corresponding increases in hardware resource consumption. The instantiated 1D-
DCT blocks communicate with one another through a shared memory, which is imple-
mented by an array of registers.
For a small input image, implementing the memory space for the image with an
array of registers can be a reasonable design option. Such a design avoids limitations
due to limited numbers of I/O channels and limited bandwidth between the FPGA de-
vice and external memory. However, since using arrays of registers is costly in terms
of FPGA resources, the approach of Jung et al. can be expected to result in very large
FPGA resource requirements for large input images. To avoid such dramatic increases in
FPGA resource requirements, image storage is generally implemented in external mem-
ory, which has limited numbers of ports (typically dual ports) and limited bandwidth.












Figure 3.1: Functional block diagram of 2D-FFT computation.
tention to memory interfacing in the design of the FPGA architecture The approach de-
veloped in this chapter examines 2D-FFT acceleration from such a viewpoint of efficient
integration of FPGA-based acceleration and external-memory-based image storage.
Uzun et al. [17] proposed a high level framework covering 1D and 2D-FFT im-
plementations for real-time applications. In this framework, the parallelism in 2D-FFT
computation is realized by allocating multiple 1D-FFT processors with a shared external
memory. Since the input and output data vectors associated with each 1D-processor are
transferred into a shared external memory, conflicts arise from multiple requests to read
and write to the shared memory. Resolving these conflicts requires a relatively complex
interconnection network, and also a complex control unit for scheduling data transfers
between the 1D-FFT cores and the shared memory.
2D-FFT computation can be executed by a combination of N row-wise and N
column-wise 1D-FFTs, as shown in Figure 3.1. Typically, 2D-FFT computation is per-
formed on large images, which require external memory for their storage. Thus, the
performance of the 2D-FFT is limited by the bandwidth of external memory, and the
FFT computation must be designed carefully to achieve parallelism in conjunction with
efficient communication with memory.
Previous work has emphasized accelerating 2D-FFT computation by employing
multiple 1D-FFT cores. In our approach, we build on this general multiple-core approach,
35
and to make the approach more efficient, we incorporate our recently-developed methods
to realize data parallelism within each of the instantiated 1D-FFT cores [26]. We do this
by allocating multiple processing units to an individual 1D-FFT core, and incorporating
a novel memory addressing scheme.
A distinguishing aspect of this approach is that our realization of data parallelism
inside a single 1D-FFT core requires only a single pair of vector reading and writing re-
quests to the external memory, regardless of the speed-up factor. Our architecture there-
fore prevents conflicts among requests from multiple cores in 2D-FFT implementation,
and enables better utilization of memory bandwidth. Furthermore, by regularizing the
access patterns to external memory, our approach reduces controller complexity and im-
proves predictability.
3.3 2D-FFT Design
As described above, our approach for ILUT-based acceleration of the 1D-FFT,
along with a formal development of the associated addressing scheme, are developed
in [26]. In this section, we summarize important features of the ILUT-based approach
that are relevant when applying it to 2D-FFT implementation, and we present details of
the 2D-FFT architecture that we have developed by building on our ILUT-based 1D-FFT
accelerator.
Henceforth, for conciseness, we refer to our ILUT approach simply as ILUT — that
is, by ILUT, we mean our specific approach for FFT inner loop unrolling, as developed
























Figure 3.2: Functional block diagram of ILUT-based, 1D-FFT implementation.
3.3.1 Inner Loop Unrolling Technique (ILUT)
(1D) FFT computation involves logN FFT stages, where each FFT stage consists
of N/2 butterfly computations. In ILUT, we refer to each FFT stage as an inner loop that
“rolls” the butterflies. Also, we roll iterations across FFT stages through a conceptual
outer loop. Intuitively, ILUT involves unrolling a given FFT stage by running multiple
butterfly operations in parallel. Figure 3.2 shows an architectural block diagram of an
FFT core after applying ILUT. We parameterize the core with a configurable number B
of butterfly units, and increase the value of B to trade-off increased area for improved
throughput. Addresses for input/output and for the butterfly units are controlled by the
address generation unit (AGU). The AGU in our design allows conflict-free, simultaneous
read and write accesses to the same dual-ported data memory bank. With this carefully-
designed addressing scheme, the size of an individual data memory bank can be reduced
by a factor of k when unrolling the inner loop by k (i.e., when k = B). Thus, since k data
memory banks are required for an unrolling factor of k, the application of ILUT results in
no net change in the overall data memory requirement, regardless of the unrolling factor.
37
In contrast to ILUT, the outer loop unrolling technique (OLUT) allocates multiple
FFT cores to achieve parallelism in FFT implementation. OLUT-based approaches have
been explored extensively in previous research efforts, such as [17, 30]. Figure 3.3 illus-
trates a functional block diagram of OLUT-based FFT implementation. For an unrolling
factor of k, OLUT generally requires a factor of k in memory space increase compared to
a single core implementation with no outer loop unrolling applied. Furthermore, OLUT
introduces k identical copies of the underlying AGU, so it also involves an increase in the
number of FPGA slices required.
3.3.2 2D-FFT Architecture
Figure 3.4 shows a functional block diagram of our proposed 2D-FFT architecture,
which we refer to as the IBTF (ILUT-Based Two-dimensional FFT) architecture. The
IBTF deploys a single 1D-FFT core with ILUT applied within the 1D core to achieve the
desired level of parallelism. The 1D-FFT core employed has a single input port and a
single output port, regardless of the degree of inner loop unrolling applied to the 1D core.
Each of these ports is connected to a dual-port memory, which we call the local memory
(LM). The LM is used to buffer data between the external memory and the ILUT-based
1D-FFT core. More specifically, the LM is used for sending and receiving vectors of
FFT outputs and inputs, respectively, through an external memory interface that operates
concurrently with the transform computation within the 1D-FFT core. The LM is divided
into two separate regions — the LMR provides a buffer for reading from external memory,
and similarly, the LMW provides a buffer for writing to external memory. Both the LMR
38
and LMW have the same size Sbuffer (in bytes).
The LM is implemented on the targeted FPGA device. In general, it can be imple-
mented in FPGA block ram (BRAM) or in FPGA slices (distributed memory). For small
to moderate LM sizes, BRAM implementation has the disadvantage that the BRAMs
used for the LM are largely underutilized. In our experiments, we have used distributed
memory to implement the LM. Such an approach frees up the BRAMs to support other
applications or subsystems that co-exist with the IBTF core on the same FPGA device.
The control unit (CU) handles the scheduling of all requests for transferring data be-
tween the LM buffers and the external memory. Since external memory is volatile, the CU
must also take steps to ensure that the data stored in the external memory remains valid
throughout its required lifetime. Furthermore, to increase the efficiency of data transfers,
the CU accesses external memory through groups of sequential addresses, which are fur-
ther clustered together in terms of common types of accesses (read or write). This kind of
clustered, sequential access pattern is more efficient than more irregular types of patterns
(e.g., see [17]). For every iteration of the underlying 1D-FFT transformation, the CU
issues Sbuffer read requests followed by Sbuffer write requests.
In contrast to ILUT, OLUT-based approaches require k pairs of I/O ports in the ex-
ternal memory interface, along with k 1D-FFT cores. Furthermore, the external memory
interface in the OLUT approach requires a complex interconnection network, including
a crossbar switch, to connect the k pairs of I/O ports, and provide the required exter-
nal memory access from the set of parallel FFT cores. Furthermore, since the CU must
control multiple memory requests from multiple pairs of I/O ports, it needs to incorpo-



















1-D FFT Core (k-1)





Figure 3.3: Functional block diagram of 1D-FFT with OLUT
OLUT-based implementations of the 2D-FFT can be expected to consume more FPGA
slices compared to ILUT-based implementations under the same unrolling factor. We will
provide a more in-depth comparison on these points in Section 3.4.
3.4 Analysis and Comparison ILUT-based and OLUT-based Implemen-
tation
As described previously, when external memory is involved, the achievable speed-
up for a 2D-FFT implementation depends heavily on the bandwidth available for external
memory accesses. The on-board external memory on the NI-FlexRio platform provides a
bandwidth of 320 MB/s under a 40MHz base clock. Since the default size of data in the
interface to the memory is 64 bits, the bandwidth can be viewed as a single sample per a
cycle.












Figure 3.4: Functional block diagram of 2D-FFT with ILUT.
in LMR for the next 1D-FFT computation, and write out N samples from LMW to exter-
nal memory. Here, N represents the input image size — i.e., the input image contains
NxN pixels. During the k-th 1D-FFT computation, the CU must transfer N inputs for the
(k + 1)-th 1D-FFT computation in LMR from the external memory. In the same compu-
tation frame, the CU also needs to transfer N outputs (produced by the (k−1)th 1D-FFT
computation) from LMW to the external memory.
In other words, 2N cycles of data communication are required between the local
memory (LM) and the external memory for each 1D-FFT computation, and this is a lim-
iting factor in the achievable throughput.
3.4.1 Operation of ILUT-based 2D-FFT Implementation
A timing diagram for an iteration of ILUT-based 2D-FFT computation is shown in
Figure 3.5. The data loading and unloading processes can be overlapped in the proposed
1D-FFT IP, and with such overlapping, N clock cycles are required to process N samples.
FFT computation follows the loading/unloading process, and for this computation, N/2 ·
logN cycles are required if no unrolling is applied in the underlying radix-2 1D-FFT. If
41
we apply ILUT with unrolling factor k, then k butterfly units are deployed inside the 1D-
FFT core so that the execution time for each 1D-FFT computation can be decreased by
a factor of k. Therefore, the total time, as a function of the unrolling factor, for 1D-FFT
computation is




Now recall that it requires 2N cycles of data communication to prepare the next 1D-
FFT computation after the previous 1D-FFT computation has completed. Thus, an upper
bound on the achieved speed-up can be expressed as SMAXinner = logN/2. Up to this level
of speedup, ILUT exhibits speed-up that is linear to the unrolling factor k. The achieved
speedup saturates, however, at SMAXinner due to bandwidth limitations in the target platform.
Note that this analysis is based on our use of 2N cycles as a bound for the required
LM-external-memory data transfer between FFT computations. This transfer rate bound
is applicable, for example, in the NI FlexRio target platform that we have targeted in our
experiments. This bound is also applicable in the OLUT- and external-memory-based
2D-FFT implementations explored in [17]. Changes in this bound, however, require cor-
responding changes to the speedup analysis presented in this section.
ILUT-based implementation promotes efficient utilization of FPGA resources. To
see this, recall that LMR (LMW ) connects the input (output) port of the underlying 1D-
FFT core to the output (input) channel of the external memory. Because of the regular
access patters to and from LM, LMR and LMW can be implemented by FIFO buffers
that operate based on standard (push and pop) FIFO access operations, and simple inter-
42
facing logic. More specifically, data transfers involving the LM can be controlled by a
simple rule — data is pushed or popped as needed whenever the FIFO status is neither
“full” nor “empty.” This simplicity is facilitated by the form of data parallelism provided
by the ILUT architecture, which is implemented entirely in the 1D-FFT core, and does
not require parallel or random-access interfaces to LM. Exploiting this feature allows
for resource-efficient implementation of LM and its associated interfaces in distributed
memory or BRAM, and allows also for simple, resource-efficient implementation of the
CU.
3.4.2 Operation of OLUT-based 2D-FFT Implementation
In OLUT-based FFT implementation, k ≥ 1 FFT cores operate simultaneously, and
each of these cores contains a single butterfly unit. Therefore, OLUT enables a reduction
in 1D-FFT processing time by a factor of k. To run k 1D-FFT cores in parallel, the
associated memory access controller must periodically fill up input data and clear out
output data local memory at a sufficient rate. Since 2N cycles are needed for the data
transfers associated with each 1D-FFT core, the controller can set up a single 1D-FFT
computation frame for each of k 1D-FFT cores every k ·2N cycles.
Thus, if Tbase represents the time for 1D-FFT computation without acceleration









As with the ILUT-based architecture, the throughput improvement with OLUT is
limited by the bandwidth between the FPGA device and external memory.
Note also that the minimum inner loop unrolling factor kinner (for ILUT) that is
required to reach a given level of throughput is generally larger than the minimum outer
loop unrolling factor kouter required to achieve the same level of performance. This is
because the total size of the required data memory space (the storage space represented
by the blocks labeled as “Data Memory” banks in Figure 3.3) for OLUT is kouter times
larger than the data memory space required by ILUT, and hence, the net time required for
loading and unloading local memory is reduced by a factor of kouter by the ILUT approach
compared to OLUT. Note that ILUT requires constant data memory size (independent of
the inner loop unrolling factor), and therefore, the net time required by ILUT for loading
and unloading local memory is also constant.
Overall, even though the larger unrolling factors required by ILUT (for given lev-
els of performance) result in correspondingly higher factors of FPGA resource usage
increase due to parallel resource instantiation, this increase is more than compensated by
the improvement in the storage requirements of the data memory banks (especially for
larger unrolling factors). Thus, when FPGA distributed memory is used to implement
local memory, ILUT exhibits a significantly better ratio of achieved throughput to con-
sumed resources (FPGA slices) compared to the OLUT approach. This is demonstrated
quantitatively in section 3.5 through our experiments.
Furthermore, OLUT requires a relatively complex interconnection network to switch
paths from multiple I/O ports of the 1D-FFT cores to the local memory subsystem. To
maintain peak performance, this interconnection network must be capable of supplying
44
N cycles





N cycles N cycles
One iteration of transferring
a pair of input/output vectors





Figure 3.5: A timing diagram of ILUT-based FFT computation.
an input vector before each transform computation and receiving an output vector af-
ter each computation. Due to the reduced regularity of the memory accesses across the
OLUT interconnection network, the local memory cannot be managed in the form of a
simple FIFO, as can be done with our ILUT-based architecture. In OLUT, the local mem-
ory controller must keep track of the associated row/column vector set for each 1D-FFT
transform computation, and must continuously perform book-keeping to switch the in-
terconnection paths. Also, the OLUT controller must perform inter-core synchronization
across the set of 1D-FFT cores. As we demonstrate in the next section, the increased
control complexity in OLUT results in significant FPGA resource consumption increase
compared to ILUT.
3.5 Experimental Results and Discussions
In our experiments, 2D-FFT designs have been implemented and evaluated for two
different sizes of images — 256x256 and 2048x2048. Since a 2048x2048 image re-























































































Figure 3.7: Computation time and FPGA resource utilization for 2D-FFT with an image
size of 2048x2048.
46
nal memory, 2048x2048 is the largest standard image size (i.e., the number of rows and
columns is a power of two) that can be supported on our platform — the next largest image
size, 4096x4096 requires approximately 4×33MB, which slightly exceeds the available
128MB.
We have implemented both inner loop (ILUT) and outer loop (OLUT) unrolling
separately in alternative 2D-FFT implementations, and we have carefully compared the
results. Each unrolling technique has been applied with increasing unrolling factors until
the maximal throughput allowed by the external memory bandwidth was achieved. While
the given FPGA target allows us to transfer data between the external memory and FPGA
device at a clock frequency of 100MHz, our 2D-FFT implementation cannot operate on
such a fast clock. Thus, multiple clock domains are required to support the highest pos-
sible memory bandwidth. In this chapter, we focus on exploring 2D-FFT design trade-
offs for conventional, single-clock-domain implementation, and therefore, we slow the
memory interface down to the same speed as the 2D-FFT computation subsystem. More
specifically, we use a single clock domain that operates at 40MHz. Applying heteroge-
neous clock domains to explore further performance enhancement is a useful direction
for further investigation.
In our OLUT implementation, we employed the LabVIEW FPGA 1D-FFT library
module, which is a widely-used commercial 1D-FFT library module that has competitive
performance compared to related commercial FPGA cores [26]. In both the OLUT and
ILUT implementations, we employed distributed memory to implement the local memory
subsystems, as described earlier in Section 3.3. For this purpose, we used the distributed
memory library from Xilinx LogiCore [22]. Another useful direction for follow-on re-
47
search is the integration of block RAM (BRAM) into the design space for optimized
ILUT-based 2D-FFT implementation.
For our experiments with ILUT, we have restricted the inner loop unrolling factor
to be a power of 2 for efficiency in hardware utilization. When ILUT is applied with
unrolling factors that are not powers of two, significant resource usage inefficiency re-
sults. This is because the 1D-FFT data memory indices cannot be generated simply by
concatenating the binary bit patterns of the memory addresses to that of the associated
memory bank addresses, and thus significant overhead results in the address generation
logic. While multiple butterfly units jointly compute a single input vector and are con-
trolled by a novel memory address scheme in ILUT, each butterfly unit handles its own
individual input vectors separately in OLUT. In this sense, the unrolling factor in OLUT
can be a natural number rather than a power of two as in the ILUT case. We compare the
proposed ILUT to OLUT under the performance levels that ILUT provides. This com-
parison may not be a comprehensive comparison between two techniques, but it clearly
demonstrates the advantages of ILUT in terms of resource utilization across all of its
allowed performance levels.
Given a 2D-FFT implementation based on the assumptions described above (a sin-
gle clock domain and distributed-memory-based local memory), we define the relative
resource utilization as the quotient R/T , where R denotes the total number of FPGA
slices (including resources for computation and for distributed memory) required for
the implementation, and T denotes the throughput in 2D-FFT computations per second.
Thus, decreasing levels of relative resource utilization indicate increasing levels of cost-
efficiency relative to the achieved processing performance (or conversely, increasing lev-
48
els of performance-efficiency relative to the achieved cost).
Figure 3.6 shows the computation time and the number of occupied FPGA slices
for 2D-FFT implementation under both ILUT- and OLUT-based approaches with an im-
age size of 256x256. Corresponding values of relative resource utilization are given in
Table 3.1. From Figure 3.6, we see that OLUT exhibits a smaller computation time com-
pared to ILUT under for an unrolling factor of 2 (k = 2). This is due to the reduced
time for local memory loading and unloading, which we discussed in Section 3.4. Even
though there is a difference in throughput for k = 2, both ILUT and OLUT techniques
exhibit similar levels of relative resource utilization in Table 3.1. OLUT achieves the
maximal achievable throughput (as constrained by the external memory bandwidth) at an
unrolling factor of 3, while ILUT achieves the maximal achievable throughput at an un-
rolling factor of 4. At this maximal performance level, ILUT exhibits 20% less relative
resource utilization compared to OLUT, as shown in Table 3.1. This demonstrates the
significant resource-efficiency advantage offered by our ILUT-based approach compared
to the more conventional approach of OLUT-based 2D-FFT implementation.
Computation time and FPGA slice usage results for an image size of 2048x2048 are
shown in Figure 3.7. While OLUT has a smaller execution time than ILUT with a small
unrolling factor, ILUT consistently exhibits better relative resource utilization than OLUT
under similar levels of performance. For example, even though ILUT at k = 4 and OLUT
at k = 3 employ different unrolling factors, both of these configurations exhibit similar
levels of performance, as shown in Fig 3.7, and these configurations can be compared in
terms of the relative resource utilization metric. This is shown in Table 3.2.
Furthermore, ILUT consumes a smaller number of FPGA slices at the highest per-
49
formance level. The lowest unrolling factor at which OLUT achieves maximal through-
put is kouter = 4. However, OLUT cannot be synthesized on our target platform at this
unrolling factor. This is because, as indicated by the results from our synthesis attempts,
the number of FPGA slices required at this unrolling factor exceeds the number of avail-
able slices in the FPGA device. In Table 3.2, we make a note of the “compile error” in
the OLUT value at k = 4 to describe that we could not synthesize this case due to limited
FPGA resources on the target platform. Even though we cannot synthesize this case, we
can estimate its relative resource utilization. Since this case reaches the maximal achiev-
able throughput, it will have the same throughput as the ILUT case with k = 8, as shown
in Fig 3.7. Also, this OLUT configuration (k = 4) is expected to consume more FPGA re-
sources than the OLUT configuration with k = 3 due to increases in the butterfly unit and
its associated control logic. In Fig 3.7, the ILUT configuration with k = 8 shows much
better resource utilization than OLUT with k = 3. Hence, we can expect that the ILUT
approach has smaller relative resource utilization compared to OLUT when we compare
their respective maximal-performance configurations.
Another interesting result from our experiments is that the relative resource utiliza-
tion of ILUT at kinner = 4 is smaller than that at kinner = 8. This is because the potential
speed-up at kinner = 8 is not fully realized due to the limited external memory bandwidth.
This saturation of performance can be seen in Figure 3.7.
50
Table 3.1: Relative resource requirements for an image size of 256x256.
Unrolling Factor ILUT OLUT
k = 1 32.96 32.96
k = 2 21.71 21.43
k = 3 N/A 19.36
k = 4 16.18 N/A
Table 3.2: Relative resource requirements for an image size of 2048x2048.
Unrolling Factor ILUT OLUT
k = 1 3994 3994
k = 2 2391 2546
k = 3 N/A 2244
k = 4 1632 Compile Error
k = 8 1736 N/A
3.6 Conclusion
In this chapter, we have developed a systematic approach for generating dedicated
2D-FFT subsystems for FPGA implementation. Our approach realizes data parallelism
within an individual 1D-FFT core, and minimizes the interface complexity between the
underlying 1D-FFT core and local memory. Our approach allows for scalable, paral-
lel 2D-FFT implementation with a relatively simple interconnection network, and corre-
spondingly simple control logic. These features contribute to improved FPGA resource
consumption at a given level of performance compared to previous 2D-FFT FPGA archi-
tectures.
Our methods are demonstrated through extensive synthesis experiments using the
Xilinx Virtex-5 FPGA device. Our synthesis results quantify the cost-performance trade-
51
offs provided by our proposed class of FFT architectures. A distinguishing characteristic
of our approach, compared to previous techniques for 2D-FFT implementation, is that we
provide a systematic method to generate streamlined, FPGA-based, 2D-FFT architectures
while taking into account trade-offs between performance and cost.
52
Chapter 4
Efficient Static Buffering to Guarantee Throughput-Optimal FPGA
Implementation of Synchronous Dataflow Graphs
4.1 Introduction and related work
At the graph level of the design methodology illustrated in Fig. 1.1, it is important
to consider real-time constraints as well as optimization of hardware resources. When
describing a DSP application with an graph, functional blocks and storage space for trans-
ferring data between adjacent blocks are modeled as graph vertices (actors) and edges,
respectively. When mapping graph edges into storage locations, care must be taken to
make effective use of limited storage locations (e.g., on-chip memory in programmable
digital signal processors, and block RAM and distributed memory in FPGAs). However,
reducing the storage space for transferring data between actors may result in decreased
throughput due to idle time that is required to prevent buffer overflow — as buffers be-
come smaller, the frequency and duration for such overflow-avoiding idle time generally
increases, which leads to decreased throughput. The limited amounts of storage available
in DSP implementation targets, and the importance of meeting real-time performance
constraints motivate the goal of guaranteed, throughput-optimal buffer configuration for
SDF graphs. In this chapter, we study this throughput and buffering analysis problem in
the context of FPGA-based implementation.
53
Synchronous dataflow (SDF) [1] has been used widely as an efficient model of
computation for analyzing performance and resource requirements of DSP applications
that are implemented on various target architectures (e.g., see [2, 3, 4, 5, 6, 31]). Tradi-
tionally, throughput analysis for SDF graphs is performed by solving an instance of the
maximum mean cycle problem (e.g., see [32, 33]) after converting the input SDF graph
into an equivalent homogeneous SDF (HSDF) graph [1]. HSDF is a special case of SDF
in which the production and consumption rates are identically equal to unity for all input
and output ports of all actors. These rates are in terms of data values (tokens) per actor
execution (firing). Throughput analysis based on SDF-to-HSDF conversion suffers from
high worst case complexity because neither the time nor space required to perform this
conversion is polynomially bounded (e.g., see [34]).
This complexity arises from the nature of periodic schedules of SDF graphs, which
are used for static scheduling. A periodic schedule for an SDF graph is a schedule that
produces no net change in the buffer state — i.e., the numbers of tokens that are queued
on the buffers associated with the graph edges. The total number of actor firings in a
periodic schedule can scale exponentially even for simple classes of SDF graphs [34].
Since each actor firing corresponds to a separate vertex in the HSDF version of an SDF
graph, the SDF-to-HSDF transformation process can result in similar exponential growth.
Ghamarian et al. [35] have developed a method for SDF throughput analysis that
avoids conversion to an HSDF graph, and uses state space exploration techniques — in
terms of the buffer state — instead. In general, executions of actors change the buffer state
by removing (consuming) tokens from input edges of the actors that fire, and inserting
(producing) tokens onto output edges. Ghamarian exploits the property that when SDF
54
graphs execute in a purely data driven (“self-timed”) manner under bounded memory
space, the state space is also bounded, and execution eventually settles into a periodic
pattern (periodic steady state or PSS). In Ghamarian’s method for throughput analysis,
only selected states need to be stored when detecting the PSS of execution, and through
Ghamarian’s careful pruning technique for state storage, significant improvements can
be achieved in the efficiency of performance analysis. However, the technique requires
simulation of the overall schedule, and the worst case complexity is linear in the length
(number of firings in) the given periodic schedule, which, as described above, is not
polynomially bounded in the size of the input SDF graph.
Buffer minimization in SDF graph has been studied to mainly focus on single-
processor target, see for example [36, 37, 38, 39, 40]. FPGA implementation, however,
allows simultaneous actor firings by assigning each actor into its own dedicated FPGA
slices. Thus, the minimal buffer solution given by the previous deadlock-free schedule
can not be applied to this FPGA implementation domain.
Horstmannshoff et al. [41, 42] developed the scheduling method for complex RT
level building blocks from SDF graph. Based on timing patterns of producing and con-
suming token in each block, it constructed the retiming graph to generate the schedule of
generating the stall signal for each SDF actor with a minimum buffer cost.
Stuijk [43] develops a systematic approach for exploring throughput and storage
trade-offs for SDF graphs. This approach applies methods developed in [44] for determin-
ing minimum storage requirements based on state-space analysis of buffer states. Stuijk’s
approach operates by first finding a minimal storage distribution, and then recursively in-
creasing the storage space for each edge that has a storage dependency. This results in a
55
family of buffer distribution-throughput pairs as a representation of Pareto solutions for
the graph. Although this approach prunes the search space to reduce complexity, schedule
simulation is still required in the search process, so again, worst case complexity is not
polynomially bounded.
Wiggers [45] presents an algorithm with linear computational complexity to deter-
mine close-to-minimum buffer capacities for a given throughput constraint. However, this
approach imposes a form of strictly periodic scheduling that requires a counter in every
functional block, which leads to resource overhead in FPGA and other hardware-oriented
implementations. Also, since Wiggers’s approach assumes that execution will enter the
required periodic steady-state only with the timely availability of sufficient starting to-
kens for every actor, it may not adequately handle irregular streaming inputs, where token
arrival times are less predictable.
In contrast to the related prior work, we propose a heuristic algorithm with low
polynomial time complexity that provides upper bounds on buffer requirements to guar-
antee throughput-optimal FPGA realizations of SDF graphs. Our approach focuses on
the restricted class of tree-structured SDF graphs — that is, the input application model
(application graph) must be in the form of an SDF tree. We emphasize that our algorithm
is a heuristic only in the sense of the buffer sizes that are computed; in terms of achieved
throughput performance, our approach guarantees optimality.
We first analyze relationships of firing patterns between actors and buffer require-
ments for the two-actor SDF graph model (TASM), which is a specialized form of SDF
graph that we propose for efficient analysis of data communication on individual edges
in a given SDF application graph. We then apply this two-actor firing pattern analysis
56
repeatedly when traversing an application graph to determine buffer configurations that
guarantee maximum achievable throughput.
In our buffer optimization scenario and our associated TASM analysis, we consider
self-timed dataflow graph execution (e.g., see [46, 47]), which means that an actor is fired
as soon as all of its input edges have enough tokens — that is, as soon as the number of
tokens on each input edge e is at least c(ei). If each actor is mapped to a separate hard-
ware resource, and the overhead of communication and synchronization between actors is
negligible, then self-timed execution leads to the maximum achievable throughput (e.g.,
see [48, 47]). Moreover, this form of execution does not require any global schedule, and
therefore storage, performance, and interconnect overhead associated with implementing
a global schedule is avoided.
With predominantly coarse-grained dataflow actors (e.g., digital filters, and trans-
form computations as opposed to adders and multiplers), and streamlined implementa-
tion of dataflow edges, one can reduce the relative overhead of inter-actor communication
and synchronization significantly so that self-timed scheduling becomes an effective ap-
proach. This context of coarse-grain actors and streamlined edge implementation is the
form in which we explore self-time implementation and associated buffer configuration
strategies in this chapter.
We first present precise definition and notations related to buffer analysis of SDF-
based implementations. Using these concepts, we analyze the data transfer behavior on
an SDF edge by the TASM model described earlier. Based on this analysis, we develop
an algorithm for buffer analysis based on the TASM model, and we show an overall de-
sign flow for applying this algorithm for efficient synthesis of FPGA implementations.
57
The proposed algorithm is implemented in the dataflow interchange format (DIF) pack-
age [49], which provides a standard language and associated toolset that is founded in
dataflow semantics and tailored for DSP system design [5].
4.2 Background
4.2.1 Application representation
We represent a DSP application with a dataflow graph G = (V,E), where each com-
putational module is mapped to a vertex (actor) v∈V and each directed edge e∈ E corre-
sponds to a FIFO buffer for communicating data from the source actor src(e) to the sink
actor snk(e) of e. We assume that the given dataflow model adheres to the assumptions
of SDF, which require that the production and consumption rates of all actor output and
input ports, respectively, are constant [1]. The SDF model is used widely in tools for DSP
system design, and powerful analysis techniques have been develop for mapping SDF
representations into various kinds of platforms (e.g., see [47]).
Given an SDF edge e, we represent the associated production rate of src(e) by p(ei),
and we represent the associated consumption rate of snk(e) by c(ei). An SDF edge e also
has associated with it a non-negative delay, denoted del(e), which represents the number
of initial tokens that reside on the corresponding buffer at the start of execution.
A necessary condition for executing (firing) an SDF actor v is that the number of
tokens on every input edge ein of v is greater than or equal to c(ein). While v consumes
c(ein) tokens from each input edge ein during its execution — i.e., during the execution
of a single invocation or firing of the actor — it produces p(eout) tokens onto each output
58
edge eout.
If an SDF graph is properly constructed in a certain technical sense, then there
exists a periodic schedule for the graph — that is, a schedule that is free from deadlock,
fires each actor at least once, and produces no net change in the number of tokens on
every edge e in the graph. This concept of a properly constructed SDF graph is referred
to as consistency; efficient algorithms have been developed to determine whether or not
an SDF graph is consistent, and therefore has a periodic schedule [1]. For a consistent
SDF graph, the relative rates at which actors need to fire can be determined from the the
balance equations [1]:
p(ei)×q [src(e)] = c(ei)×q [snk(e)] foralle ∈ E. (4.1)
For a consistent, connected SDF graph G there is a unique minimum integer solu-
tion the balance equations in G. This solution is called the repetitions vector, and is often
denoted by q. repetition count. For each actor v in G, we refer to q[v] as the repetition
count of v. A valid and minimal periodic schedule should fire each actor a number of
times equal to the repetitions count of the actor. Such a periodic schedule can then be
iterated as many times as needed with guaranteed bounded memory for all of the edges in
the graph [1].
If an SDF graph G is not connected, then repetitions vectors and periodic sched-
ules can be computed separately for each connected component, and these “connected
component schedules” can be iterated at arbitrary rates relative to one another to achieve
bounded memory execution of G. The relative rates of execution for the connected com-
59
ponents can be managed — based on relevant characteristics of the associated signal pro-
cessing subsystems — using a vector called the blocking vector [50]. The blocking vector
has elements that are non-negative integers, and is indexed by the connected components
in G.
In the remainder of this chapter, we assume that we are working with connected
SDF graphs. However, through appropriate use of blocking vectors, the developments in
the chapter can be extended naturally to handle SDF graphs that are not connected.
4.3 Target platform model
Since resource sharing is often avoided in FPGA implementation due to the rela-
tively high cost of multiplexing and routing resources (e.g., see [51]), we assume that
each computational block (SDF actor) is assigned to a dedicated set of FPGA logic cells
without any sharing. Integrating resource sharing considerations into the developments of
this chapter is an interesting direction for future work, and may be useful in cases where
resources are limited compared to the amount of required computation.
FPGAs provide two ways of implementing memory space between functional blocks
— such memory space can be implemented using block RAMs, which provide dedicated
memory hardware within an FPGA, and distributed RAM using FPGA slices. The number
of ports for reading (writing) data from (into) both forms of RAM is limited, and these
limitations must be taken into account carefully for correct buffer management. In the
Xilinx Virtex-II Pro FPGA, which we target in this chapter, the number of ports is limited
to two, and therefore, only a single pair of simultaneous read/write operations to each
60
RAM subsystem is possible.
To support this limitation on RAM access, we incorporate in our dataflow-based ar-
chitecture model, a self-loop on each actor, and we add a single unit of delay to each such
self-loop. By a self-loop, we mean an edge whose source and sink actors are identical.
By adding a self-loop with unit delay to each actor, we ensure that successive executions
of the same actor are always serialized, which guarantees that the memory requests of one
actor invocation do not conflict with those of another invocation of the same actor.
Our overall mapping approach therefore maps each actor in the SDF application
model to a single actor (dedicated hardware resource) in the architecture model, along
with a self-loop connection for that actor. Thus, we allow for concurrent execution of
distinct actors, while serializing successive invocations of the same actor since such suc-
cessive invocations must access the same memory ports for buffer access.
4.4 Design flow
Fig. 4.1 illustrates the overall design flow for our proposed buffer optimization tech-
nique under performance constraints. The proposed technique is implemented in the
dataflow interchange format (DIF) package [5]. The DIF package provides a flexible
dataflow design language, intermediate representations, and transformations for specify-
ing, analyzing, and optimizing implementations of DSP applications. While DIF supports
a variety of dataflow models of computation, including synchronous [1], cyclo-static [52],
and enable-invoke [53] dataflow support for SDF in DIF is especially well-developed and
mature. We leverage this support for SDF graph techniques in the DIF package to de-
61
Figure 4.1: Overall design flow.
62
velop and experiment with the novel buffer optimization techniques that we introduce in
this chapter.
The buffer optimization module that we have developed in the DIF package has
two inputs — a DIF language specification of the the application to be implemented,
and the throughput constraint — that is, the required rate at which the implementation
must be able to process samples. As part of the DIF application specification, we include
timing characteristics that are derived by profiling the execution of each actor on the target
hardware. In the experiments that we report on in this chapter, the timing characteristics
are derived from the LabVIEW FPGA library, which is targeted to Xilinx Virtex II-Pro
FPGA platform.
The DIF package front-end parses the DIF-based application specification and con-
structs an intermediate representation on which we apply our new algorithm for analysis
and optimization of throughput-constrained buffer configurations. This algorithm is the
main contribution of this chapter.
Our proposed algorithm for buffer optimization aims to minimize FPGA resource
requirements for implementing the buffers associated with the edges in the input SDF
graph. This buffer resource minimization is performed subject to the given throughput
constraint. Careful estimation of throughput is performed in conjunction with our buffer
optimization technique to ensure that the throughput constraint is satisfied without signif-
icant performance over-design (i.e. with actual throughput that is significantly higher that
what is required based on the throughput constraint). Such over-design in general leads
to buffer allocations that are larger than what is required to achieve the given through-
put constraint, and is therefore counter-productive in terms of our throughput-constrained
63
buffer minimization problem.
In our experimental setup, the result of our buffer optimization technique is applied
by hand to the LabVIEW FPGA code of the target DSP system. Thus, we demonstrated
a semi-automated design flow, where the result of our fully-automated buffer optimiza-
tion algorithm is translated by hand to configure the buffers in the target implementation.
Although this process is generally more time consuming compared to an end-to-end au-
tomated flow, it is a highly flexible approach because it can easily be adapted to different
target platforms and back-end synthesis tools.
Through experiments with LabVIEW FPGA, we demonstrate significant improve-
ments in resource efficiency, and minimal levels of performance over-design that result
from the buffer configurations derived from our optimization technique.
4.5 Two-actor SDF graph model (TASM)
We assume a static buffering approach for SDF graphs, which means that for each
SDF edge we allocate a fixed amount of memory space at compile time. We refer to
the fixed amount of space that is allocated for an edge ei as the buffer size of ei, and we
denote this buffer size by the symbol D(ei). For real-time implementation of SDF graphs,
static buffering is often preferable due to its enhanced predictability and elimination of
overhead due to dynamic memory allocation.
In this section, we introduce a model called the Two-Actor SDF Graph Model
(TASM). For any edge ei ∈ E in an arbitrary SDF graph G = (V,E), the TASM for ei,






p(ei) c(ei)τ1 (0) = di
p(ei)






Figure 4.2: An example of an SDF edge and its TASM model.
as the enclosing SDF graph G executes under bounded memory. Also, TASM facilitates
the formalization of our proposed synthesis approach, and its feature of computing buffer
space requirements for throughput-optimal implementation.
4.5.1 Two-actor SDF graph model (TASM)
Suppose that edge ei, shown in Fig. 4.5, is part of some arbitrary enclosing SDF
graph G = (V,E) (i.e., ei ∈ E), and suppose that src(ei) = vsrc, snk(ei) = vsnk, del(ei) = di,
and the production and consumption rates of ei are denoted by p(ei) and c(ei), respec-
tively. Suppose also that ei is assigned a pre-specified buffer size D(ei). Then the TASM
graph associated with ei, which we denote by GTi , is defined as illustrated in Fig. 4.5(b).
Here, vTsrc = vsrc and v
T











src with delay (D(ei)−di). The production and consumption rates
for edges in the TASM graph are set as follows.
p(eT(i,1)) = c(e
T





(i,2)) = c(ei). (4.3)
At any given time, buffer slots (cells in the memory that are allocated for the buffer)
are categorized into two types based on whether they contain live data (filled) or whether
they are available for storing new data (empty or free). The filled space in the buffer for ei
is modeled by eT(i,1) in TASM. Thus as G
T
i executes, each token on e
T
(i,1) represents a live
token in the buffer associated with ei in a corresponding execution of G. Since the source
actor src(ei) can be fired only when ei has enough free space to store all of the tokens
produced by a firing of src(ei), each firing of src(ei) can be viewed as consuming p(ei)
free cells from the buffer space available on ei. Conversely, each execution of snk(ei)
expands the free space on ei by c(ei) cells. Hence, the free space on ei can be modeled by
the edge eT(i,2) shown in Fig. 4.5(b), where each token on e
T
(i,2) during an execution of G
T
i
represents an empty cell in the buffer associated with ei in a corresponding execution of
G.
4.5.2 Modified self-timed execution (MSTE) in TASM
We use the self-timed execution model when mapping the input SDF graph into
an FPGA implementation. Self-timed execution of SDF graphs can in general lead to
execution periods (the patterns in which actors execute on the available resources) that
are of exponential length in terms of the size of the of the graph (e.g., see [47]). Such
exponential growth of execution periods can significantly complicate static analysis. To
help address this difficulty, we add an additional firing rule, which we call the MSTE firing
66
rule: actor vTsrc in G
T
i (see Fig. 4.5(b)) of the TASM model cannot be fired if
τ1(t)≥ max(p(ei), c(ei)), (4.4)
where τ j(t) represents the number of tokens on eT(i, j) at time t for j = 1,2. By imposing
the MSTE firing rule, we obtain a modified form of self-timed execution, which we refer
to in the remainder of this chapter as modified self-timed execution (MSTE).
We have empirically observed that this additional firing rule usually results in only
relatively minor deviations from self-timed execution. However, imposing the rule leads
to a periodic execution pattern SP that is defined by the repetition vector of G. More pre-
cisely, by SP in this context, we mean a finite-duration schedule onto the disjoint subsets,
rsrc and rsnk, of FPGA resources that are occupied by the actors vsrc and vsnk, respectively.
In other words, SP can be viewed as a mapping
SP : [0,1, . . . ,(ti−1)]×{rsrc,rsnk}→ {vTsrc,vTsnk,vidle}, (4.5)
where ti is the length of the schedule (the period of the periodic pattern), and vidle repre-
sents a void computation (idle resource). Note that even though SP is formulated as a fully
static schedule, it is implemented using our modified form of self-timed execution — i.e.,
the constraints imposed by our modified form of self-timed execution lead naturally to
this kind of periodic pattern in the steady state. If ts represents the time when this steady
state pattern first emerges, (i.e., just after then point when the transient ends), then the
schedules for all time intervals of the form [(ts + kti),(ts +(k +1)ti−1)], for k = 0,1, . . .,
can be obtained by appropriately-shifted versions of the schedule defined by (4.5).
67
Thus, if ~q represents the repetitions vector of G, then one period of SP contains
q[vsrc] and q[vsnk] firings of vTsrc and v
T
snk, respectively. This kind of periodic schedule
helps to significantly reduce the complexity of performance analysis since the iterative
dataflow execution is characterized by a relatively compact periodic structure.
4.5.3 Subperiods in TASM
The entire firing pattern in an iteration (i.e., a single execution in the periodic rep-
etition) of SP can be expressed as a sequence of subperiods, where by a subperiod (SP),
we mean a smaller firing pattern within SP. From the additional firing rule that we intro-
duce in our implementation model, we are able to constrain execution so that it becomes
more structured, which leads to potential for more efficient static analysis. Fortunately,
the constraints imposed by the MSTE firing rule do not impose significant performance
limitations, which we will demonstrate in the experiments that we present in Section 4.10.
An SP is defined as the time period between two consecutive breakpoints of actor
execution, where the breakpoints are derived from two key conditions. The first condition,
which we denote by c1(t), is
c1(t) = c1,a(t) and c1,b(t), (4.6)
where
c1,a(t) = (τ1(t)≥max(p(ei), c(ei))), and (4.7)
c2,a(t) = (τ1(t)< p(ei)+ c(ei)). (4.8)
68
We say that the first condition, Condition 1, “holds” or “is true” at time a given time
instant θ if (4.7) is satisfied for t = θ (i.e., if both c1,a(θ) and c2,a(θ) hold). As we see
from the MSTE firing rule, the source actor cannot be fired when Condition 1 is satisfied.
To introduce Condition 2, denoted by c2(t), it is useful to first define the following
notion of inter and intra firing times of an actor. The set of inTRA firing times of an
actor X , denoted by TRA(X), is defined as the set of time instants during which actor X is
executing. This set can be formulated as follows.
TRA(X) = ∪∞j=1 {t | start(X , j)≤ t < end(X , j)} .
Similarly, the set of inTER firing times of an actor X is defined as the set of time
instants during which actor X is not executing (“idle”). This set can be expressed as the
complement of the inTRA firing times of X with respect to the set Z+ of non-negative
integers:
TER(X) = Z+−TRA(X).
Condition 2 associated with the definition of breakpoints is defined as
c2(t) =

true, if t ∈ {TER(A)∩TER(B)}
false, otherwise
(4.9)
This condition represents that breakpoints do not occur during the execution time of either
actor in TASM — breakpoints occur only “between” executions of vTsrc and v
T
snk. Based
on the two conditions, c1(t) and c2(t), the k-th breakpoint, denoted BP(k), is defined by
69





























(b) Execution pattern under conventional self-timed execution
(c) Execution pattern under MSTE
τ1(t=0) = 5 τ1(t=5) = 6 τ1(t=10) = 10 τ1(t=15) = 5
τ1(t=10) = 7τ1(t=0) = 5 τ1(t=5) = 6 τ1(t=15) = 5
τ1(t=0) = 5
τ2(t=0) = 15
vTsrc  firing pattern
vTsnk firing pattern
vTsrc  firing pattern
vTsnk firing pattern
SPβSPα SPα
Figure 4.3: Example of TASM-based modeling approach, and execution patterns under
conventional self-timed execution and MSTE.
BP(k) = min{t | c1(t)∧ c2(t)∧ (t > BP(k−1))} . (4.10)
Fig. 4.3 shows a concrete example to illustrate our TASM-based modeling ap-
proach; Fig. 4.3(b) shows how the associated self-timed schedule evolves under conven-
tional self-timed execution; and Fig. 4.3(c) shows how execution of the TSNM evolves
under MSTE, our modified form of self-timed execution. In Fig. 4.3(c), c1(t) in (4.6) is
true in the shaded areas of the timeline for vTsrc, and c2(t) is true at t = 0,5,10,15 when
both actor are idle. Hence, the dotted lines represents the breakpoints, and these break-
70
points divide an iteration of SP into three subperiods.
Since there are enough tokens on edge e1 to keep actor vTsnk running after the sec-
ond and fourth firings of actor vTsrc in Fig. 4.3(c), it is not essential (to achieve maximum
throughput) for vTsrc to start its third and fifth firings immediately after its correspond-
ing previous firings (i.e., vTsrc can remain idle for some time while v
T
snk consumes tokens
from the edge (vTsrc,v
T
snk). Thus, in this example, MSTE does not reduce throughput per-
formance compared to standard self-timed execution. Indeed on practical examples, we
have found that typically the constraints imposed in MSTE do not affect throughput. At
the same time, MSTE results in subperiods, which improve the efficiency with which we
can analyze execution in terms of metrics that include performance and buffering require-
ments.
4.6 Properties of subperiods in TASM
As described in Section 4.5.2 and 4.5.3 , MSTE leads to efficient static analysis
because an execution pattern under MSTE can be decomposed into a periodic pattern,
and such a pattern can be further decomposed into a sequence of subperiods (smaller
patterns). A subperiod can be more precisely defined as the time between successive
breakpoints. For convenience in this discussion, let the greatest common divisor (GCD)
of p(ei) and c(ei) be denoted by g(ei), and consider the following two mutually exclusive
scenarios:
g(ei) 6=min(p(ei), c(ei)), (4.11)
71
and
g(ei) = min(p(ei), c(ei)). (4.12)
Under Scenario (4.11), we distinguish between two different types of subperiods
that occur, and we refer to these types as SPα and SPβ. Each of these two types consists
of a fixed number of firings of vTsrc and v
T
snk. Thus, an iteration of SP is a sequence of
subperiods, where each subperiod in the sequence takes on one of two statically-known
forms — SPα and SPβ. The specific numbers of firings are summarized in Table 4.1.
Here, fSP
λ
(X) represents the number of firings of actor X that occur in a subperiod of
type λ ∈ {α,β}. For example, in Fig. 4.3(c), the first two of these subperiods are of type
α, and the third is of type β.
Under Scenario (4.12), there exists only one type of subperiod in SP. In this case,
p(ei) divides c(ei) or c(ei) divides p(ei), and it follows that the numbers of firings in
Table 4.1 for the source and sink actors are the same between the rows corresponding to
type α and type β. In other words, under Scenario (4.12), SPα and SPβ are identical, and
thus, execution proceeds based on only one type of subperiod.
In summary, there are in general two types of subperiods to consider — SPα and
SP
β
, and these forms are identical under Scenario (4.12). Execution within SP can always
be broken down into a succession of subperiods, where each individual subperiod con-
forms to one of these two forms. This is established by the following two lemmas. Basic
notation related to TASM, which is used in our formulation of these lemmas, is summa-
rized in Fig. 4.5(b). Proofs of theorems and lemmas are omitted throughout the chapter
72
Table 4.1: The number of firings of vTsrc and v
T
snk in subperiod α and β of TASM













due to lack of space.
Lemma 1. Suppose that we are given a TASM GTi under MSTE. If p(ei)≥c(ei), then in
each subperiod, vTsrc has exactly one firing, and v
T
snk has either bp(ei)/c(ei)c or dp(ei)/c(ei)e
firings.
Proof. Suppose that
p(ei)≥c(ei) inGTi . (4.13)
Then from the definition of TASM, vTsrc produces p(ei) tokens on e
T
(i,1). Thus, after the
first firing of vTsrc in each sub-period, τ1(t)≥ p(ei). From (4.4), vTsrc cannot be fired again
until τ1(t) becomes smaller than p(ei). Meanwhile, each firing of vTsnk reduces τ1(t), and
the breakpoint condition (see (4.10)) is satisfied before the next firing of vTsrc. Since, by
definition, each sub-period ends at a breakpoint, there exists only one firing of vTsrc in each
sub-period.
Now, we determine the number of firings of vTsnk in a subperiod. Let breakpoint
BP(k) denote the time at which the kth subperiod starts within the encolsing schedule
period SP. As shown above, vTsrc is fired exactly once in each subperiod. Also, v
T
snk is fired
continuously until τ1(t) meets the next breakpoint condition. Thus, the number of vTsnk
73




= min{ j | τ1(BP(k))+p(ei)− j∗c(ei)< p(ei)+c(ei)}
= bτ1(BP(k))/c(ei)c , (4.14)
where λ ∈ {α,β}. The left hand side in the inequality of (4.14) represents the change in
τ1(t) due to j firings of vTsnk. The initial value of τ1(t) is τ1(BP(k)) and p(ei) is added
from the single firing of vTsrc. As v
T
snk gets fired, τ1(t) is reduced until it meets the next
breakpoint condition. From (4.6), the minimum and maximum of τ1(BP(k)) are p(ei) and
(p(ei)+c(ei)−g(ei)), respectively, because τ1(t) is always a multiple of g(ei). We assign
these minimum and maximum values of τ1(BP(k)) to (4.14) for computing fSP
λ
()vTsnk. If
g(ei) 6=c(ei), then fSP
λ
()vTsnk is either bp(ei)/c(ei)c or dp(ei)/c(ei)e(= bp(ei)/c(ei)+1c).
On the other hand, if g(ei)=c(ei), then fSP
λ
()vTsnk = p(ei)/c(ei).
Lemma 2. Suppose that we are given a TASM GTi under MSTE. If p(ei) < c(ei), then in
each subperiod, vTsnk has exactly one firing, and v
T
src has either bc(ei)/p(ei)c or dc(ei)/p(ei)e
firings.
Proof. Suppose that
p(ei) < c(ei) inGTi . (4.15)
First, we count the number of firings of vTsnk in each subperiod. Let ta be the time at
which the first firing of vTsnk completes. If τ1(ta)<c(ei), then v
T
snk cannot be fired at time
ta because τ1(ta) < c(ei). In this case τ1(t) will be increased as time progresses due to
one or more firings of vTsrc, and eventually τ1(t) will exceed c(ei) to satisfy the breakpoint
condition (4.10) and terminate the subperiod.
74
On the other hand, if τ1(ta)≥ c(ei), then it follows from (4.15) that condition c1,a
in (4.7) holds. Furthermore, observe from the MSTE firing rule together with (4.15) that
vTsrc cannot be fired if τ1(ta) ≥ c(ei). Also, a single firing of vTsrc adds p(ei) tokens to
τ1(t). Therefore, τ1(ta) is always smaller than p(ei) + c(ei). Thus c2,a in (4.7) holds,
and the overall breakpoint condition (see (4.10)) also holds. Thus, ta marks the end of a
subperiod, and by the definition of ta, vTsnk fires exactly once within this subperiod.
Next, we determine the number of firings of vTsrc in each subperiod. In a similar




= min{ j | τ1(BP(k))−c(ei)+ j∗p(ei)≥c(ei)}
= d{2∗c(ei)−τ1(BP(k))}/p(ei)e , (4.16)
where λ∈{α,β}. The minimum and maximum of τ1(BP(k)) are then applied to (4.16) for
computing fSP
λ
()vTsrc. If g(ei)= p(ei), then fSPλ()v
T
src is c(ei)/p(ei). On the other hand, if
g(ei) 6= p(ei), then fSP
λ
()vTsrc is either dc(ei)/p(ei)e or bc(ei)/p(ei)c(= dc(ei)/p(ei)−1e).
From Lemma 1 and 2, it follows that the numbers of firings of vTsrc and v
T
snk in a
subperiod can be determined as shown Table 4.1.
75
4.7 Throughput analysis in TASM
In this section, we analyze the pattern of actor firings in TASM and analyze the
impact of allocated buffer sizes on the achieved throughput.
4.7.1 Firing pattern analysis
We begin with the following lemma, which relates tokens produced and consumed
by the source and sink actors, respectively, in TASM.
Lemma 3. Given a TASM GTi under MSTE, tokens produced by v
T
src in a subperiod are
never consumed by vTsnk in the same subperiod.
Proof. We prove this Lemma by a contradiction. Suppose that
Ck > τ1(BP(k)), (4.17)
where Ck is the number of tokens consumed by vTsnk in the kth subperiod. (4.17) represents
that at least one token produced by vTsrc in the kth subperiod is consumed by v
T
snk within
the same kth subperiod. Let k′ denote the index of the subperiod that immediately follows
subperiod k — that is, k′ = 1 (the first subperiod in the next periodic schedule iteration) if
k represents the last subperiod within SP, and otherwise, k′ = (k +1).
We examine two separate cases, and derive contraditions in both cases.
Case 1:
p(ei)≥c(ei) inGTi , (4.18)
Let τ1(BP(k)) = p(ei)+ε, where BP(k) is the beginning of the kth subperiod, and 0≤ε<
c(ei). From (4.17), we have that the total number of tokens consumed by vTsnk in the kth
76
subperiod must be greater than (p(ei)+ ε). Since there is a single firing of vTsrc in each
subperiod (from Lemma 1), we have that at the end of the kth subperiod,
τ1(BP(k′)) = (p(ei)+ ε+ p(ei)−Ck). (4.19)
Since Ck > (p(ei) + ε), τ1(BP(k′)) < p(ei). This contradicts c1,a(t) (see (4.6)), which
was assumed to hold by the definition of a breakpoint.
Case 2: Now suppose that
p(ei) < c(ei) inGTi . (4.20)
From Lemma 2, we have that
fSP
λ
()vTsnk = 1. (4.21)
Let τ1(BP(k)) = c(ei)+ ε, and 0≤ ε < p(ei). From (4.17), we have that the total
number of tokens consumed by vTsnk in the kth subperiod must be greater than (c(ei)+ ε).
Since, from the definition of TASM, vTsnk consumes c(ei) tokens on every firing, the
number of firings of vTsnk should be greater than one in the kth subperiod. This contra-
dicts (4.21).
Lemma 3 states that firing of vTsnk is never delayed in a subperiod (i.e., it is not
preceded by any idle time at the start of the subperiod). This is because vTsnk does not
need to wait for tokens produced from vTsrc in the same subperiod. While (τ1(BP(k))
tokens are always sufficient to avoid delaying vTsnk during each subperiod k, v
T
src may be
delayed (due to a value of (τ2(BP(k)) that is too small) if the allocated buffer size D(ei)
77
is not sufficient. In other words, vTsrc can be delayed to wait for tokens on e
T
(i,2) that must
be produced by vTsnk or equivalently, vsrc waits until one or more firings of vsnk generate
sufficient empty space in the buffer shown in Fig. 4.5(a). Hence, the firing pattern of vTsrc
in a subperiod is in general a function of the allocated buffer size D(ei).
Before exploring a relationship between D(ei) and a firing pattern, we first show
that (τ1(BP(k)) determines a type of kth subperiod. From the second breakpoint condition
c2(t) in (4.9), the size of buffer(D(ei)) allocated on ei of Fig. 4.5(a) can be represented by
D(ei) = (τ1(BP(k))+τ2(BP(k))) (4.22)
Thus, (τ1(BP(k)) condition deciding a type of the subperiod is important in deriving
a buffer size equation subject to a certain firing pattern of vTsrc and v
T
snk. The type of a
subperiod is determined as a function of τ1(BP(k)). Our analysis here is divided into two
cases — p(ei)≥c(ei), and p(ei)<c(ei).
Case 1: first suppose that p(ei)≥c(ei). Then from Lemma 1, there exist bp(ei)/c(ei)c
and dp(ei)/c(ei)e firings of vTsnk in SPα and SPβ, respectively. Since v
T
snk only consumes
τ1(BP(k)) in each kth subperiod, as established by Lemma 3, we have that τ1(BP(k))










Case 2: now suppose that p(ei)< c(ei). Then from reasoning that is analogous to
















We first derive a buffer size that is sufficient to guarantee that firings of vTsrc are never
delayed. This derivation is general in the sense that it holds in the absence of information
about the execution times of vTsrc and v
T
snk (beyond the assumption that the execution times
are constant). Thus, this execution pattern analysis is useful for applications in which
actor execution times are known to be constant, but whose constant values are not known
exactly. Furthermore, this analysis provides a foundation for computing more tight buffer
size requirements in the presence of known (constant) execution times (as we show in
Section 4.8).
Theorem 1. Suppose that we are given a TASM GTi under MSTE, and suppose that the
buffer size is given by
D(ei) = max(p(ei), c(ei))+p(ei)+c(ei)−g(ei). (4.25)
Then firings of vTsrc are never delayed in any subperiod.
Proof. We examine two possible cases for GTi , and derive a common buffer size equation
in both cases.
Case 1:
p(ei)≥c(ei) inGTi , (4.26)
From Lemma 1, there is a single firing of vTsrc in a subperiod. To fire v
T
src without any delay
at the beginning of the kth subperiod, we must have that τ2(BP(k))≥ p(ei) From (4.22),
79
the corresponding buffer size requirement is given by
D(ei)≥ τ1(BP(k))+ p(ei).
From (4.6), we know that (p(ei)+ c(ei)− g(ei)) is an upper bound for τ1(BP(k)).
Thus, vTsrc can be executed without delay if
D(ei) = 2∗ p(ei)+ c(ei)−g(ei). (4.27)
Case 2:
p(ei)<c(ei) inGTi , (4.28)
We divide Case 2 into two sub-cases:
• Case 2a: g(ei) 6= p(ei).
• Case 2b: g(ei) = p(ei).
We begin with Case 2a. In this case, we have that from Lemma 2, the numbers
of firings of vTsrc in SPα and SPβ are dc(ei)/p(ei)e and bc(ei)/p(ei)c, respectively. Thus,
in a type α subperiod, dc(ei)/p(ei)e∗p(ei) is a lower bound for τ2(BP(k)) to achieve
dc(ei)/p(ei)e firings of vTsrc in a type α subperiod. From (4.24), an upper bound on
τ1(BP(k)) for SPα is given by
(2∗c(ei)−bc(ei)/p(ei)c∗p(ei)−g(ei)). (4.29)
Thus, vTsrc can be executed without delay in a type α subperiod if
80
D(ei) ≥ τ1(BP(k))+ τ2(BP(k))
= (2∗c(ei)−bc(ei)/p(ei)c∗p(ei)−g(ei))+
{dc(ei)/p(ei)e∗p(ei)}
= 2∗ c(ei)+ p(ei)−g(ei). (4.30)
Similarly, in a type β subperiod, bc(ei)/p(ei)c∗p(ei) is a lower bound on τ2(BP(k))
to achieve bc(ei)/p(ei)c firings of vTsrc. Also, from (4.6), we have that c(ei)+p(ei)−g(ei)
is an upper bound on τ1(BP(k)) in a type β subperiod.
Thus, vTsrc can be executed without delay in a type β subperiod if




Because (4.30) is a sufficient condition of (4.31), vTsrc can be executed without delay in
both α- and β-type subperiods if
D(ei) = 2∗ c(ei)+ p(ei)−g(ei)
The right hand side of the last equation in (4.32) matches (4.25).
Now, we examine Case 2b: g(ei) = p(ei), which means that c(ei) = z× p(ei) for
some integer z, and from Table 4.1, the type α and type β subperiods are identical. In this
81
case, we have from (4.28), that the number of firings of vTsrc in any subperiod is z. Thus, to
achieve z firings of vTsrc, we must have that τ2(BP(k))≥ c(ei). Also, from (4.6), we have
that τ1(BP(k))≤ c(ei).
Thus, vTsrc can be executed without delay in any subperiod if
D(ei) ≥ τ1(BP(k))+ τ2(BP(k))
= 2∗ c(ei)
= max(p(ei), c(ei))+p(ei)+c(ei)−g(ei). (4.32)
Again, the right hand side of the last equation (in (4.32)) matches (4.25).
In summary, from our analysis of Cases 1, 2a, and 2b, we have that vTsrc is never
delayed delayed in a subperiod if
D(ei)≥ max(p(ei), c(ei))+p(ei)+c(ei)−g(ei). (4.33)
In the next two theorems, we derive buffer size levels that are sufficient to guar-
antee certain kinds of firing patterns for vTsrc and v
T
snk. These firing patterns are useful in
throughput analysis.
Theorem 2. Suppose that GTi is executed under MTSE; p(ei)≥ c(ei); γ is an integer
satisfying 0≤ γ≤ fSPα(v
T
snk); and
D(ei) = 2∗p(ei)+(1−γ)∗c(ei)−g(ei). (4.34)
82
Then in any given subperiod, there is exactly one firing of vTsrc. Furthermore, if n firings
of vTsnk precede v
T
src in a given subperiod, then n ≤ γ. In other words, the single firing of
vTsrc in a subperiod occurs after at most γ firings of v
T
snk.
Proof. From Lemma 1, there is exactly one firing of vTsrc in any subperiod. Also, from the
definition of TASM, vTsrc can be fired whenever τ2(t)≥ p(ei). Now recall that the number
of tokens on eT(i,2) at the beginning of the kth subperiod is denoted by τ2(BP(k)). Clearly,
the first γ firings of vTsnk in a subperiod produce (γ∗c(ei)) tokens on eT(i,2). Thus, v
T
src can
be fired after the γ-th firing of vTsnk within the kth subperiod if
τ2(BP(k))+(γ∗c(ei))≥ p(ei). (4.35)
From (4.6),
τ1(BP(k))≤ p(ei)+c(ei)−g(ei) forallk. (4.36)
It follows from (4.22), (4.35), and (4.36) that the single firing of vTsrc occurs after at most
γ firings of vTsrc if
D(ei)≥ 2∗p(ei)+(1−γ)∗c(ei)−g(ei).
Theorem 2 tells us how long a single firing of vTsrc is delayed in each subperiod when
the buffer size for ei in Fig. 4.5(a) is bounded.
83
Theorem 3. Suppose that GTi is executed under MTSE; p(ei) < c(ei); δ is an integer
satisfying 0≤ δ≤ fSP
β
(vTsrc); and
D(ei) = (1+δ)∗p(ei)+c(ei)−g(ei). (4.37)
Then in any given subperiod, there is exactly one firing of vTsnk. Furthermore, if n firings
of vTsrc occur before the end of the v
T
snk firing in a given subperiod, then n ≤ δ. In other
words, vTsrc is fired at most δ times in a subperiod before the single firing of v
T
snk completes.
Proof. From Lemma 2, there is exactly one firing of vTsnk. Also, from the definition of
TASM, vTsnk does not produce any tokens on e
T
(i,2) within a given subperiod before v
T
snk
completes its firing in that subperiod. In other words, only τ2(BP(k)) “empty buffer
slots” are filled when firing vTsrc. Thus, in each kth subperiod, v
T
src can be fired δ times
before the single firing of vTsnk completes if
τ2(BP(k))≥ δ∗ p(ei). (4.38)
From (4.6),
τ1(BP(k))≤ p(ei)+c(ei)−g(ei) forallk. (4.39)
It follows from (4.22), (4.38), and (4.39) that vTsrc is fired at most δ times in a subperiod
before the single firing of vTsnk completes if
D(ei)≥ (1+δ)∗p(ei)+c(ei)−g(ei).
84
Theorem 3 tells us, for a given bounded buffer size, how many times vTsrc can be fired
independently of vTsnk within a given subperiod. After firing v
T
src δ times in a subperiod,
any remaining firings of vTsrc are delayed until v
T
snk completes its execution.
4.7.2 Saturated TASM systems
In this section, we assume that the execution times of actors are constant and known
apriori, and we develop methods for throughput analysis of MSTE under this assumption.
We begin by defining some notation.
Definition 1. Suppose that we are given a TASM GTi . Then the execution times of v
T
src and





Intuitively, T(vTsrc) and T(v
T
snk) give the time required for each actor to complete a
single firing on rsrc and rsnk, respectively. Our development of throughput analysis for
MSTE also involves the following definition.
Definition 2. Suppose that we are given a TASM GTi that executes under MSTE, and
suppose that in each subperiod, the resource rsrc operates without any idle time — that
is, SP(t,rsrc) = vTsrc for all t ∈ 0,1, . . . ,(ti−1). Then we say that GTi is source-saturated.
Similarly, if rsnk executes without any idle time, then we say that GTi is sink-saturated.
For example, the execution pattern shown in Fig. 4.3(c) illustrates a sink-saturated
scenario. Clearly, since the net production and consumption rates of vTsrc and v
T
snk are
balanced across SP, it follows that the original SDF graph (Fig. 4.5(a)) executes at its
maximum achievable throughput if GTi is source- or sink-saturated. This is summarized
in the following property.
85
Property 1. A TASM that is source-saturated or sink-saturated executes at its maximum
achievable throughput when it executes under MSTE.
Due to the additional firing rule of MTSE (see (4.4)), the execution of TASM under
MTSE has the following property.





Then GTi is either source- or sink-saturated if
TSPλ(rsrc) ≥ TSPλ(rsnk) forallλ ∈ {α,β} (4.41)
or
TSPλ(rsrc) < TSPλ(rsnk) forallλ ∈ {α,β}. (4.42)
In particular, GTi can be neither source- nor sink-saturated under two corner cases, which
we denote as corner case 1 (CC1) and corner case 2 (CC2). CC1 corresponds to the
condition that the following two inequalities both hold:
TSPα(rsrc) ≥ TSPα(rsnk) (4.43)
and
TSPβ(rsrc) < TSPβ(rsnk). (4.44)
Similarly, CC2 corresponds to the condition that (4.45) and (4.46) both hold:
86
TSPα(rsrc) < TSPα(rsnk) (4.45)
and
TSPβ(rsrc) ≥ TSPβ(rsnk). (4.46)
Equation (4.43) means that rsnk has nonzero idle time in each α-type subperiod, while (4.44)
holds if rsrc has nonzero idle time in each β-type subperiod. Clearly, neither rsrc nor rsnk
is saturated in such a system.
The corner cases CC1 and CC2 represent limitations in our MSTE approach since
our guarantee of maximal throughput, as given by Property 1, does not apply under these
cases. However, we observe that CC1 and CC2 do not apply to a broad class of practical
systems — in particular, systems that contain functional blocks that perform as bottle-
necks, where by a “bottleneck”, we mean a block whose computational complexity is
dominant over other functional blocks. For example, in the dataflow-based 3GPP-Long
Term Evolution (LTE) protocol application developed in [54], the FFT block can be ob-
served to be a bottleneck.
In Section 4.10 we present detailed experimental studies with three practical appli-
cations, all of which involve bottleneck actors and corresponding avoidance of the corner
cases (CC1 and CC2) that prevent source- and sink-saturated execution.
4.8 Analysis of saturated systems
Motivated by our discussion on bottleneck actors and the practical relevance of
source- and sink-saturated systems, we develop in this section a detailed analysis of
87
throughput-constrained buffer optimization for such systems. Throughout the remainder
of this section, we assume that we are working with a source- or sink-saturated TASM —
i.e., we assume that the corner cases CC1 and CC2 (defined in Section 4.7.2 do not hold).
Definition 3. Suppose that we are given an SDF Graph G = (V,E) that executes under
MSTE, and suppose that the time duration of SP (i.e., a single iteration of the periodic
schedule) is denoted by ti. Then by the throughput of an actor v ∈V , which we represent
by Φ(v), we mean the number of firings of v that execute per unit time. Since q[v] firings
of an actor v execute in each iteration of SP, we have that
Φ(v) = q[v]/ti for all v ∈V. (4.47)
Furthermore, by the throughput of GTi , which we refer to as the TASM throughput,
we mean the reciprocal of the time duration of SP — i.e., (1/ti).
In this section, we show how to determine an upper bound on the buffer size re-
quired to execute GTi at its maximum achievable throughput. Henceforth, we refer to the
reciprocal this maximum achievable throughput as tmin.
We remind the reader that although this analysis is developed for two-actor SDF
graphs, the methods can be applied to arbitrary tree-structured SDF graphs, as described
in Section 4.1, by using them on each edge (and the underlying two-actor subgraph)
separately and combining the results.
Property 3. Suppose that we are given a TASM GTi . From the definitions of SP, ti, tmin,
and actor throughput, we have that
88
(Φ(vTsrc)≥q[vTsrc]/tmin) and (Φ(vTsnk)≥q[vTsnk]/tmin)
⇒ ti ≤ tmin
From Lemma 3, firings of vTsnk are never delayed in a subperiod. Also, from The-
orem 1, firings of vTsrc are not delayed in a subperiod if D(ei) is set according to (4.25).
Thus, under such a setting for D(ei), each actor in GTi is fired throughout a subperiod
without any dependency on the other TASM actor. Hence, we establish the following






for η ∈ {vTsrc,vTsnk}. (4.48)
If the execution times of vTsrc and v
T
snk are known, then it may be possible to exploit
this knowledge to relax the buffering requirements, and thereby save resources on the
target FPGA device. In particular, we can reduce buffering requirements if after applying
the reduced buffer size given by Theorem 2 or Theorem 3 (based on whether p(ei)≥ c(ei)
or p(ei) < c(ei), respectively), the resulting throughput given by (4.48) still meets the
given throughput constraint.
In a given enclosing SDF graph G = (V,E), the minimum achievable iteration pe-
riod for SP is given by
tmin = T (vbtlneck)∗q[vbtlneck], (4.49)
89
where vbtlneck = maxv∈V{v|q[v]∗T (v)}. If an iteration of SP completes exactly every tmin
time units, then we can conclude that vbtlneck is source- or sink-saturated, and the overall
TASM throughput cannot be increased further.
4.9 Application to general tree-structured SDF graphs
Our TASM analysis can be applied iteratively to determine buffer sizes for all edges
in an arbitrary tree-structured SDF graph. This assumption of a tree-structured graph is
needed to ensure that the “extra (feedback) edges” added by the TASM models for differ-
ent SDF graph edges do not “interact” (i.e., introduce new directed cycles in the overall
graph model). Many practical SDF graphs or subsystem models are tree-structured, in-
cluding models for multi-stage sample rate conversion, and various kinds of filterbanks,
as well as the JPEG and OFDM transmitter applications that we examine in Section 4.10.
Algorithm 1 provides a systematic procedure for determining buffer sizes for an ar-
bitrary, tree-structured SDF graph in a way that guarantees that the achieved performance
will satisfy a given throughput constraint. The output of this procedure is a buffer size
function D : E→ Zpos, where Zpos denotes the set of positive integers. The complexity of
this algorithm is O(E), which renders the approach practical for DSP and FPGA design
tools.
4.10 Experimental results
We have implemented Algorithm 1 in the DIF environment [5], and applied it to
three relevant signal processing applications — a CDtoDAT (compact disc to digital audio
90
Algorithm 1
1: INPUT : Tree-structured SDF graph G = (V,E)
2: : Actor execution times T : V → Zpos
3: OUTPUT: Buffer sizes, D : E→ Zpos
4: procedure TASM-BUFFERING(G)
5: for each e ∈ E do
6: p← p(e);c← c(e);
7: tsrc← T (src(e)); tsnk← T (snk(e))
8: if p≥ c then
9: γ = fα(snk(e))−dtsrc/tsnke
10: D(e)← apply Theorem 2 with γ
11: else
12: δ = btsnk/tsrcc




tape) sample rate converter, JPEG encoder, and DVB-T OFDM transmitter for digital
video broadcasting as shown in Fig. 4.4. Using National Instruments LabVIEW FPGA
8.5, we have developed FPGA implementations of these three applications along with
corresponding buffer size computation results from Algorithm 1.
LabVIEW is a graphical, dataflow-based programming environment for embedded
system design. LabVIEW features for HDL (hardware description language) synthesis
along with LabVIEW’s dataflow orientation make the tool well-suited to FPGA-based
design of signal processing applications. We have targeted the Xilinx Virtex II Pro P30
embedded in the National Instruments PCI-5640R digital system prototyping board to
synthesize SDF-based application graphs with the buffer size functions computed by Al-
gorithm 1. The base clock rate for our experiments is 40 MHz.
In the CDtoDAT application, the FIR filter in the first conversion stage becomes the
bottleneck (vbtlneck in (4.49)) of the system due to the high number of taps. Similarly, a
91
Figure 4.4: DIF-based Application specifications
92
Table 4.2: Sum of result buffer distribution under the maximum through-
put(samples/cycle) and its synthesis result
CDtoDAT JPEG Encoder DVB-T OFDM
Algorithm
Result
Throughput 5.6∗10−3 .159 .191
Buffer Sum 32 34112 9179
Synthesis
Result
FPGA Slices 5438 8105 2810
Block RAM 5 41 36
18x18 MULT 5 41 36
discrete cosine transform (DCT) block in the JPEG encoder and the inverse FFT block in
the DVB-T OFDM transmitter are bottlenecks for their respective applications.
Table 4.2 shows the results of Algorithm 1 and the associated synthesis results on
targeted FPGA. Based on the synthesis results for the three applications, we verified that
all of the solutions operate at the corresponding maximum achievable throughput levels,
which correspond to the absence of idle time in the execution profiles of the resources
that execute the associated bottleneck actors. Our results are therefore consistent with our




Hardware synthesis technique for parameterized dataflow model
5.1 Introduction
The ever increasing demand for richer applications and multimedia content in mo-
bile devices has fueled the continuous evolution of wireless standards towards bringing
higher data rates and lower latencies to the end user. The third-generation partnership
project (3GPP) has responded to this by recently finalizing the latest cellular standard
called long-term evolution (LTE) [55]. LTE promises data rates of up to 300 Mbps in the
downlink, 150 Mbps in the uplink, spectrum flexibility from 1.4 to 20 MHz, and mobil-
ity support from stationary users all the way to high-speed train speeds with a graceful
degradation of service. In order to meet these demanding requirements, both base station
and user equipment also require much higher complexity than ever before. In order to
meet the ever tightening time-to-market requirements and resource constraints, the ability
to quickly design, simulate, and prototype complex communication systems such as LTE
is becoming more and more valuable to equipment vendors and network operators alike.
The ability to input a design at an appropriate level of abstraction, and having the tools
to make necessary trade-offs early in the design process are becoming more and more
crucial in this rapidly evolving marketplace.
Synchronous dataflow (SDF) [1] has been used widely as an efficient model of com-
putation (MOC) to analyze performance and resource requirements when implementing
94
DSP algorithms on various kinds of target architectures (e.g., see [56, 57]). The SDF
model has been incorporated in many commercial tools for DSP system design, such as
ADS from Agilent, Signal Processing Designer from CoWare, and System Studio from
Synopsys. In SDF semantics, DSP applications are modeled by directed graphs in which
vertices (actors) correspond to computational blocks, and edges represent the passage of
data between blocks. SDF imposes the restriction that the number of data values (tokens)
that is produced on each output edge is constant per actor execution (firing), and similarly,
the number of tokens consumed per firing is constant for each actor/input-edge pair. Thus,
SDF does not accommodate actors that can have dynamically varying token production
and consumption rates. Such “dynamic dataflow” actors are employed in many modern
DSP applications, including the LTE physical layer, and therefore, when developing such
applications, we must explore models of computation that are more general than pure
SDF.
Parameterized dataflow (PSDF) is a generalization of SDF that allows dynamically-
changing production and consumption rates that are formulated in terms of changes to
parameters of parameterized SDF graphs (PSDF graphs) [58]. A PSDF graph can be
viewed as a parameterized family of graphs such that each instance in the family (i.e.,
each specific setting of the parameters) corresponds to an SDF graph. PSDF significantly
improves upon the expressive power of SDF while providing a framework in which many
SDF analysis techniques can be naturally adapted into parameterized versions. For ex-
ample, techniques for constructing efficient parameterized looped schedules have been
developed for PSDF graphs [58]. These scheduling techniques can provide for efficient
simulation or software synthesis from PSDF specifications.
95







Figure 5.1: Example LTE subframe showing multiplexing of various channels on a 2D
time-frequency grid (not to scale).
In this chapter, we apply PSDF to modeling the LTE physical layer protocol at the
dataflow model level of the design methodology illustrated in Fig. 1.1. A distinguishing
aspect of our approach is that we develop a PSDF-based hardware synthesis framework
for efficient utilization of parallel processing capabilities in FPGAs. In contrast, the pa-
rameterized looped schedules described above have been designed for single-processor,
software-based implementations. Also, our work develops novel connections among
model-based DSP system design, FPGA implementation, and next generation wireless
communication systems, which lead to systematic, formally supported design methods
for hardware implementation in this domain.
96
5.2 Background
5.2.1 LTE downlink physical layer
The LTE downlink physical layer is based on the modulation and multiple access
scheme called Orthogonal Frequency Division Multiple Access, or OFDMA. OFDMA
uses an IFFT to divide a wideband channel into multiple narrowband subchannels. This
creates a two-dimensional resource grid in frequency and time. In LTE, each element of
this grid is called a resource element. This 2D grid allows multiplexing various physical
channels, e.g., data and control channels, which could be intended for possibly multiple
users. An example 1ms LTE subframe comprising 14 OFDMA symbols in the normal
cyclic prefix mode is shown in Fig. 5.1. LTE can be configured for 6 different bandwidths,
namely 1.4, 3, 5, 10, 15, and 20 MHz, but still maintain a constant 15 kHz subcarrier
spacing. The LTE physical layer can also support multiple antenna transmission schemes,
including transmit diversity, beamforming, and spatial multiplexing, but we primarily
focuses on implementation for the single-antenna transmission mode.
5.2.2 Parameterized Synchronous Dataflow
Parameterized Synchronous Dataflow(PSDF) [58] extends the expressive power of
SDF to manage DSP application dynamics in terms of run-time configuration of dataflow
actor, edge, and subsystem parameters. A PSDF subsystem that is enabled for run-time
configuration involves two separate “parameter configuration controllers,” which are re-
ferred to as the init and subinit graphs of the associated subsystem. These controllers
97
provide two different levels of granularity in the run-time configuration processing — the
init graph can form parameter configurations that are in general less restricted but also less
frequent compared to the kinds of configurations that are allowed by the subinit graph.
The modeling discipline imposed by the subinit and init graphs in PSDF is designed
to provide significant flexibility in how and when parameters are configured, while ensur-
ing that configurations that affect the structure of subsystem schedules are allowed to
occur only between iterations (in terms of SDF repetitions vectors) of the associated sub-
systems. This allows each subsystem to be viewed as a dynamically evolving sequence
of SDF graphs whose SDF properties can change only at well-defined points in time (be-
tween SDF graph iterations). Such a structured view of dynamic dataflow graph execution
is valuable for efficient quasi-static scheduling [58, 59, 60].
5.3 Parameterized SDF Model of LTE
5.3.1 LTE specification
Fig. 5.2 shows our PSDF model for a single-antenna LTE Base Station Modula-
tor, which is the basis of our FPGA implementation. Each of the solid blocks corre-
spond to PSDF actors whose production and consumption rates at their solid edges can
change given the value of the parameters indicated by the dashed blocks communicated
by the dashed edges. The data, control, and reference symbol generation blocks pro-
vide the QPSK, 16-, or 64-QAM symbols that are multiplexed via the Resource El-
ement (RE) mapper. The RE mapper takes in different numbers of symbols s1, s2,
and s3 from the available input ports as a function of the number of control symbols
98
Snk
 L        1
p1        s1
p2        s2
p3       s 3
Figure 5.2: PSDF Model for LTE BS Modulator.
(Nctrl ∈ {1,2,3,4}), subframe index (S f idx ∈ {0, ..,9}), bandwidth configuration (BW ∈
{1.4,3,5,10,15,20}), cyclic prefix mode (CPmode ∈ {Normal,Extended}), and symbol
index (SymbIdx∈{0, ..,13}). These symbols are multiplexed into Nu ∈{72,180,300,600,900,1200}
used subcarriers, which is a direct map from the bandwidth configuration BW . The Zero
Pad block then takes in Nu symbols and appends zeros at the DC and edge subcarriers
forming 2048 frequency domain complex values. The following block then performs a
2048-pt IFFT, and appends a cyclic prefix of length that is a function of the CPmode and
SymbIdx parameters. The rate at the output of this block should be 30.72 Ms/s with a
worst case bandwidth of 20 MHz, and so in order to interface to the 25 MHz D/A con-










REmapper.body (Фb)p1          c1
1                          1p2          c2








Figure 5.3: PSDF specification of RE Mapper.
5.3.2 PSDF Modeling Details
A PSDF specification for the RE mapper is shown in Fig. 5.3. Since there are differ-
ent bandwidth configurations allowed, and each symbol of the LTE subframe is composed
of different combinations of physical channel symbols (see Fig. 5.1), production and con-
sumption rates in the RE mapper subsystem can be changed across OFDMA symbols, i.e.,
across the invocations of the RE mapping subsystem. Meanwhile, in order to multiplex
the combination of physical channel types in each OFDMA symbol, the appropriate input
edge is connected to the output edge for each resource element in the OFDMA symbol
during each invocation of the RE mapper. We have likewise modeled the other process-
ing blocks in the downlink LTE physical layer protocol, and have verified that PSDF has
sufficient expressive power for describing the full functionality of our target LTE protocol.
PSDF specifications support hierarchical reconfigurable subsystem modeling struc-
tures in that a PSDF specification can be abstracted as a hierarchical PSDF actor, and
embedded in a parent (higher level) PSDF graph. For example, a PSDF abstraction of the
100
RE mapper in Fig. 5.2 is considered as a PSDF specification consisting of a body graph
(Φb), init graph (Φi), and subinit graph (Φs), as shown in Fig. 5.3.
Before the invocation of the RE Mapper PSDF specification, the init graph receives
the parameter set, determines the physical channel data combinations in the particular
OFDMA symbol, and counts the number of REs allocated for each physical channel to
determine production rates on input edges and consumption rates on output edges in the
parent graph. During the invocation of the specification, the subinit graph determines
production and consumption rates on internal edges in order to switch the input edge con-
nected to an output edge depending on the value of the received remapping matrix data at
run-time. Based on the distribution of active and inactive edges, the body graph, which
implements the computational core of the subsystem, can produce a sequence of data cor-
responding to the OFDMA symbol index. Hence, in the architecture of our parameterized
dataflow framework, the body graph models the main functional behavior of the RE map-
per, while the init and subinit graphs provide two different levels of control based on the
given, dynamically arriving parameter sets.
5.3.3 PSDF Execution Model
Each LTE subframe is composed of multiple OFDMA symbols, and each OFDMA
symbol in our PSDF specification is processed after all actors in the graph are fired at the
rate determined by the repetitions vector of the enclosing graph. Because PSDF semantics
guarantees that any specific configuration of a PSDF graph is an SDF graph, and that such
configurations can only be changed between SDF graph iterations, there is always a well-
101
defined repetitions vector that governs the processing of a given OFDMA symbol. For
details on fundamental relationships between SDF graphs and repetitions vectors, we refer
the reader to [1].
When executing the LTE FPGA implementation, we apply a self-timed execution
model, which means that each actor should be fired as soon as all of its input edges have
sufficient data. When actors execute and communicate on dedicated resources (so that re-
source contention is not an issue), this type of execution generally enhances throughput by
facilitating the exploitation of parallel processing capabilities on the target hardware. This
type of distributed-control execution model also avoids hardware and run-time overhead
due to the stronger synchronization requirements that are associated with centralized-
control schedules.
FPGA targets allow dataflow actors to be assigned onto independent, dedicated
processing units that are implemented by FPGA slices. In such a computing environ-
ment, signal processing throughput can be significantly increased due to the possibility
for simultaneous firings of multiple actors. To ensure valid, distributed firing rule check-
ing in our PSDF-based implementation framework, we model empty memory spaces on
dataflow graph edges by adding feedback edges with appropriate numbers of initial to-
kens (based on the sizes of the corresponding buffers) in the execution model graph (an
intermediate dataflow graph representation used to map the application into hardware),
and enable actors for execution using principles of efficient self-timed execution [47].
Wiggers et al. have employed a similar backpressure-driven, self-timed execution
model to implement cyclo-static dataflow (CSDF) graphs in multi-processor system-on-
chip devices [61]. Our approach is in this chapter differs in its exploration of PSDF,
102
Table 5.1: FPGA resource utilization for LTE implementation.
Occupied FPGA Slices 5,244 out of 14,720 (35%)
Number of BlockRAM 96 out of 244 (39%)
Number of DSP48Es 54 out of 640 (8%)
which is a significantly more dynamic form of dataflow compared to SDF or CSDF, and
its application to FPGA implementation.
5.4 LTE Prototype Implementation
As a proof-of-concept of our PSDF LTE model, we have designed and implemented
from the top down an LTE real-time base station emulator prototype [62]. The prototype
is based on a PXI-express system with an embedded real-time controller PC running a
real-time operating system, which handles the link control, higher-layer software, and
communication with an optional host PC via TCP-IP. The PSDF LTE model is designed
in LabVIEW FPGA [63], and implemented on the PXIe-5641R, Intermediate Frequency
(IF) Transceiver module, which includes a Xilinx Virtex-5 SX95T FPGA with integrated
2-input and 2-output IF ports. The IF signals are then modulated onto a radio frequency
carrier using the PXI-5610 2.7 GHz RF upconverter, and looped-back to a PXI-5600
2.7 GHz RF downconverter, where the downconverted IF signal is fed back to the IF
Transceiver for receiver processing. The base clock for our experiments with this system
is 160 MHz. Synthesis results from the experiments are shown in Table 5.1.
103
As an illustrative example, we detail the implementation of the 625/768 sample rate
conversion block of Fig. 5.2, which converts a 30.72MSPS LTE signal to the DAC at
25MSPS. In order to save hardware resources, we divide the filter into a cascade of two
rational resampling stages, namely a 25/24 and a 25/32 stage. Using the LabVIEW Digital
Filter Design Toolkit (DFDT), the individual floating point rational filters are designed
and the fixed point behavior of the overall filter is simulated. We then used Xilinx’s FIR
compiler to implement the filter using the IP integration node from NI-Labs. This node
uses an XCO or VHDL file as imports to build a simulation and implementation model
compatible with LabVIEW FPGA.
104
Chapter 6
Conclusion and Future Work
6.1 Conclusion
In this thesis, we have presented new design techniques and methodologies for
dataflow-based synthesis of field programmable gate array (FPGA) implementations for
digital signal processing (DSP) applications. We have focused mainly on formulating and
exploring design spaces for finding cost-efficient solutions subject to given constraints on
performance. Our experimental results have demonstrated that the proposed techniques
are highly effective in improving the efficiency of implementations on FPGA platforms.
In chapter 2, we developed a systematic approach for generating dedicated fast
Fourier transform (FFT) subsystems for FPGA implementation. Our approach incorpo-
rates efficient FFT address generation and memory management, and applies two or-
thogonal loop unrolling methods to provide a tunable trade-off between performance and
FPGA resource costs. We also developed an analytical approach for high level design
space exploration. This approach allows one to derive a resource-efficient FFT architec-
ture configuration for a given throughput constraint, and a given critical target resource
(e.g., FPGA BRAM or logic slices).
Our methods are demonstrated through extensive synthesis experiments using the
Xilinx Virtex II Pro FPGA device family. Our synthesis results quantify cost-performance
trade-offs provided by our proposed class of FFT architectures. A distinguishing charac-
105
teristic of our approach, compared to commercially available FFT IP cores and other
specialized FFT implementations, is that we provide a systematic method to generate an
FPGA-based FFT architecture while taking into account trade-offs between performance
and cost.
In chapter 3, we extended our 1D-FFT implementation technique to generate ded-
icated 2D-FFT subsystems for FPGA implementation. Our approach realizes data par-
allelism within an individual 1D-FFT core, and minimizes the interface complexity be-
tween the underlying 1D-FFT core and local memory. Our approach allows for scalable,
parallel 2D-FFT implementation with a relatively simple interconnection network, and
correspondingly simple control logic. These features contribute to improved FPGA re-
source consumption at a given level of performance compared to previous 2D-FFT FPGA
architectures. Our synthesis results quantify the cost-performance trade-offs provided by
our proposed class of FFT architectures.
In chapter 4, we presented a novel algorithm to provide upper bounds on FPGA
buffer distributions for throughput-optimal execution of synchronous dataflow graphs that
are in the form of tree-structured, directed acyclic graphs. The resulting bounds can be
employed directly as buffer sizes when mapping SDF graphs into digital hardware. A
distinguishing aspect of our proposed algorithm is that it has low polynomial time com-
plexity, which makes it especially useful for rapid prototyping and for implementation of
large scale or heavily multirate designs. Our work appears promising for integration into
high-level design processes for FPGA-based DSP system implementation, as our experi-
ments with the LabVIEW FPGA demonstrate.
In chapter 5, we presented a framework for the modeling and FPGA implementation
106
of LTE downlink physical layer processing using the parameterized synchronous dataflow
(PSDF) model of computation. The results of our study and our associated prototype pro-
vide a concrete demonstration of PSDF-based design and implementation techniques for
emerging wireless communication systems. Due to its formal properties, support for sys-
tematic scheduling and implementation techniques, and capabilities for efficient frame-
based dynamic dataflow modeling, PSDF is promising as a semantic foundation for future
design tools, and as an architectural foundation for digital system design methodologies
in the domain of fourth generation wireless communication systems.
6.2 Future work
In this section, we describe a number of useful directions for future work that build
on the results of this thesis.
The FFT actor architecture developed in chapter 2 has provided the framework and
core design for the burst mode built-in FFT IP block, which is a new feature introduced
in LabVIEW FPGA 8.6, released by National Instruments. This released version does
not employ the inner/outer loop unrolling features in our FFT actor architecture frame-
work. These features are presently integrated together with the other components of our
FFT actor implementation approach within in-house distributions that are being used ex-
perimentally at National Instruments. These unrolling-enabled versions provide higher
throughput FFT realizations, but are more complex and require more extensive experi-
mentation before integration into the commercially released product.
Overall, our FFT architecture techniques are suitable as the basis for FFT IP blocks
107
that can be configured across a wide range of trade-offs between resource cost and achiev-
able performance based on implementation requirements.
While radix-4 FFT designs are generally less resource-consuming compared to
radix-2 FFT designs, radix-4 designs are more restricted in that the FFT size must be
a power of 4, while the size for radix-2 must be a power of 2. To implement FFT sizes
that are powers of 2, combined radix-2/4 architectures are attractive candidates. One
interesting direction for further study is to apply our proposed unrolling techniques to
combined radix-2/4 FFT architectures.
The proposed buffering algorithm in chapter 4 is restricted to SDF graphs, which
are dataflow graphs that have static dataflow (production and consumption rate) behavior.
While many DSP applications can be modeled using SDF graphs, an increasing range of
applications require more flexibility and cannot be fully represented by SDF semantics.
Furthermore, SDF-compatible behaviors can sometimes be synthesized more effectively
if they are converted to alternative representations that employ more flexible modeling
techniques such as cyclo-static dataflow (CSDF) [52]. Thus, extending our techniques for
buffer analysis and optimization to more expressive dataflow models is a useful direction
for further investigation.
Another interesting direction for future work is buffer optimization for parameter-
ized dataflow graphs under self-timed execution. In chapter 5, we developed a novel
PSDF-based FPGA architecture design approach, and demonstrated this approach using
National Instrument’s LabVIEW FPGA. In this work, we exploited the expressive power
of parameterized dataflow, and demonstrated the mapping from a complex PSDF applica-
tion specification into an FPGA implementation. Integrating buffer optimization into the
108
PSDF-to-FPGA mapping process will be a useful direction for further study to achieve
more efficient hardware utilization in derived implementations.
109
Bibliography
[1] E. A. Lee and D. G. Messerschmitt, “Static scheduling of synchronous data flow
programs for digital signal processing,” IEEE Trans. Comput., vol. 36, no. 1, pp.
24–35, 1987.
[2] J. L. Pino, S. Ha, E. A. Lee, and J. T. Buck, “Software synthesis for dsp using
ptolemy,” Journal of VLSI Signal Processing, vol. 9, pp. 7–21, 1995.
[3] W. Sung, M. Oh, C. Im, and S. Ha, “Demonstration of codesign workflow in peace,”
in in Proc. of International Conference of VLSI Circuit, Seoul, Koera, 1997.
[4] J. Buck and R. Vaidyanathan, “Heterogeneous modeling and simulation of embed-
ded systems in el greco,” in CODES ’00: Proceedings of the eighth international
workshop on Hardware/software codesign. New York, NY, USA: ACM, 2000, pp.
142–146.
[5] C. Hsu, M. Ko, and S. S. Bhattacharyya, “Software synthesis from the dataflow
interchange format,” in Proceedings of the International Workshop on Software and
Compilers for Embedded Systems, Dallas, Texas, September 2005, pp. 37–49.
[6] C. Hsu, J. L. Pino, and S. S. Bhattacharyya, “Multithreaded simulation for syn-
chronous dataflow graphs,” in Proceedings of the Design Automation Conference,
Anaheim, California, June 2008, pp. 331–336.
[7] W. Wolf, FPGA-Based System Design. Upper Saddle River, NJ, USA: Prentice
Hall PTR, 2004.
[8] Xilinx, “Xilinx core generator 10.1,” 2008. [Online]. Available: http://www.xilinx.
com/ipcenter/coregen/updates 101.htm
[9] J. W. Cooley and J. W. Tukey, “An algorithm for the machine calculation of complex
Fourier series,” Mathematics of Computation, vol. 19, no. 90, pp. 297–301, April
1965.
[10] S. Winograd, “On computing the discrete fourier transform,” Mathematics
of Computation, vol. 32, no. 141, pp. 175–199, 1978. [Online]. Available:
http://www.jstor.org/stable/2006266
[11] D. Kolba and T. Parks, “A prime factor fft algorithm using high-speed convolution,”
Acoustics, Speech and Signal Processing, IEEE Transactions on, vol. 25, no. 4, pp.
281–294, Aug 1977.
[12] R. Bracewell, “The fast hartley transform,” Proceedings of the IEEE, vol. 72, no. 8,
pp. 1010–1018, Aug. 1984.
110
[13] M. Frigo and S. G. Johnson, “FFTW: An adaptive software architecture for the
FFT,” in IEEE Intl. Conf. Acoustics Speech and Signal Processing, vol. 3, 1998, pp.
1381–1384.
[14] A. Ganapathiraju, J. Hamaker, and J. Picone, “Contemporary view of fft algorithms,”
in Proceedings of the IASTED International Conference on Signal and Image Pro-
cessing (SIP ’98), 1998, pp. 130–133.
[15] B. Baas, “A low-power, high-performance, 1024-point fft processor,” Solid-State
Circuits, IEEE Journal of, vol. 34, no. 3, pp. 380–387, Mar 1999.
[16] W. Li and L. Wanhammar, “A pipeline fft processor,” in In IEEE Workshop on Signal
Processing Systems, 1999, pp. 654–662.
[17] I. Uzun, A. Amira, and A. Bouridane, “Fpga implementations of fast fourier trans-
forms for real-time signal and image processing,” Vision, Image and Signal Process-
ing, IEE Proceedings -, vol. 152, no. 3, pp. 283–296, June 2005.
[18] S. Sukhsawas and K. Benkrid, “A high-level implementation of a high performance
pipeline fft on virtex-e fpgas,” VLSI, 2004. Proceedings. IEEE Computer society
Annual Symposium on, pp. 229–232, Feb. 2004.
[19] J. Vite-Frias, R. Romero-Troncoso, and A. Ordaz-Moreno, “Vhdl core for 1024-
point radix-4 fft computation,” Reconfigurable Computing and FPGAs, 2005. Re-
ConFig 2005. International Conference on, pp. 4 pp.–24, Sept. 2005.
[20] C. Chad, Z. Qin, X. Yingke, and H. Chengde, “Design of a high performance fft
processor based on fpga,” Design Automation Conference, 2005. Proceedings of the
ASP-DAC 2005. Asia and South Pacific, vol. 2, pp. 920–923 Vol. 2, Jan. 2005.
[21] C. Gonzalez-Concejero, V. Rodellar, A. Alvarez-Marquina, E. M. d. Icaya, and
P. Gomez-Vilda, “An fft/ifft design versus altera and xilinx cores,” Reconfigurable
Computing and FPGAs, 2008. ReConFig ’08. International Conference on, pp. 337–
342, Dec. 2008.
[22] Xilinx, “Fast fourier transform v4.1,” 2007.
[23] Y. Ma, “An effective memory addressing scheme for fft processors,” Signal Process-
ing, IEEE Transactions on, vol. 47, no. 3, pp. 907–911, Mar 1999.
[24] G. Nordin, P. A. Milder, J. C. Hoe, and M. Püschel, “Automatic generation of cus-
tomized discrete fourier transform ips,” in DAC ’05: Proceedings of the 42nd annual
conference on Design automation. New York, NY, USA: ACM, 2005, pp. 471–474.
[25] J. H. Takala, T. S. Ja”rvinen, P. V. Salmela, and D. A. Akopian, “Multi-port inter-
connection networks for radix-r algorithms,” in In Proc. IEEE Intl. Conf. Acoustics,
Speech, Signal Processing, 2001, pp. 1177–1180.
111
[26] H. Kee, N. Petersen, J. Kornerup, and S. S. Bhattacharyya, “Systematic generation
of FPGA-based FFT implementations,” in Proceedings of the International Confer-
ence on Acoustics, Speech, and Signal Processing, Las Vegas, Nevada, March 2008,
pp. 1413–1416.
[27] M. Hasan and T. Arslan, “Fft coefficient memory reduction technique for ofdm ap-
plications,” Acoustics, Speech, and Signal Processing, 2002. Proceedings. (ICASSP
’02). IEEE International Conference on, vol. 1, pp. I–1085–I–1088 vol.1, 2002.
[28] C. Burrus, “Unscrambling for fast dft algorithms,” Acoustics, Speech and Signal
Processing, IEEE Transactions on, vol. 36, no. 7, pp. 1086–1087, Jul 1988.
[29] J. M. Blackledge, Digital Image Processing. Horwood Publishing, 2005.
[30] H. Jung and S. Ha, “Hardware synthesis from coarse-grained dataflow specification
for fast hw/sw cosynthesis,” in In Proceedings of Int. Conf. on Hardware/Software
Codesign and System Synthesis (CODES/ISSS, 2004, pp. 24–29.
[31] S. S. Bhattacharyya, P. K. Murthy, and E. A. Lee, “Synthesis of embedded software
from synchronous dataflow specifications,” Journal of VLSI Signal Processing Sys-
tems for Signal, Image, and Video Technology, vol. 21, no. 2, pp. 151–166, June
1999.
[32] R. Reiter, “Scheduling parallel computations,” Journal of the Association for Com-
puting Machinery, October 1968.
[33] A. Dasdan, A. Dasdan, R. K. Gupta, and R. K. Gupta, “Faster maximum and mini-
mum mean cycle algorithms for system-performance analysis,” IEEE Transactions
on Computer-Aided Design of Integrated Circuits and Systems, vol. 17, pp. 889–899,
1998.
[34] J. L. Pino, S. S. Bhattacharyya, and E. A. Lee, “A hierarchical multiprocessor
scheduling system for DSP applications,” in Proceedings of the IEEE Asilomar Con-
ference on Signals, Systems, and Computers, Pacific Grove, California, November
1995, pp. 122–126 vol.1.
[35] A. H. Ghamarian, M. C. W. Geilen, S. Stuijk, T. Basten, B. D. Theelen, M. R.
Mousavi, A. J. M. Moonen, and M. J. G. Bekooij, “Throughput analysis of syn-
chronous data flow graphs,” in ACSD ’06: Proceedings of the Sixth International
Conference on Application of Concurrency to System Design. Washington, DC,
USA: IEEE Computer Society, 2006, pp. 25–36.
[36] J. T. Buck, “Scheduling dynamic dataflow graphs with bounded memory,” Berkeley,
CA, USA, Tech. Rep., 1993.
[37] R. Govindarajan, G. R. Gao, and P. Desai, “Minimizing buffer requirements under
rate-optimal schedule in regular dataflow networks,” Journal of VLSI Signal Pro-
cessing, vol. 31, p. 2002, 1994.
112
[38] C.-T. Hwang, J.-H. Lee, and Y.-C. Hsu, “A formal approach to the scheduling prob-
lem in high level synthesis,” IEEE Trans. on CAD of Integrated Circuits and Sys-
tems, vol. 10, no. 4, pp. 464–475, 1991.
[39] Q. Ning and G. R. Gao, “A novel framework of register allocation for software
pipelining,” 1993.
[40] H. Oh and S. Ha, “Efficient code synthesis from extended dataflow graphs for mul-
timedia applications,” in DAC ’02: Proceedings of the 39th annual Design Automa-
tion Conference. New York, NY, USA: ACM, 2002, pp. 275–280.
[41] J. Horstmannshoff and H. Meyr, “Efficient building block based rtl code generation
from synchronous data flow graphs,” in DAC ’00: Proceedings of the 37th Annual
Design Automation Conference. New York, NY, USA: ACM, 2000, pp. 552–555.
[42] ——, “Optimized system synthesis of complex rt level building blocks from multi-
rate dataflow graphs,” in ISSS ’99: Proceedings of the 12th international symposium
on System synthesis. Washington, DC, USA: IEEE Computer Society, 1999, p. 38.
[43] E. Stuijk, M. Geilen, and T. Basten, “Exploring trade-offs in buffer requirements
and throughput constraints for synchronous dataflow graphs,” in In DAC. ACM
Press, 2006, pp. 899–904.
[44] M. Geilen, T. Basten, and E. Stuijk, “Minimising buffer requirements of syn-
chronous dataflow graphs with model checking,” in in Proceedings of the Design
Automation Conference. ACM, 2005, pp. 819–824.
[45] M. Wiggers, M. Bekooij, P. G. Jansen, and G. J. M. Smit, “Efficient computa-
tion of buffer capacities for multi-rate real-time systems with back-pressure,” in
CODES+ISSS, 2006, pp. 10–15.
[46] E. A. Lee and S. Ha, “Scheduling strategies for multiprocessor real time DSP,” in
Proceedings of the Global Telecommunications Conference, November 1989.
[47] S. Sriram and S. S. Bhattacharyya, Embedded Multiprocessors: Scheduling and
Synchronization, 2nd ed. CRC Press, 2009.
[48] S. Y. Kung, P. S. Lewis, and S. C. Lo, “Performance analysis and optimization of
VLSI dataflow arrays,” Journal of Parallel and Distributed Computing, pp. 592–
618, 1987.
[49] C. Hsu, F. Keceli, M. Ko, S. Shahparnia, and S. S. Bhattacharyya, “DIF: An inter-
change format for dataflow-based design tools,” in Proceedings of the International
Workshop on Systems, Architectures, Modeling, and Simulation, Samos, Greece,
July 2004, pp. 423–432.
[50] S. S. Bhattacharyya, P. K. Murthy, and E. A. Lee, Software Synthesis from Dataflow
Graphs. Kluwer Academic Publishers, 1996.
113
[51] W. Sun, M. J. Wirthlin, and S. Neuendorffer, “Fpga pipeline synthesis design ex-
ploration using module selection and resource sharing,” IEEE Trans. on CAD of
Integrated Circuits and Systems, vol. 26, no. 2, pp. 254–265, 2007.
[52] G. Bilsen, M. Engels, R. Lauwereins, and J. A. Peperstraete, “Cyclo-static dataflow,”
IEEE Transactions on Signal Processing, vol. 44, no. 2, pp. 397–408, February
1996.
[53] W. Plishker, N. Sane, M. Kiemb, K. Anand, and S. S. Bhattacharyya, “Functional
DIF for rapid prototyping,” in Proceedings of the International Symposium on Rapid
System Prototyping, Monterey, California, June 2008, pp. 17–23.
[54] H. Kee, I. Wong, Y. Rao, and S. S. Bhattacharyya, “FPGA-based design and im-
plementation of the 3GPP-LTE physical layer using parameterized synchronous
dataflow techniques,” in Proceedings of the International Conference on Acoustics,
Speech, and Signal Processing, Dallas, Texas, March 2010.
[55] G. Americas, The Mobile Broadband Evolution: 3GPP Release 8 and beyond, Feb.
2009.
[56] C. Hsu, S. Ramasubbu, M. Ko, J. L. Pino, and S. S. Bhattacharyya, “Efficient sim-
ulation of critical synchronous dataflow graphs,” in Proceedings of the Design Au-
tomation Conference, San Francisco, California, July 2006, pp. 893–898.
[57] C. B. Robbins, Autocoding Toolset Software Tools for Automatic Generation of Par-
allel Application Software. Technical report, Management Communications and
Control, Inc., 2002.
[58] B. Bhattacharya and S. S. Bhattacharyya, “Parameterized dataflow modeling for
DSP systems,” IEEE Transactions on Signal Processing, vol. 49, no. 10, pp. 2408–
2421, October 2001.
[59] S. Saha, S. Puthenpurayil, and S. S. Bhattacharyya, “Dataflow transformations in
high-level DSP system design,” in Proceedings of the International Symposium on
System-on-Chip, Tampere, Finland, November 2006, pp. 131–136, invited paper.
[60] M. Ko, C. Zissulescu, S. Puthenpurayil, S. S. Bhattacharyya, B. Kienhuis, and
E. Deprettere, “Parameterized looped schedules for compact representation of ex-
ecution sequences in DSP hardware and software implementation,” IEEE Transac-
tions on Signal Processing, vol. 55, no. 6, pp. 3126–3138, June 2007.
[61] M. Wiggers, M. Bekooij, P. Jansen, and G. Smit, “Efficient computation of buffer
capacities for cyclo-static real-time systems with back-pressure,” in Proceedings of
the IEEE Real-Time and Embedded Technology and Applications Symposium, 2007.
[62] I. Wong, Y. Rao, and M. Santori. Video: Prototyping complex communications sys-
tems. http://zone.ni.com/wv/app/doc/p/id/wv-1696.
[63] N. Instruments, LabVIEW FPGA User Manual, 2009.
114
