Automatic synthesis and optimization of floating point hardware. by Ho, Chun Hok. & Chinese University of Hong Kong Graduate School. Division of Computer Science and Engineering.
Automatic Synthesis and Optimization of 
Floating Point Hardware 
Ho Chun Hok 
A Thesis Submitted in Partial Fulfillment 
of the Requirements for the Degree of • 
Master of Philosophy 
in 
Department of Computer Science k Engineering 
©The Chinese University of Hong Kong 
July, 2003 
The Chinese University of Hong Kong holds the copyright of this thesis. Any 
person(s) intending to use a part or the whole of the materials in this thesis 
in a proposed publication must seek copyright release from the Dean of the 
Graduate School. 
2 9 胸 ra ) | | 
\S^BfiARY SYSTEMXW 
Automatic Synthesis and Optimization 
of Floating Point Hardware 
submitted by 
Ho Chun Hok 
for the degree of Master of Philosophy 
at the Chinese University of Hong Kong 
Abstract 
This thesis presents a methodology for designing floating point and fixed point 
systems on FPGA platforms by means of a programming language. A com-
piler, fly, floating point library, float and arbitrary function module generator, 
were developed for the rapid system prototyping research, fly takes a Perl-like 
program as input and produces a synthesizable VHDL description of a one-hot 
state machine and the associated datapath elements as output. Furthermore, 
it is tightly integrated with the hardware design environment and implemen-
tation platform, and is able to hide issues associated with these tools from 
user. The float library consists of a floating point class for the simulation of 
quantization effects associated with high precision floating point operators, an 
optimizer which can automatically determine the minimal number of exponent 
and fraction bits required for a specified degree of accuracy, and a parameter-
ized floating point library which can generate floating point operators with 
arbitrary precision. The function generator can generate any one-operand 
function and is compatible with the fly compiler. The systems was used to 
prototype an FPGA based greatest common divisor (GCD) coprocessor, dig-
ital sine-cosine generator, a dedicated circuit for solving ordinary differential 
11 
equation (ODE), and a simulation model for the N-Body problem. By com-
bining these design tools, the time and knowledge required for a designer to 























Many people have contributed to my education through their guidance and 
support in my graduate school years. I especially wish to thank my final year 
project and Master degree supervisor, Dr. Philip Leong for his suggestion and 
ideas on research. He also reviewed my manuscript carefully. This dissertation 
cannot be done without his help and support. 
I would like to acknowledge Dr. P. Zipf, Mr. R. Ludewig and Mr. A. G. Or-
tiz of Institute of Microelectronic Systems, Darmstadt University of Technolgy 
for the development of the fly compiler project. They provided the embryo of 
fly compiler so that I can extend from their work. 
Thanks must be given to Mr. K. H. Tsoi, who assisted me in debugging 
various host interface used in this thesis. He also reviewed my Chinese abstract 
thoroughly 
I would like to thank my colleagues. In particular, Mr. Y. H. Cheung, Mr. 
C. W. Sham, Mr. Y. M. Lam, Mr. C. L. Yuen, Mr. K. Y. Tong and F. Wu 
for their assistance and support. 






1 Introduction 1 
1.1 Motivation 1 
1.2 Aims 3 
1.3 Contributions 3 
1.4 Thesis Organization 4 
2 Background and Literature Review 5 
2.1 Introduction 5 
2.2 Field Programmable Gate Arrays 5 
2.3 Traditional design flow and VHDL 6 
2.4 Single Description for Hardware-Software Systems 7 
2.5 Parameterized Floating Point Arithmetic Implementation . . . . 8 
2.6 Function Approximations by Table Lookup and Addition . . . . 9 
2.7 Summary 10 
3 Floating Point Arithmetic 11 
3.1 Introduction 11 
3.2 Floating Point Number Representation 11 
3.3 Rounding Error 12 
M VI 
3.4 Floating Point Number Arithmetic 14 
3.4.1 Addition and Subtraction 14 
3.4.2 Multiplication 17 
3.5 Summary 17 
4 FLY - Hardware Compiler 18 
4.1 Introduction 18 
4.2 The Fly Programming Language 18 
4.3 Implementation details 19 
4.3.1 Compilation Technique 19 
4.3.2 Statement 21 
4.3.3 Assignment 21 
4.3.4 Conditional Branch 22 
4.3.5 While 22 
4.3.6 Parallel Statement 22 
4.4 Development Environment 24 
4.4.1 From Fly to Bitstream 24 
4.4.2 Host Interface 24 
4.5 Summary 26 
5 Float - Floating Point Design Environment 27 
5.1 Introduction 27 
5.2 Floating Point Tools 28 
5.2.1 Float Class 29 
5.2.2 Optimization 31 
5.3 Digital Sine-Cosine Generator 33 
5.4 VHDL Floating Point operator generator 35 
5.4.1 Floating Point Multiplier Module 35 
5.4.2 Floating Point Adder Module 36 
5.5 Application to Solving Differential Equations 38 
vii ^ 
5.6 Summary 40 
6 Function Approximation using Lookup Table 42 
6.1 Table Lookup Approximations 42 
6.1.1 Taylor Expansion 42 
6.1.2 Symmetric Bipartite Table Method (SBTM) 43 
6.1.3 Symmetric Table Addition Method (STAM) 45 
6.1.4 Input Range Scaling 46 
6.2 VHDL Extension 47 
6.3 Floating Point Extension 49 
6.4 The N-body Problem 52 
6.5 Implementation 54 
6.6 Summary 56 
7 Results 58 
7.1 Introduction 58 
7.2 GCD coprocessor 58 
7.3 Floating Point Module Library 59 
7.4 Digital sine-cosine generator (DSCG) 60 
7.5 Optimization 62 
7.6 Ordinary Differential Equation (ODE) 63 
7.7 N Body Problem Simulation (Nbody) . 63 
7.8 Summary 64 
8 Conclusion 66 
8.1 Future Work 68 
A Fly Formal Grammar 70 
B Original Fly Source Code 71 
viii ^ 
Bibliography 74 
； „ ； 。 • 
ix ^ 
List of Tables 
4.1 Main elements of the fly language 19 
7.1 Area and speed of the floating point library 60 
7.2 Optimization result using different QERR values where (x,y) 
are the (exponent size, fraction size) in bits 62 
7.3 Results generated by the differential equation solver for different 
values of h 64 
7.4 The frequency and slices used reported by design tools for N-
body problem 64 
7.5 All Experiments Result 65 
X -
List of Figures 
4.1 Circuitry used to handle multiple assignments to the same vari-
able. This is the circuit which results from a program with two 
assignments $l=$a and $l=$s 22 
4.2 Circuitry for if-else statements. This is the circuit which results 
from the statement i f ($a > 0) . . . e l se 23 
4.3 Circuitry for while statements. This is the circuit which results 
from the statement while ($s ！= $1) 23 
4.4 Circuitry for parallel statements 23 
4.5 Circuitry for the host to FPGA interface using register 26 
4.6 Circuitry for the host to FPGA interface using dual-port Block-
RAM 26 
5.1 Floating point algorithm design flow 29 
5.2 Parameterized Floating Point multiplier datapath 37 
5.3 Parameterized Floating Point adder datapath 39 
6.1 Input partition of SBTM 43 
6.2 Input partition of STAM 45 
6.3 Extended VHDL Preprocessor 50 
6.4 Datapath of ” 部 using STAM for floating point arithmetic . • 57 
7.1 Digital sine-cosine generator reference output 61 
xi ^ 
7.2 Quantization error of the sine-cosine generator for different frac-
tion sizes 61 
7.3 Quantization error for different fraction sizes 61 





Traditional development method for FPGA is complex 
In the standard field programmable gate array (FPGA) based prototyping 
methodology, algorithms are first developed in programming languages such 
as C on a personal computer or workstation using floating point arithmetic. 
When the system is later implemented in hardware, a fixed point version of the 
algorithm is derived from the floating point version and then translated into 
a hardware design in a hardware description language such as VHDL. Finally, 
the design is synthesized for a field programmable gate array (FPGA) based 
prototyping environment where it can be tested. 
However, it is found that using a HDL based design methodology results 
in low productivity compared with software development with programming 
language because of the following issues: 
• Hardware designs are parallel in nature while most of the people think 
in sequential patterns 
• The standard technique of decomposing a hardware design into datapath 
and controls adds complexity to the task 
1 -
Chapter 1 Introduction 2 -
• Designers must develop a hardware interface for the FPGA board as well 
as a software/hardware interface between a host system and the FPGA 
• Elementary functions are not supported and designer needs to build op-
erations like reciprocal, log and sin from primitive operations before the 
design can actually begin 
The above issues significantly increase the design complexity, with associ-
ated increase in design time and debugging, especially in developing the inter-
face between a host system and the FPGA. Furthermore, the time spent in the 
above process restricts the amount of time which can be spent on dealing with 
higher level issues such as evaluating different algorithms and architectures for 
the system. 
Floating point arithmetic can take advantages on FPGA 
Today, FPGA systems have almost solely used fixed point arithmetic. Al-
though several groups have implemented floating point adders and multipliers 
using FPGA devices [SWA95, LMM+98, JLOl], very few systems employing 
floating point arithmetic have been reported. It is envisaged that FPGA den-
sity has improved to a point where area concerns are becoming less significant, 
and aided by Moore's Law, silicon density will continue to improve at an ex-
ponential rate. It is believed that hardware systems employing floating point 
computations will become increasingly popular as the density of hardware im-
proves, particularly in applications where variables have a very large dynamic 
range, or the designer wishes to avoid the complexity of translating the imple-
mentation to fixed point. 
In this work, an efficient way to implement floating point arithmetic on 
FPGA using flexible architectures will be presented. 
Chapter 1 Introduction 16 -
1.2 Aims 
The objective of this research was to provide a design environment such that 
any algorithm designer, even if not an expert in hardware development, can 
implement their floating point algorithm on the FPGA by using Perl-like lan-
guage to describe their algorithm. The detail research aims are: 
• The designer need not be familiar with hardware description language 
yet can implement the algorithm on the FPGA. 
• The interface between the host and the FPGA board is encapsulated 
such that it hides the details of the host interface from the designer. 
• The designer need not have expertise in the implementation of floating 
point arithmetic. 
• The designer can focus on the algorithm and the implementation is done 
by the system. 
• Any differentiate function can be automatically generated and used in 
the language 
• Design time is greatly reduced since the simulation is done at a very high 
level and the resulting hardware implementation is correct by construc-
tion. ‘ 
1.3 Contributions 
To address the design time issue, a compiler called fly for the translation of 
software descriptions into hardware is developed. The input of fly is a Perl-
like description and it generates synthesizable VHDL for adaption to different 
FPGA and ASIC design tools. In addition, a VHDL Floating Point library was 
designed in which includes an optimizer for determining the minimum floating 
Chapter 1 Introduction 4 -
point precision for each variable to reach some user-specified tradeoff between 
quantization error and circuit size. To enhance its flexibility, arbitrary func-
tions for fixed point arithmetic is supported through table lookup approach. 
To the best of the author's knowledge, the integration of a hardware com-
piler, floating point library, optimizer and table lookup generator, resulting in 
a dedicated development environment is novel. 
Several applications, using fixed point and floating point arithmetic, have 
been developed using the tools. These include the following: 
• Greatest Common Divisor Processor 
• Digital Sine Cosine Generator 
• Ordinary Differential Equation Solver 
• N-Body Problem Simulator 
Compared with previous design systems, the design time required for these 
application is greatly reduced while the error is eliminated by automatic hard-
ware construction. 
1.4 Thesis Organization 
The rest of the thesis is organized as follows. Chapter 2 describes' previous 
work and implementations. Chapter 3 introduces floating point arithmetic. 
In chapter 4, the fly compiler is described. Chapter 5 will discuss the opti-
mization of floating point operations and the related library will be presented. 
The implementation of the table lookup approach and the algorithm will be 
described in chapter 6. Results from experiments using the system will be 
reported in chapter 7. Conclusions will be drawn and further work suggested 
in chapter 8 
Chapter 2 
Background and Literature 
Review 
2.1 Introduction 
This chapter provides some background informations about the thesis. It in-
cludes an introduction to Field Programmable Gate Array (FPGA) technology 
and one of its development languages - VHDL. Then the chapter reviews previ-
ous hardware compilation techniques, construction of floating point arithmetic 
and implementation of functions using the look up table approach. Hard-
ware compilation refers to translation of an algorithm specified in a source 
file into a hardware design. The aim of program translation is to build a 
working environment such that implementation of FPGA application is just 
like software programming, avoiding traditional hardware level descriptions 
completely [Pag96]. . 
2.2 Field Programmable Gate Arrays 
Field Programmable Gate Arrays (FPGA) are an integrated circuits where 
the functionality can be modified in the field after the fabrication. Therefore, 
FPGA can be customized for different application as long as the device itself 
5 -
Chapter 2 Background and Literature Review 6 •� 
is complex enough to store the logic. 
A regular FPGA chip consists of an array of logic blocks and routing chan-
nels. I /O pads are attached at the sides of the chip. Both logic blocks and 
routing channels can be reconfigured to handle arbitrarys function and con-
nections respectively. Different FPGA chips have different internal structure 
of logic blocks. 
In this research, the Xilinx Virtex XCVIOOOE FPGA [XilOl] will be used 
unless otherwise specified. The XCVIOOOE contains 6,144 configurable logic 
blocks (CLB). Each logic block contains 4 logic cells and organized in two sim-
ilar slices. The slice can be referred as the primitive component in XCVIOOOE. 
Each slices consists of two 4-input look up tables (LUTs) and two flip-flops. 
XCVIOOOE also provides 96 blocks of on-chip dual-read/write port synchronous 
RAM with 4096 memory cells in each block. The storage element can use for 
data transferring between the host machine and FPGA board and act as tem-
poral storage inside the FPGA. The routing channel is implemented using 
routing matrix which can connect I /O pads, clock signal and general purpose 
logic together. 
2.3 Traditional design flow and VHDL 
Several steps are necessary for implementing customized functions on FPGA 
chips. It is first required to simulate the algorithm in software, construct the 
datapath in hardware, design the control signals for the datapath, simulate 
the datapath and control signals for verification and implement a protocol for 
interfacing between the host and FPGA board. 
Though simulating the algorithm on software is easy for a software designer, 
the remaining stages require extra hardware knowledge to realize the design. 
To construct the datapath, a schematic approach can be used for simple design 
but it may not be practical to implement some real life applications which often 
Chapter 2 Background and Literature Review 7 •� 
involve thousands of logic gates. Therefore, it is necessary to use a hardware 
description language such as VHDL [IEE02] when implementing complex logic 
on the hardware or FPGA. Even though the programming language and hard-
ware description language share some properties like variables versus signal, 
the nature of hardware description language is totally different from program-
ming language. Hardware description language, as the name suggests, is used 
to describe the hardware functionality. Unlike normal programming languages, 
hardware description languages may run several operations in parallel and ex-
plicit specification of the timing is required to make the design work. 
In software designs, the execution sequence of the code is sequential. To 
achieve the same effect in the hardware, control signals and state machines 
can be described using VHDL. To complete the logic design, both datapath 
and state machines must be implemented. Mostly it involves rewriting the 
algorithm in VHDL. 
In order to program the FPGA, a bitstream generated by the design tool 
is required. The VHDL code will be synthesized into a netlist. The netlist will 
contain the representation of the hardware such as the function of each basic 
blocks and the connection between the blocks. The design tool will extract the 
information in the netlist and map the logical blocks and connection to specific 
lookup table and routing matrix respectively. It finally produces a bitstream 
can customize the functionality of the FPGA by writing this information onto 
the chip. 
2.4 Single Description for Hardware-Software 
Systems 
I. Page [Pag96] demonstrated the translation of basic programming constructs, 
including assignment statement, parallel composition, sequential composition, 
Chapter 2 Background and Literature Review 8 .� 
conditional composition and repetitive composition, into hardware. I. Page 
used this architecture to implement a real-time video processing application. 
It is reported that the fully operational, high-bandwidth hardware system was 
constructed by an undergraduate programmer without knowledge of hardware 
as a summer course project. 
M. Ward et al [WA02] proposed a hardware implementation of the Ada 
language that allows accurate timing analysis. It supports standard pro-
gramming statements such as assignment, branching and loop and include 
non-recursive sub-program calls. Two standard parameter-function passing 
techniques, namely pass-by-value and pass-by-reference can be used in this 
language depends on the type of variable. The timing of the produced circuit 
is analyzed accurately and the main application is the real-time systems. 
M A T C H (MATlab Compiler for Heterogeneous computing systems) [BSC+99 
is a compiler project developed at Northwestern University. MATCH takes 
MATLAB descriptions of various embedded systems applications, and auto-
matically maps them on to a configurable computing environment consisting 
of FPGAs, embedded processors and digital signal processors. Among the 
supported function are matrix addition, matrix multiplication and one dimen-
sional FFT, FIR and IIR filters. The code generation of FPGA is a conversion 
to VHDL so branching and assignment is straight forward. A finite state 
machine was developed to control loop statement. A MPEG decoder was de-
veloped using heterogeneous set of resources as a MATCH example. 
2.5 Parameterized Floating Point Arithmetic 
Implementation 
FPGA* technology is desirable for parameterized floating point arithmetic im-
plementation. A. Jaenicke and W.Luk [JLOl] have implemented parameterized 
V 
Chapter 2 Background and Literature Review 9 •� 
floating point adder and multiplier on FPGAs. The design is based on Handel-
C language and the data format is variance of IEEE standard. It's reported 
that the floating point adder can perform 28 MFLOPS for arbitrary sizes of 
fraction and exponent. A 2D Fast Hartley Transform (FHT) processor has 
been developed by using this FPU as basic building blocks and it can perform 
a IK-point transform in 10 /is. 
P. Belanovic et al [BL02] implemented a parameterized floating point li-
brary for use with reconfigurable hardware. It is based on the IEEE 754 
floating point format standard. The library includes addition, subtraction, 
multiplication and conversion between fixed point and floating point numbers. 
All of these modules are specified in VHDL and implemented on the Wild-
star reconfigurable computing engine. They are fully-pipelined and cascadable 
to form pipelines of floating point operations. This library was used to de-
velop a hybrid implementation of the K-means clustering algorithm applied to 
multispectral images. 
2.6 Function Approximations by Table Lookup 
and Addition 
Elementary function approximations are important in scientific computing. 
Lookup table approach is the most common technique for implementing these 
functions since the storage size is increased rapidly in FPGA device recently. 
J, E. Stine and M.J. Schulte [SS99a] have developed a method for computing 
elementary functions using parallel table lookups and multi-input adder. The 
method is suitable for any difFerentiable function and the input range can be 
varied according to specific needs. The latency of the design is low because of 
applying parallelism. 
Chapter 2 Background and Literature Review 10 •� 
2.7 Summary 
In this chapter, different aspects of FPGA design, including applying single 
description for both hardware and software system, floating point arithmetic 
and elementary functions implementation have been reviewed. This thesis 
will apply these techniques to form rapid system prototyping of floating point 
systems. 
Chapter 3 
Floating Point Arithmetic 
3.1 Introduction 
This chapter is an introduction to floating point number arithmetic. Floating 
point algorithms are used frequently in modern applications such as speech 
recognition, image processing and financial engineering because of its ability 
to represent a good approximation to the real numbers. 
The IEEE 754 floating point standard [ANS85] has been widely accepted 
for representing floating point numbers. With this standard, the result and 
the error of each floating point operation can be retained the same even if the 
platform of the computation is changed. 
The floating point arithmetic, including addition, subtraction and multipli-
cation is covered in this chapter. The rounding error imposed by using floating 
point arithmetic will be discussed. The concepts of quantization error between 
IEEE standard and the variant used in this thesis will be introduced. 
3.2 Floating Point Number Representation 
Every real number can be approximated by a floating point number in the 
IEEE 754 standard as long as that number is within specific range. The 
floating point number format is based on scientific notation with limited size 
11 -
Chapter 3 Floating Point Arithmetic 12 �� 
for each field. For a normalized floating point number in the IEEE 754 single 
precision standard where the integer part is always equals to 1, the sign bit is 
1 bit in size. The integer part is omitted as it is always equals to 1. The size 
of fraction part is 23 bit and the size of exponent is 8 bit. The base is always 
equal to 2 and the total size of a single precision floating point number is 32 
bits. In general, an IEEE floating point number F can be expressed as follow: 
F = ( - 1 ) ^ - 1 . / ( 3 . 1 ) 
b = - 1 (3.2) 
Where 5 stands for the sign bit, f stands for the fraction and e stands 
for the biased exponent. In order to express a negative exponent, there is a 
exponent bias b associated with the exponent field. The actual exponent is the 
value of the exponent field minus the bias. The value of bias depends on the 
size of exponent Csize as in equation 3.2. The term significand represents 1./ 
in which integer field and fraction field are packed together. 
For single precision floating point system, the bias is 127 since Csize is 8. If 
the exponent field e is 128, the actual exponent is 128 - 127 = 1. The integer 
field for most numbers is equal to 1 since they are normalized. Denormalized 
numbers are indicated by the exponent being 0. In this case, F = is 
represented. The above floating point format without denormalized numbers 
is used throughout this thesis to represent floating point values with arbitrary 
exponent and fraction sizes. 
3.3 Rounding Error 
There are four rounding modes in the IEEE floating point standard, namely, 
round to nearest, round towards +oo, round towards —oo and round towards 
Chapter 3 Floating Point Arithmetic 13 .� 
zero. The algorithm described above below uses round to zero mode. Under 
this mode, the result shall be the value closest and no greater in magnitude 
than the infinitely precise result. Assuming that the length of precision, in-
cluding the integer field, is p bit, for each of the floating point operations, 
there will be an absolute error less than where e is the exponent after 
the normalization of the resulting value. For example, let p = 3, the result of 
the following floating point addition 
1.01 X 2 ^ 1 . 0 0 X2-3 
= 1 . 0 1 1 X 2° 
« 1.01 X 2° 
will contribute the absolute error of = = 2—3 
As the answer, after normalization, must greater than 2% the relative error 
corresponding to the answer will be smaller than 
Oe-P 
^ = V (3.3) 
= 2 - P (3.4) 
When analyzing the rounding error caused by various formulas, relative 
error is better than absolute error, especially if we need to compare the error 
of certain equation using different value, it can be estimated the relative error 
since it is independent to the given value itself. The relative error is always 
bounded by e, which is referred to as machine epsilon. 
V 
Chapter 3 Floating Point Arithmetic 14 �� 
3.4 Floating Point Number Arithmetic 
In this section, the arithmetic of the floating point number is outlined. It 
focuses on the hardware aspect of the floating point operation using a register 
transfer language (RTL). The descriptions further assumed that it use IEEE 
rounding to zero mode when handle inexact number condition. 
3.4.1 Addition and Subtraction 
Let Fi and F2 represent the two single precision floating point numbers, Fsum 
is the sum of these two numbers and Fminus is F1-F2. As floating point format 
uses a signed-magnitude representation, the equation 
Frmnus 二 厂1 _ (3.5) 
can be rewritten as 
F 爪 , 画 = + (3 .6) 
So this section will deal with the addition algorithm only. Subtraction is a 
variation of addition in which the sign bit of F2 is inverted. 
Let Fi be denoted as (—1 产 . ( 1 + O./i) . where fi and e^ - are the 
sign field, fraction field and the exponent field in floating point representation 
respectively and b is the exponent bias. 
The IEEE standard requires that the arithmetic operations, including addi-
tion and multiplication should be computed as if first produced an intermediate 
result correct to infinite precision with unbounded range, and then coerced this 
to fit in the destination's format. However, it is very expensive in terms of the 
intermediate storage size, if the operands differ greatly in size. Assuming that 
p = 3, 1.11 . 210 + 1.00 . 2—2 would be calculated as 
Chapter 3 Floating Point Arithmetic 15 �� 
X = 1.110000000000 .SiG 
y = 0.000000000001 . 
;r + y = 1.110000000001 
which is then rounded to 1.11. It uses 13 bits to store the result which 
is 4 times the numbers of bits. When the difference of exponent is larger, the 
size of intermediate result is larger too. 
Without using infinite precision for the intermediate result, lengthening the 
intermediate result by 2 bits at the right is adequate for obtaining properly 
rounded to zero result. These 2 bits are called guard bit and round bit. The 
guard bit can guarantee the relative rounding error in the result is less then 
2e. The round bit can guarantee the rounding to zero mode is always correct 
GO191]. In general, the sum of Fi and F2 is evaluated as shown in algorithm 1， 
where the symbol # # denotes concatenation of two registers, s“ e^  and fi 
denote the sign field, exponent field and fraction field of the floating point 
number Fi respectively. The algorithm further assumed that it used single 
precision format for Fi and F2. However, with some minor modifications, it 
can be used for arbitrary precision floating point formats. For simplicity, the 
algorithm does not check any special cases such as negative zero, illegal number 
and so on. These cases are handled in the hardware implementation of floating 
point addition. 
Chapter 3 Floating Point Arithmetic 16 �� 
Algorithm 1 Calculate F\ + F2 with floating point arithmetic  
Require: Fi = (si, ei, fi), F2 = (52, 625/2) 
Ensure： F— = (^ ans, ^ ans, fans) = Fi + F2 
1： edif f ei- 62 
2: if ediff > 0 then 
3： fa — /I, fb /2, E^ — ediff 
4： else 
5： fa — /2, 
6： /j, / i , 65 f - 2's complement of tdifj 
7： end if 
8: fa — (，’00r##/a), fb f - ”001，，##/6 
9： fb — shift fb right with edij/ bits 
10： if 5a = 1 then 
11: rrria <r- 2's complement of fa 
12： end if 
13： if 56 = 1 then 
14: rrrib 卜 2's complement of fb 
15： end if 
16: ftmp rnia + rrrib 
17： if ftmp is negative then 
18： ftmp ^  2's complement of ftmp, Sans ^ 1 
19： else 
20： 5ans — 0 
21： end if 
22： find the leading one of ftmp, shift ftmp left until ftmp(rnsb�= 1, 
23: Cans ^ Ga - numbei of bits shift to left, msb is the location of most 
significant bit 
24： omit the integer part, fans = ftmp(jnsb — 1...0) 
Chapter 3 Floating Point Arithmetic 30 �� 
3.4.2 Multiplication 
Multiplication is simpler than addition assuming that the fixed point multiplier 
is provided. The product of F\ and F2, where both Fi and F2 are normalized 
floating point numbers, is evaluated as in algorithm 2. For simplicity, the 
algorithm does not check any special cases such as negative zero, illegal number 
and so on. These cases are handled in the hardware implementation of floating 
point multiplication. 
Algorithm 2 Calculate F\ x F2 with floating point arithmetic  
Require; Fi = (^i, ei,/i), F2 = (^2,625/2) 
Ensure： Fans = {Sans.^ans, fans) = Fi X F2 
1： Sans 51 ® 32 
2： append 1 bit "1" to f i and f � a t left as the hidden integer field 
3： iM—”l，，##/i 
4 : 仍 — ” l ，’# # / 2 
5: do fixed point unsigned multiplication mc f - ul x v2 
6: rei 卜 ei + 62 - 6 
7: shift mc to left until msb of mc is 1 
8： Cs ^ number of bit shifted to left 
9： Cans 卜厂el _ ^s 
10： fans — mc(44…22) 
3.5 Summary 
This chapter described the fundamental concepts of the floating point numbers. 
It introduced various number formats and operations including addition, sub-
traction and multiplication. It further discussed the effect of rounding errors 
for floating point operation. 
Chapter 4 
FLY - Hardware Compiler 
4.1 Introduction 
This chapter describes the implementation details of fly compiler. Fly compiler 
translates a Perl-like algorithm description into synthesizable VHDL code. 
Fly supports most elementary constructs such as conditional branching and 
looping. This chapter begins with the syntax of fly programming language. 
For each constructs, the implementation will be described using a greatest 
common divisor as an example. Summary is given at the end of the chapter. 
4.2 The Fly Programming Language 
The syntax of the fly programming language is modeled on Perl, with exten-
sions for parallel statements and the host/FPGA interface. Table 4.1 shows the 
main elements of the fly language with simple examples. The formal grammar 
definition is in Appendix A. 
Using Perl-like description has its advantages. It facilitates the compatibil-
ity between software simulation and hardware implementation. Any algorithm 
that can be described in fly without using parallel constructs, would be able to 
simulate on Perl by executing the script without any modification. In addition, 
it is easier for designers to learn the fly other than HDL based languages. It 
18 -
Chapter 4 FLY - Hardware Compiler 19 �� 
also minimizes the error due to the translation of software simulation version 
to hardware datapath description. 
"Constructs | Elements | Example “ 
assignment var 二 expr; %varl = %tempvar;  
"parallel statement [ { . . . } { . . . } . . . ] — [ { $ “ = $6;} { = $a * $c ; } ] 
expression val op expr; = 贴 . * c^； 
valid ops: *,/’+’—，.*，. —，.+ 
while (rel) { . . . } while (%x < %y) { 
$a = $ci + $6;$y = $y + l ; } 
I f i a i ^ if (cond) { . . . } else { . . . } if {%i < = $j) { $a = $6;} 
else {a = c;} 
if (cond) { . . . } if > $j) {$i = + 1；} 
cond expr rel expr � = $c 
valid rels:〉,<，<=,>=，==,! = 
built-in function &readJiost(..) = kreadJiost{2bb)  
comment # comment #this line is comment 
Table 4.1: Main elements of the fly language. 
The fly program for a greatest common divisor (GCD) CO-processor, which 
will be used as an example in the rest of this chapter is given in listing 4.1: 
The program uses most elements of the fly language and system including 
the host interface, while loops, if-else branches, integer arithmetic, parallel 
statements and register assignment. This example will be used in the rest of 
this chapter to illustrate the translation process. • 
4.3 Implementation details 
4.3.1 Compilation Technique 
Programs in the fly language are automatically mapped to hardware by using 
the technique described by Page [Pag96]. The compiler generates synthesizable 
VHDL code instead of a netlist, simplifying code generation and making the 
Chapter 4 FLY - Hardware Compiler 20 -
Listing 4.1: Greatest Common Divisor  
iR 
2 $s = $din [ 1 ] ; $1 = $ d i n [ 2 ] ; 
3 wh i l e ($s ！ = $1) { 
4 $a = $1 - $s ； 
5 i f ($a > 0) { 
6 $1 = $a; 
7 } 
g 6 1 S G ^ 
9 [ { $ s = $ 1 ; } { $ 1 = $s ; } ] 
10 } 
11 } ‘ 
12 $dout[1] = $1； 
13 } • 
output portable to many different FPGA and ASIC design tools. Furthermore, 
as an intermediate language, VHDL enables the logical optimization of the 
synthesis tool to be included in the design flow. 
In order to facilitate the support of control structures, each statement has 
a start and end signal that specifies temporally when the execution of one 
statement begins and ends. By connecting the start and end signals of ad-
jacent statements together, a one-hot state machine is constructed that serves 
as the control flow of the fly program. 
Fly is written in the Perl programming language [WCOOOJ. Perl is a lan-
guage with very good portability, string handling facilities and libraries. The 
fly system's source code in Appendiex B is made simpler and concise as a re-
sult of using Perl. Development of the fly compiler was also facilitated using 
a parser generator called Parse: :RecDescent [ConOl] which generates a Perl 
based recursive descent parser from a description of the grammar of the target 
language. 
Chapter 4 FLY - Hardware Compiler 21 �� 
4.3.2 Statement 
A program is a sequence of statements, each statement being either an as-
signment, sequences of statements to be executed in parallel, if-else，or a while 
loop. Each statement has an associated start and end signal, and a sequence of 
statements is constructed by connecting the individual statement's start and 
end signals together. A statement is said to be enabled if its start signal is 
high during the rising edge of the (global) clock. 
The start signal of the entire program is generated by the host interface. 
For example, the first statement of the GCD program that is enabled is the 
assignment $s = $ d i n [ l ] ；. The end signal of this statement is connected to 
the start signal of the next statement, namely $1 = $din[2] ；. In this case, 
the end signal is generated from the start signal by delaying it one clock cycle 
using a D-type flip flop. 
Eventually, the last statement of the program $dout [1] = $1； will be 
enabled, and after it has been executed (i.e. its end signal is asserted), the 
execution of the program is completed. 
4.3.3 Assignment 
Assignments are implemented simply by asserting the destination register's 
enable signal when its associated statement is enabled. If a variable is the 
target of an assignment from more than one statement, a multiplexer and 
encoder is used to select the according source value. 
For example, if a program has two assignments to the same variable i.e. $1 
= $ a and $1 = $s, and if the associated start and end signals are $ s t a r t l , 
$endl and $start2, $end2 respectively, the circuit in Figure 4.1 is generated. 
Chapter 4 FLY - Hardware Compiler 22 �� 
• S - 7 ^ , 
select D Q 
a - ^ > I 
WE 
> 
startl start2  
Figure 4.1: Circuitry used to handle multiple assignments to the same variable. 
This is the circuit which results from a program with two assignments $l=$a 
and $l=$s. 
4.3.4 Conditional Branch 
If-else statements have both a condition and two statements. The start sig-
nal of the if-else statement is routed to the appropriate block of statements 
depending on the condition. Figure 4.2 shows the resulting circuit for the 
statement i f ($a > 0) . . . e l se . . . . The end signals of both blocks are 
or'd together to produce the end signal of the if-else statement. 
4.3.5 While 
The end signal of a while statement must be conditionally fed back to the start 
signal for the statement block. The circuit corresponding to the while loop in 
the GCD algorithm is shown in Figure 4.3. 
4.3.6 Parallel Statement 
In the GCD example, a parallel statement is used to swap the $s and $1 
variables. As shown in Figure 4.4, each sequential block enclosed by parallel 
brackets [ ] will start execution at the same time. The parallel block will end 
when all sequential blocks give an end signal. A statement will only have an 
active end signal for a single cycle, so flip-flops (labelled "FF" in the figure) 
Chapter 4 FLY - Hardware Compiler 23 �� 
start_while s I 
,/ z 
start a 0 V ^ ^ / f T . 
\ / Comparator 
Comparator /- ‘  
1 V 
‘ ‘ start_while_block • 1 
I N n I statement | 
I end 一 while—block 上 
V V ^ ^ 
start_if start_else 
end_while 
Figure 4.2: Circuitry for if-else 
statements. This is the circuit Figure 4.3: Circuitry for while 
which results from the state- statements. This is the circuit 
ment i f ($a > 0) . . . e l se which results from the state-
ment while ($s ！= $1). 
• • • • 
. . I statement 1 I— ) _ _ N end_parallel 
start_paralle!_ ' ^ _ 
I statement 2 |—^ I 
Figure 4.4: Circuitry for parallel statements. 
Chapter 4 FLY - Hardware Compiler 24 �� 
are added to determine when all statements have finished. If all the flip flops 
are set, it indicates the end of the parallel statement and they will be cleared 
at next clock cycle. 
4.4 Development Environment 
4.4.1 From Fly to Bitstream 
Although the interface is easily adaptable to any reconfigurable computing 
card, the fly system currently only supports the Pilchard reconfigurable com-
puting platform [LLC+01]. Pilchard uses a DIMM memory bus interface in-
stead of a conventional PCI bus. The advantage of the memory bus is that it 
acheives much improved latency and bandwidth over the standard PCI bus. 
The translated output of a fly program is interfaced with a generic Pilchard 
core written in VHDL. A shell script, automatically invoked by the fly system, 
includes the libraries and invokes the programs which are required to compile 
the VHDL representation of the user's program to a bitstream. The bitstream 
is also automatically downloaded to the FPGA and the host interface program 
automatically invoked. Thus the entire compilation and execution process are 
hidden from the user. 
4.4.2 Host Interface 
To enhance the flexibility of host/FPGA interface, two interfaces were de-
veloped namely register and BlockRAM approach. Each approach suits for 
certain application. 
Registers can be used to transfer data between the FPGA and host. The 
architecture of host interface is shown in Figure 4.5 In normal operation, the 
host processor would initialize values in $ d i n [ l ] to $din[x ] , and then start 
execution of the FPGA based coprocessor by performing a write cycle to the 
Chapter 4 FLY - Hardware Compiler 25 � 
$din[0] register. The write cycle causes the start signal of the first statement 
in the FPGA to be asserted. The software then polls the least significant bit 
of $din[0] which is connected to the end signal of the last statement. When 
execution on the FPGA finishes, the least significant bit of $din[0] is set 
and the program can read values returned by the hardware by reading the 
appropriate registers. 
By using the register interface, the fly core can be adopted to different 
FPGA and ASIC products. The data can be fetched immediately without 
address decoding cycles inside the FPGA. However, the register approach can-
not support streaming data which is common in DSP design. The number of 
argument passing to the fly core is limited since register will use the resource 
of FPGA cells. 
Another approach to the host/FPGA interface is using the BlockRAM 
XilOl] feature which is available on Xilinx Virtex devices. BlockRAM is dual 
port configured and one side of port is connected to the host bus while the 
other side is connected to the fly core as shown in Figure 4.6. Two built-in 
functions readJiost ( ) and wri teJ iostO are introduced to access the data 
in the BlockRAM. The handshaking is similar to the register approach. The 
address 0 in the BlockRAM is used for handshaking and will trigger the start 
of FPGA coprocessor during a write cycle is issued on address 0. When the 
FPGA finishes the execution, it will return 1 once the host performs a read 
cycle on address 0. 
Since the BlockRAM does not consume the logic resources in the FPGA, 
it has advantages in area and performance over a large number of registers. 
In addition, the interface clock and the core clock can be of different frequen-
cies. This can enhance the flexibility to reach specific design constraints. It is 
possible that the core clcok can run faster then the interface clock when two 
clocks are provided. It also supports data streaming such that the processor 
can provide data to the FPGA and the FPGA can return the result at the 
Chapter 4 FLY - Hardware Compiler 26 �� 
address bus i 
data bus — i a 
write enable -———i T " 
o令 一 p Q � 
, , >lin1 � I 
-—-J — _如 WE •俄 
• J� * FLY core  
Address j  
Decoder | l ^ o o 令 
n >Jin2 
I~» wt > 
Figure 4.5: Circuitry for the host to FPGA interface using register 
data bus —^― i 
writ* enable  
" r ^ Port A 
] I Z Z d a t a b w FLY core 
^ddmw but 
PortB ^  
( w r l l f _ 
> 
Figure 4.6: Circuitry for the host to FPGA interface using dual-port Block-
RAM 
same time since BlockRAM is dual portted 
4.5 Summary 
In this chapter, the Perl programming language was used to develop, a power-
ful yet simple hardware compiler for FPGA design. Unlike previous compilers, 
fly was designed to be easily modifiable to facilitate research in .hardware lan-
guages and code generation. Since fly is tightly integrated with the hardware 
design tools and implementation platform, designers can operate with a higher 
level of abstraction than they might be accustomed to if they used VHDL. An 
example of a GCD coprocessor was given. Development time was significantly 
reduced since deubgging can be done through the simulation of the program. 
Chapter 5 
Float - Floating Point Design 
Environment 
5.1 Introduction 
With the increasing size of FPGA devices, implementing floating point arith-
metic on FPGAs are now possible. However, as the size of the FPGA is still 
limited, a carefully designed floating point implementation is essential. In 
custom hardware designs, there are always trade-offs between conflicting re-
quirements of performance, area and quantization error to be addressed. For 
example, area can usually be reduced if a larger quantization error is allowed 
for a hand-held application. It would be desirable to allow a program to auto-
matically determine the minimum exponent and fraction sizes required for each 
signal to reach some user-specified quantization error. A floating point library 
called float is presented to enable users to optimize the design. -In addition, a 
library which can generate arbitrary sized floating point adders and multipliers 
was developed to facilitate the FPGA-based floating point applications. 
The first section will discuss the software aspect of this system. An ex-
ample using floating point tools to develop and optimize a digital sine-cosine 
compiler is presented. To generate a arbitrary sized of floating point operator, 
a Perl program has been developed as a VHDL generation module and will be 
27 -
Chapter 5 Float - Floating Point Design Environment 28 ,� 
introduced in Section 5.4. 
5.2 Floating Point Tools 
Float consists of the following modules: 
• A Perl class called float for the representation of floating point num-
bers. Simulation of the effect of low precision floating point operations 
is performed using this class. 
• An optimizer which minimizes a cost function by adjusting the floating 
point format of the float variables in an algorithm function. 
• A VHDL generation module which produces synthesizable VHDL code. 
• float is compatible with fly compiler described in the previous chapter. 
Figure 5.1 illustrates the float design flow. A designer begins by writing 
a Perl function, hereafter referred to as the algorithm function, to represent 
the algorithm to be implemented. All variables used in the algorithm are float 
objects, where float is a Perl class that is capable of representing a floating 
point value under arbitrary precision. The function takes a number of float 
variables as input and produces a number of float variable as the output. 
By varying the precision of the float objects, the optimizer minimizes a cost 
function which is a weighted sum of the quantization error of the outputs of 
the algorithm function and the circuit size of the resulting implementation. In 
order to determine the outputs, a set of test input vectors are required. The 
algorithm function is executed with the test vectors as inputs, float operators 
being used to perform computation. The class computes the result using both 
IEEE double precision and the user-specified precision. These two results are 
then used to compute the quantization error, with an underlying assumption 
that the IEEE double precision result is without quantization error, and the 
Chapter 5 Float - Floating Point Design Environment 29 ,� 
Floating-Point 
Algorithm 
Use Float Class to \ 
implement the algontbrfi Suggest the required size 
^ y d accuracy constraints  
' ~ Float Class V 
I 
Cost Function • Optimizer \ 
Compiler  
Float Tools I ^ 1 � • • • 




Figure 5.1: Floating point algorithm design flow.-
float precision is less than double precision. Given the precision of a floating 
point operator, the cost function also includes a term which is an estimation 
of the circuit size. 
Once the optimizer has determined a suitable precision for each variable 
in an algorithm function, the same function will pass to fly compiler which 
can output synthesizable VHDL code for implementing the algorithm on the 
FPGA. The precision of variables are provided by the optimizer, the fly sim-
ply instantiates components with the required precision from a floating point 
operator module generator library. 
5.2.1 Float Class 
To describe hardware that utilizes variable precision floating point computa-
tions, a class called float, which facilitates the simulation of arbitrary precision 
floating point arithmetic was developed. Perl is a modern high level program-
ming language which offers improved productivity over traditional languages 
such as C. The following features of Perl were important to the design of the 
float system: 
• Perl supports objects which are used to abstract the details of variable 
Chapter 5 Float - Floating Point Design Environment 30 .. 
wordlength operators. 
• Perl supports operator overloading so that if x and y are float objects, 
one can write x + y instead of x.add(y). 
« Perl has strong memory management and string manipulation facilities 
making it easy to construct VHDL module generators. 
• Perl is very portable so the float design environment can run on many 
platforms including Unix, Linux and Windows. 
• There are many open source software libraries available for Perl. 
The float object provides several methods for interrogation of its parameters 
and computation. The main ones are: 
• addO , mult ip ly0： 
The addO and mult ip lyO methods will add/multiply two float objects 
together at their specified precision, creating a new float object. If the 
two floating point numbers have a different number of exponent bits, the 
output will have an exponent being the larger one of the two. Similarly, 
if the two numbers have different fraction sizes, the output will have 
fraction bit length equal to the larger one of the two input bit lengths. 
Overloading is used so that the + and * operators will invoke the add() 
and mult ip lyO methods respectively. 
Apart from the arbitrary precision result, another IEEE 754 double pre-
cision floating point calculation is also computed. This value is used as a 
reference value for computing quantization error. Furthermore, the max-
imum and minimum range of this reference value is stored in the object 
for computation of the minimum exponent value which is required. 
»» 
Chapter 5 Float - Floating Point Design Environment 31 ,� 
• setExponentSizeO，setFract ionSizeO: 
The setExponentSizeO， s e t F r a c t i o n S i z e O methods will set the pre-
cision of a float object. For se tFrac t i onS izeO , the value of the object 
will be truncated if the fraction size will be smaller than original. 
• setValueO , getValueO： 
These two methods are used to retrieve and write the value represented 
by the float object. Two values are stored, the IEEE double precision 
reference value, and the arbitrary precision value. 
• getQERRO： 
Both the arbitrary size floating point number and reference double preci-
sion floating point value are stored in the float oh]ect. getQERRO returns 
their difference. 
5.2.2 Optimization 
Although any measure of accuracy could be used, average quantization error, 
QERR, in decibels is used in this dissertation. QERR is computed as follows: 
1 … outi — refi /J, ix 
QERR 20 log ref, (5.1) 
where out,- are the outputs and ref^  are the corresponding double-precision 
reference outputs. 
The total circuit area is determined by summing the area, estimated for 
each operator. Operator area is estimated from the precision of the float class, 
assuming a Xilinx Virtex-E series FPGA [XilOl]. Although the area estimation 
is based on a specific reconfigurable computing platform, optimization using 
these measures should lead to reasonable area estimates on other platforms. 
The area in Virtex slices [XilOl] occupied by floating point adder is esti-
mated based on the fraction size and exponent size. Nonlinear regression has 
Chapter 5 Float - Floating Point Design Environment 32 ,� 
been applied to model the relation between area and precision using adaptive 
nonlinear least-squares algorithm purposed by J.E. Dennis et al [JGW81]. The 
architecture of floating point adder, as discussed in section 5.4, has linear rela-
tionship of exponent size and fraction size. The initial relationship is modeled 
as follows: 
add_area = a x ebits + 6 x fbits + c (5.2) 
where ebits is the number of exponent bits in the float representation and fbits 
is the number of fraction bits. 
To determine the parameters a, b and c, different precision of floating point 
adders were implemented on FPGA and the slices used was collected as shown 
in chapter 7 which acts as sample data point in the nonlinear regression al-
gorithm. The result was further fine-tuned and the best approximation was 
found that a = 6, 6 = 12 and c = 0. 
Similarly, the area occupied by a floating point multiplier is modeled by the 
equation 5.3, fraction size is contributed large portion of slices because larger 
value of fraction size means larger fixed point multiplier is used. 
mul^rea = a x ebits + 6 x f b i t s �+ c (5.3) 
After applying nonlinear regression algorithm and fine-tuning, the best ap-
proximation was a = 8, 6 = 0.47 and c = 230. 
The cost function is computed from the QERR and circuit area is measured 
using the equation 5.4: 
fcost = q X y ^ add_areai + ^ mul^reaj + 6 x QERR (5.4) 
V i j 
where a and b are non-negative weightings and i and j sum over all the add 
and multiply operators in the algorithm function respectively. 
Chapter 5 Float - Floating Point Design Environment 33 ,� 
The optimizer uses the Nelder-Mead [NM65] method to minimize the cost 
function (without requiring the computation of derivatives) by adjusting the 
precisions of float variables in the algorithm function. The designer can adjust 
a and b in equation 5.4 to weigh the relative importance of area and QERR. 
For example, if the designer needs a very accurate result and circuit area is 
not critical, a large value of b can be used. 
The optimization procedure is outlined as follows: 
1. Change the precisions of float variables (using Nelder-Mead). 
2. Simulate the algorithm function at the specified precision using user-
supplied input data. 
3. Compare the result with the reference result and compute the cost func-
tion. 
4. Repeat until the optimization terminates. 
5.3 Digital Sine-Cosine Generator 
Digital sine-cosine generators [Mit98] have a number of applications, such as 
the computation of discrete Fourier transform and in certain digital commu-
nication systems, such as in future Hiperlan systems [ETS96] for high per-
formance wireless indoor communication. Let and denote the two 
outputs of a digital sine-cosine generator, the outputs at the next sample can 
be computed using the following formula: . 
r "1 r n � -
5ln+l COS(60 COs(6l) + 1 5ln 
= (5.5) 
52^+1 cos(6>) - 1 cos(6') sin 
J L «J L J 
Equation 5.5 will be used as one of the example of float application in this 
chapter, with cos 0 = 0.9. Its algorithm function can be described by the Perl 
code listing 5.1: 
Chapter 5 Float - Floating Point Design Environment 34 ,� 
Listing 5.1: Digital sine cosine generator  
l | $ c o s _ t h e t a = new F l o a t ( 8 , 2 3 , 0 . 9 ) ; 
2 $ c o s _ t h e t a _ p l = new Float (8 , 2 3 , 1 . 9 ) ; 
3 $ cos_ the ta_ml = new Float (8 , 2 3 , —0.1 ) ; 
4 $sl [0] = new F l o a t ( 8 , 23 , 0 ) ; 
5 $s2 [0] = new F l o a t ( 8 , 23 , 1 ) ; 
6 for ( $ i = 0; $i < 50; $ i + + ) { 
7 $sl [ $ i + l ] = $sl [ $ i ] * $ c o s _ t h e t a + $s2 [ $i ] 
8 * $ c o s _ t h e t a _ p 1 ； 
9 $s2 [ $i +1] = $sl [ $i 1 * $ cos_ the ta_ml + $s2 [ $i ] 
10 * $ c o s _ t h e t a ； 
l l [ }  
This algorithm function first declares the variables used via float object 
instantiations, each object being specified to have an 8-bit exponent and a 23-
bit fraction in this example. The initial value of the variable is also defined in 
the float constructor, with 5I and <s2 being initialized to 0 and 1 respectively. 
The update values of si and 52 are derived using the floating point operators 
provided by the float class via overloading. 
This algorithm function can be passed to different components for process-
ing. Normally, a set of input vectors is specified for the algorithm function, but 
since this particular function is an oscillator with no inputs, the time domain 
response is computed via the loop in the algorithm function. 
The simulator can be used to determine the result and the optimizer can 
determine a suitable precision format for each of the five float objects in the al-
gorithm function, which minimizes the following optimization. The inner part 
of the algorithm function can be given to fly compiler to produce VHDL code. 
Finally, the VHDL output can be used for simulation and/or implementation 
on a reconfigurable computing platform. 
Chapter 5 Float - Floating Point Design Environment 35 .� 
5.4 VHDL Floating Point operator generator 
The module library was implemented in Perl and currently supports two op-
erators, namely multiplication and addition. Thus one can use the module 
library to generate operators with arbitrary precision. Operators are pipelined 
for high throughput. 
5.4.1 Floating Point Multiplier Module 
The Algorithm 2 in chapter 3 was implemented as a VHDL module and the 
corresponding datapath of the parameterized floating point multiplier is shown 
in Figure 5.2 using the mentioned algorithm. It has 4 stages with 8 clock cycles 
pipelining to evaluate the product of the given numbers. 
In the first stage, the steps 1 and 2 are implemented by padding one to the 
fraction to produce the significand and calculating the sign bit using the XOR 
of the sign bits. This stage uses 1 clock cycle. 
In the second stage, steps 3 - 5 are implemented. The significands vl and 
v2 will be multiplied. The most significant bits of the product, ranged from 
2 X fsize - 1 to fsize — 1, where fsize is the size of fraction, is stored to the 
register mc. Since both vl and v2 have leading 1 at most significant bit, the 
leading 1 of mc is at its first two most significant bits. This observation can 
simplify the normalization process as described below. 
The intermediate exponent will be calculated by considering two cases. If 
the leading 1 of mc is located at the most significant bit, mc is a normalized 
number and the final exponent would be el + e2 + 1 — bias. This exponent is 
stored as eel. If the leading I's of mc is located at the next most significant bit, 
mc should be normalized by shifting 1 bit to left, and the exponent would be 
el + e2 — bias. This exponent is stored as ecO. Since at mc is not determined, 
both ecO and eel are stored to save time. Since a fixed point multiplier is 
involved, the latency of this stage is 5 clock cycles. 
»» 
Chapter 5 Float - Floating Point Design Environment 36 “ 
The third stage does steps 6 - 9 . As mc is evaluated, Cans is determined by 
the most significant bit of mc. The mc will shift left appropriately so that the 
most significant bit of mc is 1. The result of normalization will be stored at 
mcO. This stage takes 1 clock cycle. 
The forth stage implements steps 10. It omits the integer part of mcO 
and stores the remaining fraction as fans and the product is returned. This 
stage uses 1 clock cycle. Extra logic is required to complete the floating point 
multiplier. These logics include zero checking and infinity handling. They 
are omitted in the Figure 5.2 for simplicity but implemented in the module 
generator. • 
5.4.2 Floating Point Adder Module 
The datapath of a parameterized floating point adder/subtractor is shown in 
Figure 5.3 is the hardware implementation of algorithm 1. Similar to floating 
point-multiplier, it has 4 stages to evaluate the product of the given numbers. 
Each stage uses 1 clock cycle. A subtracter is implemented by flipping the 
sign bit of the second operand and is not shown in the figure. 
The first stage implements steps 1 - 7. ediff, which is the difference of ei 
and 62 is calculated and if ediff is negative, / i and fi will be swapped. After 
swapping, Fa is the number with larger exponent and the other one is called 
n. • 
The second stage implements step 8 - 15. The correct significands are 
evaluated from the given fractions, fraction J) will be aligned such that both 
fraction share the same intermediate exponent, namely, exponent—a. The sig-
nificands are not in 2's complement format, so conversion is necessary if the 
corresponding sign bit is set. The intermediate exponent, exponent.a, is prop-
agated to ea2. The intermediate significands are stored in register rrria and 
rrub. 
t» 
Chapter 5 Float - Floating Point Design Environment 37 ,� 
esize - 1 0 fsize - 1 0 e s i z e - 1 0 <size - 1 Q 
sl exponent—1 fraction� s2 exponent—2 fraction_2 
^ ^ esize-1:0 fsize-1:0 esize-1:0 fsize-1:0 
J — r 
I <1 I 1 I V1 <1 I <1 |1 I v2 < 
esize+1:0 esize+1:0 fsize:0 fsize:0 
I • 广 zz] 
e1 +e2 - bias ~ \ / ~ / 
1 \ fixean^int / 
•esize+1:0 \ ... / 
—l—j^ l NQiultipliej/ 
+ 1 
I (2 X fs ize-1) : (fsize-2) 
I I esize+1:0 
<1 ecO <1 I ec1 <1 mc < 
I J esize-1:0 I • 
. , „ ^ n I ,‘ . i ‘ . t � mc(fsize;0) I lmc{fsize+1:1) 
esize-1:0 I I , mcfsize+1:fsize+1) ' ~ ~ 7 
\ o ^  
• . 墨 I faize.O 
<1 I eans <| nncO < 
I esize-1:0 _mcO(fsize-1:0) | 
Sans ^ans ！ans  
Figure 5.2: Parameterized Floating Point multiplier datapath 
Chapter 5 Float - Floating Point Design Environment 38 ,� 
The third stage does the steps 16 - 21. The significands are added. The 
sum of rnia and rrrib will be stored to register ftmp. The value of ftmp should 
be an unsigned number it is returned. So conversion is necessary if ftmp is 
negative. The sign bit is retrieved from the adder and stored to register sal. 
In addition, the intermediate exponent, which is exponentm, is propagated to 
ea3. 
The last stage, steps 22 - 24, is normalization and rounding. A priority en-
coder is used to determine the location of leading 1 at register ftmp. The final 
exponent, namely Cans, is calculated by Cans = ea3-number of bits shift to left+ 
ebias. fans is obtained by shifting ftmp to left such that the most significand 
bit of ftmp is 1, and the leading one is omitted. Sans is propagated from sal. 
Rounding is a truncation in round to zero mode so it is done implicitly when 
the result is packed in the fans register. 
Like the multiplier, extra logic is required to complete the floating point 
adder. These include zero checking and infinity handling. They are omitted 
in Figure 5.3 for simplicity, but implemented in the module generator. 
5.5 Application to Solving Differential Equa-
tions 
The floating point generation module and fly compiler were used to'solve the 
ordinary differential equation 
学 = o v e r t G [0,3] with y(0) = 1 [MF99；. • 
cLt I* 
The Euler method was used so the evolution of y is computed by yk+i = 
yi^  + hSt广2饥、and tk+i = h + h where h is the step size. 
The following fly program implements the scheme, where /i is a parameter 
sent by the host, as shown in listing 5.2. 
In each iteration of the program, the evolution of y is written to the block 
一 
Chapter 5 Float - Floating Point Design Environment 39 -
esize-1 tsize -1 eslze-l  
s l exponent—1 fraction_1 s 2 exponent_2 fract ion_2 
飞 ^JJL J 
6size + fsiz6:0 L ^ J ~ , ~ I I ~ I ~ I ~ I esize + fsize:0 
J 丨 esize:0 I esize:0 
1 — ^ 
esize:0 | esize:0 \ _ / 
I 2's c o m p l e m e n t 1  
esizeiesize  
6size:0 I — 1 
\o / 
esize:0 esize + fsize:0 esize + fsize;0 
2 sa e x p o n e n t J i | fraction—a""“^ | sb | exponent_b | fract ion_b < 
I fsiz 二 “ I tsize-1:0 
001 ~ I 1 I 
的丨ze:0 卜2:0 I .size:0 
S \ Shift right~ 
1 1 I (size:& 
2 's complement 丨 00 丨 
esize-1:0 ‘ |~fsize+2:0 
I • fsize+2:o 2's complement  
\p h I—* q 
fsize+2:0 ^ 1 ^ 
• • I fsize+2:0 
^ " " “ I fma < I ” b < 
丨 fsize+2:0 • I fege+2:0 
esize -1:0 X + / 
\ Z fsize+2:0 
f 麵__ f 0 丨 2-s c o m p l e m e n t 
^ _ I • , • fsize+2:0 
\o 1 / • 
I fsize+1:0 , 
"saTl ea3 <| fa1 
I I I tsize+1;0 
esize - 1 ;0 I  
priority encoding + normal izat ion 
correct I 
exponent esize-1:0 
I ^ 1 丨 size-1:0 
esize • 1:0 
I _ _ _ r ^ 
Sans a^ns  
Figure 5.3: Parameterized Floating Point adder datapath 
y 
Chapter 5 Float - Floating Point Design Environment 40 ,� 
Listing 5.2: Ordinary Differentiable Equation Solver  
iR ‘ 
2 $h = &read_host ( 1 ) ; #fetch 
3 [ 
4 { $ t = 0 . 0 ; } { $ y = 1 . 0 ; } { $dy = 0 . 0 ; } 
5 {Soneha l f 二 0 . 5 ; } { $index 二 0 ; } 
6 ] # parallel assignment 
7 while ( $ t < 3.0) { 
8 [ { $ t l = $h $onehalf ; } { $t2 = $t . - $y ； } ] 
9 [ { $ d y 二 $t l $ t 2 ; } { $t = $t . + $h ; } ] 
10 [ 
11 {$y = $y . + Sdy ; } 
12 {S index = Sindex + 1 ; } 
13 1 • 
14 $void = & wr i te_host ($y , Sindex )； 
15 #write host 
16 } 
17|}  
RAM via a writeJiostO function call and a floating point format with 1 
sign bit, 8-bit exponent and 23-bit fraction was used throughout. The floating 
point format can, of course, be easily changed. Parallel statements in the main 
loop achieve a 1.43 speedup over a straightforward serial description. 
5.6 Summary 
The float environment for the rapid prototyping of floating point digital system 
was described. These tools enable the designers to concentrate on higher level 
algorithmic issues thus increasing their productivity and being able to explore 
more of the design space in a give time. A digital sine-cosine generator and a 
differentiable equation solver were as an example of using float The module 
geneartor is packaged in Perl so as to allow easy interface with the current 
development tools. 
Chapter 5 Float - Floating Point Design Environment 41 .� 
The float environment extends the capabiltiy of fly compiler in which float-
ing point operator is now supported. With a single Perl description, the algo-
rithm function can be optimized and implemented using the provided design 
environment with ease. 
¥ 
Chapter 6 
Function Approximation using 
Lookup Table 
This chapter discusses an efficient table lookup generation system for supple-
menting a hardware description language (HDL). In particular, an implementa-
tion of the Symmetric Table Addition Method (STAM) which acts as a module 
generator for any differentiable functions is described. This module generator 
was integrated with fly compiler to produce a very flexible design environment 
which allows the specification of arbitrary functions in a high level manner. 
The environment is used to develop a coprocessor for the computation of the 
N-body problem, and the designer productivity is much higher than a typical 
designer using VHDL. 
6.1 Table Lookup Approximations 
6.1.1 Taylor Expansion 
The main idea behind the table lookup approximation algorithms is the Taylor 
Expansion. If a function f{x) has continuous derivatives up to (n + 1)访 order, 
42 •‘ 
Chapter 6 Function Approximation using Lookup Table 43 •� 
then it can be expanded as 
二 f / ( % ) (… y 伐 ‘ (6.1) 
i = 0 
where 
Rn = ( > + 1 ) � 
J ^  
/ ( - , ( … r + 1 for a < e < . 
(n + 1)! 
To reduce the required hardware resources and/or computating time, only 
the first few terms in the Taylor series are used to approximate the function 
in practice. The selection of a will affect the error introduced and a carefully 
selected a can be used to introduce symmetry in the lookup table as explained 
later. 
6.1.2 Symmetric Bipartite Table Method (SBTM) 
The SBTM uses the first two terms of the Taylor series to approximate a 
function f(x) as f(x) [SS97]. In the SBTM, two lookup tables are constructed 
and the precision of the output is maximized. 
Assume that the n-bit input, x, of the function to be approximated ranges 
in [0,1). It is first partitioned into 3 segments as shown in Fig 6.1 where 
X = xo -i- -h X2. 
n • 
-_"o~~»+«~"l 'I' "2~-
Q X q X 1 X 2 
个 个 个 t t 
1 2-1 2-tn, 
Figure 6.1: Input partition of SBTM. 
Chapter 6 Function Approximation using Lookup Table 44 •� 
The ranges of Xi are: 
0 <Xo < 1 -
0 < < 2 一 打 0 一打 1 — 
Two lookup tables which return the value do and ai are then constructed. 
The sum of these two values will be the approximated result of the function. 
撫—丨—-
f(x) = + . 
a +工1+^2) (6.2) 
We first select mid points in the ranges of Xi and X2： 
= (2-几0 — 2-恥 
— 2 — 71(3 一 1 _ 2 一 打 0 一打 1 一 1 
二 (2 -n� - " i _ 2 - " � - " ^ 2 ) / 2 
— 2 一 几 0 一几1 一 1 一 — T i 2 — 1 
(6.3) 
Let a = XQ + xi 82 and use the first two terms of the Taylor Expansion: 
f{x) = f(xo-^Xi + X2) 
« f{xo + + 82) + f{xo + � + -〜） 
二 /T^ ) (6.4) 
Not all bits from ai are required to be in the table as the carefully selected 
S2 results in a large number of leading Os or Is in the ai table. Since 62 is 
located in the center of X2^ s range, 
Chapter 6 Function Approximation using Lookup Table 45 •� 
� |a :2 -J2|<2 -n� - " i - i (6.5) 
The upper bound of ai is 
< I/'(⑴丨 2 - " � - n i - i (6.6) 
where 
[ 0 , 1 ) 1 /⑷ ( ( ,） > / “ )⑷） . 
6.1.3 Symmetric Table Addition Method (STAM) 
The logic in SBTM is simple and two tables are required. The STAM algorithm 
uses more tables with smaller size to significantly reduce the overall memory 
required [SS99b]. 
As shown in Fig 6.2, the n-bit input is partitioned into m segments instead 
of 3 in SBTM. The input is now x 二 Xi. 
^ _ — n  
< ~ n o ~ H " * " i ~ H K — "m-h—» 
Q X o X i ^ m-1 
M 个 个 
Figure 6.2: Input partition of STAM. 
The ranges of Xi are shown here: 
0 <a:o < 1 - 2 - " 0 
0 < 2一P卜 1 - 2 -P� 
(6.7) 
Chapter 6 Function Approximation using Lookup Table 46 •� 
where pi = 几知 and 5{ is defined as following: 
5i = ( 2 -巧 - 1 _ 2 - ” / 2 ( 6 . 8 ) 
To apply the Taylor approximation, let the a = Xq Xi X)^ The 
approximation function is now: 
m m m m 
2 2 2 2 
m 




ai.i{xo, Xi) = f'(xo + 知 + - 5i) 2 <i<m 
2 
The error analysis of STAM is very similar to the SBTM algorithm. The 
constraints for the parameter configuration are: 
2no + ni < P/ + /0 仍(|/"((2)|) (6-10) 
g < 2 + log2{m - 1) (6.11) 
6.1.4 Input Range Scaling 
The analysis above are all based on the input range [0,1). Both SBTM and 
STAM can be adapted to other input ranges. But this requires som^ transfor-
mations when generating the table contents. The transformation is done by 
dividing the input range evenly for all the possible input patterns. 
For an n-bit input x^  let x be the integer value of the bit pattern assuming 
the decimal point is on the right of the LSB. If the input range is [xmin, ^ max)^  
then 
^ —工 min _ ^ 
'•••• • I ^ ^ — — 
^max — ^min 2 
� X = "2 (^ maa; — ^min) + ^min (6.12) 
Chapter 6 Function Approximation using Lookup Table 47 •� 
Let this be the transform function t{x). The range of Xi in (6.7) is modified 
as in (6.13): 
Xmin <Xo 
^min ^ ^ 力(2” ——) 
(6.13) 
Let Mi be the maximum value of Xi. The Xi and Si are first transformed 
as in (6.14) before passing to ai to generate the table contents. 
— 2 7 ^ - — ^mm) + ^min “ 
x-v 
— - ^mm) + ^min 
(6.14) 
The transformation must also be applied when analyzing the errors in the 
approximations. 
6.2 VHDL Extension 
To allow for the easy implementation of the STAM algorithm in VHDL de-
signs, simple extension is introduced by making use of the comment section 
inside VHDL code segments as many synthesis tools do for the synthesizing 
directories. A set of preprocessing tools are developed to generate VHDL codes 
using the STAM algorithm. The user includes the name and the body of the 
target function as well as some configuration parameters. The preprocessing 
tools will generate the corresponding VHDL codes of the function using STAM 
algorithm which can be used directly anywhere in the design. The listing 6.1 
demonstrates the instantiation and usage of a sin function in the VHDL source. 
In the example above, the sin function will accept input ranges of [0,1) and 
the input will be partitioned as described in the segments statement. Four 
Chapter 6 Function Approximation using Lookup Table 48 •� 
Listing 6.1: STAM instantiation  
1 architecture . . . 
2 
LJ • • • 
3 — __STAM-BEGIN一- ‘ 
4 — — m y - f u n c t i o n (x) — Sin (x) 
5 ——range-uiin = 0 
6 ——range-max = 1 
7——segments =各么2 2么 
8 — decimal-point = 16 
9 — --STAM-END--
10 component my_funct ion is port ( 
11 elk : in s t d _ l o g i c ； 
12 X : in s t d - l o g i c . v e c t o r (15 downto 0) ; 
13 f x : out s t d _ l o g i c _ v e c t o r (20 downto 0 ) ) ; 
14 end component ； 
1 5 … 
16 begin 
17 . . . 
18 fO : my_funct ion 
19 port map ( c lk=>clk , x=>x, f x = > f x ) ; . 
20 ... 
Chapter 6 Function Approximation using Lookup Table 49 •� 
tables will be generated for the 16-bit input. The decimal .point statement 
indicates that decimal point is located at the left side of the most significant 
bit. The output will be ready the after next rising clock edge and will be valid 
as long as the input x is valid. The clock signal is required since synchronous 
RAMs are used to store the contents of the tables. Since the descriptions 
are only inside the comment section, the VHDL code can be processed by 
traditional synthesis tools without modification. 
The VHDL codes are first passed to a preprocessor before going to the 
synthesis stage. A flow chart of the preprocessor is shown in Fig 6.3. First, 
the function extractor extracts the function body in the extended VHDL block 
and passes it to YACAS (YACAS is a public domain software which perform 
symbolic arithmetic operations [Pin03]). YACAS accepts the input function 
to find the symbolic first and second derivatives and passes the results to 
the table generation program. The table generation program uses a stack 
to transform the input strings to a sequence of arithmetic operations and 
generates the content of the lookup tables. These contents will be used in the 
VHDL generator to generate a complete VHDL code using Xilinx BlockRAM 
as the lookup tables. 
With this extension, an arbitrary function can be used in VHDL code 
without any knowledge of the detailed implementation. The default evaluation 
time is 1 clock cycle but this can be easily modified in the generated VHDL 
codes. The only limitation is that the function must be twice differentiable due 
to the Taylor Expansion. As a structural design, this preprocessing method 
can be easily modified to other HDL languages such as Verilog. 
6.3 Floating Point Extension 
In the original STAM algorithm, the input value is considered a fixed point 
number within a predefined range. It is possible to modify the logic such that 
Chapter 6 Function Approximation using Lookup Table 50 •� 
Extended VHDL 




S i n � 
C o s � 
,丨 - S i n � 
STAM table generation 




Figure 6.3: Extended VHDL Preprocessor. 
it can handle specific functions for floating point arithmetic. This section will 
describe this process as used for the development of a floating point coprocessor 
for the N-body problem. • 
The extended fly compiler can use basic floating point operations, such as 
addition, subtraction and multiplication with different precision. Transcen-
dental functions such as square root and exponential are frequently required 
to evaluate the force or acceleration in N-body problem. Such functions can 
be implemented using the modified STAM approach. In this research, v-邮 
was implemented using this approach. 
The STAM is configured to use 4 lookup tables. 
Chapter 6 Function Approximation using Lookup Table 51 •、 
Range reduction and result correction are necessary in the floating point 
implementation. Consider the IEEE 754 binary floating point number repre-
sentation: 
I； = 1./ X 2' (6.15) 
Then, 
”一 3 / 2 = ( 1 . / X 2 0 ) ( - 3 / 2 ) X ( 2 ( - 3 / 2 ) e ) ( 6 . 1 6 ) 
When e is even, let e = 2N, equation 6.16 becomes: 
” - 3 / 2 = ( 1 . / X 2 0 ) ( - 3 / 2 ) X ( 2 - 3 力 （ 6 . 1 7 ) 
Similarly, if e is odd, let e = 2N + 1, equation 6.16 becomes: 
；^-3/2 = (1./ X 20)(-3/2) X (2-3力 X (2-3/2) (6.18) 
In both cases, the fraction part can be calculated using STAM with the 
input range [1,2), and the exponent part is shift and add operations. The only 
difference is that if e is odd, the final result should be obtained by multiplying 
a constant 2-3/2. 
IEEE 754 requires normalization of the result from STAM. Since the output 
of STAM 0.354 < v -叩 < 1 for 1 < i; < 2, the location of the leading one must 
lie at either of the two most significant bits. The datapath of the calculation is 
shown in Fig 6.4. Since it supports parameterized size floating point numbers, 
it can generally fit in different FPGA devices. 
To implement the circuit on FPGA, the fly [HLT+02] compiler was used 
to generate synthesizable VHDL code and the Pilchard board [LLC+01] was 
used as the reconfigurable platform. Pilchard uses a DIMM memory bus in-
terface to provide high I /O performance compared to the PCI bus. Fly is used 
because of its efficiency to design a floating point algorithm by using Perl-like 
Chapter 6 Function Approximation using Lookup Table 52 •� 
descriptions and its handy library which fully supports parameterized floating 
point arithmetic. In addition, the mechanism of the fly compiler allows for easy 
integration of a block such as STAM. The fly compiler was modified such that 
it can handle the .power 15 ( ) function using the built-in function mechanism. 
Due to the limitation of memory available on the FPGA chips, the STAM 
used 16-bit integers as input and the table size is (8, 2, 2, 2, 2). The STAM is 
used to process the function f{x) = x-耶 where I <x <2. After scaling, 0 at 
the input of the STAM stands for 1 according to equation 6.14. Additionally, 
to enhance the efficiency of the STAM and minimize the critical path in the 
STAM, the symmetric property in the lookup table is removed; As the output 
of BlockRAM is 32-bit, the memory usage is 2(8+2) x 4 x 32 = 131072 bits or 
32 BlockRAMs. 
6.4 The N-body Problem 
N-body simulation finds application in various fields of science. A wide range 
of physical systems can be studied by modeling them as an N-Body problem. 
They include problems in various fields of science such as astrophysics and 
molecular biology. The basic idea of the N-Body problem is simple. Particles 
are modeled as points in space. The potential of the system can be expressed 
as a function of the properties and positions of all particles in the system. The 
force exerted on a particle is the first derivative of this potential with respect 
to the position of the particle. The N-body problem for different systems share 
the same basic structure but differ in the physical law that governs the force 
between particles. Therefore, the exact equation for calculating the potential 
and force depends on the application. By integrating the force acting on a 
particle, its position can be computed as a function of time. 
There is no known analytic solution for the N-body problem for N > 
3. Therefore, N-Body problems are solved numerically using simulation in 
Chapter 6 Function Approximation using Lookup Table 53 •� 
practice. The simulation is performed in discrete time steps. In each time step, 
forces exerted on each particle are computed. The positions of the particles 
are updated at the end of the time step by integrating all forces acted on the 
particle. 
In N-body simulation, most computation time is spent on force calculation. 
The number of interactions between particles grows as 0(72^), where n is the 
number of particles. For large n, the calculation becomes very expensive and 
time consuming. In spite of algorithms that reduced the computation time of 
force calculation at the expense of accuracy, the force calculation remains an 
expensive step and pose a limit on the size of system that can be realistically 
studied. 
Since the force calculation part consumes most of the CPU time, and at the 
same time has a rather simple algorithm, it is a good candidate for hardware ac-
celeration. In fact, this has been done in many systems. Such systems usually 
have a heterogeneous architecture consisting of a general purpose host com-
puter and a special purpose hardware. The special purpose hardware handles 
the force calculation while the host computer handles all other computations. 
Most notable of those is the GRAPE (Gravitational Pipeline) computer for 
the gravitational N-body problem [MT98:. 
The reason for using such architecture is as follows. In a system powered 
by general purposed processors, only a small fraction of the transistors in 
the processor are doing useful work at any moment. The key for GRAPE or 
other such systems to achieve performance orders of magnitude higher than a 
general-purpose system is to utilize almost all of the transistors on the chip at 
any moment. With filled pipelines of the processors for the force calculation, 
almost all of the transistors are performing useful computations at any given 
moment. 
Chapter 6 Function Approximation using Lookup Table 54 •� 
6.5 Implementation 
In this work, a FPGA based co-processor for evaluating gravitational forces 
in N-body simulations was built using the module generator approach. The 
architecture of the co-processor is similar to the GRAPE-1 system. GRAPE-1 
is the first in a series of specialized processors evaluating gravitational forces or 
acceleration in a gravitational N-body simulation. Equations 6.19, 6.20, 6.21 
show the force evaluation in this system. 
N 
a , = [ � (6-19) 
i=i 
a., = ( X广 + (6.20) 
r l = (x^ - ^Tjf + to - yjY + (zi - (6.21) 
The equations are the same as those implemented in the GRAPE-1 system. 
Eli is the gravitational acceleration at the position of particle i, Xj- is the position 
of vector particle i, rij is the distance between particles i and j and e is the 
artificial potential softening used to suppress the divergence of the force at 
Tij — 0. 
A program written in fly language is used to implement the equation 6.19, 
which is used intensively during the whole calculation as shown in listing 6.2. 
The input of the program is x “ Xj and e while the output is the acceleration 
(aij) for a particular value of Xj. Most of the constructs are parallel in nature 
so that the vector manipulation can be processed simultaneously. For example, 
Xj — Xi can be done at the same time for each scalar in Xj and x^. The fly 
code can used for simulation and verification by directly executing it under 
the Perl environment, which saved time and reduced the error when compared 
with manually translating the algorithm description into VHDL. 
The floating point module supplied by the fly compiler is readily parameterized, 
so the tradeoff between the accuracy and slice resources is adjustable. Different sizes 
Chapter 6 Function Approximation using Lookup Table 55 
Listing 6.2: Implementation of N-Body Simulation  
2 # initialization , fetch xi , yi , zi 
3 while ( $j < $n) { 
4 # fetch xj , yj , zj from memory 
5 $xj = & r e a d _ h o s t ( $ i n d e x ) ; 
6 Sindex = $index + 1; 
7 $yj = &;read_host (Sindex ) ; 
8 $index = $index + 1 ； 
9 $zj = & r e a d _ h o s t ( $ i n d e x ) ; 
10 $index = Sindex + 2; 
11 [{ S d i f f x = $xj .— $ x i ; } 
12 { $ d i f f y = $yj .— $ y i ; } . 
13 { $ d i f f z = $zj $zi ；}] 
14 [ 
15 {$x 二 $ d i f f x .* $ d i f f x ; } 
16 {$y = S d i f f y $ d i f f y ; } 
17 { $ z = S d i f f z S d i f f z ; } 
18 ] 
19 [ 
20 { $ r l = $x . + $ y ; } 
21 { $ r 2 = $z . + Seps i l on ；} 
22 1 
23 # caculate rij 
24 $ r i j 二 $rl . + $r2 ; 
25 
26 # call b u ilt—in function power ^{ — 1.5} 
27 $tmp2 = &_powerl5 ( $ r i j ) ; 
28 
29 [ {$tmpx = $tmp2 .* $ d i f f x ; } . 
30 {$tmpy = $tmp2 $ d i f f y ; } 
31 {$tmpz = $tmp2 S d i f f z ；}] 
32 [ { $ a x = $ax . + $tmpx; } # accumulate a 
33 {$ay = Say . + $tmpy; } 
34 { $az = $az . + $tmpz ； } ] 
35 $j = $j + 1; 
36 } 
37 $void = & w r i t e _ h o s t (Sax , 6 0 ) ; 
38 $void = w r i t e . h o s t ( Say , 6 1 ) ; 
39 $void = & w r i t e _ h o s t ( $ a z , 6 2 ) ; # write back to host 
40 } 
V 
Chapter 6 Function Approximation using Lookup Table 56 
of fraction and magnitude can be implemented and the best performance rating can 
be achieved. 
6.6 Summary 
A flexible framework for implementing elementary function using lookup table on 
the FPGA has been introduced in this chapter. Using the STAM algorithm, it can be 
used to generate synthesizable VHDL modules from comments in the VHDL source 
code. This function generator was integrated into the fly environment to extend 
flexibility and efficiency. An N-body problem simulation was implemented on the 
FPGA to demonstrate the power of this framework. Without detailed knowledge 
of the STAM implementation, the N-body core was generated from 45 lines of fly 
source code. This example shows that this framework can be used to solve a real 
world problem with minimum design effort. 
¥ 
Chapter 6 Function Approximation using Lookup Table 57 •� 
二” fel^e'size-1 0 
exponent fraction 
I 
esize -1 :0 
— i — — 4 fsize -1 :0 
0 : 0 
exp - bias exp - bias - 1 
1 STAM 
I I es丨ze-1:0 
— � z 
^ 1 ^ fsize -1 :0 
esize • 1:0 
卜 exp1 I |> — 
esize -1:1 ## "0" I esize - 1:6size-1 ## esize-1:1 
\ \ / / fsize • 1:0 
I esize • 1:0 
n [> 3 N I [> 
I esize • 1:0 _ 
2's complement 'size • i :o 
I esize -1 :0 
n |> -3N I [> 
I I fsize-1:0 
+ bias -1 +bias - 2 
I esize-1:0 I fsize -2 -0## _0" fs ize-3 :0##-00- : 
I esize-1:0 I fsize - 1 :0 
^ e^ ^ fri 
T , T . 
esize + I I 
fisze -1 I f9izef$iz9-1 I ^ 




fsizeltsizB -1 a 
0 exponent fraction 




In this chapter, results of all the experiments described in previous chapters are 
presented. All experiments, unless other specified, were tested on a Pilchard FPGA 
board [LLC+01] and the FPGA chips used was Xilinx XCVlOOOE-6 which contains 
a total of 12,288 slices and 96 BlockRams. This chapter includes the following 
experiments: 
1. GCD coprocessor 
2. Floating point module generator 
3. Digital sine-cosine generator (DSCG) 
4. Ordinary difFerentiable equation solver (ODE) 
\ 
5. N-body problem simulation 
7.2 GCD coprocessor 
The GCD coprocessor design was synthesized for a Xilinx XCV300E-8 and the design 
tools reported a maximum frequency of 126 MHz. The design, including interfacing 
circuitry, occupied 135 out of 3,072 slices. The design time for the GCD processor, 
including host interface was approximately one hour. 
58 " 
Chapter 1 Results 59 “ 
Listing 7.1: GCD Testing program 
l l for (my $i = 0; < Sent ； + ) { 
2 $a = r a n d ( 0 x 7 f f f ) k 0 x 7 f f f ; 
3 $b = r a n d ( 0 x 7 f f f ) h O x T f f f ; 
4 
5 &;p i l chard_wr i te64 (0 , $a , 1 ) ; # write a 
6 & p i l c h a r d _ w r i t e 6 4 (0 , $b , 2 ) ; # write b 
7 & p i l c h a r d _ w r i t e 6 4 (0 , 0 , 0 ) ; # start coprocessor 
8 
9 do { 
10 & p i l c h a r d _ r e a d 6 4 ( $data_hi , $data_ lo , 0 ) ; 
11 } while ( $data_ l o = = 0); # poll for finish 
12 &;pi l chard_read64 ( $data_hi , $data_ lo , 1 ) ; 
13 ‘ 
14 print ( " g cd^Sa , ^ $ b ^ = ^ $ d a t a _ l o \ n " ) ; 
15Q  
The Perl listing 7.1 tests the GCD coprocessor using randomly generated 15-bit 
inputs. The GCD coprocessor was successfully tested at 100 MHz by calling the 
FPGA-based GCD implementation with random numbers and checking the result 
against a software version. The resulting system could compute a GCD every 1.63 //s 
(including all interfacing overheads). 
7.3 Floating Point Module Library 
Different configurations of adders and multipliers were extracted from the module 
library, simulated and synthesized for the Virtex XCVlOOOE-6 FPGA. Table 7.1 is 
a summary of the resource requirements, maximum reported frequency and latency 
for a fixed exponent length of 8 bits and different fraction sizes. The adder is not yet 
fully optimized and the maximum frequency was 40 MHz with a 4 stage pipeline. 
V 
Chapter 7 Results ^^ �� 
Table 7.1: Area and speed of the floating point library. 
Fraction Size (bits) Circuit Size (slices) Frequency (MHz) Latency (cycles) 
Multiplication 
~ 7 178 I 103 8 
15 375 102 8 
— 23 598 “ 100 8 
一 31 694 100 8 
Addition 
7 I 120 I 58 I 4 
15 225 — 46 — 4 
23 336 _ 41 4 
— 31 I 455 I 40 I 4 
7.4 Digital sine-cosine generator (DSCG) 
The algorithm function of the sine-cosine generator was simulated by directly exe-
cuting it in Perl. Figure 7.1 shows the resulting double precision reference output. 
The output will be used for evaluating the quantization error for different precision 
configurations. 
Figure 7.2 shows the quantization error of the Float simulation for different 
fraction size, as a function of time. In the simulation, the exponent field was set to 
be large enough to avoid overflow. The maximum exponent value can be determined 
during the simulation of the algorithm. As expected, the error is reduced as the 
number of fractional bits (and hence precision) is increased. 
Figure 7.3 shows the QERR as mentioned in Section 5.2.2 of digital sine-cosine 
generator with a varying number of fraction bits, assuming that the exponent field 
is large enough to avoid overflow. For fraction bits varying from 12 to 40 bits, the 
QERR ranged from -50 to -210 dB. Linear relationship is discovered between QERR 
and fraction size. 
The single precision of digital sine-cosine generator is implemented. The reported 
frequency is 52.4 MHz and consumed 3,470 slices. 
Chapter 7 Results ^^ �� 
5 I 1 j 1 1 , , , , 
I WKntom of Digital Sin«-Conne Oanvrator + 
^ - f l . -...- -...... -..汽 1..- -A -
三髮 
•si i i__A__i , i 
0 5 10 15 20 25 30 35 40 45 50 
I 
Figure 7.1: Digital sine-cosine generator reference output. 
0-1 I ！ 1——~I i 1 1 ；ia 1 ‘ 1 
001 1 ‘ . . p l i S i S g ^ ^ l ^ " ^ -
0.001 1/ lU y .i Y ^ -1T. 
s ：“^—... 
I 二 . .��-...‘........:......：仁一二:：-
j 
,••“ • oi————-i™— - j..一一…-... 
i j i 
1..12 • ‘ “ ’ I i i 1  0 5 10 15 20 25 30 35 40 45 50 
I 
Figure 7.2: Quantization error of the sine-cosine generator for different fraction 
sizes. 
•40 
-»0 - • � . � 
=.-100 • X. • t \ 
I .'�• \\ • 
�.,,�. \ 
-160 - \ . 
-200 - � < � � 
.220 . , , , Quantization ^ -::-�•�_ 12 16 20 24 28 32 36 40 
Size of Fractkm 
Figure 7.3: Quantization error for different fraction sizes. 
Chapter 7 Results ^^ �� 
Table 7.2: Optimization result using different QERR values where (x,y) are 
the (exponent size, fraction size) in bits. 
" " Q M R I si I S2 I cos⑷ c o s � + 1 c o ^ s⑷-1 
-52 (5,10) (5,12) (5,11) (5,11) (5,12) 
-73 (5,15) (5,14) (5,15) (5,15) (5,16) 
-98 (5,19)言 18) (5,19) 一 (5,19) —(5,20) 
-123 (5,23) (5,23) (5,24) —(5 ,24) 
-148 (5,26) (5,28) (5,27) 一 (5,27)——(5,28) 
-171 (5,31) (5；30^  (5,31) (5,31) “ (5,32) 
(5,35) (5,33) (5,35) “ (5,36) “ (5,36) 
(5,38) (5,39) (5,38) (5,39) (5,40) 
7.5 Optimization 
By varying the fraction size of the Float objects using the technique described in 
Section 5.2.2, the optimizer can minimize the cost function while maintaining a given 
maximum quantization error. This technique was used to determine the minimum 
area requirements for a given QERR. Table 7.2 shows the optimized number of 
fraction bits and exponent bits for different maximum QERR. As expected, the 
trend for all variables is an increase in wordlength as the QERR requirement is 
increased. 
Figure 7.4 compares the optimized circuit size (which allows variables to have 
different numbers of fractional bits) to a scheme where all variables have the same 
number of fraction bits (i.e. the fixed fraction case). The "Fraction Size" curve was 
made by computing the area of the sine-cosine generator for the case that all variables 
have the fraction size on the x-axis. The "Optimized Circuit Size" curve was made 
by using the fraction size of the x-axis as the starting point for an optimization, with 
the maximum QERR specified to be that of the fixed fraction case. Thus it can be 
seen from the figure that for the same quantization error, a 2% to 5% reduction in 
area is achieved by the optimization process. 
In the sine-cosine generator, all variables require similar precisions. In applica-
tions where variables have widely different precisions, one would expect the scheme 
Chapter 7 Results ^^ �� 
鄉咖 I ‘ ‘ ‘ ‘ “ “FV . c l l o n S l z . l b H . ) + 
Optimiz«d Circuit Size ( r t o t ) X 
250000 -
亡 200000 • 
j ....考 
S ,50000 • 
I 
“ 1 0 0 0 0 0 -
i c— 
50000 • 
°12 16 20 24 2e 32 36 40 
Size of Fraction 
Figure 7.4: Area estimation of the fixed fraction and optimized circuits. 
allowing different fractional sizes to offer a much larger improvement in area effi-
ciency. 
7.6 Ordinary Differential Equation (ODE) 
The differential equation solver described in section 5.5 was synthesized for a Xilinx 
XCV300E-8 device and the design tools reported a maximum frequency of 53.9 MHz. 
The design, including interfacing circuitry, occupied 2,439 out of 3,072 slices. The 
outputs shown in Table 7.6 were obtained from the hardware implementation at 
50 MHz using different h values. The resulting system {h = took 28.7 fis for an 
execution including all interfacing overheads. 
7.7 N Body Problem Simulation (Nbody) 
The VHDL code generated by fly compiler which implement n body problem simu-
lation with N = 10 together with the STAM extensions was implemented using the 
design tools and the bitstream is generated. Table 7.7 shows the result of implemen-
tation using different floating point configurations. The number of BlockRAMs was 
always 32 and thus not included in the table. The data used was a NEMO N-body 
snapshot data set [Teu03]. For experimental purposes, N = 10 was used during the 
Chapter 7 Results ^ ^ �� 
tk = = * h = ^ h = ^ y{tk) Exact 
1 .0 1 .0 1 .0 1 .0 1 .0 
0.125 - - - 0.9375 0.940430 0.943239 
0.25 - - 0.875 0.886719 0.892215 0.897491 
0.375 - - - 0.846924 0.854657 0.862087 
0.50 - 0.75 0.796875 0.817429 0.827100 0.836402 
0.75 - - 0.759766 0.786802 0.799566 0.811868 
1.00 0.5 0.6875 0.758545 0.790158 0.805131 0.819592 
1.5 - 0.765625 0.846386 0.882855 0.900240 0.917100 
2.00 0.75 0.949219 1.030827 1.068222 1.086166 1.103638 
2.50 - 1.211914 1.289227 1.325176 1.342538 1.359514 
3.00 1.375 1.533936 1.604252 1.637429 1.653556 1.669390 
Table 7.3: Results generated by the differential equation solver for different 
values of h. 
Table 7.4: The frequency and slices used reported by design tools for N-body 
problem  
Floating Point Configuration Area Frequency QERR 
(exponent size, fraction size) (slices) (MHz) (dB) 
一 （5, 15) 3,523 47.34 -82 _ 
一 （5,23) 5,267 4 4 . 0 7 - 1 0 2 
— (8, 15) _ 3,837 48.92 
— (8, 23) 5,475 44.79 
evaluation of quantization error. 
7.8 Summary 
This chapter presents the area and performance results for all the designs previously 
described. The tradeoff between area and precision are discussed for each experi-
ment. The results are summarized in Table 7.8. The QERR of the GCD example is 
omitted as it does not involve approximations to fractional numbers. The exponent 
size of all the floating point numbers was fixed to 8-bit. 
To implement each design, designer needs to do 2 steps: 
1. Use a Perl-like language (ffy) to describe the algorithm. 
Chapter 7 Results ^^ �� 
Problem Name Fraction Size Frequency Area QERR 
(bits) (MHz) (Slices) (dB) 
GOT 16 -
DSCG N = 50 15 78.16 2,300 -81 
DSCG N = 50 23 52.38 3,470 -127 
ODE h = ^ 15 75.74 1,715 -84.8 
ODE h =知 23 64.50 2,495 -134 
l b _ 
Nbody N = 10 15 48.92 3,837 -82 
Nbody N = 10 I 23 44.79 5,475 -102 
Table 7.5: All Experiments Result 
2. Suggest the precision of the floating/fixed point numbers to be used. 
The fly description for all examples was short and easily understandable and it 
can be easily seen that the descriptions are much easier to write and understand 
than corresponding VHDL description. The design environment can generate the 
bitstream suitable for on-board testing. When compared with the traditional design 
flow, a significant amount of time is saved and thus the productivity of the designer 
is increased. To further customize the design, the precision of the floating point 
number can be varied as specified by the designer. The optimizer can help the 
designer to balance the tradeoff between accuracy and area of the hardware via the 
given cost function. 
Chapter 8 
Conclusion 
This research purposed a mixture of hardware compilation, module generators, float-
ing point arithmetic and automatic interface generation to improve the the effi-
ciency, productivity and flexibility when implementing the floating point design on 
the FPGA. The framework allows designers to use a programming language to im-
plement a design, automatically generating floating point circuits and elementary 
arithmetic. For the same design, this framework allows the tradeoff of precision 
and area used from a single description. Several applications, such as digital sine 
cosine generator, greatest common divisor coprocessor, ordinary difFerentiable equa-
tion solver and N-body problem simulator have been developed using this approach. 
The key issues of this research are highlighted as below. 
Integration of programming language and FPGA 
Using the same programming language for describing the algorithm and imple-
menting on FPGA design can benefit designers in several ways. The algorithm can 
be verified and simulated by executing the code under a software environment. The 
translation and optimization process are done by the tools and the designer can 
concentrate on the higher level in details. This methodology greatly reduces design 
time and achieves rapid system prototyping. Design errors can be reduced compare 
with the traditional design flows since the translation is done automatically instead 
of manually porting the algorithm into datapath and control components. 
Software programming and hardware designs being treated as distinct entities 
remain an obstacle to developing a FPGA based system. The design goal of fly 
66 " 
Chapter 8 Conclusion 67 “ 
and float is the bridge between these two entities in a way that a software program 
can be translated into a hardware implementation. Using these tools, the designer 
can reuse the software code, optimize the hardware resources used and perform 
on-board testing without additional effort. The time required to implement floating 
point algorithms on FPGA can be significantly reduced. With ever increasing device 
densities, this design methodology should become even more attractive in the future. 
Floating point/Elementary arithmetic on FPGAs 
This dissertation discussed the possibility of connecting a floating point algo-
rithm description to a hardware. When the floating point algorithm on the reconfig-
urable computing platform, using arbitrary length of operator is now possible such 
that the tradeoff between circuit size and the accuracy can be varied. Thus the de-
signer can choose the best performance rating by providing a suitable cost function 
and the optimizer can return the best configuration for each of the floating point 
operator. 
Elementary functions can be automatically generated using lookup tables. These 
act like a flexible mathematical library in software. It enhances the flexibility since 
the designer does not need to implement every elementary function from scratch. 
The automatic function generator saves the design time and extra hardware knowl-
edge is not required to build any elementary function. 
By combining all of these module generators, the implementation of floating 
point design in reconfigurable computing platform is made simpler. It allows wide 
range of applications, such as scientific simulation, equation solver, DSP design, to 
be implemented as a FPGA based coprocessor. It also benefits the HDL design flow 
because floating point arithmetic is available as a synthesizable VHDL module. Any 
HDL design can easily interface with the floating point operator. 
Adapt to different architectures 
Fly generates VHDL code since it is generally available to different reconfigurable 
computing platform. Therefore, even though the design environment is now targeted 
for the Pilchard board, it can be ported to different reconfigurable computing plat-
forms such as other FPGA products or even ASICs with only slight modification. 
In addition, HDL output enables further optimization on different FPGA platform 
Chapter 8 Conclusion 68 “ 
using the corresponding design tools. 
Fly is a modifiable compiler which can be able to produce code for different 
HDLs, program proving tools, and programming languages. Having an easily un-
derstandable and easily modifiable compiler allows for the easy integration of the 
fly language to many other tools. The integration of fly language was introduced. 
For example, new host interface mechanism, floating point arithmetic and arbitrary 
function generation is extended from the basic fly environment. 
8.1 Future Work 
There are several possibilities for improvements to the system. The compiler pro-
duces only one-hot state machines which may be inefficient in certain cases. The 
state machine can be different and not limited to a certain implementation. The re-
sulting datapath is not fully utilized, and the operators are idle most of the time. It 
would be desirable if the coding strategy let the datapath share hardware resources 
for some operation. This coding strategy thus can save area if it is critical for certain 
application. The parallelism must now be implemented by the user. It would be 
better if the compiler itself can detect the dependency to reorganize the datapath 
in which the parallelism can be achieved automatically. However, it is believed that 
the benefits in productivity and flexibility that could be gained from this approach 
outweighs the cons. 
The compiler in Appendix B generates a bit parallel implementation but, for 
example, if a digit serial operator library were available, it could be easily modified 
to use digit serial arithmetic. Similarly, both fixed point and floating point imple-
mentations of the same algorithm could be generated from the same fly description. 
In the future, we will experiment with different code generation strategies. Many 
designs could be developed from the same program, and different fly based code 
generators could serve to decouple the algorithmic descriptions from the back-end 
implementation. In the case of using a digit serial library, users could select the 
digit size, or produce a number of implementations and choose the one which best 
meets their area/time requirements. 
Appendix Conclusion 69 
Finally, the elementary function generator is a fixed point one and floating point 
functions was implemented by the designer. This process could conceivably be fur-
ther automated to produce an automatic floating point elementary function gener-
ator. 
Appendix A 
Fly Formal Grammar 
program = statement Jist .. 
statement Jist = statement |，，{，’ statement (s)，，}，’ 
parallel-Statement = “ [“ statement (s)，，]，， 
statement = comment | assignment | ifelse | if 丨 while 丨 paralleLstatement | func-
tion _call 
assignment = variable ,’=’，expression，’；，， 
expression = value operator expression | value 
operator = "*" |，，/” |，,+” |，，-,，| ”.+，，| ,，.-” | ”.*,， 
value = INTEGER | variable 
variable = "$" LETTER | "$" LETTER DIGIT 
while = "while" ” (，，condition ")" statmentJist 
ifelse = "if，,(，，condition ")" statmentJist "else" statmentJist 
if = " i f "(" condition ")" statmentJist • 
condition = expression relation expression 
relation =">"丨，，<，,丨，’<=，，丨 ’ , � = ” | ”！=’，丨，，=’， . 
function—call = variable ,’=,’ function—name ”(，，variable Jist ’，)’，，，;，， 
function-name = "readJiost" | "write_host" | ".power 15" 
variable-list = value，，’’’ variableJist | value 
comment = ” # ” ANYTHING 
70 " 
Appendix B 
Original Fly Source Code 
package main; downto 1〉）； 
use Parse::RecDescent; end arith_core； 
architecture rtl of arith_core is; 
my $grammar = q { EOF 
{ my ($seq, $comb, $aux, $paux, $s, '/.sigs)= ； 
(•_._，"", 0, 0, "signal"); } 
foreach my $k (keys ‘/.sigs) { 
prog: stmtlist /"$/ { if ($sigs{$k}) { 
print « E O F print "$s $k :\t w o r d s ( $ s i g s { $ k }“. 
library ieee; "downto 0);\n" 
use ieee.std_logic_l164.all； if ！($k eq "din") 
use ieee.std_logic_arith.all； and ！($k eq "dout")； 
package hc_pack is ； } 
subtype word is integer； else { 
type words is array(integer print "$s $k :\t word; \n"； 
range <>) of word; } 
end hc_pack; } 
for (my $i=l; $i<$aux; $i++) { 
library ieee; print "$s s$i, f$i :\t boolean;“; 
use ieee.std_logic_1164.all; print “--std_logic;\n"； 
use ieee.std_logic_arith.all； }； 
use work.hc_pack.all； for (my $i=l； $i<=$paux; $i++) { 
print "$s p$i, q$i :\t boolean;“； 
print “―std_logic;\n"; 
entity arith.core is }； 
port( 
elk: in std_logic; print "$s s$item[l], f$item[l] :\t boolean;"； 
rst: in std_logic; print "~std_logic ;\nbegin —architecture\n"; 
start: in std一logic; print “ s$item[l] <= TRUE when start='l'"； 
din : in words( $sigs{din} ； print "else FALSE ；~start;\n finish <= ,1，"； 
downto 1)； print "when f$item[l] else '0'; ~f$item[l];\n"; 
finish: out std_logic; print "process(clk)\nbegin\n"； 
dout: out words( $sigs{dout} ； print "if rising一edge(clk) then\n"; 
71 " 
Appendix B Original Fly Source Code 72 ’ 
print $seq; $aux; 
print "end if ;\nend process; \n__; } 
print “--combinational part\n$comb"； 
print "end rtl;\n"; asgn: var ，=’ expr ';' { 
} $aux = $aux + 1； 
$seq .= "if s$aux then\n\t"； 
stmtlist: stmt | ，{， stmt(s) '}' { $seq .= "$item[l] <= $item[3];\n"; 
my $fst_in = shift(®{$itera[2]}); $seq •= "end if;\n"; 
my $int_in = $fst_in; $seq .= "f$aux <= s$aux; \n\n__; 
$aux += 1 ； $aux; 
$comb .= "s$int_in <= s$aux; \n"； } 
foreach $int_in (®-[$item[2]}) { 
$comb .= "s$int_in <= f$fst.in;\n" ; expr: val op expr { "$item[l] $item[2] $item[3]" } I val 
$fst_in = $int_in; 
} op: I ，/, I ，+， I ’-， 
$comb .= "f$aux <= f$fst.in;\n"; 
$aux; val: /\d+/ I var 
} 
var: A$[a-z][\w\[\]]*/ { 
stmt: asgn 丨 ifelse I if 丨 while $item[l] =" s/-\$//; 
I pstmtlist I <error> my $sig = $item[l]； 
$sig =- s/\[(\d+)\]//; 
pstmtlist: '[‘ 8tmtlist(s) ，]， { $sigs{"$sig"} = ($sigs{"$sig"} && ($sigs{"$sig"} > $ 1 ) ) 
$aux += 1; ？ $sigs{"$sig"} : $1; 
my $int_in; $item[l] =" t r A [ \ ] A ( \ ) / ; 
my Splist = ()； $item[l]； 
foreach $int_in (@{$item[2]}) { } 
$comb .= sprintf("s'/.d <= s'/.d;\n", 
$int_in, $aux)； while: ，while， ，（， cond ，）， stratlist { 
$paux += 1； $aux += 1； 
push (®plist, $paux)； $comb .= "s$item[5] <= ($item[3]) and “. 
"(s$aux or f$item[5j);\n"; 
$seq •= "if f$aux then --pstmtlist\n\t"； $comb .= "f$aux <= (not ($item[3])) and “. 
$seq •= "q$paux <= false;\n"; "(s$aux or f$item[5]);\n"; 
$seq .= "else\n\t"; $aux; “ 
$seq .= "q$paux <= p$paux； \n"; } 
$seq .= "end if; \n"; 
ifelse: ' i f '(' cond ，）， stmtlist ，else， stmtlist { 
$comb .= "p$paux <= f$int_in or q$paux;\n"; $aux += 1; 
} $comb .= "s$item[5] <= ($item[3]〉 and s$aux;\n"; 
my $pend = "f$aux <= p" • $comb .= "s$item[7] <= (not ($item[3])) and s$aux;\n"; 
j o i n C and p" , ®plist) $comb .= "f$aux <= f$item[5] or f$item[7];\n"; 
.“；--pstmt end\n"； $aux； 
$comb .= Spend; } 
w 
Appendix B Original Fly Source Code 73 
i f : ' i f ，（， cond ，）’ s t m t l i s t { 
$aux += 1； 
$comb .= "s$itein[5] <= ($item[3]) and s$aux;\n"; 
$comb .= "f$aux <= (not ($item[3]) and s$aux) or f$itera[5]；\n"; 
$aux; 
} 
cond: expr rel expr { "$item[l] $item[2] $item[3]" > 
r e l : '>» I ,<， I ,<=，I ，>=, I ‘ ！ = ' { __/=_• } I，==, { "=" } 
v a r l i s t : v a r ，，' v a r l i s t { " $ i t e m [ l ] $ i t e r a [ 3 ] " } I v a r 
}； . 
$::RD_HINT = 0; 
$::RD_AUTOACTION = q { $item[l] }; 
my Sparser = Parse: :RecDescent->ne*j($grainmar) 
or die "Bad grammar"； 
l o c a l $ / ; 
my $ s c r i p t = <>； 
my $tree = $parser->prog($script) or die "Bad script"； 
Bibliography 
[ANS85] New York ANSI/IEEE. IEEE Standard for Binary Floating-Point Arith-
metic. Technical report, The Insittution of Electrical and Electronics 
Engineerings, Inc, 1985. IEEE Std 754-1985. 
BL02] Pavle Belanovic and Miriam Lesser. A Library of Parameterized 
Floating-point Modules and Their Use. In Field Programmable Logic 
and Application. Reconfigurable Computing Is Going Mainstream, pages 
657-666. Springer-Verlag Heidelberg, Sept 2002. 
[BSC+99] P. Banerjee, N. Shenoy, A. Choudhary, S. Hauck, C. Bachmann, 
M. Chang, M. Haldar, P. Joisha, A. Jones, A. Kanhare, A. Nayak, 
S. Periyacheri, and M. Walkden. MATCH: a MATLAB compiler for 
configurable computing systems. Technical report, Center for Parallel 
and distributed Computing, Northwestern University, Aug 1999. Tech-
nical Report CPDCTR -9908-013. 
[ConOl] D. Conway. Parse: :RecDescent Perl module. In 
http://www.cpan.org/modules/hy-module/Parse/DCONWAY/Parse-
RecDescent-1.80. tar.gz, 2001. 
[ETS96] ETSI. Radio Equipment and Systems (RES); High PErformance Radio 
Local Area Network (HIPERLAN) Type 1; Functional specification. The 
European Telecommunications Standards Institute, 1st edition, 1996. 
[Gol91] David Goldberg. What every computer scientist should know about 
floating-point arithmetic. ACM Computing Surveys, 23(l):5-48, 1991. 
74 " 
[HLT+02] C.H. Ho, P.H.W. Leong, K.H. Tsoi, R. Ludewig, P.Zipf, A.G. Ortiz, 
and M. Glesner. Fly - a modifiable hardware compiler. In Proceedings 
of the twelfth International Workshop on Field-Programmable Logic & 
Applications, 2002. 
[IEE02] IEEE Computer Society. 1076 IEEE Standard VHDL Language Refer-
ence Manual. Technical report, The Insittution of Electrical and Elec-
tronics Engineerings, Inc, 2002. IEEE Std 1076-2002. 
[JGW81] John E. Dennis Jr, David M. Gay, and Roy E. Welsch. An adaptive 
nonlinear least-squares algorithm. ACM Transactions on Mathematical 
Software, 7(3):348-368, Sept 1981. 
[JLOl] A. Jaenicke and W. Luk. Parameterised floating-point arithmetic on FP-
GAs. In Proceedings of the IEEE International Conference on Acoustics, 
Speech and Signal Processing, pages 897-900, 2001. 
[LLC+01] P.H.W. Leong, M.P. Leong, O.Y.H. Cheung, T. Tung, C.M. Kwok, M.Y. 
Wong, and K.H. Lee. Pilchard - a reconfigurable computing platform 
with memory slot interface. In Proceedings of the IEEE Symposium on 
FCCM, 2001. 
:LMM+98] W. B. Ligon, S. McMillan, G. Moon, K. Schoonover, F. Stivers, and 
K. D. Underwood. A Re-evaluation of the Practicality of Floating-Point 
Operations on FPGAs. In Proc. of IEEE Symposium on FPGAs for 
Custom Computing Machines, pages 206-215. IEEE Computer Society 
Press, 1998. . 
[MF99] J. Mathews and K. Fink. Numerical Methods Using MATLAB, pages 
433-441. Prentice Hall, 3rd edition, 1999. 
[Mit98] San jit K. Mitra. Digital Signal Processing A Computer-Based Approach 
International Editions 1998�pages 339-416. McGraw-Hill, 1998. 
75 
[MT98] Junichiro Makino and Makoto Taiji. Scientific Simulation with Special-
Purpose Computers - the GRAPE systems, pages 41—48. John Wiley & 
Sons Ltd, 1998. 
[NM65] J. Nelder and R. Mead. A simplex method for function minimization. 
In Computer Journal, pages 308-313, 1965. 
[Pag96] I. Page. Constructing hardware-software systems from a single descrip-
tion. Journal of VLSI Signal Processing, 12(1):87-107, 1996. 
[Pin03] Ayal Pinkus. Yet another computer algebra system (YACAS), 2003. 
[SS97] Michael J. Schulte and James Stine. Symmetric bipartite tables for 
accurate function approximation. In Tom as Lang, Jean-Michel Muller, 
and Naofumi Takagi, editors, Proceedings of the 13th IEEE Symposium 
on Computer Arithmetic, pages 175-183, Los Alamitos, CA, 1997. IEEE 
Computer Society Press. 
[SS99a] James E. Stine and Michael J. Schulte. The symmetric table addition 
method for accurate function approximation. Journal of VLSI Signal 
Processing, 21:167-177, 1999. 
[SS99b] James E. Stine and Michael J. Schulte. The symmetric table addition 
method for accurate function approximation. Journal of VLSI Signal 
Processing, 21:167-177, 1999. 
[SWA95] N. Shirazi, A. Walters, and P. Athanas. Quantitative analysis of floating 
point arithmetic on FPGA based custom computing machines. In Proc. 
FCCM, pages 155-162, 1995. 
[Teu03] Peter Teuben. NEMO - A Stellar Dynamics Toolbox, 2003. 
[WA02] M. Ward and N.C. Audsley. Hardware Implementation of Programming 
Languages for Real-Time. In Proceedings of the Eigth IEEE Real-Time 
•d Embedded Technology and Applications Symposium, pages 276-285, 
Sept 2002. 
76 
[WCOOO] L. Wall, T. Christiansen, and J. Orwant. Programming Perl O'Reilly, 
3rd edition, 2000. 
[XilOl] Xilinx Inc. Architectural Description, pages 6-57. Xilinx Inc, 2001. 
77 " 
Publications 
Full Length Conference Papers 
• C.H. Ho, M.P. Leong, RH.W. Leong, J. Becker, M.Glesner, "Rapid Proto-
typing of FPGA based Floating Point DSP Systems", in Proceedings of IEEE 
International Workshop on Rapid System Prototyping, July 2002. 
• C.H. Ho, RH.W. Leong, K.H. Tsoi, R. Ludewig, P. Zipf, A.G. Ortiz, M.Glesner, 
"Fly - A Modifiable Hardware Compiler", in Proceedings of International Con-
ference on Field Programmable Logic and Applications, September 2002. 
• C.H. Ho, K.H. Tsoi, H.C. Yeung, Y.M. Lam, K.H. Lee, RH.W. Leong, R. 
Ludewig, P. Zipf, A.G. Ortiz, M. Glesner, "Arbitrary Function Approximation 
in HDLs", submitted to Proceedings of IEEE International Conference on 
Field-Programmable Technology, December 2003. 
78 " 
圓 ！ ： 、 ： , - ,
 , 
f v 
- - , . . . 〜 ， 〜 • -









r - - r r 、 ， c
 . ’ ) ， ， - ,
 -





































 . . . 





























 . . . 
V
 , 
• • T . .
 
I
 > 、 . L
 i . :











 - { .
 -
 . . 、 ； ， . . •
 - •
 .
 • 、 . . . ， . ： .
 .
 .
 • • 
i 
-














 • > • • . .
 . . . . .
 . . .












































— : . 
-
琴卞





 . : : . : : : : : . " ‘ ： ： : . ‘ ， . . . . 
CUHK Libraries 
