Floating-Point FPGA: Architecture and Modeling by Chun Hok Ho et al.
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 12, DECEMBER 2009 1709
Floating-Point FPGA: Architecture and Modeling
Chun Hok Ho, Student Member, IEEE, Chi Wai Yu, Philip Leong, Senior Member, IEEE, Wayne Luk, Fellow, IEEE,
and Steven J. E. Wilton, Senior Member, IEEE
Abstract—This paper presents an architecture for a recon-
ﬁgurable device that is speciﬁcally optimized for ﬂoating-point
applications. Fine-grained units are used for implementing con-
trol logic and bit-oriented operations, while parameterized and
reconﬁgurable word-based coarse-grained units incorporating
word-oriented lookup tables and ﬂoating-point operations are
used to implement datapaths. In order to facilitate comparison
with existing FPGA devices, the virtual embedded block scheme
is proposed to model embedded blocks using existing ﬁeld-pro-
grammable gate array (FPGA) tools. This methodology involves
adopting existing FPGA resources to model the size, position, and
delay of the embedded elements. The standard design ﬂow offered
by FPGA and computer-aided design vendors is then applied and
static timing analysis can be used to estimate the performance of
the FPGA with the embedded blocks. On selected ﬂoating-point
benchmark circuits, our results indicate that the proposed archi-
tecture can achieve four times improvement in speed and 25 times
reduction in area compared with a traditional FPGA device.
Index Terms—Architecture, embedded blocks, ﬁeld-pro-
grammable gate array (FPGA), ﬂoating point, modeling.
I. INTRODUCTION
F
IELD-programmable gate array (FPGA) technology has
been widely adopted to speed up computationally inten-
siveapplications.MostcurrentFPGAdevicesemployanisland-
style ﬁne-grained architecture [1], with additional ﬁxed-func-
tion heterogeneous blocks such as multipliers and block RAMs;
these have been shown to have severe area penalties compared
with application-speciﬁc integrated circuits (ASICs) [2]. In this
paper, we propose an architecture for FPGAs that are optimized
for ﬂoating-point applications. Such devices could be used for
applications in DSP, control, high-performance computing, and
other applications that have large dynamic range, convenience,
and ease-of-veriﬁcation compared with traditional ﬁxed-point
designs on conventional FPGAs.
The proposed ﬂoating-point FPGA (FPFPGA) architecture
has bothﬁne-and coarse-grainedblocks,such usageofmultiple
granularity having advantages in speed, density, and power over
more conventional heterogeneous FPGAs. The coarse-grained
Manuscript received April 16, 2008; revised September 12, 2008. First pub-
lished April 14, 2009; current version published November 18, 2009. This work
was supported in part by the U.K. Engineering and Physical Sciences Research
Council (EPSRC) under Grant EP/C549481/1 and Grant EP/D060567/1 and by
the Research Grants Council of the Hong Kong Special Administrative Region,
China Earmarked Grant CUHK413707.
C. H. Ho, C. W. Yu, and W. Luk are with the Department of Computing,
Imperial College, London SW7 2AZ, U.K. (e-mail: cho@doc.ic.ac.uk).
P. Leong is with the Department of Computing Science and Engineering,
Chinese University of Hong Kong, Shatin N.T., Hong Kong.
S. J. E. Wilton is with the Department of Electrical and Computer Engi-
neering, University of British Columbia, Vancouver, BC V6T 1Z4, Canada.
Digital Object Identiﬁer 10.1109/TVLSI.2008.2006616
block is used to implement the datapath, while lookup table
(LUT) based ﬁne-grained resources are used for implementing
state machines and bit level operations. In our architecture,
the coarse-grained blocks have ﬂexible, parameterized ar-
chitectures that are synthesized from a hardware description
language. This allows tuning of the parameters in a quantitative
manner to achieve a good balance between area, performance,
and ﬂexibility.
One major issue when evaluating new architectures is deter-
mining how a fair comparison to existing commercial FPGA ar-
chitectures can be made. The Versatile Place and Route (VPR)
tool [1] is widely used in FPGA architecture research; however,
the computer-aided design (CAD) algorithms used within are
different from those of modern FPGAs, as is its underlying is-
land-style FPGA architecture. As examples, VPR does not sup-
port retiming, nor does it support carry chains that are present
in all major FPGA devices. To enable modeling of our FPFPGA
and comparison with a standard island-style FPGA, we pro-
pose a methodology to evaluate an architecture based on an ex-
isting FPGA device. The key element of our methodology is to
adopt virtual embedded blocks (VEBs), created from the recon-
ﬁgurable fabric of an existing FPGA, to model the area, place-
ment, and delay of the embedded blocks to be included in the
FPGA fabric. Using this method, the impact of incorporating
embedded elements on performance and area can be quickly
evaluated, even if an actual implementation of the element is
not available.
The key contributions of this paper are as follows.
1) A novel FPFPGA architecture combining ﬁne-grained
resources combined with design-time parameterizable
coarse-grained units that are reconﬁgurable at runtime.
To the best of our knowledge, this is the ﬁrst time such a
scheme has been proposed.
2) The VEB methodology that allows modeling of FPGA ar-
chitectures with embedded blocks and comparisons with
commercial FPGAs.
3) Experimental results over various applications for the
FPFPGA device.
This paper is organized as follows. Section II describes re-
lated work and existing FPGA architectures. Section III de-
scribes the proposed FPFPGA architecture. An example map-
ping is presented in Section IV. Section V discusses the re-
quirements and the associated design challenges of an FPFPGA
compiler. The evaluation methodology, including a review of
the VEB ﬂow, is described in Section VI, and the evaluation is
given in Section VII. Section VIII summarizes our research and
discusses opportunities for future research.
1063-8210/$26.00 © 2009 IEEE
Authorized licensed use limited to: Imperial College London. Downloaded on December 3, 2009 at 04:09 from IEEE Xplore.  Restrictions apply. 1710 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 12, DECEMBER 2009
II. BACKGROUND
A. Related Work
FPGA architectures containing coarse-grained units have
been reported in the literature. Compton and Hauck propose
a domain-speciﬁc architecture that allows the generation of a
reconﬁgurable fabric according to the needs of the application
[3]. Ye and Rose suggest a coarse-grained architecture that em-
ploys bus-based connections, achieving a 14% area reduction
for datapath circuits [4].
The study of embedded heterogeneous blocks for the accel-
eration of ﬂoating-point computations has been reported by
Roesler and Nelson [5] as well as Beauchamp et al. [6]. Both
studies conclude that employing heterogeneous blocks in a
ﬂoating-point unit (FPU) can achieve area saving and increased
clock rate over a ﬁne-grained approach.
Leijten–Nowak and van Meerbergen [7] proposed mixed-
level granularity logic blocks and compared their beneﬁts with
a standard island-style FPGA using the VPR tool [1]. Ye et al.
[8] studied the effects of coarse-grained logic cells (LCs) and
routing resources for datapath circuits, also using VPR. Kuon
and Rose [2] reported the effectiveness of embedded elements
in current FPGA devices by comparing such designs with the
equivalent ASIC circuit in 90-nm process technology.
Beck modiﬁedVPRto explorethe effectsof introducinghard
macros [9], while Beauchamp et al. augmented VPR to assess
the impact of embedding FPUs in FPGAs [6]. We are not aware
of studies concerning the effect of adding arbitrary embedded
blocks to existing commercial FPGA devices, nor of method-
ologies to facilitate such studies.
In earlier work, we described the VEB technique for mod-
eling heterogeneous blocks using commercial tools [10], do-
main-speciﬁc hybrid FPGAs [11], and a word-based synthe-
sizable FPGA architecture [12]. This paper provides a uniﬁed
view of these studies, describes the proposed FPGA architec-
ture in greater detail, presents improved results through the use
of a higher performance commercial ﬂoating-point core, intro-
duces the mapping process for the FPFPGA, discusses the re-
quirement of a hardware compiler dedicated to such FPFPGA
device, and includes two new synthetic benchmark circuits in
the study, one of which is twice the size of the largest circuit
studied previously.
B. FPGA Architectures
An FPGA is typically constructed as an array of ﬁne-grained
or coarse-grained units. A typical ﬁne-grained unit is a -input
LUT, where typically ranges from 4 to 7, and can imple-
mentany -inputBooleanequation.WecallthisanLUT-based
fabric. Several LUT-based cells can be joined in a hardwired
manner to make a cluster. This greatly reduces area and routing
resources within the fabric [13].
Heterogeneous functional blocks are found in commercial
FPGA devices. For example, a Virtex II device has embedded
ﬁxed-function 18-bit multipliers, and a Xilinx Virtex 4 device
has embedded DSP units with 18-bit multipliers and 48-bit ac-
cumulators. The ﬂexibility of these blocks is limited and it is
less common to build a digital system solely using these blocks.
When the blocks are not used, they consume die area without
adding to functionality.
FPGA fabric can have different levels of granularity. In gen-
eral, a unit of smaller granularity has more ﬂexibility, but can
be less effective in speed, area, and power consumption. Fab-
rics with different granularity can coexist as evident in many
commercial FPGA devices. Most importantly, the aforemen-
tioned examples illustrate that FPGA architectures are evolving
to be more coarse-grained and application-speciﬁc. The pro-
posed architecture in this paper follows this trend, focusing on
ﬂoating-point computations.
III. FPFPGA ARCHITECTURE
A. Requirements
Before we introduce the FPFPGA architecture, common
characteristics of what we consider a reasonably large class
of ﬂoating-point applications that might be suitable for signal
processing, linear algebra and simulation are ﬁrst described.
Although the following analysis is qualitative, it is possible to
develop the architecture in a quantitative fashion by proﬁling
application circuits in a speciﬁc domain.
In general, FPGA-based ﬂoating-point application circuits
can be divided into control and datapath portions. The datapath
typically contains ﬂoating-point operators such as adders,
subtractors, and multipliers, and occasionally square root and
division operations. The datapath often occupies most of the
area in an implementation of the application. Existing FPGA
devices are not optimized for ﬂoating-point computations, and
for this reason, ﬂoating-point operators consume a signiﬁcant
amount of FPGA resources. For instance, if the embedded
DSP48 blocks are not used, a double-precision ﬂoating-point
adder requires 701 slices on a Xilinx Virtex 4 FPGA, while a
double-precision ﬂoating-point multiplier requires 1238 slices
on the same device [14].
The ﬂoating-point precision is usually a constant within an
application. The IEEE 754 single precision format (32 bit) or
double-precision format (64 bit) is commonly used.
The datapath can often be pipelined and connections within
thedatapathmaybeunidirectionalinnature.Occasionally,there
is feedback in the datapath for some operations such as accu-
mulation. The control circuit is usually much simpler than the
datapath,andtherefore,theareaconsumptionistypicallylower.
Control is usually implemented as a ﬁnite-state machine and
most FPGA synthesis tools can produce an efﬁcient mapping
from the Boolean logic of the state machine into ﬁne-grained
FPGA resources.
Based on the aforementioned analysis, some basic require-
ments for FPFPGA architectures can be derived as follows.
1) A number of coarse-grained ﬂoating-point addition and
multiplication blocks are necessary since most computa-
tions are based on these primitive operations. Floating-
point division and square root operators can be optional,
depending on the domain-speciﬁc requirement.
2) Coarse-grained interconnection, fabric, and bus-based op-
erations are required to allow efﬁcient implementation and
interconnection between ﬁxed-function operators.
Authorized licensed use limited to: Imperial College London. Downloaded on December 3, 2009 at 04:09 from IEEE Xplore.  Restrictions apply. HO et al.: FLOATING-POINT FPGA: ARCHITECTURE AND MODELING 1711
Fig. 1. Architecture of the FPFPGA.
3) Dedicated output registers for storing ﬂoating-point values
are required to support pipelining.
4) Fine-grained units and suitable interconnections are re-
quired to support implementation of state machines and
bit-oriented operations. These ﬁne-grained units should be
accessible by the coarse-grained units and vice versa.
B. Architecture
Fig. 1 shows a top-level block diagram of our FPFPGA ar-
chitecture. It employs an island-style ﬁne-grained FPGA struc-
ture with dedicated columns for coarse-grained units. Bothﬁne-
grainedandcoarse-grainedunitsarereconﬁgurable.Thecoarse-
grained part contains embedded ﬁxed-function ﬂoating-point
adders and multipliers. The connection between coarse-grained
units and ﬁne-grained units is similar to the connection between
embedded blocks (embedded multiplier, DSP block or block
RAM) and ﬁne-grained units in existing FPGA devices.
The coarse-grained logic architecture is optimized to imple-
ment the datapath portion of ﬂoating-point applications. The ar-
chitecture of each block, inspired by previous work [4], [12],
is shown in Fig. 2. Each block consists of a set of ﬂoating-
point multipliers, adder/subtractors, and general-purpose bit-
blocks connected using a unidirectional bus-based interconnect
architecture. Each of these blocks will be discussed in this sec-
tion. To keep our discussion general, we have parameterized the
architecture, as shown in Table I.There are subblocks ineach
coarse-grained block. of these subblocks are ﬂoating-point
multipliers, another of themare ﬂoating-point adders,and the
rest are general-purpose wordblocks. Speciﬁc values
of these parameters will be given in Section VI.
The core of each coarse-grained block contains multiplier
and adder/subtractor subblocks. Each of these blocks has a
reconﬁgurable registered output, and associated control input
and status output signals. The control signal is a write enable
signal that controls the output register. The status signals report
the subblock’s status ﬂags and include those deﬁned in IEEE
standard as well as a zero and sign ﬂag. The ﬁne-grained unit
can monitor these ﬂags via the routing paths between them.
Each coarse-grained block also contains general-purpose
wordblocks. Each wordblock contains identical bitblocks,
and is similar to our earlier published design [12]. A bitblock
contains two 4-input LUTs and a reconﬁgurable output register.
The value of depends on the bit-width of the coarse-grained
block. Bitblocks within a wordblock are all controlled by
the same set of conﬁguration bits, so all bitblocks within a
wordblock perform the same function. A wordblock, which
includes a register, can efﬁciently implement operations such
as ﬁxed-point addition and multiplexing. Like the multiplier
and adder/subtractor blocks, wordblocks generate status ﬂags
such as MSB, LSB, carry out, overﬂow, and zero; these signals
can be connected to the ﬁne-grained units.
Apart from the control and status signals, there are input
buses and output buses connected to the ﬁne-grained units.
Each subblock can only accept inputs from the left, simplifying
the routing. To allow more ﬂexibility, feedback registers have
been employed so that a block can accept the output from the
right block through the feedback registers. For example, the
ﬁrst block can only accept input from input buses and feedback
registers, while the second block can accept input from input
buses, the feedback registers, and the output of the ﬁrst block.
Each ﬂoating-point multiplier is logically located to the left of
a ﬂoating-point adder so that no feedback register is required to
support multiply-and-add operations. The coarse-grained units
cansupportmultiply-accumulatefunctionsbyutilizingthefeed-
back registers. The bus width of the coarse-grained units is 32
bits for the single-precision FPFPGA and 64 bits for double
precision.
Switches in the coarse-grained unit are implemented using
multiplexers and are bus-oriented. A single set of conﬁguration
bits is required to control each multiplexer, improving density
compared to a ﬁne-grained fabric.
IV. EXAMPLE MAPPING
To illustrate how our architecture can be used to implement
a datapath, we use the example of a ﬂoating-point matrix mul-
tiply. Fig. 3 illustrates the example datapath and the implemen-
tation of this datapath on our architecture. In this example, we
assume an architecture in which the multiplication subblocks
are located in the second and sixth subblocks within the archi-
tecture, and ﬂoating-point adder/subtractor units are located in
the third and the seventh subblocks.
Thedatapath of thisexample applicationcan be implemented
using two coarse-grained blocks. The datapath produces the re-
sult of the equation . The ﬁrst
coarse-grained unit performs two multiplications and one ad-
dition. The result is forwarded to the next coarse-grained
unit. The second coarse-grained unit performs one multiplica-
tionandoneaddition.However,asallmultiplicationsstartinthe
same clock cycle, the last addition cannot start until is ready.
In order to synchronize the arrival time of and , an-
other ﬂoating-point adder (FA2) in the second coarse-grained
block is instantiated as a ﬁrst-input, ﬁrst-output (FIFO) with the
same latency as FA6 in CGU0. This demonstrates an alternate
Authorized licensed use limited to: Imperial College London. Downloaded on December 3, 2009 at 04:09 from IEEE Xplore.  Restrictions apply. 1712 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 12, DECEMBER 2009
Fig. 2. Architecture of the coarse-grained unit.
Fig. 3. Example mapping for matrix multiplication.
TABLE I
PARAMETERS FOR THE COARSE-GRAINED UNIT
useofacoarse-grainedunit.Finally, and areaddedto-
gether and the state machine sends the result to the block RAM.
All FPU subblocks have an enabled registered output to further
pipeline the datapath.
V. FPFPGA COMPILATION
While traditional HDL design ﬂow can be used in trans-
lating applications to our FPFPGA, the procedure is tedious
and the designers have to fully understand the usage of the
coarse-grained units in order to manually map the circuit effec-
tively. A domain-speciﬁc hardware compiler, which can map
a subset of a high-level language to the proposed architecture,
is useful in developing applications on such as an FPFPGA.
In addition, the hardware compiler is beneﬁcial during the
development of the FPFPGA itself since the compiler can be
used to generate benchmark circuits. Although we have not
implemented such a compiler, this section proposes the basic
requirements of the compiler and discusses how some of the
design challenges can be addressed.
The basic requirements of the FPFPGA compiler are as
follows.
1) The compiler should contain a set of predeﬁned
built-in functions that represent the functionality in
the coarse-grained unit. For example, the compiler can
provide ﬂoating-point functions such as fadd(), fmul() (or
even better, overloaded operators such as “ ”o r“ ”) that
associate with the ﬂoating operators in the coarse-grained
unit. This feature allows application designers to infer the
coarse-grained units easily.
2) It should have the ability to differentiate the control logic
and the datapath. This feature would allow the technology
mapper to handle the control logic and the datapath sep-
arately. Since the control logic can be efﬁciently imple-
mented using the ﬁne-grained logic, a standard hardware
Authorized licensed use limited to: Imperial College London. Downloaded on December 3, 2009 at 04:09 from IEEE Xplore.  Restrictions apply. HO et al.: FLOATING-POINT FPGA: ARCHITECTURE AND MODELING 1713
compilation technique such as [15] can be used. The data-
path, which is usually much more complicated, can be
mapped to coarse-grained units whenever it is possible.
3) The compiler should contain a parametrizable technology
mapperforthecoarse-grainedarchitecture.Sincethisispa-
rametrized for design exploration, the technology mapper
should map to devices with differing amounts of coarse-
grained resources. For example, the technology mapper
should be aware of the number of ﬂoating-point operator
in a coarse-grained unit so it can fully utilize all the opera-
tors in a unit. This feature would allow FPGA designers to
evaluatenewarchitectureseffectivelybycompilingbench-
mark circuits with modiﬁed architectural parameters.
4) The compiler should contain an intelligent resource al-
location algorithm. It should be aware of the function-
ality of the coarse-grained unit and decide if the given
operation is best implemented by coarse-grained units or
ﬁne-grained units. For example, if the compiler receives a
“square root” instruction but there is no square root func-
tion in the coarse-grained units, the allocation algorithm
can infer a square root operator using ﬁne-grained unit
instead.
5) Support is required for bitstream generation for coarse-
grained units. Such a feature is necessary to determine the
delay of a mapped coarse-grained unit.
Requirements 1, 4, and 5 have been studied in other contexts
[16], and requirement 2 has been addressed in [17] in which the
authors propose a compiler that can produce separate circuits
for control logic and datapath for ﬂoating-point applications.
Requirement 3 is new, and is speciﬁc for our architecture. One
approach to creating this tool would be to develop a dedicated
technology mapper for the coarse-grained units within the Tri-
dent framework [17]. A bitstream generator for coarse-grained
units can be integrated into the framework as well. This is on-
going work.
VI. MODELING METHODOLOGY
In this section, we describe the methodology we use to model
our architecture. We employ an experimental approach and use
the concept of VEBs to model the embedded coarse-grained
blocks. The following sections ﬁrst describe the benchmark
circuits we used, followed by a description of the VEB
methodology.
A. Benchmark Circuits
Eight benchmark circuits are used in this study, as shown in
Table II. Five are computational kernels, one is a Monte Carlo
simulation datapath, and two are synthetic circuits. All bench-
mark circuits involve single-precision ﬂoating operations. We
choose these circuits since they are representative of the appli-
cations weenvision beingusedonan FPFPGA. We notethatthe
strong representation of simple ﬂoating-point kernels that map
directly to the CGU favorably inﬂuences the overall density and
performance metrics, so our results can be considered an upper
bound. Dependencies, mapping, control, and interfacing are is-
sues likely to degrade performance.
The bﬂy benchmark performs the computation
where the inputs and output are complex numbers; this is
TABLE II
BENCHMARK CIRCUITS
Fig. 4. Modeling ﬂow overview.
commonly used within a fast Fourier transform computation.
The dscg circuit is the datapath of a digital sine--cosine gen-
erator. The ﬁr circuit is a four-tap ﬁnite-impulse response
ﬁlter. The mm3 circuit performs a 3 3 matrix multiplication.
The ode circuit solves an ordinary differential equation. The
bgm circuit computes Monte Carlo simulations of interest
rate model derivatives priced under the Brace, Gatarek, and
Musiela (BGM) framework [18]. All the word lengths of the
aforementioned circuits are 32 bit.
In addition, a synthetic benchmark circuit generator based on
[19] is used. The generator can produce ﬂoating-point circuits
from a characterization ﬁle describing circuit and cluster sta-
tistics. Two synthetic benchmark circuits are produced. Circuit
syn2 contains ﬁve ﬂoating-point adders and four ﬂoating-point
multipliers. Circuit syn7 contains 25 ﬂoating-point adders and
Authorized licensed use limited to: Imperial College London. Downloaded on December 3, 2009 at 04:09 from IEEE Xplore.  Restrictions apply. 1714 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 12, DECEMBER 2009
Fig. 5. Modeling coarse-grained unit in FPGAs using VEBs.
25 ﬂoating-point multipliers. The syn7 circuit is considerably
larger than the other benchmarks.
B. VEB Methodology
To model the mapping of our benchmark circuits on the ar-
chitecturedescribedinSectionIII,weemploytheVEBmethod-
ology.Thismethodologyallowsustoquantifytheimpactofem-
bedding ourblock into a modern FPGAusing commercialCAD
tool optimizations. This is in contrast to VPR-based methodolo-
gies that assume a bare-bone island-style FPGA (without carry
chains and with a simpliﬁed routing architecture) and do not
employ modern optimizations such as physical synthesis and
retiming.
Fig. 4 illustrates the modeling ﬂow using the VEB method-
ology. The input is a high-level application description and the
outputisan FPGAbitstream.Theapplicationisﬁrstbrokeninto
control logic and datapath portions. Since we do not yet have
a complete implementation of a suitable compiler, we perform
this step manually.
The datapath portion is then mapped to the embedded
ﬂoating-point blocks (again, this is currently done manually).
An example of this mapping was givenin Section IV. The result
of thisstepis a netlistcontainingblack boxesrepresenting those
parts of the circuit that will be mapped to embedded blocks,
and ﬁne-grained logic elements representing those parts of
the circuit that will be mapped to LUTs in the cases where no
suitable embedded block is found or all have been used.
Unfortunately, this netlist cannot be implemented directly
using commercial FPGA CAD tools, since the corresponding
commercial FPGAs donot contain our ﬂoating-point embedded
blocks. The basic strategy in our VEB ﬂow is to use selected
logic resources of a commercial FPGA (called the host FPGA)
to match the expected position, area, and delay of an ASIC
implementation of the coarse-grained units, as shown in Fig. 5.
To employ this methodology, area and delay models for the
coarse-grained units are required. To estimate the area, we syn-
thesize an ASIC description of each coarse-grained block using
a comparable technology. For instance, 0.13- m technology is
used in synthesizing the ASIC block embedded in a Virtex II
device, which, in turn, uses a 0.15- m/0.12- m process. Nor-
malization to the feature size is then applied to obtain a more
accurateareaestimation.Weemployaparameterizedsynthesiz-
able IEEE 754 compliant ﬂoating-point library [20]. Thelibrary
supports four rounding modes and denormalized numbers. A
ﬂoating-point multiplier and ﬂoating-point adder are generated
and synthesized using a regular standard cell library ﬂow.
The area of the coarse-grained block is then translated into
equivalent LC resources in the virtual FPGA. In order to make
this translation, an estimate of the area of an LC in the FPGA
is required, where an LC refers to a four-input LUT and an as-
sociated output register. The area estimation includes the asso-
ciated routing resources and conﬁguration bits. All area mea-
sures are normalized by dividing the actual area by the square
of the feature size, making them independent of feature size.
VEB utilization can then be computed as the normalized area
of the coarse-grained unit divided by the normalized area of an
LC. This value is in units of equivalent LCs, and the mapping
enables modeling of coarse-grained units using existing FPGA
resources. In addition, special consideration is given to the in-
terface between the LCs and the VEB to ensure that the corre-
spondingVEBshavesufﬁcientI/Opinstoconnecttotherouting
resources. This can be veriﬁed by keeping track of the number
ofinputsandoutputsthatconnecttotheglobalroutingresources
in an LC. For example, if an LC only has two outputs, it is not
possible to have a VEB with an area of four LCs that requires
nine outputs. For such a case, the area is increased to ﬁve LCs.
In order to accurately model the delay, both the logic and the
wiring delay of the virtual FPGA must match that of the host
FPGA. The logic delay of the VEB can be matched by intro-
ducing delays in the FPGA resources. In the case of very small
VEBs, it may not be possible to accurately match the number of
I/Opins,area,orlogicdelay,anditmayresultininaccuracies.A
complex coarse-grained unit might have many paths, each with
differentdelays. Inthiscase,weassumethatalldelaysareequal
to the longest one (i.e., the critical path) as it is the most impor-
tant characteristic of a coarse-grained unit in terms of timing.
In our implementation, area matching is achieved by creating
a dedicated scan chain using shift registers. A longer scan chain
consumes more LC, and therefore, the VEB is larger. There are
Authorized licensed use limited to: Imperial College London. Downloaded on December 3, 2009 at 04:09 from IEEE Xplore.  Restrictions apply. HO et al.: FLOATING-POINT FPGA: ARCHITECTURE AND MODELING 1715
TABLE III
NORMALIZATION ON THE AREA OF THE COARSE-GRAINED UNITS AGAINST A VIRTEX II LC
SP and DP stand for single precision and double precision, respectively. CGU stands for coarse-grained unit. For the values shown in the second
column (area), 15% overheads have already been applied on the coarse-grained units.
manyoptionsavailabletomatchthetimingofaVEB.Weutilize
thefastcarrychainspresentedinmostFPGAstogeneratedelays
thatemulatethecriticalpathinaVEB.Thischoicehastheadded
advantage that relocation of LCs on the FPGA does not affect
the timing of this circuit.
It should also be noted that the use of the carry and scan
chains allows delay and area to be varied independently. Mod-
eling wiring delays is more problematic, since the placement
of the virtual FPGA must be similar to that of an FPGA with
coarse-grained units to ensure that their routing is similar. This
requires that: 1) the absolute location of VEBs matches the in-
tended locations of real embedded blocks in the FPGA with
coarse-grainedunitsand2)thedesigntoolsareabletoassignin-
stantiations of VEBs in the netlist to physical VEBs while min-
imizing routing delays.
TheﬁrstrequirementisaddressedbylocatingVEBsatprede-
ﬁned absolute locations that matches the ﬂoorplan of the FPGA
with coarse-grained units. To address requirement 2), the as-
signment of physical VEBs is currently made by a two-phase
placementstrategythatconsistsofunconstrainedplacementfol-
lowed by manual placement. We ﬁrst assume that the VEB can
be placed anywhere on the virtual FPGA so that the place and
route tools can determine the most suitable location for each
VEB. Once the optimal VEB locations are known, a manual
placement is applied to ensure that the placement of each VEB
is aligned on dedicated columns while maintaining nearest dis-
placement to the optimal location. We believe that this strategy
can providea reasonableplacement asthelocation ofeach VEB
is derived from the optimal placement.
There are inevitable differences between real implementa-
tions and the VEB emulated ones. In our previous work [10],
we compared an actual embedded multiplier with one modeled
using the VEB method. It was found that timing difference can
be as large as 11% while the area is accurately determined. We
believe such errors are acceptable for the ﬁrst-order estimations
desired.Onceasuitablecoarse-grainedunitarchitectureisiden-
tiﬁed, a more in-depth analysis using lower level methods such
as SPICE simulation can be performed to conﬁrm the results.
To instantiate all the VEBs and connect all together, we de-
scribe the control logic and instantiate the VEBs explicitly and
connect the signals between the ﬁne-grained units and coarse-
grained units. The design is then synthesized on the target de-
vice and a device-speciﬁc netlist is generated. The timing of the
VEBs is also speciﬁed in the FPGA synthesis tool.
After generating the netlist of the overall circuit, a two-phase
placement is used to locate near-optimal placement of VEBs
along dedicated columns. We then use the vendor’s place and
route tool to obtain the ﬁnal area and timing results. This
represents the characterization of a circuit implemented on the
FPFPGA with ﬁne-grained units and routing resources exactly
the same as the targeted FPGA.
It is important to note that timing information cannot be de-
terminedbeforeprogrammingtheconﬁgurationbits.Otherwise,
the tool reports the worst-case scenario where the longest com-
binational path from the ﬁrst wordblock to the last wordblock
is considered as critical path, and this is usually not the correct
timing in most designs. To address this issue, the tool has to
recognize the conﬁguration of the coarse-grained units before
the timing analysis. Therefore, a set of conﬁgurations is gener-
ated during manual mapping, and the associated bitstream can
beusedintiminganalysis.Thisbitstreamcanbeimportedtothe
timing analysis tool, so the tool can identify false paths during
timing analysis and produce correct timing for that particular
conﬁguration.
VII. RESULTS
In this section, we present an evaluation of our architecture.
The ﬂow described in the previous section is employed.
The best-ﬁt architecture can be determined by varying the
parameters to produce a design with maximum density over
the benchmark circuits. Additional wordblocks are included, al-
lowing more ﬂexibility for implementing circuits outside of the
benchmarkset.Manualmappingsareperformedforeachbench-
mark. A more in-depth analysis on how these parameters affect
the application performance is ongoing work.
For the single-precision FPFPGA device, a Xilinx
XC2V3000-6-FF1152 FPGA is used as the host, and we
assume 16 coarse-grained units. We emphasize that the pa-
rameter settings chosen for the coarse-grained block is ﬁxed
over the entire set of benchmarks, each coarse-grained unit
having nine subblocks , four input buses ,
three output buses , three feedback registers ,
two ﬂoating-point adders, and two ﬂoating-point multipliers
. We assume that the two ﬂoating-point multipliers in
the coarse-grained unit are located at the second and the sixth
subblocks. The two ﬂoating-point adders are located in the
third and the seventh subblocks.
The coarse-grained blocks constitute 7% of the total area of
anXC2V3000device.AllFPGAresultsareobtainedusingSyn-
plicity Synplify Premier 9.0 for synthesis and Xilinx ISE 9.2i
design tools for place and route. All ASIC results are obtained
using Synopsys Design Compiler V-2006.06.
The physical die area and photomicrograph of a Virtex II de-
vice has been reported [21], and the normalization of the area of
coarse-grained unit is estimated in Table III. From inspection of
the die photograph, we estimate that 60% of the total die area is
used for LCs.
Authorized licensed use limited to: Imperial College London. Downloaded on December 3, 2009 at 04:09 from IEEE Xplore.  Restrictions apply. 1716 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 12, DECEMBER 2009
Fig. 6. Comparisons of FPFPGA and Xilinx Virtex II FPGA device.
This means that the area of a Virtex II LC is 5456 m .
This number is normalized against the feature size (0.15 m).
A similar calculation is used for the coarse-grained units. The
ASIC synthesis tool reports that the area of a single-precision
Fig. 7. Floorplan of the single-precision bgm circuit on Virtex II FPGA and
FPFPGA. Area is signiﬁcantly reduced by introducing coarse-grained units.
coarse-grained unit is 433 780 m . We further assume 15%
overhead after place and route the design based on our expe-
rience [12]. The area values are normalized against the fea-
ture size (0.13 m). The number of equivalent LC is obtained
through the division of coarse-grained unit area by slice area.
This shows that single-precision coarse-grained unit is equiva-
lent to 122 LCs. Assuming each LC has two outputs, the VEB
allows a maximum of 244 output pins while the coarse-grained
unit consumes 162 output pins only. Therefore, we do not need
to further adjust the area.
Single-precision FPFPGA results are shown in Table IV(a)
and Fig. 6(a) and (b). A comparison between the ﬂoorplan of
the Virtex II device and the ﬂoorplan of the FPFPGA on bgm
circuit is illustrated in Fig. 7.
The FPU implementation on FPGA is based on the work in
[22].Thisimplementationsupportsdenormalizedﬂoating-point
numbers that are required in the comparison with the FPFPGA.
The FPU area for the XC2V3000 device [seventh column of
Table IV(a)] is estimated from the distribution of LUTs, which
is reported by the FPGA synthesis tool. The logic area (eighth
column) is obtained by subtracting the FPU area from the total
area reported by the place and route tool. As expected, FPU
logic occupies most of the area, typically more than 90% of the
Authorized licensed use limited to: Imperial College London. Downloaded on December 3, 2009 at 04:09 from IEEE Xplore.  Restrictions apply. HO et al.: FLOATING-POINT FPGA: ARCHITECTURE AND MODELING 1717
TABLE IV
FPFPGA IMPLEMENTATION RESULTS
Values in the brackets indicate the percentages of LC used in corresponding FPGA device. CGU stands for coarse-grained unit and FGU stands for
ﬁne-grained unit.
user circuits. While the syn7 circuit cannot ﬁt in an XC2V3000
device,itcanbetightlypackedintoafewcoarse-grainedblocks.
The circuit syn7 has 50 FPUs that consume 214% of the total
FPGA area. They can ﬁt into 16 coarse-grained units, which
constitute just 6.8% of the total FPGA area.
Similar experiments for double-precision ﬂoating-point ap-
plications have been conducted, and the results are reported in
Table IV(b) and Fig. 6(c) and (d). In double-precision ﬂoating-
point FPFPGA, we use the XC2V6000 FPGA as the host FPGA
and the comparison is done on the same device.
Forbothsingle-anddouble-precisionbenchmarkcircuits,the
proposed architecture reduces the area by a factor of 25 on av-
erage, a signiﬁcant reduction. The saving is achieved by: 1) em-
bedded ﬂoating-point operators; 2) efﬁcient directional routing;
and 3) sharing conﬁguration bits. On larger circuits, or on cir-
cuits with a smaller ratio of ﬂoating-point operations to random
logic, the improvement will be less signiﬁcant. However, the
reported ratio gives an indication of the improvement possible
if the architecture is well matched to the target applications. In
essence, our architecture stands between ASIC and FPGA im-
plementation. The authors in [2] suggest that the ratio of sil-
icon area and delay required to implement circuits in FPGAs
and ASICs is on average 35. Our proposed architecture can re-
duce the gap between FPGA and ASIC from 35 times to 1.4
timeswhenﬂoating-pointapplicationsareimplementedonsuch
FPGAs.
The delay reduction is also signiﬁcant. In our benchmark cir-
cuits, delay is reduced by 3.6 times on average for single-pre-
cision applications and 4.3 times on average for double-pre-
cision applications. We believe that double-precision ﬂoating-
point implementation on commercial FPGA platform is not as
effective as the single-precision one. Therefore, the double-pre-
cisionFPFPGAoffersbetterdelayreductionthanthesingle-pre-
cision one. In our circuits, the critical path is always within the
embedded FPUs; thus, we would expect a ratio similar to that
between normal FPGA and ASIC circuitry. Our results are con-
sistent with [2] that suggests the ratio is between 3 and 4. As the
critical paths are in the FPU, improving the timing of the FPU
through full-custom design would further increase the overall
performance.
VIII. CONCLUSION
We propose an FPFPGA architecture that involves a combi-
nationofreconﬁgurableﬁne-grainedandreconﬁgurablecoarse-
grained units optimized for ﬂoating-point computations. A pa-
rameterizable description is presented that allows us to explore
different conﬁgurations of this architecture. To provide a more
accurate evaluation, we adopt a methodology for estimating the
effects of introducing embedded blocks to commercial FPGA
devices. The approach is vendor independent and offers a rapid
evaluation of arbitrary embedded blocks in existing FPGA de-
vices. Using this approach, we show that theproposed FPFPGA
enjoys improved speed and density over a conventional FPGA
for ﬂoating-point intensive applications. The area can be re-
duced by 25 times and the frequency is increased by four times
on average when comparing the proposed architecture with an
existing commercial FPGA device. Current and future work
includes developing automated design tools supporting facili-
ties such as partitioning for coarse-grained units, and exploring
further architectural customizations for a large number of do-
main-speciﬁc applications.
REFERENCES
[1] Architecture and CAD for Deep-Submicron FPGAs, V. Betz, J. Rose,
and A. Marquardt, Eds.. Norwell, MA: Kluwer, 1999.
[2] I. Kuon and J. Rose, “Measuring the gap between FPGAs and ASICs,”
IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 26, no. 2,
pp. 203–215, Feb. 2007.
Authorized licensed use limited to: Imperial College London. Downloaded on December 3, 2009 at 04:09 from IEEE Xplore.  Restrictions apply. 1718 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 12, DECEMBER 2009
[3] K. Compton and S. Hauck, “Totem: Custom reconﬁgurable array gen-
eration,” in Proc. FCCM, 2001, pp. 111–119.
[4] A. Ye and J. Rose, “Using bus-based connections to improve ﬁeld-
programmable gate-array density for implementing datapath circuits,”
IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 14, no. 5, pp.
462–473, May 2006.
[5] E. Roesler and B. Nelson, “Novel optimizations for hardware ﬂoating-
point units in a modern FPGA architecture,” in Proc. FPL, 2002, pp.
637–646.
[6] M. J. Beauchamp, S. Hauck, K. D. Underwood, and K. S. Hemmert,
“Architecturalmodiﬁcationstoenhancetheﬂoating-pointperformance
of FPGAs,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 16,
no. 2, pp. 177–187, Feb. 2008.
[7] K. Leijten-Nowak and J. L. van Meerbergen, “An FPGA architecture
with enhanced datapath functionality,” in Proc. FPGA, 2003, pp.
195–204.
[8] A. Ye, J. Rose, and D. Lewis, “Architecture of datapath-oriented
coarsegrain logic and routing for FPGAs,” in Proc. IEEE Custom
Integr. Circuits Conf. (CICC), 2003, pp. 61–64.
[9] L. Beck, A Place-and-Route Tool for Heterogeneous FPGAs. Dis-
tributed Mentor Project Report. Ithaca, NY: Cornell Univ. Press,
2004.
[10] C. Ho, P. Leong, W. Luk, S. Wilton, and S. Lopez-Buedo, “Virtual
embedded blocks: A methodology for evaluating embedded elements
in FPGAs,” in Proc. FCCM, 2006, pp. 35–44.
[11] C. Ho, C. Yu, P. Leong, W. Luk, and S. Wilton, “Domain-speciﬁc
FPGA: Architecture and ﬂoating point applications,” in Proc. FPL,
2007, pp. 196–201.
[12] S. Wilton, C. Ho, P. Leong, W. Luk, and B. Quinton, “A synthesizable
datapath-oriented embedded FPGA fabric,” in Proc. FPGA, 2007, pp.
33–41.
[13] E. Ahmed and J. Rose, “The effect of LUT and cluster size on deep-
submicron FPGA performance and density,” IEEE Trans. Very Large
Scale Integr. (VLSI) Syst., vol. 12, no. 3, pp. 288–298, Mar. 2004.
[14] Xilinx,Inc.,SanJose,CA,“Floating-pointoperatorv3.0.Productspec-
iﬁcation,” 2005.
[15] I. Page and W. Luk, Compiling Occam Into FPGAs. Oxford, U.K.:
Abingdon EE&CS Books, 1991, pp. 271–283.
[16] Agility Design Solution, Inc., Palo Alto, CA, “Software product de-
scription for DK design suite Version 5.0,” 2008.
[17] J. Tripp, M. Gokhale, and K. Peterson, “Trident: From high-level lan-
guage to hardware circuitry,” Computer,vol. 40, no. 3, pp. 28–37, Mar.
2007.
[18] G. Zhang, P. Leong, C. H. Ho, K. H. Tsoi, C. Cheung, D.-U. Lee, R.
Cheung, and W. Luk, “Reconﬁgurable acceleration for Monte Carlo
based ﬁnancial simulation,” in Proc. ICFPT, 2005, pp. 215–222.
[19] P. D. Kundarewich and J. Rose, “Synthetic circuit generation using
clustering and iteration,” IEEE Trans. Comput.-Aided Des. Integr. Cir-
cuits Syst., vol. 23, no. 6, pp. 869–887, Jun. 2004.
[20] Synopsys, Inc., Mountain View, CA, “DesignWare building block IP,
Datapath—Floating point overview,” 2007.
[21] C. Yui, G. Swift, and C. Carmichael, “Single event upset susceptibility
testing ofthe XilinxVirtex IIFPGA,” presentedat the MilitaryAerosp.
Appl. Program. Logic Conf. (MAPLD), Laurel, MD, 2002.
[22] R. Usselmann, “Open ﬂoating point unit,” 2005. [Online]. Available:
http://www.opencores.org/project.cgi/web/fpu/overview
Chun Hok Ho (S’06) received the B.Eng. degree
(Honors) in computer engineering and the M.Phil.
degree in computer science and engineering from
the Chinese University of Hong Kong, Hong Kong,
in 2001 and 2003, respectively. He is currently
pursuing the Ph.D. degree in the Custom Computing
Group, Department of Computing, Imperial College,
London, U.K.
Hisresearchinterestsincludecomputerarithmetic,
computer architecture, design automation, and opti-
mization.
Mr. Ho was a recipient of several awards, such as Oversea Research Students
Awards, and the Stamatis Vassiliadis Awards from International Conference on
Field Programmable Logic and Applications in 2007 and 2008.
ChiWaiYureceivedtheB.Eng.degreefromtheChi-
nese University of Hong Kong, Hong Kong. He is
currentlypursuingthe Ph.D. degree in Departmentof
computing, Imperial College, London, U.K.
His research interest is in FPGA architecture.
Philip Leong (M’84–SM’02) received the B.Sc.,
B.E., and Ph.D. degrees from the University of
Sydney, Sydney, Australia, in 1986, 1988, and 1993,
respectively.
In 1993, he was a consultant to ST Microelec-
tronics, Milan, Italy, where he worked on advanced
ﬂash memory-based integrated circuit design. From
1994-1997, he was a Lecturer with the University
of Sydney. Since 1997, he has been with the Depart-
ment of Computer Science and Engineering, Chinese
University of Hong Kong, Hong Kong, where he
is a Professor and the director of the Custom Computing Laboratory. He is
also Visiting Professor at Imperial College, London, and the Chief Technology
Consultant to Cluster Technology. His research interests include reconﬁgurable
computing, signal processing, computer architecture, computer arithmetic and
biologically inspired computing.
Dr. Leong was a recipient of the 2005 FPT Conference Best Paper as well
as the 2007 and 2008 FPL conference Stamatis Vassiliadis Outstanding Paper
Awards. He was program co-chair of the FPT and FPL conferences and is an
Associate Editor for the ACM Transactions on Reconﬁgurable Technology and
Systems. He is the author of more than 100 technical papers and 4 patents.
Wayne Luk (F’09) received the M.A., M.Sc.,
and D.Phil. degrees in engineering and computing
science from the University of Oxford, Oxford, U.K.
He isa Professor ofcomputer engineering with the
DepartmentofComputing,ImperialCollegeLondon,
London,U.K.,and aVisiting Professorwith Stanford
University, Stanford, CA, and with Queen’s Univer-
sity Belfast, Belfast, U.K. His research interests in-
clude theory and practice of customizing hardware
and software for speciﬁc application domains, such
as multimedia, communications, and ﬁnance. Much
of his current work involves high-level compilation techniques and tools for
parallel computers and embedded systems, particularly those containing recon-
ﬁgurable devices such as ﬁeld-programmable gate arrays.
Steven J. E. Wilton (S’86–M’97–SM’03) re-
ceived the M.A.Sc. and Ph.D. degrees in electrical
and computer engineering from the University of
Toronto, Toronto, ON, Canada, in 1992 and 1997,
respectively.
In 1997, he joined the Department of Electrical
and Computer Engineering, University of British
Columbia, Vancouver, BC, Canada, where he is now
an Associate Professor. During 2003 and 2004, he
was a Visiting Professor with the Department of
Computing, Imperial College, London, U.K., and at
the Interuniversity MicroElectronics Center (IMEC), Leuven, Belgium. He has
also served as a consultant for Cypress Semiconductor and Altera Corporation.
His research focuses on the architecture of FPGAs and the CAD tools that
target these devices.
Dr. Wilton was a recipient of Best Paper Awards at the International Con-
ference on Field-Programmable Technology in 2003, 2005, and 2007, and at
the International Conference on Field-Programmable Logic and Applications
in 2001, 2004, 2007, and 2008. In 1998, he won the Douglas Colton Medal for
ResearchExcellenceforhisresearchintoFPGAmemoryarchitectures.In2005,
he was the Program Chair for the ACM International Symposium on Field-Pro-
grammable Gate Arrays and the program co-chair for the International Con-
ference on Field Programmable Logic and Applications. In 2007, he was the
Program Co-Chair for the IEEE International Conference on Application-Spe-
ciﬁc Systems, Architectures, and Processors.
Authorized licensed use limited to: Imperial College London. Downloaded on December 3, 2009 at 04:09 from IEEE Xplore.  Restrictions apply. 