Using FPGAs to prototype a self-timed floating point co-processor by Brunvand, Erik L. & Novak, Joe H.
Using FPGAs to Prototype a Self-Timed Floating Point Co-Processor 
Joe H. Novak Erik Brunvand* 
Dept. of Computer Science, University of Utah 
Salt Lake City, UT 84112 
Abstract 
Self-timed circuits offer advantages over their syn-
chronously clocked counterparts in a number of situations. 
However, self-timed design techniques are not widely used 
at present for a variety of reasons. One reason for the lack 
of experimentation with self-timed systems is the lack of 
commercially available parts to support this style of design. 
Field programmable gate arrays (FPGAs) offer an excel-
lent alternative for the rapid development of novel system 
designs provided suitable circuit structures can be imple-
mented. This paper describes a self-timed floating point 
co-processor built usinx a combination of Actel Field Pro-
grammable Gate Arrays (FPGAs) and semi-custom CMOS 
chips. This co-processor implements IEEE standard single-
precision floating point operations on 32-bit values. The 
control is completely self-timed. Data moves between parts 
of the circuit according to local constraints only: there is 
no global clock or global control circuit. 
1 Introduction 
Self-timed circuits are distinguished from clocked syn-
chronous circuits by the absence of a global synchronizing 
clock signal. Instead, circuit elements synchronize locally 
using a handshaking protocol. This protocol requires that 
a circuit begin operation upon receipt of a request signal 
and produce an acknowledge signal when its operation is 
complete. Self-timed circuit techniques are beginning to at-
tract attention as designers confront the problems associated 
with the speed and scale of modem VLSI technology [1]. 
Many of the problems associated with large VLSI systems 
are related to distributing the global clock to all parts of 
the system. In addition to avoiding these clock distribution 
problems, self-timed circuits can be faster, more robust, and 
easier to design than their globally clocked counterparts. 
As part of our ongoing investigation into the suitability 
of self-timed design in a variety of application domains, we 
have designed a completely self-timed IEEE single precision 
floating point unit (FPU) [2] using a combination of FPGAs 
and semi-custom CMOS. The FPGAs' quick tum around 
"This woric: is supported in part by NSF award MIP-9111793 
time was essential to timely completion of the project, and 
the self-timed circuit style allowed the FPGA and CMOS 
circuits to cooperate without concern for the relative speeds 
of the different technologies. The floating point unit con-
sists of an adder, fabricated in semi-custom CMOS, and 
a multiplier and divider, implemented with Actel FPGAs. 
Because we are interested in exploring the benefits of self-
timed control, simple algorithms are used for the floating 
point arithmetic operations. However, also due to the self-
timed nature of the circuits, more sophisticated algorithms 
could easily be used to upgrade the FPU after the prototype 
is evaluated. For testing and performance evaluation, the 
self-timed floating point unit was interfaced to a commer-
cial computer, the Atari ST. This particular computer was 
chosen not for its speed, but because its cartridge port allows 
convenient asynchronous communication between the CPU 
and an external device, the FPU in this case [3]. Because the 
FPU is self-timed, we are able to use even a slow (8 Mhz) 
personal computer as a test platform for the prototype. 
2 Self Timed Circuits 
A self-timed, or asynchronous, circuit does not have a 
globally distributed clock signal to synchronize events. In-
stead, events are initiated locally between parts of the circuit 
using handshaking signals. Synchronization of events is 
controlled by these local handshaking protocols. Self-timed 
protocols are often defined in terms of a pair of signals: 
one that requests or initiates an action, and another that ac-
knowledges that the requested action has been completed. 
One module, the sender, sends a request event (Req) to an-
other module, the receiver. Once the receiver has completed 
the requested action, it sends an acknowledge event (Ack) 
back to the sender to complete the transaction. 
Although self-timed circuits can be designed to imple-
ment their communication protocols in a variety of ways. 
the circuits used to build the FPU use two-phase transi-
tion signalling for control and a bundled protocol for data 
paths [1]. Using two-phase signalling [4], also known as 
transition signalling or event logic, every transition on a 
control line, either from low to high or from high to low, 
causes an action to be taken or an event to occur. Bundled 
5.4.1 85 
IEEE 1994 CUSTOM INTECRATED CIRCUITS CON]<'I<:RENO: 0.7803-1886-2/94 $3.00 © 1994 IEEE 
data paths are a compromise to complete self-timing in the 
data path. Associated with each set of data wires is a pair of 
transition control wires encoding a Req <Uld Ack signal. The 
hundled protocol requires that the data he valid at the re-
ceiver bef()re theReq tnUlsition is seen by the receiver. This 
single-sided local timing constraint is similar to, but weaker 
th<Ul, the equipotential constraint described by Seitz [4). 
A variety of control circuits have been designed to be 
used with the Actel FPGAs [5. 6J. They include: 
Merge The "OR" function for transitions, implemented by 
an XOR gate. The merge's output makes a transition 
in response to a transition on either of its inputs. 
Join The "AND" function for transitions, implemented by a 
C-element. The C-element's output changes state only 
after both of its inputs change state. It is useful for 
synchronizing events. 
Call A module that acts as a hardw<rre subroutine call al-
lowing multiple access to a shared resource. The Call 
module routes the Req signal from a client to the sub-
routine, lUld after the subroutine acknowledges, routes 
the Ack back to the appropriate client. The requests 
must be mutually exclusive. 
Select A module that steers an input trlUlsition to one oftwo 
outputs based on the value of a Boolean select signal. 
The select signal is a bundled data signal with respect 
to the input transition. 
Q-select A module like a select, except the select signal is 
not bundled ,Uld may be changing even when the Q-
select IS sampling that signal. Thus, it requires some 
way of sampling the select signal reliably. Because 
smnpling changing signals reliably requires analog cir-
cuits, this module is approximated in the FPGA imple-
mentation using metastability information found in the 
Actel databook [7, 5). A Q-select can be used in loops 
to act as an arbiter for concurrent events. 
Toggle A module that routes input tnUlsitions alternately 
between two outputs. 
Latch A module that latches bundled data signals upon 
receipt of transition control signals. 
Carry Completion Adder A form of adder that reports 
when the addition is complete by sensing when the 
carry chain is complete. 
Using these modules, algorithms can be implemented 
directly in hlU'dwlU'e. This allows for an intuitive design 
methodology somewhat different than st<'U1dlU'd state ma-
chine design. Rather than designing a global control state 
machine, the algorithm is mapped directly to the circuits 
that implement it. Control, in the form of signal transitions, 
is passed from one part of the circuit to the next in accor-
dance with the desired sequence of events. Each step in the 
algorithm starts with receipt of a Req signal, takes as long 
or as short a time as it needs to complete its task, ,md then 
acknowledges with an Ack signal to start the next phase of 
the algorithm. 
Because of the self-timed organization, the functional-
ity of the individual circuits is seplU'ated from their per-
formance. Thus, circuit pieces in a prototype system may 
be constructed from technologies that have different per-
formance characteristics. In our FPU, for exmnple, al-
though the multiplier and divider lU'e implemented using 
Actel FPGAs, the floating point adder is implemented us-
ing a similar cell set in a MOSIS CMOS technology. If a 
faster overall system is desired, a new algorithm or a faster 
technology can easily be substituted for the existing im-
plementation [8]. For exmnple, to speed up the FPU, the 
FPGAs could be replaced by CMOS chips withoutretiming 
the system. 
3 Adder 
The adder is a semi -custom CMOS chip. It was fabricated 
in 2.0 micron CMOS through the MOSIS service and was 
designed using the PPL integrated circuit design soflw,rre 
developed at the University of Utah [9]. There are six steps 
in the addition algorithm used in the FPU, shown in Figure 1. 
This is also the block diagrmn of the adder because the 
algorithm is implemented directly in hardware. The direct 
implementation was made possible because of the self-timed 
design paradigm. 
The first step is 10 unpack the 32-bit input words to extract 
the signs, exponents, and mantissas from the IEEE format 
operands. Second, the mantissas are aligned for addition. 
Third, the addition is performed on the mantissa <'Uld the 
exponents are adjusted according to the result. Fourth, the 
result is normalized according to the IEEE format. Fifth, the 
result is rounded. All four rounding modes as defined in the 
IEEE standlU"d are implemented in this circuit [2]. Finally, 
the result is packed into IEEE format and delivered to the 
output of the adder. This organization is clearly suitable 
for pipelining to improve the performance of the adder; 
although, for this first prototype, pipelineing was not used. 
Because the adder circuits are self-timed, adding pipelining 
is as simple as adding pipeline latches between each of the 
stages shown in Figure 1. 
The adder is packaged in a 65-pin PGA. It consists of 
15090 transistors in a die area of 400 mil2• A total of 49 
user I/O pins are used. The adder was tested for functionality 
by applying a large number of test patters to the simulation, 
and then using the smne patterns on the fabricated chips. To 
measure performance a test platform was constructed that 
sends data to the chip as quickly as the chip can respond. 






Figure 1: Adder Block Diagrmn 
stale machine can send data to the chip in response to the 
acknowledge from the previous data. Using this platform, 
this unpipelined version of the adder chip was measured at 
256.4 KFLOPS. 
4 Multiplier and Divider 
A serial-parallel or "pencil and paper" algorithm is used 
for multiplication and division. This simple algorithm wa'i 
used for two primary reasons. First, it is consistent with 
the FPU's principle of operation. That is, the project is a 
first attempt at fully self-timed floating point and its design 
should be kept simple to maximize the probability of correct 
operation. Second. it allows the multiplier and divider to 
share hardware, saving space on the FPGAs. 
Figure 2 shows the block diagram of the multiply/divide 
unit's data path. After unpacking the mantissas and expo-
nents from the floating point arguments, the shared adder 
is used to perform all required operations. In the case of 
multiplication, the mantissas are multiplied by shifting and 
adding through the P and A registers shown in Figure 2. The 
exponents ,rre then added using the same adder, the result 
is normalized, rounded, ,md packed back into IEEE format. 
The division uses the s(rrnt~ hardware to shift and subtract 
the mantissas during division, subtract the exponents, <Uld 
then repack the result. Because of the sharing of hardware, 
this implementation is less suitable for pipelining than the 
adder. The multiply/divide unit works on a single operation 
Figure 2: MuItiplier/Divider Block Diagram 
at a time. Future versions could use the smne interface with 
a different implementation to improve performance. In this 
case, the self-timed nature of the circuit ensures that the FPU 
would continue to function properly, only the performance 
is affected. 
The use of FPGAs in the multiply and divide circuitry 
allowed greater experimentation in self-timed design. For 
example, a carry completion sensing adder is used for integer 
arithmetic. This type of adder computes its sum in response 
to a Req signal. By sensing that every bit of the adder 
has correctly computed its sum and carry, the adder also 
generates an Ack signal when the sum is correct. Because 
the carry chain of any particular addition is likely to be short, 
this adder exhibits extremely fast operation on average, only 
slowing down in the rare cases when the carry chain must 
propagate along many bit of the adder. 
The circuitry is contained in two Actel1280 FPGAs. The 
1280s are 1.2 micron CMOS devices in 160-pin PQFPs [7]. 
The first 1280 contains interface circuitry, control modules, 
and rounding logic. It utilizes 93% of the available logic 
modules (1148/1232) and 69% of the available I/O pins 
(97/140). The second 1280 contains the data path shown 
in Figure 2. It utilizes 100% of the available modules 
(1232/1232) and 42% of the I/O pins (59/140). As with 
the adder, a large number of test patterns were used during 
the simulation of the multiply/divide unit, and these smne 
patterns were then used on the completed FPGAs. The test 
platform used for the adder to measure merformance was 
also used on the multiply/divide unit. This implementa-
tion using the Actel FPGAs was measured at 20.0 KFLOPS 
average for multiplication and 15.4 KFLOPS average for 
division. Note that these performance numbers are average 
case performance using our current data sets. The data-
dependent completion time of the adder used in the multi-
ply/divide unit means that individual operations may take 
slightly more or slightly less time depending on the length 




The system controller is the interface between the host 
computer and the FPU. It interprets the computer's instruc-
tions and directs the FPU to operate accordingly. All 
operands and results are transferred to and from the host 
computer through the controller in 16-bit words to match the 
cartridge interface port of the Atari ST. Also, the Atari car-
tridge port uses four-phase return-to-zero signalling. This 
is translated by the controller to the two-phase transition 
signalling used by the FPU. Data passed from the host com-
puter to the FPU includes an operation code (add, multiply, 
or divide) and 32-bit IEEE format data passed in 16-bit 
words. Results from the FPU are also passed back to the 
host computer in 16-bit words. 
The self-timed FPU's addition hardware is independent 
of its multiplication and division hardware. The system 
controller takes advantage of this fact by allowing parallel 
instruction scheduling. Out of order instruction completion 
is also allowed. An extra bit in the result lets the computer 
determine what type of operation completes first. The con-
troller circuitry is also built as a self-timed circuit using the 
same set of circuit modules used in the FPU, and is con-
t.'lined in one Actel1020A FPGA. It is a 1.2 micron CMOS 
chip packaged in an 84-pin PLCC. The control chip utilizes 
n% of the available logic modules (436/547) and 97% of 
the I/O pins (67/69). 
6 System Performance 
Performance of the FPU as a system was measured by 
connecting the completed FPU to the Atari ST through its 
cmtridge port. The Atari ST is based on ,ill MC68000 run-
ning at 8Mhz and is thus a rather slow, albiet convenient, 
platform. When interfaced to an Atari ST, the adder is capa-
ble of 13.6 KFLOPS. This is considerably slower than the 
measured rate of 256.4 KFLOPS for the adder alone and is 
due to the overhead of the Atari cartridge port and the soft-
ware overhead of using the FPU. The multiplier and divider 
operate at 10.7 KFLOPS when interfaced to the Atari ST. 
This is compared to 20.0 KFLOPS and 15.4 KFLOPS for 
multiplication and division respectively without the system 
overhead. 
The linpack benchmark prognun running on the Atari 
was used to determine the overall speed of the FPU running 
floating point code. It rated the FPU at 8 KFLOPS without 
using parallel instruction scheduling. Using parallel instruc-
tion scheduling and accounting for I/O overhead, the FPU 
could theoretically run at 60 KFLOPS. 
88 
7 Conclusions 
Self-timed circuits constitute an interesting design do-
main. Many small scale examples of self-timed circuits can 
readily be found; however, realistic self-timed circuits are 
rare. We have built, using a blend of FPGAs and custom 
CMOS chips, a self-timed IEEE single precision floating 
point processor. The FPU has been interfaced to a commer-
cial computer system for testing and evaluation. FPGAs are 
well suited to experimenting with and developing novel sys-
tems like self-timed systems. The quick turn around time of 
the FPGA coupled with the ease with which self-timed cir-
cuits can be interchanged allow the designer to experiment 
with and fabricate different implementations of the same 
circuit in a timely manner. 
References 
[1] I. Sutherland, "Micropipelines," CACM, vol. 32, no. 6, 
1989. 
[2] "IEEE standard for binary floating point arithmetic," 
August 1985. ANSIIIEEE Std 754-1985. 
[3] R. Constan, "A 16-bit cartridge port interface," ST Log. 
January 1989. 
[4] C. L. Seitz, "System timing," in Mead and Conway, 
Introduction to VLSI Systems, ch. 7, Addison-Wesley, 
1980. 
[5] E. Brunvand, "Using FPGAs to implement self-timed 
systems;' Journal of VLSI Signal Processing, vol. 6, 
1993. Special issue on field programmable logic. 
[6] E. Brunvand, "A cell set for self-timed design using actel 
FPGAs;' Technical Report UUCS-91-013, University 
of Utah, 1991. 
[7] Actel Corporation, ACT Family Field Programmable 
Gate Array Databook, March 1991. 
[8] E. Brunvand, N. Michell, and K. Smith, "A comparison 
of self-timed design using FPGA, CMOS, and GaAs 
technologies," in International Conference on Com-
puter Design, (Cambridge, Mass.), October 1992. 
[9] J. Gu and K. F. Smith, "A structured approach for VLSI 
circuit design," Computer, November 1989. 
5.4.4 
