Fred: an architecture for a self-timed decoupled computer by Brunvand, Erik L. & Richardson, William
F r e d :  A n  A r c h i t e c t u r e  f o r  a  S e l f - T i m e d  D e c o u p l e d  C o m p u t e r
William F. Richardson 
Computer Science Department 
University of Utah 
Salt Lake City, UT 84112
w i l l r i c h @ c s .U ta h .e d u
Abstract
Decoupled computer architectures provide an effec­
tive means o f  exploiting instruction level parallelism. Self­
timed micropipeline systems are inherently decoupled due 
to the elastic nature o fth e  basic FIFO structure, and may 
be ideally su ited  fo r  constructing decoupled computer 
architectures. F red is a self-tim ed decoupled, p ipelined  
com puter arch itec ture  based  on m icrop ipelines. We 
present the architecture o f  Fred, with specific details on a 
micropipelined implementation that includes support fo r  
multiple functional units and out-oforder instruction com­
pletion due to the s e l f  timed decoupling.
1. Introduction
As com puter systems have grown in size and com ­
plexity, the difficulty in synchronizing the system compo­
nents has also grown. For example, simply distributing the 
clock signal throughout a large synchronous system can be 
a major source o f complication. Clock skew is a serious 
concern in a large system , and is becom ing significant 
even w ithin a single chip. A t the chip level, more and 
more o f the power budget is being used to distribute the 
clock signal, while designing the clock distribution net­
work can take a significant portion of the design time.
These symptoms have led to an increased interest in 
asynchronous designs. General asynchronous circuits do 
not use a global clock for synchronization, but instead rely 
on the behavior and arrangement o f the circuit elements to 
keep the signals proceeding in the correct sequence. How­
ever, these circuits can be very difficult to design and 
debug w ithou t som e add itional struc tu re  to help the 
designer deal with the complexity. While there are many 
different asynchronous methodologies, one of the simplest 
to design, test, and debug is the self-timed micropipeline 
approach described by Sutherland [19], w hich avoids 
clock-related timing problems by enforcing a simple com­
m unication protocol betw een circuit elem ents. This is 
quite different from traditional synchronous signaling con­
ventions where signal events occur at specific times and
Erik Brunvand 
Computer Science Department 
University of Utah 
Salt Lake City, UT 84112
e lb @ c s . U ta h . edu
must remain asserted for specific time intervals. In self­
tim ed  system s it is im p o rtan t only th a t the co rrec t 
sequence o f signals be maintained. The timing o f these 
signals is an issue of performance that can be handled sep­
arately.
Experience has shown the difficulty o f writing paral­
lel programs, yet most sequential programs have an (argu­
a b ly )  s ig n i f i c a n t  a m o u n t o f  in s t r u c t io n - l e v e l  
parallelism [13.23]1. One way o f exploiting this parallel­
ism is by decoupling the memory access portion o f an 
instruction stream from the execution portion [7,24,5], By 
performing the two operations independently, peaks and 
valleys in each may be smoothed, resulting in an overall 
performance gain.
Although decoupled architectures have been proposed 
and built using a traditional synchronous design style, a 
self-timed approach seems to offer many advantages. Typ­
ically the independent com ponents o f  the m achine are 
decoupled through a FIFO queue of some sort. As long as 
the machine components are all subject to the same system 
clock, connecting the components through the FIFOs is 
subject to only the usual problems of clock skew and dis­
tribution. If, however, the components are running at dif­
ferent rates or on separate clocks the FIFO must serve as a 
synchronizing element and thus presents even more seri­
ous problems.
The micropipeline approach is based on simple, self­
timed, elastic, FIFO queues, which suggests that decou­
pled computer architectures may be implemented much 
more easily in a self-timed micropipeline form than with a 
clocked design. Because the FIFOs are self-timed, syn­
chronization o f the decoupled elements is handled natu­
rally as a part o f the FIFO comm unication. The elastic 
nature o f a micropipeline FIFO allows the decoupled units 
to run at data-dependent speeds; producing or consuming 
data as fast as possible for the given program and data. 
Because the data are passed around in self-tim ed FIFO 
queues, and the decoupled processing elements are run­
n ing  at th e ir  ow n ra te , the degree  o f  deco u p lin g  is
1. N icolau claim s there is lots o f  parallelism  available. Wall claim s 
there's some, but not much.
C o p y r ig h t  1 9 9 6  T E E E R e p r o d u c t io n  w i t h o u t  p e rm is s io n  is p r o h ib i t e d
increased in this type of system organization, without the 
overhead of a global controller keeping track o f the state 
o f the decoupled components. This should allow increased 
performance due to the increased decoupling and poten­
tially faster local control of the components, however it 
also means that exception handling must be considered 
carefully. Bccausc cach o f the elements is running at its 
own rate, and data arc possibly being transmitted through 
FIFO queues when the exception is signaled, care must be 
taken to make sure that the machine can proccss an excep­
tion in a functionally prccisc way without losing state that 
might be in the proccss of being modified by a different 
component.
Fred~ is a self-timed decoupled, pipelined processor 
architecture based on m icropipelines. Wc present the 
architecture of Fred, with specific details on a micropipe­
lined im plem entation that includes support for out-of­
order instruction completion due to the decoupling, and a 
model for functionally prccisc exception processing.
2. Asynchronous Processors
In spite o f the possible advantages, there have been 
veiy few asynchronous processors reported in the litera­
ture. Early work in asynchronous computer architecture 
includes the Macromodule project during the early 7 0s at 
W ashington University [3] and the self-tim ed dataflow 
m achines b u ilt at the U niversity  o f  U tah in the late 
70's [4],
A lthough these projects were successful in many 
ways, asynchronous processor design did not progress 
much, perhaps bccausc the circuit concepts were a little 
too far ahead of the available technology. With the advent 
o f  easily available custom  ASIC technology, either as 
VLSI or FPGAs, asynchronous processor design is begin­
ning to attract renewed attention. Some recent processor 
projects include the following:
2.1 The CalTech Asynchronous Microprocessor
The first asynchronous VLSI processor was built by 
A lain M artin 's group at CalTcch [ I I ] ,  It is completely 
asynchronous, using (mostly) delay-insensitive circuits 
and dual-ra il data encoding. The processor as im ple­
mented has a small 16-bit instruction set, uses a simple 
two-stage fetch-execute pipeline, is not decoupled, and 
docs not handle exceptions. It has been fabricated both in 
CMOS and GaAs [20],
2. Fred is not an acronym, and it doesn't mean anything. It's just a name, 
like "SPARC" or “Alpha."
2.2 The NSR
The NSR (Non-Synchronous RISC) processor [2,15] 
is structured as a five-stage pipeline where cach pipe stage 
operates concurrently and communicates over self-timed 
data channels in the style o f micropipelines. Branches, 
jum ps, and memory acccsscs arc also decoupled through 
the use of additional FIFO queues which can hide the exe­
cution latency o f these instructions. The NSR was built 
using FPGAs. It is pipelined and decoupled, but doesn 't 
handle exceptions. It is a simple 16-bit processor with only 
sixteen instructions, since it was built partially as an exer­
cise in using FPGAs for rapid prototyping o f self-timed 
circuits [I],
2.3 The Amulet
A group at M anchester has built a self-timed micro­
pipelined VLSI implementation o f the ARM processor [6] 
which is an extremely power-efficient commercial micro­
processor. The Amulet is a real processor in the sense that 
it mimics the behavior of an existing commercial proces­
sor and it handles simple exceptions. It is more deeply 
pipelined than the synchronous ARM, but it is not decou­
pled (although it docs allow instruction prefetching), and 
its prccisc exception model is a simple one. The Amulet 
has been designed and fabricated. The performance of the 
first-generation design is w ithin a factor o f  two o f the 
commercial version [14], Future versions o f Amulet arc 
cxpcctcd to elose this gap.
2.4 The Counterflow Pipeline Processor
This is an innovative architecture proposed by a group 
at Sun Microsystems Labs [18], It derives its name from 
its fundamental feature, that instructions and results flow 
in opposite directions in a pipeline and interact as they 
pass. The nature of the Counterflow Pipeline is such that it 
supports in a veiy natural way a form of hardware register 
renaming, extensive data forwarding, and speculative exe­
cution across control flow changes. It should also be able 
to support exception processing.
A self-timed micropipeline-style implementation of 
the CFPP has been proposed. The CFPP is deeply pipe­
lined and partially  decoupled, w ith m em ory acccsscs 
launched and completed at different stages in the pipeline. 
It can handle exceptions, and a self-timed implementation 
which mimics a commercial RISC processor's instruction 
set has been simulated. The potential o f this architecture is 
intriguing, but still unknown.
3. The Fred Architecture
The Fred architecture is based roughly on the NSR
C o p y r ig h t  1 9 9 6  T E E E R e p r o d u c t io n  w i t h o u t  p e rm is s io n  is p r o h ib i t e d
u  5E s
cft ^
St
Figure 1. Fred block diagram 
Black lines are primary data paths; gray lines are control paths. 
All data and control paths are pipelined queues.
Data Memory
architecture developed at the University o f Utah [2,15], As 
such it consists o f several decoupled independent pro­
cesses connected by FIFO queues o f various lengths, an 
approach which we believe offers a number of advantages 
over a clocked synchronous organization. The Fred archi­
tecture specifies the instruction set and the general layout 
and behavior of the processor. Other extensions to the Fred 
a rch itec tu re  may be m ade. New instructions may be 
added, and additional functional units may be incorpo­
rated. The existing functional units may be rearranged, 
combined, or replaced. The details o f the exception han­
dling mechanism is not specified by the architecture, but 
some means must be provided.
A pro to type o f Fred has been  im plem ented  in a 
detailed VIIDL model. Figure 1 shows the overall organi­
zation. Each box in the figure is a self-timed process com­
m unicating via dedicated data paths rather than buses. 
Each of these data paths, shown as wires in Figure 1, may 
be pipelined to any desired depth w ithout affecting the 
results o f the computation. Because Fred uses self-timed 
micropipelines [19] in which pipeline stages communicate 
locally only with neighboring stages in order to pass data, 
there is no extra control circuitry involved in adding addi­
tional pipeline stages. Because buses are not used, the cor­
responding resource contention is avoided.
M ultiple independent functional units allow several 
instructions to be in progress at a given time. Because the 
machine organization is self-timed, the functional units 
may take as long or short a time as necessary to complete 
their function. One of the performance advantages o f a 
self-timed organization is directly related to this ability to 
finish an instruction as soon as possible, without waiting 
for the nex t d iscre te  c lock  cycle . It also  a llow s the 
machine to be upgraded incrementally by replacing func­
tional units w ith higher perform ance circuits after the 
machine is built with no global consequences or retiming. 
The perform ance benefits o f the improved circuits are 
realized by having the acknowledgment produced more 
quickly and thus the instruction that uses that circuit fin­
ishes faster.
The VIIDL version chooses particular implementa­
tions for each of the main pieces of Fred. For example, the 
Dispatch unit is organized so as to dynamically reorder 
instructions for issue, allowing instructions to be issued 
out o f order, and to complete in yet a different order. This 
is of particular interest in a self-timed processor where the 
multiple functional units might take varying amounts of 
time to compute a result. An individual functional unit
C o p y r ig h t  1 9 9 6  T E E E R e p r o d u c t io n  w i t h o u t  p e rm is s io n  is p r o h ib i t e d
might even take different amounts o f time to compute a 
result based on the data, which will lead naturally to out of 
order instruction com pletion. The VIIDL prototype is 
fully operational, and includes a functionally  precise 
exception  m odel [16]. The tim ing and configuration  
param eters can be adjusted for each com ponent o f the 
design.
4. Instruction Set
Choosing an instruction set for a RISC processor can 
be a complex task [9,8,10], Rather than attempt to design a 
new instruction set from scratch, an instruction set from an 
existing commercial RISC processor was adapted. Much 
o f the Fred instruction  set is taken d irectly  from  the 
Motorola 88100 instruction set [12], However, Fred does 
not implement all the 88100 instructions, and several o f 
Fred’s instructions do not correspond to any instructions 
of the 88100. The instructions, and the functional units 
that execute them, are shown in Figure 2.
Functional Unit Instructions
Dispatch
doit, rtc, sync, trap. 
Mega!
Logic
and, clr, ext, cxtu, ffl), 
f fl, mak, mask, or, 
rot, set, xor
Arithcmctic
add, addu, cmp, div, 
divu, mul, sub, subu
Memory Id, Ida, st, xmcm
Branch
bit, blc, bnc, beq, bgc, 
bgt, bbO, b b l, br, Idbr
Control
gctcr, mvbr, mvpc, 
putcr
Figure 2. Fred instruction  se t
5. Instruction Dispatch
Instruction Dispatch is, in some sense, the main con­
trol unit for the Fred processor. It is responsible for keep­
in g  tra c k  o f  th e  P ro g ra m  C o u n te r , f e tc h in g  new  
instructions, issuing instructions to the rest of the proces­
sor, and monitoring the instruction stream to watch for 
data hazards. Instructions are fetched and issued to the rest 
o f the m achine as quickly as possible. Instructions are 
issued as soon as all dependencies are satisfied, without 
further regard to program order Because different func­
tional units may take different amounts of time to com­
plete, individual instructions may complete in a different 
order than which they were issued.
Deadlocking the processor is theoretically possible. 
B ecause bo th  the R I Q ueue and B ranch Q ueue (see 
below) are filled and emptied via two separate instruc­
tions, it is possible to issue an incorrect number of these
instructions so that the producer/consumer relationship of 
the queues is violated. Fred’s dispatch logic w ill detect 
these cases, and take an exception before an instruction 
sequence is issued that would result in deadlock. Obvi­
ously, there is no way to handle such an exception except 
by aborting the current user program. Deadlock is only 
possible due to programmer error, and Fred can detect and 
abort the illegal instruction sequence before it takes effect.
5.1 The Instruction  W indow
An Instruction Window (IW) is used to buffer incom­
ing  in s tru c tio n s  and  to  tra c k  th e  s ta tu s  o f  is su ed  
instructions [22], A register scoreboard is used to avoid all 
data hazards. The IW is a set of internal registers located 
in the Dispatch unit which tracks the state of all current 
instructions. Each slot in the IW contains inform ation 
about each instruction such as its opcode, address, current 
status, and various other parameters. As each instruction is 
fetched, it is placed into the IW. New instructions may 
continue to be added to the IW independently, as long as 
there is room for them. The scoreboard is also maintained 
in the Dispatch unit, and is cleared when results arrive at 
the Register File.
Instructions are issued from the IW when all their data 
dependencies are satisfied (including WAW dependen­
cies). Issuing an instruction does not remove it from the 
IW. Instead, instructions are removed from the IW only 
after they have com pleted  successfully . Each issued 
instruction is assigned a tag which uniquely distinguishes 
it from all other current instructions. When an instruction 
completes, it uses this tag to report its status to back to the 
IW. The status is usually an indication that the instruction 
completed successfully, but is also used to report excep­
tions. Instructions signal completion as soon as the func­
tional unit w hich processes them has generated a valid 
result, even though that result may not yet have reached its 
final destination. When an instruction is unsuccessful, it 
returns an exception status to the IW, which then begins 
exception processing. Instructions which can never cause 
exceptions do not have to report their status, and can be 
removed from the IW when they are dispatched. Because 
instructions may com plete out-of-order, recoverab le 
exceptions can cause unforeseen  WAW hazards. The 
In struction  W indow contains enough info rm ation  to 
resolve these issues.
The Dispatch unit uses the Instruction Window and 
scoreboard to determine when to issue new instructions to 
the rest of the machine. When instruction issue occurs, the 
required operands are requested from the Register File 
(possibly through a FIFO), and the instruction is issued to 
the Execute unit (also possibly through a FIFO).
C o p y r ig h t  1 9 9 6  T E E E R e p r o d u c t io n  w i t h o u t  p e rm is s io n  is p r o h ib i t e d
5.2 E xceptions 7. Branch Decoupling
The exception model seen by the programmer is not 
tha t o f  a single po in t w here the exception  occurred. 
Instead, the Instruction Window holds a set o f instructions 
which were in progress when the exception occurred. The 
hardware guarantees that this set (unless empty) will con­
sist only of instructions which either faulted or which had 
been  fetched  bu t no t ye t issued w hen  the exception  
occurred. The instructions in this set are a subset o f  a 
sequential portion o f  the dynamic program instructions, 
where the missing elements are those instructions which 
com pleted successfully out o f  order, and w hich do not 
need to be re-issued. Because die total state o f the proces­
sor is not available at one known time (such as on a clock 
tick), the details o f the exception handling are somewhat 
complicated, but no more so than for a synchronous pro­
cessor that is deeply pipelined and may issue or complete 
instructions out o f  order. This is described in more detail 
elsewhere [16].
6. R1 Queue
There are 32 general registers in the Fred architecture. 
Registers r2 through r3 1 are normal general-purpose regis­
ters, but rO and r l  have special meaning. Register rO may 
be used as the destination of an instruction, but will always 
contain zero. Register r l  is not really a register at all but 
provides read access to a data memory pipeline similar to 
that used in the WM machine [24], Specifying r l  as the 
destination o f an instruction inserts the result into the pipe­
line. Each use o f r l  as a source for an instruction retrieves 
one word from the R l Queue. For example, the instruction 
a d d  r 2 , r l , r l  would fetch two words from the R l 
Queue, add them together, and place the sum in register r2. 
Likewise, assuming that sequential access to register r l  
w ould  re su lt in va lues A , B , and C, the in s tru c tio n  
s t  r l ,  r l ,  r l  would w rite the value C into memory 
location A +B. Data from any o f the functional units may 
be queued into the R l Queue, and loads from memory can 
also be queued. It may be possible to subsume some o f the 
memory latency by queuing loaded data in the R l Queue 
in advance o f  its use. This is sim ilar to having as many 
load delay slots as desired and allowed by the program 
structure. Note also that the program receives different 
information each time it performs a read access on register 
r l ,  thus achieving a form o f register renaming directly in 
the R l Queue. Instructions which write to the R l Queue 
are forced to complete in-order, to provide deterministic 
behavior.
Flow control instructions are significantly affected by 
the degree o f  decoupling in Fred. By decoupling the 
branch instructions into an address generating part and a 
sequence change part, we gain  the ability  to prefetch 
instructions effectively. Fred does not require any special 
external memory system, but it can provide prefetching 
information which may be used by an intelligent cache or 
prefetch unit. This information is generated by the Branch 
unit when branch target addresses are computed, and is 
always correct.
The in s tru c tio n s  fo r b o th  ab so lu te  and re la tiv e  
branches compute a 32-bit value which will replace the 
program counter i f  the branch is taken, but the branch is 
not taken immediately. Instead, the branch target value is 
computed by the Branch unit and passed back to the Dis­
patch unit, along with a condition bit indicating whether 
the branch should be taken or not. These data are con­
sum ed by the D ispatch  un it w hen a subsequent doit 
instruction is encountered, and the branch is either taken 
or not taken at that time. Although this action is similar to 
the synchronous concept o f  squashing instructions, Fred 
does not convert the doit instructions into NO-OPs, but 
instead removes diem completely from the main processor 
pipeline.
Any number o f  instructions (including zero) may be 
placed between the branch target computation and die doit 
instruction. From the programmer's view, these instruc­
tions do not have to be common to both branches, nor 
must they be undone if  the branch goes in an unexpected 
way. The only requirement for these instructions is that 
they not be needed to determ ine the d irec tion  o f  the 
branch. The branch instruction can be placed in the current 
block as soon as it is possible to compute the direction. 
The doit instruction should come only when the branch 
m ust be taken, allowing maximum time for instruction 
prefetching, as shown in Figure 3. Because the doit is con­
sumed entirely within the Dispatch Unit, it will take effect 
as soon as the branch target data is available, allowing 
instructions past the branch point to be loaded into the IW 
before the prior instructions have com pleted (or even 
issued). This lets the IW act as an instruction prefetch 
buffer, but it is always correct, never speculative. The doit 
instruction does not have to be explicitly specified. To pre­
vent extra instruction fetches, the doit instruction can be 
encoded implicitly by a single bit available in the opcode 
o f other instructions. The doit is im plicit in Figure 3B. 
F igu re  4 show s an exam ple , based  on the code in 
Figure 3B. Note that instructions may continue to be 
issued out-of-order, even w ith respect to the delay slot 
instructions. Note also that the doit may be consumed 
independently o f the instruction which encodes it.
C o p y r ig h t  1 9 9 6  T E E E R e p r o d u c t io n  w i t h o u t  p e rm is s io n  is p r o h ib i t e d
lo o p :
a d d u r 3 , r 3 , 3
m u l r 9 , r 2 , r 3
a d d u r 2 , r 9 , 2
s u b u r 8 , r 8 , 1
b g t r 8 , l o o p
d o i t
A. Simple ordering
l o o p :
s u b u r 8 , r 8 , 1
b g t r 8 , l o o p
a d d u r 3 , r 3 , 3
m u l r 9 , r 2 , r 3
a d d u . d r 2 , r 9 , 2
B. Reordered, with implicit doit
Figure 3. Two ways of ordering the 
same program segment
Tag Status Instruction Loop #
1 Issued subu r8,r8,l 1
2 - bgt r8,loop 1
3 Issued addu r3,r3,3 1
4 - mul r9,r2,r3 1
5 - addu.d r2,r9,2 1
A. Branch target not yet available
Tag Status Instruction Loop #
4 Issued mul r9,r2,r3 1
5 - addu r2,r9,2 1
6 Issued subu r8,r8,l 2
7 - bgt r8,loop 2
8 Issued addu r3,r3,3 2
9 - mul r9,r2,r3 2
10 - addu.d r2,r9,r2 2
B. Branch target consumed
Figure 4. Branch prefetching in the IW
This tw o-part branch m odel allow s for a variable 
number o f  delay slots by allowing an arbitrary number of 
instructions to be executed between the computation o f the 
branch target and its use. It also allows other interesting 
behaviors such as achieving the effect o f  loop unrolling
without increasing code size. This can be accomplished by
computing several branch targets at one time and putting
-1
them in the branch queue before executing the loop code .
8. Independent Functional Units
The Distributor is responsible for routing instructions 
to their proper functional unit. It takes incoming instruc­
tions and operands, matches them up where needed, and 
routes instructions to appropriate functional units. There 
are five independent functional units in the prototype 
im plem entation o f Fred: Logic, A rithm etic, M emory, 
Branch, Control. Each functional unit is responsible for a 
particular type o f  instruction shown in Figure 2. The Dis­
tributor and its associated functional units collectively 
make up the Execute unit.
The Memory unit is treated as just another functional 
unit. The only difference is that the M emory unit some­
times produces data that is written to the data memory 
rather than the Register File.
Each o f the functional units may produce results that 
are written back to the register file directly, or which are 
made available through the RI Queue. In addition to reus­
ing a result within single functional unit, in many proces­
sors a re su lt m ay be fo rw arded  d irec tly  to  ano ther 
functional unit without passing through a register, so that 
pipeline delays involved in writing to the register file are 
avoided. Forwarding results between independent func­
tional units requires either a common shared bus as in 
Tomasulo's algorithm [21], or dedicated data paths as used 
in the DEC Alpha [17] and other high-performance pro­
cessors. Fred does not forward results directly between 
functional units, because o f  the com plexity involved. 
However, reusing the last result o f a computation within a 
single functional unit is certainly possible. Trace data sug­
gests that such reuse may provide a measurable perfor­
mance increase, but it is highly dependent on the compiler 
technology.
9. Register File
The Register File responds to requests from the Dis­
patch unit for operands which it delivers through a FIFO 
to the Execute un it. These operands are paired w ith  
instructions and passed to the appropriate functional unit. 
Because the operands are requested in the same order as 
instructions are issued, there is no matching required to 
determine which operands should be paired with which 
instructions. They emerge from the FIFO queues in the 
correct sequence.
3. This is not true loop unrolling since the registers are not recolored, but 
it could be useful.
C o p y r ig h t  1 9 9 6  T E E E R e p r o d u c t io n  w i t h o u t  p e rm is s io n  is p r o h ib i t e d
On the incoming side, the Register File accepts results 
from each functional unit that produces data. These results 
are accepted independently from each functional unit and 
are not multiplexed onto a common bus. Data hazards are 
prevented by the scoreboard and the Dispatch unit, which 
will not issue an instruction until all its data dependencies 
are satisfied, so there will never be conflicts for a register 
destination. The Register File clears the associated score­
board  b it w hen resu lts arrive at a particu lar register. 
Instraction results may also be written into the R l Queue 
as described earlier, but there is no actual register associ­
ated with the R l Queue. Instead, the Dispatch unit clears 
the scoreboard bit for reg ister r l  w hen the producing 
instruction completes successfully.
10. Results
Several benchmarks have been run through the Fred 
simulator. Although the benchmarks are not particularly 
large, representative results may still be obtained because 
eveiy signal transition is timed. The benchmarks used are 







cat 7109 copy stdin to stdout, for "cat.c”
cbubblc 13300 bubble sort on 50 integers
cquick 5680 quicksort on 50 integers
ctowers 3095 towers o f Hanoii, 4 rings
dhry 1710 dhrystone v. 2.1, 3 loops
fact 2858 10 factorial, computed 5 times
grcp 13668 search for "p rin tf’ in cat.c source
heapsort 2465 heapsort on 16 integers
mergesort 1857 mergesort on 16 integers
mod 4582 test o f 10 modulo operations
muldiv 1669 test o f multiply and divide
Pi 13883 compute 10 digits o f  K
queens 8181 solve 5 queens problem
Figure 5. B enchm ark  p ro g ram s
All of the benchmarks are written in C. The code was 
compiled for the Motorola 88100 using either the GNU C 
co m p ile r (v. 2 .4 .5 ) or the G reen  I l i l ls  co m p ile r (v. 
1.8.5nil6), and then translated into Fred's assembly lan­
guage using a custom post-processor. All possible optimi­
zation flags were used, to little effect. Both com pilers 
produced very poor code, using only a few o f the available 
registers, making many memory references, and leaving 
many obvious optimizations undone. This is entirely due 
to the fact that the compiler is not targeted specifically for 
Fred, and has nothing to do with any shortcomings of the 
Fred architecture.
Two major parameters o f the Fred simulator were var-
FIFO Length
Figure 6. A verage p erfo rm ance  vs. IW size
ied, and each o f the 14 benchmarks was executed under 
each configuration. First, the number o f IW slots was var­
ied between 2 and 16. Second, the number o f latch stages 
in each FIFO queue was varied from 0 to 8. W ith zero 
stages, there is no storage in the FIFO queue at all, and 
each request/acknowledge pair between functional units is 
directly connected. Although there are many FIFO queues 
in the Fred processor they were not varied independently, 
since general performance trends were o f more interest 
than tweaking the queues for maximum performance on a 
given benchmark.
The average performance is more dependent on the 
length of the FIFO queues than on the size o f the Instruc­
tion Window. There was no appreciable difference in per­
formance for IW sizes greater than 3 slots. Figure 6 shows 
the relationship between performance and queue length 
for various IW sizes. Because the Dispatch Unit searches 
the IW for executable instructions in a parallel manner, the 
main factor affecting performance is the time it takes to 
complete an instruction. As long as the IW is large enough 
to issue instructions efficiently, it only affects performance 
in terms of saving state during exception handling.
10.1 Instruction Window Usage
Figure 7 shows how the average IW usage varied with 
queue length and IW size. With longer queue lengths the 
time needed for each issued instruction to com plete is 
longer, giving more time for the IW to be loaded with 
instructions, so the usage increases. As the number o f IW 
slots increased the average IW usage also went up, but this 
is to be expected since there are more slots available. 
Regardless o f the configuration, the average IW usage is 
still no greater than 2.5 slots. The relatively high usage 
seen when the queue length is zero is due to the inability to 
dispatch more than one instruction at a time. Because there
C o p y r ig h t  1 9 9 6  T E E E R e p r o d u c t io n  w i t h o u t  p e rm is s io n  is p r o h ib i t e d
Figure 7. A verage IW s lo t u sag e
is no storage in the queues, there is essentially no pipelin­
ing except for those instructions which can be sent to sepa­
rate functional units.
10.2 Instruction Completion
Only those instructions which m ight possibly fail 
must report their completion to the Dispatch Unit. This 
enables a significant speedup in performance, since there 
is less com m unication  w ith the In struction  Window. 
Instructions which will always complete successfully may 
be removed from the IW as soon as they have dispatchcd, 
providing a corresponding decrease in the average IW 
usage. Figure 8 and Figure 9 tabulate the differences for 
an optimal queue length o f 1 and an IW size o f 4 slots. On 
average, intelligent completion increases the performance 




ackermann 151.82 173.45 14.2% 24.1%
cat 152.77 178.18 16.6% 13.3%
cbubblc 146.80 164.34 11.9% 31.8%
cquick 152.22 173.49 14.0% 25.2%
ctowcrs 157.19 176.96 12.6% 34.9%
dliry 148.87 164.69 10.6% 42.1%
fact 143.65 167.12 16.3% 9.9%
grep 146.48 168.27 14.9% 20.1%
hcapsort 152.54 172.82 13.3% 28.0%
mergesort 148.14 168.91 14.0% 24.2%
mod 148.87 165.66 11.3% 36.0%
muldiv 152.08 170.53 12.1% 33.9%
P' 145.58 163.43 12.3% 29.3%
queens 148.40 171.19 15.4% 16.6%
average 149.67 169.93 13.5% 26.4%
Figure 8. C om pletion signalling  and  
p erfo rm ance
Benchmark
IW slot usage 
with forced 
completion




ackermann 2.00 1.42 29.0%
cat 1.82 1.15 36.8%
cbubble 2.11 1.71 19.0%
cquick 2.17 1.65 24.0%
ctowers 2.20 1.70 22.7%
dliry 2.09 1.71 18.2%
fact 1.80 1.16 35.6%
grep 1.89 1.35 28.6%
hcapsort 2.08 1.63 21.6%
mergesort 2.04 1.55 24.0%
mod 2.24 1.82 18.8%
muldiv 2.20 1.70 22.7%
P' 2.21 1.77 19.9%
queens 1.93 1.39 28.0%
average 2.06 1.56 24.9%
Figure 9. C om pletion  signalling  and  
IW s lo t u sag e
10.3 Branch Decoupling
As mentioned earlier, Fred’s decoupled branch mech­
anism allows for a variable number o f delay slots but the 
compiler used for the benchmarks generates code for the 
Motorola 88100 processor, a synchronous RISC processor 
which has only a single delay slot. This allows only one 
instruction to be placed betw een the branch instruction 
and the first instruction at the target address. The instruc­
tions generated by the 88100 compiler are translated into 
Fred’s instruction set, and a very simple peephole optimi­
za tion  is perform ed to separate  the branch  and d o it 
instructions as far as possible within a basic block. Despite 
these handicaps, the average number o f useful delay slot 
instructions is greater than one. With a compiler targeted 
specifically  for Fred, the separation  should be m uch 
greater. The time available for instruction prefetching is 
directly related to the separation between the branch target 
calculation and the doit, and would also benefit from such 
a compiler. The dynamic separation results are shown in 
Figure 10.
11. Conclusions
The current prototype o f  Fred is in the form  o f  a 
detailed VIIDL model. This model is completely func­
tional including the out-of-order instruction completion 
and functionally precise exceptions. Benchmark results 
seem to bear out the premise that a self-timed implementa­
tion is a natural match for decoupled computer architec­
tures. The ability to allow different parts o f the machine to 
proceed at their own rate and the natural use o f self-timed 
FIFO queues enhances the decoupling due to the architec-

















Figure 10. D ynam ic b ranch /do it sep a ra tio n
tiirc. As general processor designs (both synchronous and 
asynchronous) grow more complex and the degree o f con­
currency and decoupling increases, the features and tech­
n iques found in the Fred a rch itec tu re— functionally  
precise interrupts, decoupled branches, intelligent pre­
fetching, decoupled memory access, etc.— may gain in 
importance.
12. References
[1] F.rik Brunvand. Using FPGAs to prototype a self-timed 
computer. In International Workshop on Field Programma­
ble Logic and Applications, Vienna University of Technol­
ogy, September 1992.
[2] F.rik Brunvand. Tbe NSR processor. In Proceedings o f  the 
26th Annual Hawaii International Conference on System  
Sciences, pages 428-435, Maui, I lawaii, January 1993.
[3] Wesley A. Clark and Charles A. Molnar. Macromodular 
system design. Technical Report 23, Computer Systems 
Laboratory, Washington University, April 1973.
[4] A.L. Davis. Tbe architecture and system method for DDM1: 
A recursively structured data-driven machine. In 5th Annual 
Symposium on Computer Architecture, April 1978.
[5] Matthew Farrens, Pius Ng, and Phil Nico. A comparison of 
superscalar and decoupled access/execute architectures. In 
Proceedings o f  ihe 26th Annual ACM /IEEE International 
Symposium on Microarchitecture, Austin, Texas, December 
1993. IF.F.F,ACM.
[6] S. B. Furber, P. Day, J. D. Garside, N. C. Paver, and J. V. 
Woods. A micropipelined ARM. In Proceedings o f  the VII 
B a n ff Workshop: Asynchronous Hardware D esign, Banff, 
Canada, August 1993.
[7] J. R. Goodman, J. Hsieb, K. Liou, A. R. Pleszkun, P. B. 
Schechter, and II. C. Young. PIPE: A VLSI decoupled 
architecture. In 12th Annual International Symposium on 
Computer Architecture, pages 20-27. IF.FF. Computer Soci­
ety, June 1985.
[8] Thomas R. Gross, John L. Hennessy, Stephen A. Przybyl- 
ski, and Christopher Rowen. Measurement and evaluation 
of the MIPS architecture and processor. A C M  Transactions 
on Computer Systems, 6(3):229-257, August 1988.
[9] John Hennessy, Norman Jouppi, Forest Baskett, Thomas
Gross, and John Gill. Hardware/software tradeoffs for 
increased performance. In Proceedings o f  the Symposium  
on Architectural Support fo r  Programming Languages and  
Operating Systems, pages 2-11. ACM, April 1982.
[10] Manolis G. II. Katevenis. R educed  Instruction Set Com­
puter Architectures fo r  VLSI. MIT Press, 1985.
[11] Alain Martin, Steven Burns, T.K. Lee, Drazen Borkovic, 
and Pieter Hazewindus. Tbe design of an asynchronous 
microprocessor. In Proc. CalTech C onference on VLSI, 
1989.
[12] Motorola. MC88100 RISC Microprocessor U sers Manual. 
Prentice I lall, F.nglewood Cliffs, New Jersey 07632, second 
edition, 1990.
[13] Alexandru Nicolau and Joseph A. Fisher. Measuring the 
parallelism available for very long instruction word archi­
tectures. IEEE Transactions on Computers, C-33(l I):I IO­
NS, November 1984.
[14] Nigel Charles Paver. The Design and Implementation o f  an 
Asynchronous Microprocessor. PhD thesis, University of 
Manchester, 1994. h t t p  : //w w w . c s  . m an . a c  . u k /  
a m u l e t / p u b l i c a t i o n s / t h e s i s /  
p a v e r9 4 _ p h d .h tm l.
[15] William F. Richardson and F.rik Brunvand. The NSR pro­
cessor prototype. Technical Report UUCS-92-029, Univer­
sity ofUtah, August 1992. f t p : / / f t p . c s . u t a h . e d u /  
te c h re p o r ts /1 9 9 2 /U U C S -9 2 -0 2 9 . p s . Z.
[16] William F. Richardson and F.rik Brunvand. Precise excep­
tion handling for a self-timed processor. In 1995 Interna­
tional Conference on Computer Design: VLSI in Computers 
& Processors, pages 32-37, Los Alamitos, CA, October 
1995. IF.FF. Computer Society Press.
[17] James F„ Smith and Shlomo Weiss. Powerpc 601 and alpha 
21064: A tale of two RISCs. IEEE Computer, 27(6):46-58, 
June 1994.
[18] Robert F. Sproull and Ivan F„ Sutherland. Counterflow pipe­
line processor architecture. Technical Report SMLI TR-94- 
25, Sun Microsystems Laboratories, Inc., M/S 29-01, 2550 
Garcia Avenue, Mountain View, CA 94043, April 1994. 
h t t p : / / w w w . s u n . c o m / s m l i / t e c h n i c a l - 
r e p o r t s /1 9 9 4 / s m l i_ t r - 9 4 - 2 5 .p s .
[19] Ivan Sutherland. Micropipelines. Communications o f  the 
ACM, 32(6):720-738, 1989.
[20] Jose A. Tierno, Alain J. Martin, Drazen Borkovic, and 
Tak Kwan Lee. A 100-MIPS GaAs asynchronous micropro­
cessor. IEEE D esign & Test o f  Computers, 11 (2 ):43—49, 
Summer 1994.
[21] R. M. Tomasulo. An efficient algorithm for exploiting mul­
tiple arithmetic units. IB M  Journal o f  Research and D evel­
opment, 11:25-33, January 1967.
[22] II. C. Tomg and Martin Day. Interrupt handling for out-of­
order execution processors. IEEE Transactions on Comput­
ers, 42(1): 122-127, January 1993.
[23] David W. Wall. Limits of instruction-level parallelism. 
WRL Technical Note TN-15, Digital Western Research 
Laboratory, 100 Hamilton Avenue, Palo Alto, CA 94301, 
December 1990. f t p : / / g a t e k e e p e r . d e c . c o m /  
p u b / D E C / W R L / r e s e a r c h - r e p o r t s / W R L - T N -
15 .p s .
[24] Wm. A. Wulf. Tbe WM computer architecture. Computer 
Architecture News, 16(1), March 1988.
C o p y r ig h t  1 9 9 6  T E E E R e p r o d u c t io n  w i t h o u t  p e rm is s io n  is p r o h ib i t e d
