An Asynchronous Microprocessor in Gallium Arsenide by Tierno, Jose A. et al.
An Asynchronous Microprocessor in Gallium 
Arsenide 
Jose A. Tierno 
Alain J. Martin 
Dralen Borkovic 
Tak Kwan Lee 
Computer Science Department 
California Institute of Technology 
Caltech-CS-TR-93-38 
An Asynchronous Microprocessor in Gallium 
Arsenide 
Jose A. Tierno Alain J. Martin Drazen Borkovic 
Tak Kwan Lee 
Department of Computer Science 
California Institute of Technology 
Pasadena, CA 91125 
November 9, 1993 
Abstract 
In this paper, several techniques for designing asynchronous circuits 
in Gallium Arsenide are presented. Several new circuits were designed, to 
implement specific functions necessary to the design of a full microproces-
sor. A sense-amplifier, a completion tree, and a general circuit structure 
for operators specified by production rules are introduced. These circuit 
were used and tested in a variety of designs, including two asynchronous 
microprocessors and two asynchronous static RAM's. One of the micro-
processor runs at over 100 MIPS with a power consumption of 2 Watts. 
1 Introduction 
With an electron mobility about six times that of silicon at room temperature, 
and with a lower parasitic capacitance due to semi-insulating substrate, GaAs is 
potentially faster than silicon. Up until recently, however, GaAs was not avail-
able to the VLSI community at large because of inherent fabrication difficulties. 
These difficulties have been overcome to a large extent. Several foundries are 
now offering GaAs fabrication lines under conditions similar to CMOS, with 
density limited to about 100,000 transistors. In particular, Vitesse Semiconduc-
tors is offering fabrication through MOSIS to the academic community in the 
United States. 
At the moment, the transistor of choice for GaAs digital VLSI is the MES-
FET. There is no oxide insulating the gate of a MESFET transistor from source 
and drain; therefore, the logic families available in GaAs are much less attractive 
than CMOS or even nMOS. DCFL has been adapted to GaAs, but has very re-
duced noise margins, and restricted fanin and fanout. With no complementary 
1 
transistor available, the logic is ratioed. As a result, a considerable fraction of 
the speed advantage is lost due to the complexity of the available logic families 
relative to CMOS. 
We have developed a design method for asynchronous VLSI that is, to a 
large extent, technology independent. Thus, it should be straightforward to 
port a design from one technology to another. Also, since the circuits designed 
are delay-insensitive, they are more robust with respect to variations in physical 
parameters. Hence, the method should make it easier to design in a demanding 
technology such as GaAs, where parameters-particularly threshold voltages-
are difficult to control. Finally, since the circuits we design do not use a clock, 
we avoid the complexities of high-speed clocking schemes. Adapting our method 
to GaAs design would be an excellent demonstration of the advantages of the 
method. 
From the onset, our intention was to port to GaAs the asynchronous micro-
processor we designed in CMOS in 1989 [MBL +89]. At the same time, we would 
demonstrate the portability of our approach across vastly different technologies, 
and the efficiency and robustness of the design method. 
To this end, several special purpose circuits had to be designed. Some of 
them are presented in this paper, together with their use in the microprocessor. 
2 GaAs and the MESFET 
2.1 MEtal Semiconductor Field Effect Transistor 
The MEtal Semiconductor Field Effect Transistor, or MESFET, is the transistor 
of choice for GaAs VLSI applications. It is the easiest to manufacture, provides 
the highest density (about 100,000 transistors on a chip), and has a very good 
power-delay product, that competes fairly well with CMOS. MESFET circuits 
can be better compared to ECL. At this point, GaAs is slightly faster, uses 
far less power, and has higher circuit density, for a similar cost. An important 
application of GaAs is to provide replacements for ECL parts, specially fast 
RAM's, and other LSI circuits. 
MESFET's are junction FET's. The gate is built with metal, and forms a 
Schottky junction with the transistor channel. There is no insulation between 
gate and channel; the Schottky junction creates a diode from gate to source 
and from gate to drain. This diode is probably the most serious constraint for 
MESFET circuits, since it limits the voltage differential between gate and source 
or drain to the diode's forward conduction voltage, about 0.7 volts. Above that 
difference, the diode will be forward biased and current will flow from the gate 
into the channel. 
Hole mobility is low in GaAs (10 times less than electron mobility [LB90]), 
which makes p-type FETs fairly slow. There is, therefore, no complementary 
logic available. As in nMOS, n-type transistors come in two flavors, depletion-
2 
a 
a b b l n ---jr=":"'::""---1 
(a) (b) c 
Figure 1: (a) DCFL nor gate, (b) super buffered nand gate, and (c) "squeeze" 
buffer 
mode and enhancement-mode. E-mode transistors have a positive threshold 
voltage, and d-mode have a negative threshold voltage, that is, they require a 
negative gate to source voltage to be cut off. 
2.2 Direct Coupled Fet Logic 
Analogous to its nMOS namesake, DCFL is the most widely used logic family 
in GaAs VLSI. It is simple, uses little power, and has the highest density of all. 
Figure lea) shows a DCFL nor gate. 
Signals have a restricted voltage swing, because of the input-to-ground diode 
at the input of DCFL gates. Logic-low is about O.lV, while logic-high is about 
O.6V; this drastically reduces the noise margins of DCFL gates. In DCFL nand-
gates noise margins become critical: the series transistors in the pull-down chain 
push the logic low to about O.2V, and the noise margin is so small that many 
designers avoid using nand gates completely. 
As in nMOS, signals often have to be buffered. A super-buffer configuration 
can be used (see Figure l(b) and (c». A super-buffer also increases the noise 
margins by lowering the logic-low voltage, since the output stage is not ratioed. 
3 Description of the Processor 
The microprocessor is a l6-bit, pipelined, RISC-style processor. It is a modified 
version of the CMOS design described in [MBL +89). 
Instructions are issued in order, but may complete out of order. The proces-
sor has 16 general purpose registers, with four buses, two for read and two for 
3 
write. Registers have individual locks to solve read-after-write and write-after-
write hazards. 
In addition to the AL U, there is one adder in the program counter (PC) 
unit, for relative branching and incrementing the program counter register. For 
simplicity, the memory address adder was omitted, thus reducing the number of 
required buses from five to four, and reducing the size of the datapath compared 
to the CMOS processor. 
Other modifications include a revised pipelining of the AL U unit, and better 
balanced control of some of the shared resources. 
The processor was initially specified as a set of concurrent processes. The 
text of these processes, shown in Figures 2 and 3 was later transformed into a 
signal transition language, or handshaking expansion, and then compiled into 
the gate netlist of the final circuit. 
The high-level specification of Figures 2 and 3 shows in detail how the differ-
ent units interact. The language used is similar to C.A.R. Hoare's CSP [Hoa78], 
and described in [MargO]. 
4 GaAs Technology Mapping 
Even though it is be tempting to map the processor design into DCFL gates, the 
specific requirements of asynchronous circuits makes that choice impractical. 
A robust DCFL circuit has to be made almost exclusively from nor gates, 
plus super-buffers and a few special purpose circuits. We could build the pro-
cessor with these elements, but at the expense of compromising the delay-
insensitivity of the circuit in almost every gate. It is usually safe to implement 
C-elements with nor gates; however, in such a large design it would be almost 
impossible to keep track of delays and make sure that no hazard threatens the 
functionality of the circuit. Parts of the layout are generated with standard-
cell-place-and-route tools, where we do not have as much control over delays as 
we need. 
The complexity and number of inputs that a DCFL gate can have are limited 
by noise margins, subthreshold currents, and variations in threshold voltages. 
The synthesis method sometimes generates large or complex operators, and 
these would have to be decomposed into smaller ones to fit in the available 
smaller DCFL gates. This decomposition creates a number of extra nodes that 
may introduce hazards or races. A proper decomposition of the bigger operators 
is, at best, a very hard task. We need circuits that allow a direct synthesis of 
those operators, up to a reasonable size. To solve the problem of large operators, 
we have investigated several alternative logic configurations. 
4 
fMEM == *UD!imem[pc]J 
FETCH == *[PCll; fD?i; PCf2; El!i; 
J 
[ offset(i.op) --+ PCll; ID?offset; PCf2; OF 
D ...,offset(i.po) --+ skip 
J; E2 
PCADD==( *[[ PCll --+ PCll; y:=pc+l; PCf2; pc:=y 
D PCAl --+ PCA1; y:= pc + offset; PCA2; pc:= y 
D Xpc 
D Ypc 
JJ 
--+ X!pc e Xpc 
--+ Y?pc e Ypc 
II * [[ Xof --+ X!offset e Xof JJ 
) 
EXEC == *[El?j; 
[ alu(j.op) --+ E2; Xse YseAC!j.opeZAseP 
D ld(j.op) --+ E2; ZMs e Ys e MCL 
D st(j.op) --+ E2; Xse YseMCS 
D adi(j.op) --+ OF; E2; Xofe YseAC!addeZAseP 
D stpc(j.op) --+ Xpce YseAC!addeZAseP; E2 
D jmp(j.op) --+ Ypce Ys; E2 
D brch(j.op) --+ OF; F?f; 
JJ 
[ cond(f,j.cc) --+ PCA1; PCA2 
D ""cond(f,j.cc) --+ skip 
J; E2 
Figure 2: The program, first part 
5 
ALU == ( *[[ AC ------> AC?op. X?x. Y?y; 
(z,f) := aluf(x, y, op,f). B 
U F ------> F!f 
]] 
II * [B; ZA!z. V] 
II *[P; V] 
) 
MU == *[[ MCL ------> Y?ma. MCL; MDL?w; ZM!w 
D MCS ------> X?w. Y?ma. MCS; MDS!w 
]] 
DMEM == * [[ MDL ------> MDL!dmem[ma] 
D MDS ------> MDS?dmem[ma] 
]] 
REG[k] == ( 
II 
II 
II 
) 
*[[ -.bkl\k=j.xI\Xs --+ X!r.Xs]] 
*[[ -.bkl\k=j.yl\ Ys ------> Y!r. Ys]] 
*[[ -.bkl\k =j.zI\ZAs ------> bki; ZAs; ZA?r; bkl ]] 
*[[ -.bkl\k=j.zI\ZMs ------> bkT; ZMs; ZM?r; bkl ]] 
Figure 3: The program, second part 
4.1 Datapath 
The datapath comprises three different units: the AL U, the PC unit for manip-
ulations of the program counter, and the memory unit for execution of load and 
store operations. They are mostly combinatorial circuits, replicated a number 
of times. Data-path delay is determined mostly by carry-chain, control signal, 
and bus delays. Carry-chain delay is data dependent, since the adder uses carry 
prediction [Mar92]. 
It is important that the datapath be optimized for size: in the datapath 
most signals are local, with the exception of control lines and buses, and thus 
delays and power depend directly on the physical dimensions of the datapath. 
The datapath must be optimized for power; in this design, for example 70% 
of all power is spent by the datapath, 15% by the register file, and 15% by the 
control logic (not counting pad-driver power). 
To satisfy these constraints, all datapath gates are DCFL, except nand gates 
and buffers, and completion-detection circuits. 
N and gates were implemented as shown in Figure l(b) [LB90l. The super-
6 
a 
b 
c 
d 
(a) (b) 
Figure 4: Multi-input C-element. (a) Transistor schematic, and (b) logical 
diagram 
buffer stage allows the output low voltage to be low enough, since the pull-down 
does not have to fight a passive pull-up. The lower output low increases the 
noise margins considerably, with a small penalty in area and power. 
Super-buffers were used to buffer bus and control signals. To improve per-
formance and noise margin characteristics, a feed-back transistor was added, 
creating a "squeeze" buffer (see Figure l(c) [Br092]). Squeeze buffers allow the 
use of a stronger pull-up transistor, the feed-back transistor limits the output 
high voltage. 
To generate completion signals from the datapath, we use C-elements with a 
large number of inputs. They can be built from smaller C-elements connected in 
a tree [Mar90], or, as in this case, as a single logic gate (see Figure 4( a) and (b)). 
For a discussion on how this circuit works, see [Tie92J. Though a completion 
tree could be implemented with DCFL nor gates, it would be significantly slower 
and bigger than that of Figure 4(a). Completion detection is in sequence with 
the calculations performed by the datapath, and affects performance directly. 
It is critical to have an easy and fast way of generating the completion signal. 
4.2 Control Logic 
Control logic takes care of the sequencing of actions in the processor. After 
compilation, each control signal is assigned a set of "production rules", of the 
form: 
G -+ zi 
H -+ zl 
where G and H are boolean expressions in terms of the other signals. G and 
H do not have to be complementary. In fact, most operators in the control are 
7 
a 
b 
d z 
a 
c 
b 
d z 
c 
(a) d 
z a 1\ -,b v d 1\ c -+ z r 
z e -,d 1\ -'e -+ z 1 
(b) (c) 
d 
e 
Figure 5: Dual rail implementation of control signals. (a) Transistor schematic, 
(b) logical diagram, and (c) production rules 
state holding, that is, G V H does not hold. There are different ways of giving 
a direct implementation of these production rules. One, Source Follower Fet 
Logic, is given in [Tie92]. In that paper, a systematic way of generating any 
operator described by production rules is presented. This method was applied 
in the design of the first GaAs microprocessor. However, it resulted in a circuit 
with a large power consumption (4 Watts) and modest performance (70 MIPS). 
For the second GaAs processor, a different approach was used. Each signal is 
implemented as a dual-rail encoded signal, always generating both positive and 
negative sense. Since most signals are state-holding, this is necessary anyway 
(we do not have dynamic logic available). On the up side, no inverters are 
necessary to generate the boolean expressions for the production rules. On 
the down side, combinatorial gates require extra circuitry to generate dual-rail 
outputs, as do signals coming in from the datapath. 
Figure 5 shows how a specific set of production rules are implemented in dual-
rail. Note the feed-back transistor on the outputs of the individual nor gates. 
These have the same function as the one on the "squeeze" buffer, allowing us 
to use much stronger pull-up transistors. 
4.3 PLA's 
The power consumption of these circuits is relatively high, and the chips run hot 
(around 100°C). At this temperature, subthreshold currents of the pull-down 
8 
Figure 6: Example of a nor-nor PLA implemented with source-followers 
transistors may be strong enough to overpower the pull-ups; nor gates with 
more than 6 inputs are impractical, because the off current of 6 transistors is 
of the same order of magnitude as the on current of one transistor. Therefore, 
static DCFL PLAs cannot be used. 
We use a different structure for the PLAs in the processors. The nor planes 
are implemented with source followers, which can be turned off more effectively 
than the corresponding DCFL structure (see Figure 6). A penalty is paid in 
speed and power, but min-terms with up to 10 inputs are realizable. The internal 
signals in the nor plane can switch rail to rail, giving much improved noise 
margins, and the level shifting diodes help reverse-bias the pull-up transistors 
of the source-followers. Also, the ratio between the pull-up and pull-down in 
the source-follower is close to 1, and the subthreshold current of the pull-down 
can better balance the pull-up. 
4.4 Register File 
The register file has 16 registers, 16 bits wide, and four ports, two for read and 
two for write. Register 0 is hard-wired to be read always as zero. Each register 
bit has a total of 12 transistors, four for the flip-flop and two extra for each 
port (see Figure 7). All ports are dual-rail, that is, data and inverted data 
is provided. Between reads and writes the buses are pre-charged to a neutral 
value, to prepare for the next operation, and reset the completion circuits. 
Read ports are implemented with a dual-ended sense amplifier (see Figure 8). 
This sense-amp detects a small difference between the true and false buses, and 
drives these buses strongly in opposite directions, using transistors T5, T4, and 
T1. To work properly, the register that is being read has to be selected some 
time prior to applying the sense signal. To this effect, the sense signal is derived 
from an "or" of all select signals for the given port. 
9 
Figure 7: Register cell 
Figure 8: Sense amplifier, and pre-charge circuit for the register file 
When sense is off, transistors T8,T3 and T7,T2 pre-charge buses bT and bF 
to a value determined by the ratio between T8 and T3 (T7 and T2). Transistor 
T6 further ensures the symmetry of the circuit. 
The buses are then buffered before going into the datapath, to isolate the 
sense amplifier, and to restore the logic levels of the data signals. 
5 Results 
Using these techniques, several circuits have been designed, fabricated and 
tested on the HGAAS II and HGAAS III processes offered by Vitesse Semi-
conductors. Among them, small RAM's, register files, and two microprocessors. 
10 
5.1 Asynchronous Static Memories 
We designed and fabricated two different types of static memories in GaAs. 
The first is a dual ported register file, 16 words of 4 bits/word. It was meant 
to provide a small amount of fast memory for the microprocessor to run test 
programs at full speed. Out of 30 bonded devices, 29 were functional. Access 
time is 5ns, and the chip dissipates 500m W at 2.2V. 
The second SRAM has 64 words of 4bits/word [Hom 1], and was designed 
as an intermediate step towards a larger memory to be used as a cache for the 
processor. All 30 bonded devices received were functional. The access time is 
3ns, including pad delays, and the chip dissipates 700m W at 2.3V 
This 64 x 4 memory was designed after the first processor, and incorporates 
several improvements derived from our experience with the earlier designs. Also, 
the circuits were carefully optimized for high speed and low power consumption. 
The performance obtained indicates that the improvements envisioned for the 
next processor generation should be attainable. 
5.2 Microprocessors 
The first microprocessor design uses the circuits described in [Tie92J. The main 
concern during the design was to get around parameter variation problems and 
noise margin considerations. This was achieved, but at the expense of power 
and performance. At 70 MIPS, the processor consumes 4 Watt. 
Several new ideas and circuits were incorporated in the new design, many of 
them presented in this paper. All of those circuits were fabricated and tested 
successfully. 
This design was extensively simulated with Hspice. The expected perfor-
mance of this design is about 200 MIPS with a dissipation of 2 Watts, fabri-
cated with the HGAAS III process. Power and speed predictions have been very 
accurate so far using the Hspice models. However, in this case the measured 
performance was only 100 MIPS. The causes are still under investigation. There 
is some evidence of fabrication problems, and underestimation of the parasite 
capacitances as extracted by the MAGIC layout program. 
6 Conclusions 
In the course of this project, we have designed a number of very different GaAs 
circuits. Most of them had very strict requirements, to overcome the limitations 
of GaAs. No general solution is given to synthesize all logic circuits in a design 
as big as a microprocessor. Instead, we found specific solutions to implement 
completion trees, control circuits, PLA's, registers, etc. 
The first GaAs microprocessor, though disappointing in terms of perfor-
mance, gave us invaluable experience in verifying and testing these circuits, 
as well as which were the inherent problems of each. Together with some of 
11 
the RAM designs, it allowed us to prepare for a second mIcroprocessor, with 
considerable improvements in performance. 
Another factor affecting performance is pad delay. So far we have used 
ECL levels on the outside, to be able to interface to standard parts and simplify 
prototyping. Pad delays are in the order of Ins, mostly spent in level conversion; 
also, the padframe uses a considerable amount of power -close to 1A worst case 
for the processor-. This delay and power can be greatly reduced by designing 
matched pad drivers and receivers in a system composed exclusively of GaAs 
parts. It would certainly be a requirement in the interface with cache memory. 
Porting the design of the original CMOS microprocessor was almost as easy 
as expected. A few changes were necessary because of the complexity of register 
cells in the first GaAs version. These changes were carried over to the second 
processor to speed-up the redesign. 
All considered, the expected performance of the new microprocessor is sat-
isfactory. At 50 MIPS/watt, it offers remarkable speed for the power consump-
tion. 
Acknowledgments 
We are indebted to Marcel van der Goot for his help in generating high quality 
software support. Acknowledgment is due to Steve Burns and Pieter Hazewin-
dus, for their participation in the original design, and many very useful com-
ments and conversations, and H. Peter Hofstee for his collaboration in the design 
of asynchronous memories. Many thanks to Ray Milano of Vitesse Semiconduc-
tors, for invaluable help with the HGAAS technology, and to Cindy Hibbert 
and Metasoftware, for their help with Hspice. 
The research described in this paper was sponsored by the Defense Advanced 
Research Projects Agency, DARPA Order number 6202, and monitored by the 
Office of Naval Research under contract number N00014-87-K-0745. 
References 
[Br092] 
[Hoa78] 
[Hof91] 
[LB90] 
Richard Brown. Private communication, 1992. 
C.A.R. Hoare. Communicating sequential processes. Comm. ACM, 
21(8) :666-677, 1978. 
H. P. Hofstee. Deriving some asynchronous memories. Unpublished, 
1991. 
S. I. Long and S. E. Butner. Gallium Arsenide Digital Integrated 
Circuit Design. McGraw-Hill, New York, 1990. 
12 
[Mar90] Alain J. Martin. Synthesis of asynchronous VLSI circuits. In 
J. Straunstrup, editor, Formal Methods for VLSI Design, pages 237-
283. North-Holland, 1990. 
[Mar92] A. J. Martin. Asynchronous datapaths and the design of an asyn-
chronous adder. Formal Methods in System Design, 1(1):117-137, 
1992. 
[MBL +89] Alain J. Martin, Steven M. Burns, T. K. Lee, Drazen Borkovic, and 
Pieter J. Hazewindus. The design of an asynchronous microproces-
sor. In Charles L. Seitz, editor, Advanced Research in VLSI: Proceed-
ings of the Decennial Caltech Conference on VLSI, pages 351-373. 
MIT Press, 1989. 
[Tie92] J. A. Tierno. Designing asynchronous circuits in Gallium Arsenide. 
CS-TR-92-19, California Institute of Technology, 1992. 
13 
