VLSI design of the tiny RISC microprocessor by Abnous, A. et al.
UC Irvine
ICS Technical Reports
Title
VLSI design of the tiny RISC microprocessor
Permalink
https://escholarship.org/uc/item/2gp3v5xw
Authors
Abnous, A.
Christensen, C.
Gray, J.
et al.
Publication Date
1991
 
Peer reviewed
eScholarship.org Powered by the California Digital Library
University of California
VLSI DESIGN OF THE TINY RISC 
~ MICROPROCESSORrr 
A. Abnous, C. Christensen, J. Gray, 
,,,-;_. ,,, 
J. Lenell, A. Naylor, N. Bagherzadeh 
Department of Electrical and· Computer Engineering 
Department of Information and Computer Science 
University of California, Irvine 
Irvine, California 9271 7 
Technical Report No. 91-74 
Notice: This Material 
may be protected 
by Copyright law 
(Title 17 U.S.C.) 
VLSI Design of the Tiny RISC 
Microprocessor 
Arthur Abnous, Christopher Christensen, Jeffrey Gray, 
John Lenell, Andrew Naylor, and Nader Bagherzadeh 
Department of Electrical and Computer Engineering 
University of California, Irvine 
Irvine, CA 92717 
Abstract 
This report describes the Tiny RISC microprocessor designed 
at UC Irvine. Tiny RISC is a 16-bit microprocessor and has a 
RISC-style architecture. The chip was fabricated by MOSIS [1] in 
a 2µm n-well CMOS technology. The processor has a cycle time 
of 70 ns. 
1 Introduction 
This report presents the VLSI design and implementation of the Tiny RISC 
microprocessor. Tiny RISC is a 16-bit microprocessor with a RISC-style in-
struction set. The chip was fabricated by MOSIS in a 2µm n-well, 2-level 
metal CMOS technology. The processor is pipelined, and can execute an in-
struction in each cycle. The instruction set is designed for efficient pipelining 
and decoding. Tiny RISC has a cycle time of 70 ns (14.3 MHz clock speed). 
This gives it a peak performance of 14 16-bit MIPS. The processor was laid out 
using the MAGIC VLSI layout system [2]. IRSIM [3] was used extensively to 
simulate and verify the operation of the processor. PLA's were created using 
MPLA [4]. 
2 Instruction Set 
Table 1 outlines the instruction set of Tiny RISC. Instruction formats are 
shown in Figure 1. All instructions, memory addresses, and operands are 16 
1 
bits wide. Most instructions are of the arithmetic/logic/shift type. srcl and 
src2 are source registers and dest is the destination register. There are eight 
general-purpose registers, RO thru R 7. RO is hardwired to contain zero at all 
times. Writing to it is allowed but will not affect its content. Shift instructions 
have two versions, constant and variable. For constant shift instructions, the 
shift amount is specified by the lower four bi ts of a 5-bi t immediate value (I). 
For variable shift instructions, the shift amount is specified by the lower four 
bits of the contents of src2. ADDI and SUBI are the immediate versions of 
ADD and SUB with src2 replaced by a 5-bit immediate value (I), which is 
always zero-extended. The LDI instruction loads an 8-bit immediate value 
(LI) into dest. LI is also zero-extended. Set instructions (SEQ, SLT, SLTU, 
SGE, SGEU) set or clear the least significant bit of dest by comparing srcl and 
src2 and checking for the specified condition. The upper 15 bits are always 
cleared to zero. Set instructions are usually used to compute condition codes 
for a subsequent branch instruction. 
Load/Store instructions use the register indirect addressing mode. There 
are two instructions in this category: LDW and STW. For both instructions, 
srcl contains a data memory address. The LDW instruction loads M[src1] 
into dest (in this context, M[x] refers to the content of the memory location 
addressed by register x). The STW instruction stores the content of src2 to 
M[srcl]. Tiny RISC has a Harvard architecture. There are two separate mem-
ory spaces: one for instructions and one for data. Each memory is accessed 
via a separate 16-bit multiplexed bus. The DATA bus provides access to the 
data memory, and the INST bus provides access to the instruction memory. 
The only difference between the operations of the two buses is that there are 
no write operations to the instruction memory. 
Branch instructions (BRF and BRT) check the least significant bit of src1 
and transfer control to the target of the branch if the specified condition is 
met. The target address of branch instructions is computed by adding an 8-bit 
immediate offset ( 0) to the address of the instruction following the branch. 
0 is always sign-extended. The CALL instruction is used for procedure call 
operations. The Program Counter is loaded with srcl, and the return address 
is saved into dest. The JUMP instruction loads the PC with the content of 
srcl. It is used to return from procedure calls. 
2 
Opcode Parameters Description 
ADD src1, src2, dest add 
SUB src1, src2, dest subtract 
AND src1, src2, dest bitwise AND 
OR src1, src2, dest bitwise OR 
XOR src1,src2,dest bitwise exclusive-OR 
XNOR src1, src2, dest bitwise exclusive-NOR 
LSL src1, src2, dest logical shift left (variable) 
LSR src 1, src2, dest logical shift right (variable) 
ASR src1, src2, dest arithmetic shift right (variable) 
SEQ src1, src2, dest set if equal 
SLT src1, src2, dest set if less than 
SLTU src1, src2, dest set if less than unsigned 
SGE src1, src2, dest set if greater than or equal 
SGEU src 1, src2, de st set if greater than or equal unsigned 
ADDI src1, #I, dest add immediate 
SUBI src1, #I, dest subtract immediate 
LSLI src1, #I, dest logical shift left (constant r 
LSRI · src1, #I, dest logical shift right (constant) 
ASRI src1, #I, dest arithmetic shift right (constant) 
LDI #Ll,dest load immediate 
LDW src1,dest load word 
STW src1,src2 store word 
BRF srcl,#0 branch if false 
BRT srcl,#0 branch if true 
JUMP src1 jump 
CALL src1,dest call 
Table 1: Instruction Set 
3 
10 
5 3 3 3 2 
Compute opcode de st src1 src2 
-
5 3 3 5 
Immediate opcode de st src1 1<4 .. 0> 
5 3 8 
Long Immediate opcode de st Ll<7 .. 0> 
5 3 3 5 
Branch opcode I 0<7 .. 5> I src1 0<4 .. 0> 
Figure 1: Instruction Formats 
Time 
I IF ID 
11 IF 
12 
IF: Instruction Fetch 
ID: Instruction Decode 
EX: Execute 
WB : Write Back 
EX 
ID 
IF 
WB 
EX WB 
IB EX WB 
13 IF ID EX 
Figure 2: Pipeline Structure of Tiny RISC 
3 Pipeline Structure 
WB I 
. Figure 2 shows the pipeline structure of Tiny RISC. There are four pipeline 
stages: IF (Instruction Fetch), ID (Instruction Decode), EX (EXecute), and 
WB (Write Back). The instruction set was designed with this pipeline struc-
ture in mind. Each pipeline stage takes one cycle to complete its function. 
The timing of each cycle is derived from a non-overlapping four-phase clock. 
As shown in Figure 3, there are five different clock signals along with their 
complements: Phl, Ph2, Ph3, Ph4, and Ph123. 
During the IF stage, an instruction is fetched from the program memory. 
4 
Ph1 J 
Ph2 
Figure 3: Four-Phase Clocking Scheme 
The instruction is decoded, and the source operands are read from the register 
file during the ID stage. Most control signals are generated in this stage. 
The control signals that become active during later pipeline stages are delayed 
by the proper amount. Control transfer instructions update the PC at the 
end of ID stage. This means that all control transfer instructions have a 
delay of one cycle. The instruction following a control transfer instruction is 
always fetched and executed. It is the programmer's responsibility to schedule 
a useful instruction in the delay slot. If such an instruction is not found 
a NOP should be scheduled in the delay slot. During the EX stage, the 
actual operation specified by the instruction is performed. LDW and STW 
instructions access the data memory during this stage. In the WB stage, the 
result of the instruction is written to the register file. 
To avoid data hazards caused by the delay between RD and WB stages, 
Tiny RISC uses one level of bypassing [5]. The bypassing hardware compares 
the source register addresses of the instruction about to enter the EX stage 
to the destination register address of the previous instruction. If there is a 
match, the operand read from the register file is discarded, and instead, the 
result of the previous EX stage is used. 
4 Data Path Design 
Figure 4 shows the structure of the data path of the processor. The dimensions 
of various hardware blocks are shown in microns. The data path of the pro-
5 
cessor consists of the Register File, the Bypassing Unit, the Load/Store Unit, 
the Shifter, the ALU, and the PC Unit. The control circuitry, the instruction 
register, and the clock generator are all located on the right hand side of the 
data path. 
Each bit-slice of the data path is 56 µm wide and can accommodate seven 
metal2 (second-level metal) tracks running in the vertical direction. ·One of 
these tracks is used to distribute power or ground. This track is shared between 
two adjacent bit-slices. This is done by flipping every other bit-slice (see 
Figure 5). One track is allocated to the Sl ( src1) bus, one to the S2 ( src2) 
bus, and one to the D ( dest) bus. The other three tracks are used for local 
routing within the data path. 
Static flow-thru latches were used throughout the data path. Figure 6 
shows the schematic of a latch cell. The layout of a latch cell is shown in 
Figure 7. 
5 Operand Unit 
The Operand Unit (OU) consists of the following hardware blocks: 
1. Register File 
2. Bypassing Unit 
3. Load/Store Unit 
The block diagram of the OU is shown in Figure 8. 
5.1 Register File 
The register file is dual-ported and contains eight registers, RO thru R7. Reg-
ister RO is hardwired to contain zero at all times. Figure 9 shows the floorplan 
of the register file. The dimensions of various hardware blocks are shown in 
microns. The floorplan of the register file is centered around an 8 by 16 array 
of register cells (eight 16-bit registers). The first row from the bottom corre-
sponds to register RO. Two sets of decoders (one for each port) are positioned 
on the right hand side of the register cell array. The read/write circuitry is 
located at the bottom of the register cell array and facilitates access to the 
register file. 
6 
900 
Register File 500 
Bypassing 470 
Load/Store Unit 480 
Shifter 450 
ALU 600 
PC Unit 1050 
Figure 4: Data Path Structure 
7 
Vdd D $1 $2 GND $2 $1 D Vdd 
r·1 f "' 
~{~~;~ 
~;:> ~~~® :.. ... ... .... ; .... ; ... . .... ; 
\ v I 
Figure 5: Data Path Track Assignment 
IN 
CLK 
OUT 
CLK 
Figure 6: Data Path Latch Cell 
8 
I 
lak:h.cif ecale: 0.010000 (tnaxi Size: 50 x 65 mlcnlna 
lalch 
Figure 7: Layout of the Latch Cell 
9 
Ph4 
to DAT A bus pads 
from DATA bus pads 
de st 
D 
Register File 
Read/Write Circuitry 
src2 
OP2 
Ph3 
0 1------- bypass_to_src2 
-----i--------- bypass_to_src1 
S1 
1 0 
MADA so .... -+-----------
s 1 .... --+----------
IMM 
select_immediate 
Ph4 
Ph2 
Ph1 
.,._...._.,_ ______ Ph123 
S2 
Figure 8: Block Diagram of the Operand Unit 
10 
900 160 90 90 
Q) ...- C'J 
340 
:5 ~ ... ... 
Register Cell Array Q) ~ ~ 
"E .~ 8 8 0 ... !:C Q) Q) c c 
160 Read/Write Circuitry 
Figure 9: Floorplan of the Register File 
The register cell design was based on the CMOS veysion of the 6-transistor 
dual-port static RAM cell with split word lines (6], which is shown in Fig-
ure 10. The layout of the register cell is shown in Figure 11. A group of 
four register cells is shown in Figure 12. The register cell consists of a pair 
of cross-coupled inverters (providing the bistability needed for static storage) 
and two access transistors. Each access transistor is controlled by one of the 
word lines (WORDl or WORD2), which are driven by the decoders. 
To ·perform a read operation, the bit lines (BITl and BIT2) are precharged 
high first. If the register is selected on port 1, then WORDl goes high, pro-
viding an access path from the storage node DATA to BITl. If data is high, 
then BITl will stay precharged. However, if data is low, then the pull-down 
transistor of the inverter driving data will start to discharge the bit line. The 
capacitance of the bit line and the strength of the pull-down network determine 
the speed with which the bit line is pulled low. 
If the cell is not properly designed, a read operation can destroy the infor-
mation stored in the cell. The reason is that when the access transistor ,turns 
on, the high level voltage of the precharged bit line will be divided between 
the series combination of the access transistor and the pull-down down tran-
sistor of the inverter. If the voltage drop across the pull-down transistor is 
sufficiently high to turn on the pull-down transistor of the the other inverter 
of the cell, then the cell is in danger of losing its data. This means that the 
cell has to be designed so that the disturbance caused by a read operation is 
not strong enough to destroy the cell information. This requires that the ratio 
of the (3 of the pull-down transistor to that of the access transistor be large 
enough to reduce the read disturbance voltage to the desired level. A safe de-
sign will reduce the read disturbance voltage to less than the threshold voltage 
of an n-transistor. This will ensure that the n-transistor of the other inverter 
of the cell will not be turned on by the read disturbance. In this design, the 
11 
BIT2 BIT1 
Figure 10: Dual-Port Register File Cell 
regcel.c:if acele: 0.070000 (1n8X) Size: 60 x 50 mia'one 
regcelt 
WORD1 
WORD2 
Figure 11: Layout of the Dual-Port Register File Cell 
12 
(1204)() Size: 116x92mlcfcn• 
Figure 12: A Group of Four Register Cells 
13 
Ph1 Ph2 Ph3 Ph4 
write 
precharge bit lines 
decode read addresses 
read decode write address 
Figure 13: Timing of the Register File 
width of the inverter pull-down transistor is 6 microns, and the width of each 
access transistor is 4 microns. The length of all transistors is minimum size, 
which is 2 microns. 
The write operation is performed by setting the desired data on BITl line 
and the complement of the data on BIT2 and turning on both word lines. The 
cell is actually written to by the bit line carrying the low level. The bit line 
with the high level cannot disturb the cell because the cell is designed not to 
be disturbed by a high level on the bit lines (as mentioned earlier, this is done 
to ensure that the cell information is not destroyed by a read operation, which 
starts with precharged bit lines). The cell was designed such that the bit line 
with the low level can flip the state of the cell. This requires that the ratio of 
the f3 of the access transistor to that of the inverter pull-up transistor be large 
enough to initiate a write operation. In this design, the width of the inverter 
pull-up transistor is 3 microns, and the length of the transistor is 2 micron. 
The timing of the register file, as well as the rest of the processor, is based 
on a four-phase clock. The timing of the register file is shown in Figure 13. 
During Ph2, the bit lines are precharged, and the source register addresses 
are decoded. In Ph3, the word lines are driven by the decoders, and the read 
operation takes place. In Ph4, the source operands are driven onto the Sl and 
S2 busses. Also, the register address for the write operation to follow in the 
next Phl are decoded. During Phl; the desired data is driven onto the bit 
lines, the output of the decoders are driven onto the word lines, and the write 
operation takes place. 
5.2 Load/Store Unit 
For load/store operations, the data memory is accessed via a 16-bit multi-
plexed bus (the DATA bus). The external bus protocol for a read operation is 
shown in Figure 14. The protocol for a write operation is shown in Figure 15. 
When Phl rises, the memory address ( src1) is loaded into MADR (Memory 
14 
Address/Data Register) and driven onto the DATA bus. This address must 
be latched externally. Phl serves as the strobe signal for the external address 
latch. The address should be latched when Phl falls. The DATA bus holds 
the address until Ph2 rises. The dead time between Phl and Ph2 provides 
. the proper hold time for the external address latch. For load operations, when 
Ph2 rises, the DATA bus is tri-stated to accommodate the incoming data from 
memory. The data is loaded into DIR (Data In Register) with the falling edge 
of Ph3 and is written to the register file in the WB stage. For store instruc-
tions, when Ph2 rises, the data to be stored to memory ( src2) is driven onto 
the DATA bus and is then written to the data memory. The DRD (Data 
Read) and DWR (Data Write) signals are used to interface the data memory. 
The instruction memory is accessed via the INST bus, which is identical to 
the DATA bus except that there are no write operations to the instruction 
memory. The IRD (Instruction Read) signal is provided to interface to the 
instruction memory. 
MADR is a special latch that has two inputs (IO and 11) and two clock 
(or select) signals (SO and Sl ). The select signals are assumed to be non-
overlapping, i.e., they can never be both high at the same time. When SO 
goes high, IO is driven to the output. When Sl goes high, 11 is driven to 
the output. When SO and Sl are both low, the output is latched through 
bistable action. The schematic of a multiplexing latch is shown in Figure 16. 
This circuit achieves both the logical operation and the timing required for 
load/ store instructions. 
6 Arithmetic/Logic Unit 
The Arithmetic/Logic Unit (ALU) computes all arithmetic, logical, and com-
parison operations required by the instruction set. Table 2 lists the instructions 
for which the ALU is utilized to compute results. 
For arithmetic and logical instructions, the specified operation is performed 
on the source operands ( src1 and src2) and the result is written back into 
the destination register ( dest). For comparison operations, src1 and src2 are 
compared, and depending on the outcome of the comparison, a one or a zero 
is written back into dest. For example, suppose that Rl contains 0005H and 
R2 contains a 0007H. If the instruction SLT R1, R2, R3 were executed, then 
0001H would be written to R3 to signify that the content of Rl is less than 
15 
Ph1 
Ph2 
Ph3 
Ph4 
DATA 
DAD 
J ..____ _ ____.! 
address data 
Figure 14: DATA Bus Protocol for Read Operations 
Ph1 J I 
Ph2 
Ph3 
Ph4 
DATA data 
DWR 
Figure 15: DATA Bus Protocol for Write Operations 
16 
S1 
10 11 
so 
so S1 
out 
so S1 
so 
S1 
- -
Figure 16: A Multiplexing Latch 
Type Opcode Operation 
Arithmetic ADD add 
ADDI add immediate 
SUB subtract 
SUBI subtract immediate 
Logical AND bitwise AND 
OR bitwise OR 
XOR bitwise exclusive-OR 
XNOR bitwise exclusive-NOR 
Comparison SEQ set if equal 
SLT set if less than 
SLTU set if less than unsigned 
SGE set if greater than or equal 
SGEU set if greater than or equal unsigned 
Table 2: ALU Operations 
17 
the content of R2. If the instruction SGE R1, R2, R3 were executed, then OOOOH 
would be written to R3 to signify that the content of Rl is not greater than 
or equal to the content of R2. 
6.1 ALU Organization 
The organization of the ALU is shown in Figure 1 7. This figure also reflects 
the actual floorplan of the ALU. Located at the top of the ALU is the PG 
block which generates the P (Propagate) and G (Generate) signals used by 
the carry chain. The carry chain itself is located below the PG block. The 
comparison logic can be found below the carry chain. Immediately below the 
comparison circuitry are the multiplexors which select either the output of the 
carry chain or the output of the comparison logic. 
6.2 PG Block 
The PG block is a Weinberger structure [7] that is used to generate the P 
(Propagate) and G (Generate) signals that are needed by the carry ~hain. 
The schematic of the PG block is shown in Figure 18. The PG block and 
the carry chain use domino logic. The P output can be made to be any logic 
function of 81 and 82, and the G output can be made to be one of four different 
logic functions of 81 and 82. The FP(3-0) and FG(l-0) control signals are 
used to select a specific function for P and G, respectively. 
The P and G signals are generated in a similar manner. The P generator 
operates as follows. During Ph4, the p-transistor connected to the P node 
turns on, precharging P. During Ph123, the evaluate phase, some of the FP 
signals are pulled low, creating possible discharge paths. If FPO is pulled low, 
P will be discharged if 81 and 82 are high. If FPl is pulled low, P will be 
discharged if 81 and 82 are high. If FP2 is pulled low, P will be discharged if 
81 and 82 are high. If FP3 is pulled low, P will be discharged if 81 and 82 are 
high. T~bles 3 and 4 show how a specified function can be generated by the 
PG block by controlling the FP and FG inputs. 
To perform an addition, P should be 81EB82, G should be 81·82, and CO 
(carry in) should be zero. To subtract 82 from 81 (81-82), P should be 
81882, G should be 81·82, and CO should be one. The logic functions are 
accomplished by putting the desired function on P, and making G and CO 
both zero. This causes P to be passed right through to the output of the carry 
18 
S1 S2 
FP<3-0> 
PG Block 
FG<1 -0> 
p G 
Carry Chain co 
C16 
P15 
comparison GE_EQ 
logic SIGNED 
--- COMPARE 
.,._ __ Ph123 
Ph4 (qualified) 
to D bus 
Figure 1 7: Block Diagram of the ALU 
Function FPO FP1 FP2 FP3 
Sl·S2 1 1 1 0 
s1+s2 1 0 0 0 
S1EBS2 1 0 0 1 
S18S2 0 1 1 0 
Sl 1 0 1 0 
S2 0 1 0 1 
0 1 1 1 1 
1 0 0 0 0 
Table 3: P Generator Functions 
19 
S2 S1 S2 S1 
FPO 
FP1 
FP2 ---1----....1 
FP3 
p 
FGO ---' 
FG1 
G 
Figure 18: Circuit Diagram of the PG Block 
20 
r- Ph123 
p 
r- Ph123 
G 
Function FGO FG1 
Sl·S2 1 0 
Sl·S2 0 1 
Sl 0 0 
0 1 1 
Table 4: G Generator Functions 
chain. All of the comparisons (except for SEQ) are done by using the carry 
chain to subtract S2 from Sl and using the compare logic to check the result. 
The SEQ comparison is done by setting P to S1EBS2, setting G t<i> 0, setting 
CO to 1, and checking C16 (carry out) with the comparison logic. C16 will 
only be one if all of the P's are one, meaning that Sl and S2 are bit-for-bit 
equal. Because this is a domino circuit, Sl and S2 must be stable when the 
precharge transistor turns off. A fluctuation of Sl or S2 could erroneously 
discharge the P node. For this reason, the Sl and S2 busses are driven during 
Ph4, the precharge phase. 
6.3 Carry Chain 
The ALU utilizes a dynamic Manchester carry chain [8]. The inputs to the 
carry chain are P ( 15-0), G ( 15-0), and CO (carry input for bit 0). The chain 
consists of a cascade of Manchester carry elements (see Figure 19), which com-
pute the carry output for each bit position (Cl thru C16). Notice that in the 
Manchester carry scheme the complement of the carry is actually propagated. 
During the precharge phase (Ph4), the carry output of each bit is precharged. 
During the subsequent evaluate phase (Ph123), the carry out nodes are con-
ditionally discharged. 
The worst case delay through the carry chain occurs when all P's are high, 
and CO is high. In this situation, the carry signal has to propagate through 
16 series pass transistors. To avoid the excessive delay through a long chain 
of pass transistors, the carry chain is broken into groups, and the carry out-
put of each group is buffered before being fed into the next carry group (see 
Figure 20). 
This strategy reduces the number of series pass transistors to six. The 
performance of the carry chain can be further improved by reducing the ca-
pacitance of the carry nodes in the Manchester chain. This will have the effect 
of decreasing the carry node discharge time. The input capacitance of the 
21 
XOR gates of the sum stage increases the load of the internal carry nodes 
of the carry chain and slows the carry evaluation time. This delay penalty 
can be eliminated by having a second Manchester stage. This second stage 
outputs the carry nodes to the sum stage and is divided in the same manner 
as the first stage. The carry input into each group is the carry output of the 
previous group in the first Manchester stage. The complete carry chain is 
shown in Figure 21. Thus, the first Manchester stage evaluates the carry out 
of each Manchester group very quickly, and distributes these carry outputs 
to the carry inputs of the second stage Manchester groups. In this way, each 
Manchester group in the second Manchester stage is evaluating the carry bits 
to be passed to the sum stage nearly in parallel with the other Manchester 
units of the second stage. Thus, the Manchester adder can be optimized to 
achieve a desired operating speed within technology constraints. 
The second stage of the Manchester carry chain outputs the complements 
of the carry bits for all 16 bits. These carry bits are XORed with the P signals 
in the sum stage to generate the 16 sum bits (S(15-0) ). The sum stage uses 
static XOR gates. 
6.4 Comparison Logic 
The comparison logic checks C16 and P15 from the carry chain, and two control 
signals ( GE_EQ and SIGNED) to compute the result of a comparison instruc-
tion. Table 5 lists the logic expression that corresponds to each comparison 
condition (SEQ is treated as SGEU; see Section 6.2). 
In Table 5, V refers to overflow, which is C16EBC15. Since S15 can be 
expanded to P15EBC15, the term S15EBV expands to (P15EBC15)EB(C16EBC15), 
which in turn simplifies into P15EBC16. These simplifications are shown in 
Table 6. 
Since C16 becomes available before S15, the comparison logic begins eval-
uating before the sum stage is completely done. The comparison logic is quite 
simple (see Figure 22). The comparison logic is unusual in that it is not the 
same in each bit-slice. It was put into the data path mostly because the two 
critical inputs P15 and C16 are on the far side of the data path and would 
otherwise need to be routed about 900 microns to where the control logic is 
located (next to bit 0). The layout of the comparison circuit is very similar to 
the arrangement of the circuit diagram. 
22 
C_in 
Figure 19: Circuit Diagram of the Manchester Carry Element 
P1 
_J_ 
C4 
co G1 
CLK CLK 
Figure 20: Circuit Diagram of a 4-Bit Manchester Carry Group 
from PG Block 
4 4 
C16 
4 
to Sum stage 
Figure 21: Block Diagram of the 16-bit Manchester Carry Chain 
23 
Comparison Expression 
SLT S15 EB V 
SGE S15 EB V 
SLTU C16 
SGEU C16 
Table 5: Logic Expressions for Comparison Operations 
Comparison Expression 
SLT P15 EB C16 
SGE P15 El1 C16 
SLTU Cl6 
SGEU C16 
Table 6: Simplified Logic Expressions for Comparison Operations 
SIGNED GE_EQ 
Figure 22: Circuit Diagram of the Comparison Circuit 
24 
L6 LS L4 L3 
.. 
·· .. 
J S1 
····" 
.. 
... 
RO 
L3 L2 L1 LO 
R3 R2 R1 RO 
Figure 23: Basic Configuration for a 4-Bit Shifter 
7 Shifter Unit 
The Shifter Unit is responsible for performing all of the shift operations spec-
ified in the Tiny RISC instruction set. These include logical left shifts, logical 
right shifts, and arithmetic right shifts. The function of the Shifter Unit is to 
shift the operand on the Sl bus by the amount specified by the operand on 
the S2 bus (only the least significant four bits of src2 are used). Since the 
Operand Unit drives immediate values onto the S2 bus in the same way as it 
does register operands, the Shifter Unit and its control logic handle constant 
shifts in the same way that they handle variable shifts. 
The core of the shifter consists of a 16 x 16 matrix of pass transistors, 
forming a crossbar switch [6]. The basic configuration of a 4 x 4 crossbar 
shifter is shown in Figure 23. For the sake of simplicity, the operation of the 
shifter will be explained using this 4 x 4 configuration. 
The literal lines (10-16) are inputs, and the result lines (RO-R3) are out-
puts. The decoded shift amount is driven onto the select lines (SO-S3). Each 
node in the shifter core represents a single n-transistor, with its gate, source, 
and drain connected to a select, literal, and result line, respectively (see Fig-
ure 23). Using this core, we can accomplish any of the required shift operations 
through appropriate selection of how data is applied to the literal and select 
lines. 
For right shifts, the data is applied onto 13-10 (13 gets the MSB), and 
25 
LS AR Operation 
0 0 logical right shift 
0 1 arithmetic right shift 
1 0 logical left shift 
1 1 unused 
Table 7: Definition of the Control Signals of the Shifter Unit 
the shift-in value is applied to L6-L4. This shift-in value is equal to the MSB 
of the input data (SL3 for the 4 x 4 configuration) for arithmetic shifts, and 
is always zero for logical shifts. The shift amount is decoded, and one of the 
select lines is driven high (SO for a zero-bit shift, Sl for a one-bit shift, S2 for 
a two-bit shift, etc.). 
For left shifts, the data is applied to lines L6-L3 (L6 gets the MSB), and the 
shift-in value (always zero) is applied to lines L2-LO. The one difficulty with 
this method is that it requires that the order of the select lines be reversed, 
i.e., S3 now corresponds to a zero-bit shift, S2 to a one-bit shift, etc. This is 
easily accomplished by decoding the one's complement of the shift amount and 
driving the result on the select lines. This method works when the number of 
select lines is a power of two. The complete block diagram of a 4 x 4 shifter is 
shown in Figure 24. The Shifter Unit is controlled by two control signals: LS 
(Left Shift) and AR ( ARithmetic shift). These signals are explicitly provided 
in the instruction opcode. The function of these control signals is defined in 
Table 7. 
The shifter makes extensive use of dynamic logic, requiring precharging for 
the literal lines, the result lines, and the decoder. The circuit diagram of a 
precharge/discharge path is shown in Figure 25. A dynamic design for the 
decoder was chosen because it could be fit into the same vertical space that 
was taken by the shifter core. The logic driving the literal line is a dynamic 
circuit that takes the place of the multiplexor/ AND gate combination that was 
shown in the block diagram. This portion of the circuit selectively discharges 
the literal line based on the value on the src1, the MSB of src1, and the 
control lines. The discharged literal line then discharges the result line if the 
corresponding select line is selected. 
The Shifter Unit has the same timing as the rest of the data path. The 
shifter evaluates during Ph123, and the result is driven onto the D bus during 
Ph4 if the instruction in the EX stage is a shift instruction. The shifter is 
precharged during Ph4. Because of the dynamic operation of the shifter, all 
26 
R3 
81_3 81_2 81_1 82_1 81_0 82_0 
......,~~-+-~~,._~~-+~~ ...... ~ LS 
..--~~~---11--..-~~~~+---~~~~-+-~~~- LS+AR 
L3 
8hlfterCol'8 
81_2 81_1 81_0 
R2 R1 RO 
OUtput Latch and 0-bus Drivers 
Figure 24: Block Diagram of a 4-Bit Barrel Shifter 
27 
L4 
A3 
J- 51_3 
J- LS•AR 
51 
A A B B 
Figure 25: Circuit Diagram of the 4-Bit Shifter 
28 
input data and control signals to the shifter must reach steady state before 
Ph123 starts. The shifter requires 3 ns to precharge and 14 ns to evaluate in 
the worst case. 
8 Branch Unit 
The main function of the Branch Unit is to execute instructions that affect 
the Program Counter (PC), i.e., JUMP, CALL, BRF, and BRT. The block 
diagram of the Branch Unit is shown in Figure 26. The JUMP instruction 
changes the PC by loading it with src1. The CALL instruction is similar 
to JUMP except that it also saves a return address into dest. For example, 
CALL R4, R6 loads the PC with the content of R4 and saves the return address 
into R6. The return address is the address of the instruction following the 
delay slot of the CALL in the input program. Branch instructions specify a 
16-bit signed offset ( 0) that is added to the content of the PC to compute the 
address of the target of the branch. The branch condition is the LSB (least 
sigRificant bit) of the srcl register that is specified by the instruction. For the 
BRT instruction, the PC is loaded with PC+ 0 if the LSB of srcl is high; 
for BRF, the PC is loaded with the target address of the branch if the LSB of 
srcl is low. 
The layout of the Branch Unit follows the structure shown in Figure 26. 
The branch offset enters the Branch Unit from the right and is distributed 
down to each bit-slice, with the MSB going to bits 7-15 for sign extension. 
The sign-extended branch offset is latched into BOR (Branch Offset Register). 
The content of BOR is added to that of the PC to compute the target address 
of a branch instruction. This target address is held in the Target PC (TPC) 
latch. Also, the PC is incremented, and the result is saved in the NPC (Next 
PC) latch. At the end of Ph4, depending on the instruction that is being 
executed, the PC latches one of the following values: 
1. OOOOH when the processor is reset 
2. srcl for a CALL or JUMP instruction 
3. TPC for a taken branch instruction 
4. NPC 
29 
D S1 
branch offset 
Ph4 
Ph123 
take_branch 
0 
reset 
jump_or_reset 
Ph4 
Ph123 
Ph4 
Ph123 
To INST bus pads 
Figure 26: Block Diagram of the Branch Unit 
30 
take_branch 
BR is high for branch instructions. 
IR<11 > is bit 11 of the instruction Register. 
Figure 27: Circuit Diagram of the Branch Evaluation Logic 
The value of the PC is driven onto the INST bus during the subsequent in-
struction fetch step. The RPC (Return PC) latch holds the return address for 
CALL instructions. The content of RPC is driven onto the D bus and saved 
into dest for CALL instructions. 
The branch evaluation logic shown in Figure 27. This logic is not conducive 
to being designed in a bit-sliced style, and is thus placed to the right of the 
data path. Bit zero of the Sl bus is passed out of the data path to this logic. 
The two adders used in the Branch Unit have the same timing as the ALU; 
in fact, they utilize the same carry chain that is used in the ALU. In order to 
keep the number of branch delay slots to one, the branch target address and 
the branch condition must be evaluated by the end of the ID stage. Both of 
these tasks are performed in parallel during the ID stage, and the results are 
ready by the end of Ph123. The value to be loaded into the PC is selected 
during Ph4. 
9 Clock Generator 
The clock generator produces the internal clock phases from which the timing 
of the entire processor complex is derived. In total there are ten distinct clock 
signals that are generated. There are four non-overlapping clocks with equal 
pulse widths (Phl, Ph2, Ph3, and Ph4) and a fifth clock which is continuously 
high during Phl, Ph2, and Ph3, called Ph123; in addition, each clock has a 
global complement. The Ph123 clock is used specifically for the evaluation 
phase of the dynamic circuitry used for the computations performed in the 
ALU, the Shifter Unit, and the Branch Unit because they take the largest 
part of each machine cycle. It should be noted that the falling edge of Ph123 
must precede the rising edge of Ph4, hence the distinction between Phl23 and 
31 
Ph1 (unbuffered) Ph 1 (bullered) 
Ph1 (bullered) 
Ph2 (unbuffered) 
Figure 28: Circuit to Ensure Non-Overlapping Clock Phases 
the complement of Ph4. Global clock complements were provided in order to 
allow the designers of the data path blocks to select the polarity of a clock 
signal without the additional logic and skew that would be introduced by 
locally inverting the clock. Because of the regularity of the data path design, 
the required additional clock routing was minimal. 
In order to guarantee non-overlapping clock pulses, the activation of each 
clock signal is conditional upon the deactivation of the previous phase. This is 
accomplished with a NOR gate, the inputs to which are a phase enable signal, 
the actual buffered signal of the previous clock phase, and its complement (see 
Figure 28). Only when all three are zero is the clock signal itself allowed to 
rise. In this way, there is a guarantee of some "dead time" in between clock 
phases. This dead time is approximately equal to the delay time of the buffer 
chain driving the clock to the rest of the chip and adequately compensates for 
the skew that might be introduced by RC delay in the clock lines themselves. 
The clock enable signals are produced by a four-stage ring counter which is 
initialized upon reset to contain a '1' in the first flip-flop and '0' in the other 
three. The '1' circulates around the ring counter at the frequency of the 
external clock, producing a series of pulses that are similar to the four clock 
phases but are not guaranteed to be non-overlapping; the complements of 
these pulses are fed to the NOR gates as the clock enable signals. Thus the 
operating frequency (the inverse of the instruction cycle time) is one-fourth 
the frequency of the input clock. Figure 29 shows the complete circuit diagram 
of the clock generator. 
The five positive-polarity clock signals were brought out to the pins of the 
32 
CLK 
RESET l I I l 
s R R R 
----
-o a D a D a D a-
-> Q'- -> a- -> a- -> a-
;~ Buffer Chain Ph1 Ph1 
~ Buffer Chain Ph2 Ph2 
I 
~ Buffer Chain Ph3 Ph3 
. Ph4 
Buffer Chain 
Ph4 
I 
~~ Buffer Chain -- Ph123 Ph123 
Figure 29: Circuit Diagram of the Clock Generator 
33 
chip to verify proper operation of the clock generator. With an input clock of 
20 megahertz, the dead time was measured and averaged 7 ns. 
10 Simulation and Testing 
In order to verify the functionality of our design, we developed a simulation 
environment that allowed us to easily compare the expected state of the pro-
cessor to the actual state produced by a circuit or switch-level simulator. 
The first component that was needed for this method was a means of 
producing the expected result. This was accomplished by writing an RTL 
(Register Transfer Level) simulator in C (TPSIM). The input to TPSIM is an 
assembly program. TPSIM simulates the execution of the input program and 
computes the expected state of the processor at the end of each machine cycle. 
The output of TPSIM is written to a file. A helpful feature of TPSIM is that 
it allows random data to be loaded into the processor; all load instructions 
that access memory address zero receive a random number. This allows long, 
non-repeating test programs to be written conveniently using small looping 
sequences of instructions. 
IRSIM was used to simulate the extracted layout of the entire processor 
(including the pad frame). IRSIM was used in the linear mode. The sim-
ulation process is driven by a command file that supplies stimuli to IRSIM. 
The command files are generated by TPSIM, which also produces test vector 
files that were used to test the fabricated chips. The command files also con-
tain commands that instruct IRSIM to produce an output file that contains 
the state of the processor at the end of each machine cycle. This output file, 
along with the expected state file generated by TPSIM, are then passed to the 
comparison stage of the testing system. 
The SIMCOMP program was written to compare the output state file 
produced by IRSIM to the expected state file generated by TPSIM. SIM COMP 
detects and reports any mismatches that it encounters. SIMCOMP produces 
a file in which the expected and actual state of the processor are compared. 
All errors are visibly marked; this allows designers to easily track down the 
errors and solve the associated problems. 
34 
11 Conclusion 
We have presented the VLSI design of a RISC-style 16-bit microprocessor. 
The microprocessor was fabricated in a 2µm n-well CMOS technology. The 
chips were tested and found to be working on first silicon. The processor has 
a cycle time of 70 ns and achieves a peak performance of 14 16-bit MIPS. The 
design required about 12,000 transistors. 
12 Acknowledgments 
We are grateful to Western Digital for providing two generous fellowships to 
our students. Moreover, they supported us with the testing facility for the 
final phase of this project. Finally, we would like to thank NKK for their 
donations to our Advanced Computer Architecture laboratory. 
References 
[1] C. Tomovich, MOBIS User Manual, Release 3.1, USC/Information Sci-
ences Institute, Marina Del Rey, CA, 1988. 
[2] J. K. Ousterhout, G. T. Hamachi, R. N. Mayo, W. S. Scott, and G. S. Tay-
lor, "The Magic VLSI Layout System," IEEE Design & Test of Comput-
ers, February 1985. 
[3] A. Salz and M. Horowitz, "IRSIM: An Incremental MOS Switch-Level 
Simulator," Proceedings of the 26th Design Automation Conference, pages 
173-178, ACM/IEEE, Las Vegas, Nevada, June 1989. 
[4] W. S. Scott, R. N. Mayo, G. T. Hamachi, and J. K. Ousterhout, 1986 
VLSI Tools: Still More Works by the Original Artists, Report UCB/CSD 
86/272,. University of California at Berkeley, Berkeley, CA, December 
1985. 
[5] J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quanti-
tative Approach, Morgan Kaufmann Publishers, Palo Alto, 1990. 
35 
[6] R. W. Sherburne, M. G. H. Katevenis, D. A. Patterson, and C. H. Sequin, 
"Datapath Design for RISC," Proceedings of the Conference on Advanced 
Research in VLSI, MIT, January 1982. 
(7] A. Weinberger, "Large Scale Integration of MOS Logic: A Layout 
Method," IEEE Journal of Solid State Circuits, April 1967. 
(8] N. Weste and K. Eshraghian, Principles of CMOS VLSI Design, Addison-
Wesley, Reading, MA, 1985. 
36 
A RTL Description 
This appendix contains the RTL description of the Tiny RISC. 
I (I111111ediate) is a 5-bit i11111ediate value. 
LI (Long Illllllediate) is an 8-bit i11111ediate value. 
IMM is the actual immediate value constructed froa I or LI by zero-extension. 
src1, src2, and dest are register addresses. 
R[0 .. 7] are general purpose registers. 
R[O] is hardwired to contain zero at all times. Writing to it has no effect. 
Si_bus is the src1 bus. S2_bus is the src2 bus. D_bus is the dest bus. 
DEST is a latch that contains the data to be vritten to the register file. 
MADR is the Memory Address/Data Register. 
IIST_bus is the external instruction bus. 
DATA_bus is the external data bus. 
IR[0 .. 2] contain the instructions being processed in the pipelined. They fora 
a shift register. 
IR[].op_type is the type of the instruction in the instruction register. 
Instruction types include "compute", "load", "store", "branch", "jump", "call". 
IR[].opcode is the opcode of the instruction in the instruction register. 
PC is the Program Counter. 
IPC is the lext PC (PC+ 1). 
IPC_slave is the slave latch of IPC. 
TPC is the Target PC. It contains the target address for a branch instruction. 
RPC (Return PC) contains the return address for a CALL instruction. 
BO is the Branch Offset field of branch instructions. 
BOR is the Branch Offset Register. 
IF_4: IR[2] <= IR[i] <= IR[O] <= IIST_Bus 
BOR <• BO directly from IIST_Bus 
ID_i: decode instruction 
perform bypass comparison 
IMM <= I or IMM <= LI 
PC incrementer and branch offset adders start evaluating 
ID_2: precharge register file 
decode register addresses 
ID_3: read R[srci] and R[src2] from register file 
IPC <= output of PC incrementer 
TPC <• output of branch off set adder 
ID_4: Si_bus <= R[srci] or Si_bus <• D_bus 
S2_bus <= R[src2] or S2_bus <= IMM or S2_bus <= D_bus 
IPC_slave <= IPC 
if IR[i] .op_type = "jump" or IR[1].op_type • 
PC <= Si_ bus 
else if IR[O].opcode 11 BRF11 and 
PC <= TPC 
else if IR[O] .opcode = "BRT" and 
PC <= TPC 
else 
PC <= IPC 
EX_i: ALU and shifter start evaluating 
MADR <:o: Si_bus 
IIST;;..bus <• PC 
Si_bus(O) 
Si_bus(O) 
"call" then 
0 then 
1 then 
if IR[1] .op_type'"' "load" or IR[i] .op_type = "store" then 
37 
DATA_Bus <• MADR 
EX_2: MADR <= S2_bus 
IIST_bus <• Hi-Z 
if IR[1] .op_type • "store" then 
DATA_Bus <= M~DR 
else 
DATA_bus <= Hi-Z 
EX_3: latch output of ALU, output of shifter, and data fro• aeaory 
RPC <• IPC_slave 
EX_4: if IR[1].op_type •"compute" then 
D_bus <• output of ALU or shifter 
else if IR[1].op_type •"load" then 
D_bus <= data fro• aeaory 
else if IR[1].op_type • "juap" or IR[1].op_type •"call" then 
D_bus <• RPC 
DEST <= D_bus 
decode register address 
WB_1: R[dest] <=DEST 
WB_2: 
WB_3: 
WB_4: 
38 

