Simplified microprocessor design for VLSI control applications by Cameron, K.
3rd NASA Symposium on VLSI Design 1991
N94-
Simplified Microprocessor Design
for VLSI Control Applications
K. Cameron
18359
6.3.1 -
NASA Space Engineering Research Center for VLSI System Design
University of Idaho, Moscow, Idaho 83843
Phone: 208-885-6500 Fax: 208-885-7579
Abstract- A design technique for microprocessors combining the simplicity
of RISCs with the richer instruction sets of CISCs is presented. They utilize
the pipelined instruction decode and datapaths common to RISCs. Instruc-
tion invariant data processing sequences which transparently support complex
addressing modes permit the formulation of simple control circuitry. Compact
implementations are possible since neither complicated controllers nor large
register sets are required.
1 Introduction
The design of microprocessors has evolved considerably since the introduction of the first
microprocessor in 1971 [3]. Traditional microprocessors are extremely complicated ma-
chines that support hundreds of instructions and a dozen or so addressing modes. The
dominance of such complex instruction set machines (CISC) has recently been challenged
by much simpler processors which support only the most commonly used instructions.
These processors are known as reduced instruction set computers (RISC) [8]. Tradition-
ally, RISC processors:
• Support a small (reduced) instruction set of simple instructions which represent the
most commonly used operations,
• Process instructions at the rate of one instruction per system clock cycle,
• Have a large on board register file for instruction or data cache.
The SPARC architecture is a scalable family of traditional RISC processors which was
developed commercially. The architecture is deeply pipelined and depends heavily upon
the compiler to efficiently map the register set and avoid forbidden instruction sequences
[1]. Recently, the meaning of the term RISC processor has become quite blurred [6]. The
IBM System/6000 purports to be a RISC implementation, but supports 184 instructions
[5], which is a considerably larger instruction set than that of the 68020 microprocessor
[7], which is generally considered to be a CISC machine.
The design methodology described here is targeted for applications which must be
implemented in a small amount of circuitry (i.e. as a cell on a larger integrated circuit),
while retaining medium to high levels of performance. The approach taken is to implement
https://ntrs.nasa.gov/search.jsp?R=19940013886 2020-06-16T18:07:58+00:00Z
6,3.2
a system which adheresto most, but not all, of the designconcepts of a traditional RISC
m_achiue,Key points of the designapproach investigated are _sted below, They will each
be describedin more detail later.
• The processor supports a very small (reduced) instruction set. Only vital or fre-
quently used operations are supported directly.
• The instruction set is orthogonally partitioned. As nearly as possible, bit fields in
instructions mean the same things for all instructions. All addressing modes are
supported in the same manner for all instructions.
. All instructions are processed using invariant execution sequences. This means that
information flows through the datapath in precisely the same manner for all instruc-
tions.
• Both the datapath and the associated controller are deeply pipelin_e_. _The use of
invarian.t execution sequences permits the construction of very deep, yet simple pro-
cessing pipelir_es.
• Only a small internal register set is supported. The processor registers are memory
mapped, allowing them to be accessed and updated with general memory reference
instructions.
• The support of relatively complex addressing modes is importalat if the internal
register set is small. Implemented consistently across the entire instruction set, they
add little to the overall complexit_¢ of the m_chipe.
Though a specific processor was implemented, the design methodology followed may
be used to implement a large number of different R!SC-like processors, each with different
size-performance trade-offs.
2 Execution Cycle Pipeline
The data flow strategy of the microprocessor is the first item Which must be designed. Tl-tis
includes data flow to and from memory as well as through tiie datapath of the processo r
itself. The performance required of the processor drives the choices made at this point.
Different cost-performance ratios can be achieved through the use of different data flow
strategies. A few of the possible tradeoffs are _sted below:
* Processor word size?
• Separate address and data busses used to acce_ss memory?
• Separate instructi0n fetch and program data stores?
0 Separate address generation and data processing units?
Multiple data processing UnitS?
o Pipeline depth?
0 Data/instruction cache?
m
IE
3rd NASA Symposium on VLSI Design 1991 6.3.3
Ram Access II Fetchl Addr0 Fetch2 Addrl Fetch3 Addr2 Fetch4 Addr3
Decode Instl Inst2 Inst3 Inst4
Alu
Adder
Addrl OpO
Addrl
Addr2
Opl
........... Opl
i_Addr2
.......-Addr3
Op2
Op2
Addr3
Addr4
Op3
Figure 1: #P Execution Sequence
• Number of internal data/address registers?
• Maximum number of instructions?
Since the design implemented was to have moderate performance yet be compact, it
was decided to build a processor with a 16 bit word, separate address and memory busses,
shared data and instruction stores, combined address generation and data processing units,
a deep pipeline, no cache, and a small set (4) of general registers. It was further decided
that processor registers would be memory mapped, so separate instructions would not have
to be provided to access either the processor registers or the registers associated with the
accompanying IO subsystem.
Though shallow pipelines such as was used with RISC II [11] are relatively simple to
design, it was decided from a performance stand-point that the machine should be deeply
pipelined. A deep pipeline permits the construction of a high-throughput processor, since
each stage of the pipe can operate independently on different portions of the problem
at the same time. Deep pipelines, however, have the undesirable characteristic that any
irregularity in the processing sequence for different instructions can lead to either the need
for extremely complicated locking circuitry [5,1] or else the definition of a large number of
forbidden instruction sequences to prevent data collisions. It was, therefore, decided that
all instructions should share the same (though perhaps, a truncated) processing sequence.
The memory execution sequence finally decided upon is shown in Figure 1. Several key
points of this processing sequence are:
• Each RAM access is pipelined two clock cycles deep. This greatly eases all timing
paths associated with RAM accesses.
• Data associated with an instruction "wraps" through the ALU/Adder twice. Once
to calculate the associated address and once to process data. This strategy keeps the
datapath completely utilized at all times.
• Data processed during one instruction is available for subsequent processing on the
very next instruction.
• One instruction is executed every two clock cycles.
6.3.4
Mode I Invoked Description
Direct s = 0 Effective AddJress is part of instruction.
Indexed
Stack
s= l Offset _ O
s= 1 Offset = O
Effective Address is contents of rei_erenced
stack pointer plus signed offset.
If instruction implies a read, the referenced
pointer is pre-incremented. The Effective
Address is the new stack pointer contents.
If the instruction implies a write, the Effective
Address is the contents of the references stack
pointer. The stack pointer is post-decremented.
Figure 2: #P Addressing Modes
15
[
Reglster Select
7.....l= : ....:::.....oStackRelative lOBitAddrei_" :
I 1Op Code s a/b s/t Address Offset
% f •
= 72?
llBitAddress
StackSelect
. ?
Figure 3: #P Instruction Format
==
3 instruCtiOn Types
Once data flow strategy has been determ]neci, the instruction set and addressing modes of
the processor must be selected. Here, a wide variety of possibilities presents itself. Since
the processor impiemented iS intended for an interrupt driven environment, it was decided
that the machine should be stack oriented and provide good support for stack based oper,
ations. The addressing modes summarized in Figure 2 were finally decided Upon. Direct
referencing of memory locations with a pointer contained in the instruction itself provides
simple access of memor_ :mapped registers, gioSM var]abies :and targets _or nOrmal jump
and jump subroutine _nstructions. _he indexed w_th offset moc[e provldes support for jump
tables, arrays, stack oriented local variable access, and Subroutine argument passage. The
auto-decrement and increment modes support implied push and pop operations as a part
of any instruction, ease the placing of arguments on the stack for passage to subroutines,
and allow the return from subroutine instruction to be implemented as a special case of
the jump instruction,
Figure 4 summarizes the instructions set which was selected. Each instruction can be
operated in any addressing mode. The actual instruction format is shown in Figure 3.
=
=
=
N
=
3rd NASA Symposium on VLSI Design 1991 6.3.5
Op Code Mnemonic Register Description
0100
1111
1000
0000
0010
0011
1100
ld
st
jsr
imp
and
or
add
1110 sub
0110 not
0101
1101
1010
0111
0001
1001
1011
xor
cmp
tst
shl
shr
lds
a/b
a/b
a/b
a/b
a/b
a/b
a/b
a/b
a/b
a/b
a/b
a/b
s/t
Load Register
Store Register
Jump to Subroutine
Absolute Jump
Bitwise And
Bitwise Or
Addition
Subtraction
Bitwise Complement
Bitwise Exclusive Or
Skip next instr if not equal
Bitwise And, then skip next if 0
Shift Left
Shift Right
Load Stack Register
ien -- Enable/Disable Interrupts
Figure 4: #P Instruction Set
Since the arithmetic unit already provides an adder and a zero detect circuit for the im-
plementation of the base instruction set, virtually no additional hardware in the datapath
was required to implement the addressing modes. If a hardware multiplication instruction
had been included in the instruction set, it would have been possible to utilize it during
address generation to provide very sophisticated support for array accessing.
The requirement that all instructions be implemented with the same processing se-
quence places severe restrictions on the type of conditional statements that can be pro-
vided, however. A test and skip next instruction pattern was selected since it fits the
required schema and was possible to implement without disturbing the pipelined flow of
instructions. No retry of instructions is necessary, since the results of the test are always
known in time to abort the effects of any subsequent instruction.
4 Implementation
4.1 General
The processor was designed using structured logic design techniques in a custom environ-
ment. High operating speeds and compact layouts were achieved through the extensive
use of pass-logic.
The use of an orthogonally partitioned instruction set and an instruction invariant
processing sequence resulted in extremely small and simple control circuitry. Consequently,
the speed of a machine cycle is limited only by delays in the datapath- not by propagation
6.3.6
Cyclel: PC _ AR AR _ AR
Cycle2: RAM --_ MO PC ++
Cyclre3: MO _ Pipe2 SP(--) _;_ Pipel 0 _;_ Pipel
Cycle4: Pipel "_P AR Pipe2 '_P AR Pipel/Pipe2 _ SP
.. A/B/PC _ MI SP(--) _ AR
Cycle5: RAM _ MO MI _ RAM
Cycle6: MO _ Pipe2 A/B ";_ Pipel 0 _;_ Pipel
Cycle7: Pipel "_ A/B Pipe2 ,_____oA/B Pipel/Pipe2 ,,dd;..__
Figure5: _P Register Transfer Sequences
delays in the controller.
The processor itself is a simp!e design. Appro_mately three man months were required
to design the circuit and verify its log{cal correctness through extensive logic simulations.
During this time a s0ftware model of the processor was also written to aid• logic verification
and a macro-assembler was written for software development. Four man months were
required to implement the layout end verify its correctness.
The processor was implemented in a 1.@m CMOS process and subsequently shrunk to a
1.0/_m process due to size considerations. It runs at a clock frequency of 28MHz under w o_rst
case processing assumptions, 140deg C junction temperature, and 4.1V internal supplies.
The processor was completely functional on first silicon. Under typical conditions, it should
run at nearly 60MHz, which corresponds to an instruction rate of 14MIPS worst case and
30MIPS typical. Currently, the limiting speed path is associated with the zero detect
circuit. A redesign of this circuit would likely result in yet higher system performance.
4.2 Control
The upper bi-ts the data bus are fed into the control section where they are pipelined
parallel to the data passing throughth_d_tapath. The= i_str_ction decode and controlof
the datapath is simple, since both control and data are pipelined in an equivalent man-
ner. The individual control lines to the datapath are decoded directly from the pipellned
instructions. The logic 0fthe Control section fits= on one C sized sheet of Iogic. It consists
of four stages of pipeline registers, 61 NAND gates (most of which are 2 to 4 input gates),
10 NOR gates, and var!ous inverter/buffers. It contains no state-machines except thp__e
required for interrupts and memory cycle stealing by the IO subsystem-- and thes_e are
exclusively single bit state-machines.
4.3 Data Path
The datapath consists of a register stack and a pipelined Adder and ALU. Figure 6 is
a_ signal flow diagram of the datapath. The M Bus is used for all memory mapped data
Z
E
m
3rd NASA Symposium on VLSI Design 1991 6.3.7
M
b Add
Pipe2
Pipel
ALU
MI Q P
Figure 6: #P Register Stack
transfers. The P Bus drives the Address bus of the RAM through a clocked register, AR,
located in the pads. The I bus is the data input bus from the memory, and the Q bus is
the data bus to the RAM. The I_ and Q bus are combined into a single bi-directional bus
at the chip pads through the MI and MO registers. (MI receives data from RAM during
a read and MO outputs data to RAM during a write.) The datapath operates as follows.
The instruction fetch address is driven from either the program counter or the secondary
program register onto the P (address) bus. Two clock cycles later, the instruction arrives
on the I bus, where it is fed through the ALU. At this time the Op Code portion of
the instruction is stripped off and the remaining bits are used to form either an absolute
address or and offset for the stack relative mode of operation. The results are clocked
into the Pipe registers. Next clock cycle, this address/offset is either passed unaffected
through the ADDER (absolute addressing mode) or added to the contents of the selected
address register (SP or TP) (indirect addressing mode), and the results are driven onto the
P (address) bus through a tri-state driver. If the instruction implies a write to memory,
6.3.8
the appropriate data is driven onto the Q bus from either A,B or PC. If the instruction
implies a read the requested data enters the datapath via the I bus two clock cycles later
and is processed by the ALU and ADDER in succession, at which time the results are
loaded into the appropriate register. An RTL description of the data transfers comprising
the data processing sequences utilized to implement the entire instruction set is shown in
Figure 5. It should be noted again that though this processing sequence is seven clock
cycles deep, processing of a new instruction starts every other clock cycle.
The ALU and adder are both implemented using pass logic. The ALU consists of a
single cell replicated 16 time, each of which consists of only 23 n-channel pass gates and
9 inverter/buffers. The ALU performs all bitwise operations and provides a zero detect
function which is used in the conditional skip instructions, as well as the detection of the
auto increment/decrement addressing modes. The OpCode (figure 4) bit patterns were
selected such that the upper bits of the instructions themselves become the control lines
for the ALU with minimal remapping.= =...... ,
The configuration selected to implement the ADDER is a modified transmission gate
conditional sum scheme [10]. The configuration is small, regular, and very fast.
4.4 IO Subsystem
Though not a primary topic here, it should be mentioned that a complete IO subsystem
was implemented and integrated with the microprocessor described here. It consisted of a
DMA subsystem which was responsible for the bulk transfer of data around the chip, two
serial ports for low speed data transmission anciacqu[sition, a parallel port ]br the transfer
of data to and from an external microprocessor, as well as a prioritized interrupt/event
passage system.
5 Conclusion
Present day integrated circuit fabrication processes support levels of integration adequate
for the construction of on-board microprocessor based controUers which occupy only a
small portion of the available circuit area. Such processors can be readily designed for
different cost-performance tradeoffs, as required for Specific applications. The outlay of
engineering time need not be excessive and the Use of high-!eve! languages for code devel-
opment makes the underlying instruction set transparent to the firmware developer, and
eases code migration, development and support. - - -
F
References
[1] A. Agrawal et.al., "The Scalable Processor Architecture (SPARC)," COMP_0N '88
Proceedings, 1988,,pp;278-283. ::: =_ = : -iii i_-_ : :=
[2] H. Bakoglu' W. Whiteside, "RISC System/6000 Processor Architecture," IBM RISC
System/6000 WeS  01ogy, SA-23:2619,IBM, Austin-w-:_, 1990, pp. 8-15: m
3rd NASA Symposium on VLSI Design 1991 6.3.9
[3] D. Curtin, L. Porter, Microcomputers: Practices and Procedures, Prentice-Hall, 1986.
[4] G. Grohoski, J. Kahle, L. Thatcher, C. Moore, "Branch and Fixed-Point Instruction
Execution Units," IBM RISC System/6000 Technology, SA23-2619, IBM, Austin TX,
1990, pp. 24-32.
[5] P. Hester, "RISC System/6000 Hardware Background and Philosophies," IBM RISC
System/6000 Technology, SA23-2619, IBM, Austin TX, 1990, pp. 2-7.
[6] J. McLeod, "Tough Choices Ahead in Microprocessors," Electronics, May 1989, pp.
70-78.
[7] MC680_0 32-Bit Microprocessor User's Manual, 2nd Edition, ISBN 0-13-566860-3,
Prentice-Hall, Englewood Cliffs, NJ, 1985, p. 1-6.
[8] D. Patterson, C. Sequin, "A VLSI RISC," IEEE Computer, vol. 15, No. 9, Sep 1982,
pp. 8-12.
[9] C. Rowen et.al., "RISC VLSI Design for System Level Performance," VLSI Systems
Design, March 1986, pp. 81-88.
[10] A. Rothermel, et al., "Realization of Transmission-Gate Conditional-Sum (TGCS)
Adders with Low Latency Time," IEEE JSSC, Vol. 24, June 1989, pp. 558-561.
[11] R. Sherburne, M. Katevenis, D. Patterson, C. Sequin, "A 32b Microprocessor with a
Large Register File," Digest of IEEE International Solid-State Circuits Conference,
Feb 1984, pp. 168-169.
IE
B
z
!
!
mR
L_
Ii
mE
m
