Reducing a complex instruction set computer. by Tse, Tin-wah. & Chinese University of Hong Kong Graduate School. Division of Electronics.
REDUCING A COMPLEX INSTRUCTION SET COMPUTER
Tse Tin-wah, B.Sc.
THESIS
Presented to the Graduate School of
The Chinese University o-f Hong Kong
in Partial Fulfillment
of the Requirements for the Degree of
MASTER OF PHILOSOPHY IN ELECTRONICS
THE CHINESE UNIVERSITY OF HONG KONG
May, 1988
ACKNOWLEDGEMENTS
I have received much help from my supervisor Prof. T. C. Chen 
in preparing the thesis. For all the insight and material 
Prof. Chen has provided -e. 1 ^e^eov express hearty thank to
him. Help from my oral examiners. Dr. C. K. Chan in 
particular  is al s o apur - c :
Tse Tin-wah
The Chinese University of Hong Kong 
May, 1988
Special Acknowledgement
I have got financial support from the Croucher 
Foundation throughout my MPhil programme. Here it is 
appropriate to thank the Foundation, and in particular, Mrs 
O'Hara, the Secretary to the Trustees, who has been giving 
me wise and kind advice.

A B S T R A C T
Architectures with complex high-level machine instructions have
been thought as better high-level language (HLL) taroets.
However, many experiments have indicated that complex
instructions are not likely to be exploited when programs are
compiled down, hence the ever-increasing complexity is actually
wasting silicon estate and in many cases, slowing processors
down. To take advantge of the synergistic relations among
compiler, archi tecture, and implementation, the reduced
instruction set computer, or RISC approach is inaugurated.
Encumbering complex case are left out, and the resources saved
up are dedicated to the common cases for speed. A sophisticated
compiler is then used to bridge the gap between HLL and the
primitive machine. interface very effectively. In our research,
the general ways of reducing an existing complex, machine are
developed. Techniques are devised to streamline the architecture
for better performance. We see RISC from the angle of
implementation simplicity, then pick out instructions that do not
take long execution time. Register-to-register operations,
branch, load/store, for example, are candidates to be chosen. A
pipeline is designed on which hardware is distributed in such a
manner that each chosen instruction finishes its work in one
pipeline cycle. It is observed that the IBM S/370 instruction set
is highly irregular, each instruction carries different function
and takes very different execution time. Compiler generated code
in the case of 360/370 is also heavily skewed. Because of good
subsettability, the 360/370 lends itself to reduction. Also, as
the 360/370 architecture is well documented, information is
easily available. Hence, we have chosen to retrofit the S/370.
Evaluations based on the groundwork are found encouraging. The
design is found to exhibit most if not all RISC features, and
from some statistics, the instruction set is found to have better
usage. The reduced machine is simulated, the details and the





1.3 So What is a RISC?
1.4 Reducing a CISC
Chapter 2 Conception and Methodology
2.1 Architectural Subsetting
2.2 Plumblining a RISC
2.3 RISC Tailoring
2.4 CISC Programs on RISC
2.5 Speedup in Other Cases
Chapter 3 Attacking S/370
3.1 INtroduction to the 370 Architecture
3.2 Subsetting and Streamlining
3.3 An Evaluation Using Real Statistics
3.4 Summing Up
Chapter 4 The Processor Design
4.1 Predecoding
4.2 Coprocessor Support
4.3 A Simpler Branching Scheme
4.4 Cutting the Instruction Bandwidth
4.5 The Instruction Pipeline
4.6 Being in a Dilemma
4.7 Design Summary
Chapter 5 Evaluation and Conclusion
5.1 Static Evaluation
5.2 The Simulation
5.3 Running Low Level Synthetic Benchmarks




A1 Instruction Set Summary
A2 Design of the ALU
A3 The Microinstruction Format
A4 A Typcal Basic Block




































ns? of the major goals of architecture design is to make the
instruction set an efficient target -for high-level language
programs- Years ago, it was generally believed that architectures
with instructions resembling HLL constructs could improve
perf ormance, because those instructions could simpli-fy compilers
by narrowing the semantic gap between HLLs and the primitive
machine inter-face- It was also thought that with complex
instructions, code density was increased and programs could run
•faster, because -fewer bytes had to be -fetched -from the slow main
memory- Processor-memory traffic was then the de facto metric to
measure architectural quality.
Microprogramming had been the most powerful tool to realize a
complex instruction set. With a fast control store, it was
apparently justified to enroll HLL constructs that would
otherwise be resident in slow core memory- Moreover, as
microprogrammi ng was really a good means to manage a complex
control section., why not consider a HLL-like instruction set.,
however- complex, and microprogram them?
IBM S360 was the first design to decouple architecture from
implementation. It is successful in that with different cost
performance ratios, compatibility can still be maintained across
|<-RISC?-> | instruction
FIGURE 1 . 1
Is the subset RISC?
Not yet. RISC is more: 
individual instructions 





■family members. Because S/360 had a rich instruction set and 
employed microproqramming. and more importantly, because it was 
from IBM, some manufacturers followed suit to build such complex 
instruction set computers (CISCs). Later on, when new families 
were introduced, the instruction set was further enhanced for 
maintaining upward compatioilit y
At a certain point on the longstanding trend of complicating 
the instruction set, some researchers began to realize how 
instructions were used when HLL programs were compiled down. The 
most frequently used instructions were found to be simple 
instructions like LOAD, STORE and BRANCH. These simple cases 
matched well to HLL constructs, hence dominated most statistics 
(FIGURE 1.1). Some of these results can be found in CKnuth71I,
(A1ex75) , CLunde77), (Tanen78) and (Weick84).
Opinions from experienced compiler writers also agree with the 
point that complex add-on’s don’t really help. Hoping to close 
the semantic gap, CISC has inadvertently and inevitably 
introduced a semantic mismatch: no single firmware routine can be
found to match the corresponding HLL construct across different 
languages. For this reason, an optimizing compiler has to carry 
out intensive case analysis to find out the best combination 
CWulf8 l): this is quite contrary to CISC designer’s dream.
Actually, compilers tend to use only a few instructions.
INTRODUCTION 2
%A complex control section in a CISC has a pernicious influence
on the CPU speed. To interpret complex instructions, either a
roassive control store or a complex hardwired decoder is called
for. Both cases will 1 enpthen the critical oath the z~ocessor,
hence slowing it down. People then begin to ask: are complex
things really worth design effort and silicon estate?
It is what have learned from hincsignt that leads some
researchers to apostatize and to inaugurate tKe reduced
instruction set computer (RISC) approach. 'The RISC effort is ah
attempt at systematic rethinking of the instruction set problem
with the aim to produce what may be called lean and mean
machines C ChenSS 3. The philosophy of this new approach is to
throw out on the one hand the encumbering complex cases that are
not worth design effort, and to streamline on the other hand the
kernel cases for speed- without sacrifyi.ng HLL ccr.si derati ons.
The resources thus saved up are then dedicated to other useful
features; complex cases are simulated by primitive instructions.
In any design, one should use resource effectively. With this
in mind, RISC has its place. An advantage of the RISC approach
always emphasized is the reduced design effort and errors. Hence,
RISC saves both silicon and human resources. Moreover, as some
experiments indicate that RISCs make better HLL computers
CPat182a, DitzSOl, RISC naturally stands out in the environment
that HLLs are widely practised. RISC has now been brought out
from college laboratories and put to market. But what about the
existing CISCs? Why not retrofitting them if the RISC concept
really works?
1-2 RISC PRIMER CBernSl, Pat185, HennBS, Stall 86, Gim87 3
The key to success of RISC is the exploitation of the synergistic
relations amonp compiler, architecture and implementation. RISC
designers are aware that a simple and regular instructs.- set
will lead to efficient implementation, and a state-of-the-art
INTRODUCTION 3
optimizing compiler can make the most of the primitive kernel in
which they have invested. They begin by getting compiler writers,
aTuhitects and hardware engineers sit side by side, devise ways
streamline the architecture CHenn84U, talk over hardware
details and plan compilers to energize the bare machine.
Hopeful 1 y, they will end up with getting things done.
An overriding design constraint is to make RISC instructions
as unencumbered as microinstructions, and to make them run as
fast as microinstructions CPatt851. So, most RISC instructions
execute in one cycle as microinstructions do- Now, an optimizing
compiler translates HLL programs down to its preferred primitives
which dominate usage statistics. Complex cases are usually
synthesized in a unique way due to the inherent minimality in the
C'ISC repertoire. A primitive sequence in many cases works faster
than a single complex instruction that is usually kept as a
microinstruction sequence in ROM, because an optimizing compiler
can remove some run time inefficiencies that cannot be done by a
hardware interpreters and because the primitive sequence is now
kept in cache with the same technology as a control store, rather
than in slow primary storage.
To effect fast and smooth execution, RISC architecture is made
simple and regular. This requirement results in a set of design
constraints CHenn853. RISC has relatively few instructions with a
few addressing modes and instruction formats. Operations are
register oriented, only LOAD and STORE access memory. All these
contribute to simplify decoding and make the pipeline simple and
efficient, so that each instruction, possibly except LOAD, STORE
and BRANCH, executes in one pipeline cycle. By the way, in a
VLSI RISC chip, the control section only occupies a tiny fraction
on the floor plan CPatt853.
The LOADSTORE architecture also spawns opportunity for
compiler optimization. Frequently used objects can be kept in
the CPU may then concentrate on computational t a s k. A
RISC registerregister instruction usually has a three-address
INTRODUCTION 4
Typical RISC instruction -formats
ADD R1 R2 R3 or 1 md
LOAD R1 Rb
R1 —R2+(R3 or imd)
R1-C (Rb) +o-f-f set 3
The -first -format is a zvpical three-ad cress, register
register instruction. R3, one o-f the source register,
may be an immediate ooeranc. Opcode and operands occur in
the same places so as no simplify decoding. The second
format is a typical LOAD instruction using base address
plus offset to evaluate the effective address.
format, because that is advantageous for optimization, as none of
the operands involved will be destroyed after the computation. An
optimizing compiler tries to minimize LOADs and STOREs,
consequently, the instruction stream mainly contains register
register instructions. Compare this with memorymemory and
memoryregister instructions found in CISCs, RISC object programs
are sometimes smaller CPattSS], since register address requires
fewer bits than memory address. However, RISC programs are on the
average 507. larger as estimated by Patterson, since code density
is low to maintain regularity. Most RISCs have fixed 32-bit
instructions, only in rare cases are RISC instructions Huffmann
encoded. In between the two extremes, some contrive to pack
instructions to increase bandwidth and hardware usage CHenn82J.
Another feature common to many RISCs is the use of a large
number of registers, usually arranged in multiple overlapped
windows as in Berkeley's RISC CPatt85D. Patterson's group finds
that most of the memoryprocessor traffic is incurred in crossing
procedure boundary, to make use of this finding, the processor is
made to switch automatically to a new set of registers when a new
procedure is invoked, so that old variables need net be saved to
make room for new ones. Communication between the current and the
invoking procedures is achieved by writing information through a
I »TRODUCT I ON 5
fraction °f overlapped registers common to adjacent windows.
Global variables are kept in a set visible to all -frames. With
this innovative scheme, improvement observed is significant.
Using a large number of registers is not the only way to go.
Some machines prefer to integrate fewer registers in order to
save silicon for other app1ications. Their compiler then performs
-flow analysis and keeps live variables in registers. Variables in
loops, in and out parameters are their main targets. As long as
the memory subsystem works well, the already—reduced processor
memory traffic will create no serious problems, because frequent
data are cached up. To free the processor further from the memory
bandwidth problem, RISC usually adopts the Harvard architecture:
instruction and data are seperately cached.
In RISC, pipeline interlocks are shifted to software. Many
machines make use of a 'del ayed-branch•, which is no different
from a normal branch instruction except that the branching effect
is delayed for a few cycles. It is the task of the compiler to
rearrange program code so that useful instructions are placed in
front of a branch to fill up the otherwise bubbles. In some
machines, delayed-load is devised in a similar fashion. Delayed-
branch alone usually improves speed by 10V. CGrossS23.
1.3 SO WHAT IS A RISC?
Patterson s
Hennessy:
An alternative of this design style (CISC) suggests
that simplicity is a better match to VLSI and HLL
CPat182b 1 Thus it makes no difference to the user
of a high-level language computer system whether that
system is implemented with a CISC that maps ono-to-one
with the to I-:! ens of the language, or if HLL support is
provided largely by software on a verg fast but simple
machine CPatt82al.
RISC..... is a style of computer architecture that
advocates shifting complexity from hardware and
program run-time to software and program compi1e-time.
This leads to greater reliance on compiler
technology CHennSSI.
INTRODUCTION 6
Col well et al- s We procose the followina element? a a wnrhnn




(4) Relatively few instructions
and addressing modes...
(5) Fixed instruction foraat..•
(6) More coto11 e-t: me eff ort...
The six RISC features enumerated above can be used to
weed out misleading clai«s and provide a spring—board
for points of decade [Co135j.
Korthauer et al_.: United--: n an evaluation of the alternative
tendencies (RISC vs. CISC) in this paper there will be
made the attemot to oove a RISC as a subarchitecture
of CISCs [Kc r t S 4 j.
Software and hardware experiments on subsets of the
VAX and ISM 36030. then seem to support the RISC
arguments CFattSCI.
The demarcation between RISC and CISC has not been too clear. In
particular, one can hardly tell if a certain machine is a pure
RISC. RISC architecture as pure as Berkeley's RISC I is still
suspected by Col well et_ al_. C Col 853.
As an attempt to settle the debate about whether a system is a
RISC, Tabak takes a similar approach to Col well et aj_ CTabak863.
Eight RISC features are listed out, and systems satisfying five
or more points are regarded as RISCs. Computers from Pyramid and
Ridge, interestingly, are RISCs in Tabak's taxonomy, but are not
according to the definition of Col well et_ al_. above.
Some reserchers think that subsetting a complex instruction
set is a matter of RISC: 'Although I doubt DEC is calling them
RISCs, I certainly found it interesting that DEC's single chip
VAXs do not implement the whole VAX instruction set CPatt843.?
But see Colwell et_ al.: 'The insinuation that the MicroVAX-32
followsa RISC tradition is unreasonable. It does not follow our
definition of a RISC; it violates all six RISC criteria CColSS].-
Controversy will never end. We will not try to settle any
debate, we just take a very simple but pragmatic view. In our
definition, RISC is a simple and fast machine. And if RISC is a
rwROOUCTIDN 7
A thousand—mi1e march
IBM took the -first step. It's the SOI project launched in
1975 CRadin833. 801 and the Berkeley s RISC CF'att853,
Stan-ford's MIPS CHenn823 in the early 80s were among the
most famous RISC attempts, they were regarded as pure
RISCs. Manufacturers such as Pyramid and Ridge C0hr853 then
began to put out commercial RISCs- Others such as HP
[Birn863 and Fairchild ENeff863 also went into fray later.
Commercial RISCs usually exhibit CISC features such as
floating-point and OS supports rMark843, with some other
seemingly irrelevant things such as a large register file,
controversies as whether some features are RISCy and
whether some machines are RISCs are far from being settled.
Some commercial RISCs, including 801's successor RT F'C, are
not quite marketable at the present time [Bel 1863, it is
still a long road- The programming language C on silicon
RISCs has been the focus, out researches of RISC on other
languages such as the object-oriented language Smalltalk
[UngerS43 and the AI language Prolog CBorr873, and RISC in
new techologies such as cans CNau873 are ongoing. For a
review, see CGim873.
subarchi tecture of more complex architectures, it must be a
streamlined one- The word simple' is used in a relative sense,
so, a subset must be simple. If the subset is streamlined to run
fast, it is a RISC- But wait: does it mean that a simple and
fast machine is necessarily a RISC? Yes, if it is found to have
RISC features; and no, if it is found not. To convince ourselves
that a streamlined subset is a RISC, we may inspect whether it
has RISC features. MicroVAX-32 is not a RISC because it is not
streamlined properly.
The key words in the definition are 'simple' and 'fast'. We
are referring to implementation simplicity to be explained later,
and we want the machine fast: this is pragmatic, as most RISC
manuf acturer es just exploit RISCs speed CMan87a, Mok87, Wolfe871.
There are some assumptions behind this simple definition: we
believe that, as Patterson and others claim, simplicity is a
better match to VLSI and HLL, and compilers can bricqe the gap
between HLL and the simple architecture very efficiently.
INTRODUCTION 9
1.4 REDUCING A CISC
The current study is to explore the general ways to reducing a
CISC and to see whether such effort will bring improvements. re
IBM S370, which has a large and complex enough instruction set
to be regarded as CISC, is our target of study. In CPatt95I,
Patterson describes two related points. The -first concerns the
IBM S360 M44. This model implements only a subset c-f the -full
architecture and is founo to have significantly better cost
performance ratio than its nearest neighbours.
The second point concerns the experiment performed by the IBM
.801 group. Their optimizing PL8 compiler is retargetteci to 370,
which is treated as a register to register machine to effect
better register allocation. It is found that the 370 subset runs
programs 50% faster than the previous best optimizing compiler
knowing (yet not quite preferring) the full 370 architecture.
'Software and hardware experiments on subsets of the VAX and
IBM 360370, then,- said Patterson, 'seem to support the RISC
arguments.' But there is something more to do. The sizes of the
instruction sets are reduced, but some instructions may still be
complex (FIGURE 1.1). This view-point coincides with Col well et
al_. CCol853. 801 and the later version RT PC may be real RISCs,
but they are just al ternati ves, or stepchi 1 dren, of the 360370
along a genealogy with a relation no thicker than water. Although
these experiments are encouraging, studies about the feasibility
of, methodology of, and performance change after, reducing a CISC
are still needed.
'Instruction sets for reduced architecture are chosen after
extensive study of compi1er-generated code.J Said Milutinovic et_
al. CMi1ut86 3. But we take another approach. We do not stick to
statistics, rather, we see RISC from the angle of implementation
simplicity: potentially single-cycle instructions like register-
to—register ALU operations, LOAD, STORE, and BRANCH, are
candidate RISC instructions.
INTRODUCTION 9
What about the cycle time?
Due to its simplicity, RISC does promise short cycle time.
But how short the cycle time is depends on the technology
and expertise the manu-facturer masters. The number of logic
levels may be used to de-Fine cycle time to remove tenology—
dependent -factors, but this -figure is hard to determine. In
many cases, the basic cycle time is limited by the cache
access time, because the cache is usually the si ewes:
element in the processor pipeline. In these designs, the
-first pipeline stage is instruction -Fetch -from cache, arc
the rest stages each takes the same time as the -First.
Hence, the cache access time may be used to de-Fine a cycle.
(In our research, 'cycle' refers to a pipeline cycle.) Many
RISC machines have longer cycle time than Motorola's CISC
machine MC68030, because of Motorola's expertise in dealing
with HCMOS. On the other hand, single-cycle execution is an
attribute of RISC that CISC can hardly have. The Edge 2000
computer is a CISC that makes use of architectural advances
to reimplement the 680X0 series so that most instructions
execute in one cycle. The evele time is competitive to that
of many RISC machines, but the hardware requirement is
formidable CMan87aD. Therefore, if simplicity identifies a
RISC from CISCs, then single-cycle execution best tells
about RISCs speed advantage. Cycle time should be
optimized, but in our research, we are not able to tell how
good it is. Yet we will start out to design a single-cycle
simple machine, and the machine eventually executes
instructions in one cycle.
In the following chapters, the readers will see how we pick
out a subset from this point of view (chapter 2 and 3), and how
we devise techniques to streamline it (chapter 4). The reader
will also find that if we follow these simple procedures, the
reduced machine has most, if not all, RISC features. Instruction
usage, and hence silicon usage, is also better. The design is
simulated, and sample programs are run on the simulator to
demonstrate what speed advantage the reduced machine has over a







It has been pointed out that RISC is a subarchi tecture of more
complex architectures (see -For instance CKort84I). Many o-F these
arguments have been based on comparisons o-f a reduced machine to
an architecturally very different complex machine, usually VAX 11
780. Suspicion then arises as whether these comparisons are fair
CDavid873: one can hardly tell dispassionately which one of apple
and orange is better, especially when both are equally fresh.
In CMcNel873, McNeley and Milutinovic try to evaluate the
performance of using a RISC to emulate an architecturally very
different CISC. Their conclusion is that direct emulation of the
CISC object code is inefficient;, because complex features such as
complex addressing modes and condition code setting are not
emulated easily by the corresponding, if any, quite different
RISC features.
If asubset is carefully picked out from a CISC, it is more or
less RISC. But before the subset is compared to the CISC, it has
to be streamlined according to RISC principles. It is only after
that the reduced machine be born in wedlock. Many of the
differences clouding fair comparisons and problems concerning




CISC broe' bowr. mo
RISC kernel anc octioral
coprocessors wcriflng
possibly in parallel.
emulation e-f-ficiency will be removed- To reduce a CISC in t~is
way sounds interesting and rewarding. Anyhow, rrxcies e-ist. z
some are -found too hard to fix.
RISC principles altogether boil down to investment in toe
right things in the right way, and right things are just simcle
ones amidst all. To fire out complex things, one should allow for
flexibility to reemploy them as add-on's, so that different
workloads can get their right profiles. Hopefully, the stream¬
lined kernel gets along well with the add-on's and incidentally,
complex add-on's group together in some fashion as other 'RISC'
chips. This point will be explored later.
2-2 PLUMBLINING A RISC
Due to its inherent complexity, a CISC instruction set imposes
stringent requirements on hardware to get fast execution. Just
imagine how many functional units are needed if all instructions,
grouped in many sets operating on different data types, having
variety of addressing modes and instruction formats, are to be
made as fast as possible. To schedule all this hardware to work
in maximum parallelism is as formidable a task as to write
optimum microprograms for the CISC.
That's why CISC designers would like to take the AND approach:
trying to factor out common requirements, they repeatedly use the
hardware they have to emulate CISC instructions. The result is
CONCEPTION AND METHODOLOGY 12
that, in many cases, even simple instructions need many cycles
on. 'DEC has a terrible problem with the VAX, because
you just can't build it to run fast,' said one researcher who
anonymity. 'And in the same way, IBM is stuck with the
370- CWai1853.
Quite the contrary, it is economically feasible to get all
hardware answering to the global reauirements of RISC
arzhj- tectjras. RISC has few instructions and its instructions are
vertical CPatt833 in the sense that relatively few parallel
hs'-cwara needed for each of them to go speedily. The result is
that we can distribute hardware over a simple and efficient
pipeline, in such a way that each instruction will finish its
work, in one pipeline cycle: and each cycle is no longer than the
time to execute one microinstruction.
To consolidate our belief, we auote three papers:
.... performance of the processor itself can be enhanced in
two ways: by cutting the number of processor cycles to
perform a given function or by cutting the time used per
microcycle To obtain this increased functionality,
however, a much more elaborate set of data path is required
in addition to a highly developed control to exercise them
to maximum potential. Such a change is not an incremental
one and involves rethinking the entire implementation
CSnowSlD. (This comment is made to CISCs- author)
Researchers observed that microcoded machines could not run
faster than 1 microcycle per instruction, typically
averaging 3 or 4 microcycles per instruction; yet the
simple operations in many programs could be found directly
in microinstructions. As long as machines were toe
complicated to be implemented by ROMs, why not take
advantage of RAMs by loading different microprograms for
different app1ications? About this point, several
people, including those who had been working on
microprogramming tools, began to rethink the architectural
design principles of the 1970s. In trying to close the
semantic gap, these principles had actually introduced a
performance gap. The attempt to bridge this gap with WCSs
was unsuccessful, although the motivation for WCS— that
instructions should be no faster than microinstructions and
that programmer should write simple operations that map
directly onto microinstructions- was still valid CPattSSl.
r.nNeEPTiON and methodology 13
A RISC machine aims to keep the value of N (tne averace
number of clock cycles per Instruction) near one, while
letting D (the dynamic i nstruct: ons court zr the
benchmark) increase only slightly comoarec co a microcoded
machine. Typical microcoded machines have average values of
N in the range of 5 to 10 (inducing ge~al:cie= cue to
pipeline breaks, but ignoring stalls due rc cache misses
and TLB misses). By comparison, a RISC eachi-re s: rt aim to
have N in the range of 1.2 to 1.5 (also ignoring cache
penalities). The performance advantage c a RISC
mi cropracessor comes from the face, that N i« 4 2 times
lower and D is only slightly larger h RISC machine
will have a clock cycle time as lew as, or 1than. a
microcoded engine; increase! clocf rates are tr :t«s:sle
because the hardware complexity is lower CrennSCI.
Another related example is IBM's experiment on its CAD tools.
370 is put to a gate array chip with about 5000 gates (microstore
not included). In the report paper EDavisSO], it is said that
'• The rich instruction set of the IBM Svstem370 requires a
weighted average of 50 machine cycles for each instruction.
Another approach is taken by a Stanford group. A 11,000 gate-
count MSI UHM design (universal host machine with WCS to adapt to
different workloads) is simulated. 'Note that if the designers of
the 370 chip had 11,000 at their disposal, said Flynn, 'they
would have certainly been able to realize a well—mapped host—
with about 8-10 cycles per image instructions.' CFlynn833
It is not difficult to imagine that with modest gate count,
reduced 370 can execute instructions at single cycle rate. As
most S370 implementations are microprogrammed, the performance
improvement when using hardwired control is comprehensible,
although the possibility of arranging parallel hardware to
achieve single—cycle execution with constrained gate count is yet
to be investigated.
CONCEPTION AND METHODOLOGY 1
*t“»rt -rurKrtional units are needed?
In addition to a conventional ALU, a barrel shifter and a 
multi-port register file are usually needed to match the 
Granularity of RISC- A RISC pipeline, however simple, has 
least two stages: one for fetching instructions and the 
other for executing them, A pipeline cycle has a number cf 
chases, the number depends on the parallelism of the 
arrangement- If the register file does not nave 
enough ports to support concurrent read/write requests, 
tfcese ^eouests must occur in different phases, the cycle 
t; is then lengthened. In some machines, an additional 
adder for evaluating effective addresses is included. A 
parallel multiplier or partial hardware to support 
mui 1 p 1 1 c at ion c an be found i n some machi nes C Magen£ 7 ] .
2.3 RISC TAILORING
It is thought that reducing a CISC is an iterative process, a 
chosen subset will be modified dynamically according to further 
evaluations. Therefore, to begin with a subset as small and as 
simple as possible will make life easier. Fortunately, compilers 
like 'primitives, not solutions’ CWolfSll. As simple instructions 
lend themselves to single-cycle execution, this subsetting 
process is called single-cycle filtering.
After the candidates have been filtered out, they should be 
streamlined. The number of addressing modes and instruction 
formats should be kept small, and different-length instructions 
should be evened out so as to keep decoding overhead low. For 
short-format instructions, an immediate field or a third register 
designation field may be appended. For extra long ones, a few 
bytes should be cut if possible.
The kernel should be complete in the sense that it can work 
efficiently in HLL environments. New instructions may be 
introduced to enhence its performance or to make it more













D: ecm emui at i on tr- J
~-~ r-- c, i'» K nv,A w a w— v a
complete. However, great coanges should de avoided so that
comparisons can be simplifies. Keeping architectures alike also
makes possible that existing softwares are reused with only
sliaht modification.
2.4 CISC PROGRAMS ON RISC
The conventional method to compare a CISC with a RISC is to run
HLL benchmarks on the two machines, then evaluate them based on
their speed. If the CISC and RISC have different architectures,
or if different compilers are used for the same benchmarks, part
of the true story behind CISCRISC will be distorted: this is the
case for most, if not all, publications so far reported. The CISC
RISC pair in our case has a very close link, and attempt is made
to let the RISC emulate the CISC' s object code, so that the same
or very similar compilers are used for both architectures.
In the following analysis, we will try to find out whac
speedup is attainable if a CISC is reduced. It is believed that
RISC is able to emulate primitive instructions very efficiently,
because each instruction on the average needs fewer cycles than
the corresponding CISC one. For complex cases, how good RISC
performs is still unknown, but we will highlight the interplay
CONCEPTION AN£ tTHODOL_36Y it
between these two classes of emulation. There are good reasons to
believe that reduced complexity could lead to reduced cycle time,
yet we assume an equal quantum for the CISCRISC pair, because
any assumption favoring the CISC will place us on the safe side.
Our group is too small to write a compiler for the RISC.
Without the right compiler tailored for the reduced architecture,
many of the merits of RISC will remain hidden. Locking from
another angle, if the reducec nachine outperforms t-e CISC, it is
not difficult to conclude that the RISC with a compiler is even
better. Moreover, if direct emulation is possible, migration from
CISC to RISC is very easy, because existing software can be
reused. Incidentally, critics tell us that RISC requires new
investment in software, but RISC advocates respond that because
of trie many RISC advantages, it is worth the effort.
A RISC subset Sr is picked out first from the CISC instruction
set Sc, all the other instructions to be discarded are kept in
the remnant set SD (i.e Sc= Sr+ SD. In the CISC, Sr cooperates
with SD to get programs done, but it is believed that Sr works
harder. Sr is now streamlined, and the new RISC instruction set
Sr' contains only primitive instructions and is made very
powerful. Sr may now contain some new instructions, but for our
purpose, it doesn't matter because they are not used directly.
The work to be done by So is now emulated by Sr'. We will see how
bad the emulation can be tolerated if a greater than unity
speedup is not compromised.
In the following a prime is added to all RISC parameters. A
subscripted c refers to the number of cycles, while a
subscripted n refers to dynamic count in a benchmark. Subscript
iir•• concerns parameters pertaining to Sr, and subscript d to
SD. For each instruction Ir- f Sr, the CISC needs cr cycles and
the occurrence of Ir- in the benchmark is nr. Likewise, for each
I,-' SR?, RISC needs crJ. RISC has no instructions like Id', but
in effect, the emulation takes c'. Overbars denote averages.
For direct emulation, dynamic frequencies are unchanged.
CONCEPTION AND METHODOLOGY 1







From (2.2) we can see that if Ir- dominates, the speedup is the
factor RISC instructions outspeed CISC ones. If we have
coprocessors which outpace the CISC instructions by the same
factor., the speedup i s s0 no matter what f is. That is to say., a
coprocessor is like another RISC chip cooperating with the kernel
alternately. We can get better speedup and flexibility by getting
the right coprocessor work in parallel with the RISC kernel.
As it is said that RISC is CRAY—1ike CBellS6! and CRAY can do
floating-point arithmetic, why can't a floating-point chip,
tailored specially for the RISC kernel., be a RISC? If we invest
solely on floating-point things in the floating-point chip and
make.the chip very powerful, it is a f1 oating—point RISC. It is
interesting to find that in many circumstances, a fI oating—point
chip with parallel hardware for simple floating-point arithmetic
operations like (+,-sxs) runs programs faster than a coprocessor
with an elaborate set of instructions (e.g. with transcendental
functions) CThompSSl.
Maximum speedup when using P heterogeneous processors
When a problem is splitted up into a serial part -f and a
parallel part (1—f) to be parallelly processed by P similar
processors, the maximum speedup is S= Cf+ 1--f)P1 ~1 (1).
This is called Amdahl's law. From (1), we see that the
maximum speedup is limited by the serial part o-f the
problem, no matter how large P is. On the other hand, i -f we
use P heterogeneous processors, each for a distinct data
type, there can be no serial part because the serial part'
can be overlapped with operations of the other processors.
In this case. Maximum speedup is determined by
synchronization and communication among the P processors.
If the problem is partitioned into P parts, the speedup is:
The maximum speedup is limited by the slowest processor due
to synchronization. Special routines may be tailored for
special applications, so that the partition is made
advantageous for parallelism.
Note that if we partition So to SDi, SD=,..., SDp, for all of
the p different data types the CISC supports, we might get a
maximum of p different coprocessors to match the workload. If
each of these p coprocessors has the same speedup over the
correspondinq CISC instructions, the grand speedup is still So no
matter what the instruction distribution is and what benchmark is
run. An extra speedup factor of p and extra flexibility are also
possible.
The RISC kernel alone may be used to emulate non-RISC
instructions, but speedup will be compromised if (2.1) is not
satisfied. From (2.1) we can see that if the emulation takes no
longer than the CISC time c a, we still may have gain, as
guaranteed by the RISC performance and the usually skewed
statistics. Real statistics and figures will be used in the next
chapter as an example to illustrate the performance of the
reduced machine.
rnwrcPTTnM AMD MFTHDnfll OGY 19
2.5 SPEEDUP IN OTHER CASES
sne last =e:::onf the same compiler is assumed when a RISC
- r c-:- e is compared to the CISC from which the reduced machine is
cerived. HLL programs are compiled down to CISC code ana are then
ii~scc_ emui aoec by the RISC machine. The speedup in this case
- E -i_.no tc ascend on how good the emulation of complex
i~ s c r _i c o i on s is._ n practice, different compilers are used for
r r G G pi 1! i i .1 Z I HQ G G m p I 1 0! n f h B 0•' D 0 R 1 (T 0 R L d 0 S C R 1 b 0 G 1 H 0 0 C 0]. O H 1 a 4
:: '0 c 0 v i d 0 r g 0 g n a t w n 0 n a n a. d v a. n c: 0 d 4' L 3 c o m p 1 1 0 r t n r t he IB M
~~ 3;, jTj 0. c G 1 H t I 0- K 0 R a. R G 0 T t 0 G I O G L•'- 0. p 0 0 (j LI j. 1 0 G- R 0 0 R V 0 d n
0-1-r~~ i R l r 0 a g n :r 4 j 0 r Gj li 0 0 i 'i a. t rj n 0 o t g h 0 li n t a i r t a. g t o r 0
G G n li 1 I cG 1 G G G R- z-—• G 0 R T G R fT» afiCt IB G R B Li G B G T D6IGB R G G iT) G 1 1 B R I R
RISC =-vst ems. Using assembler bo write codes separately tor the
RISC and CISC machines, Heath -finds that RISC improvement is only
m o d e e t.
Fo avoid unfair comparison due to dissimilar compilers, we
assume similar compiler technology in this ways when an HLL
program F'hlu is compiled down to a CISC program F'c using a
compiler tar get ted to the CISC architecture, and when PHi_i_ is
compiled down to a RISC program PR using another compiler
t ar qe11 ed t o t he R ISC archi t ec t ur e, t he ex ecut i on t i me will be
t h e s a m e if P a and P r a re s e p a r a t ey run o n llci t hi e CI b C m a c h i n e.
Further, same execution time is expected when F'c and F'r are
separately run on Mr, the RISC machine, i he following symbols ax e
i n 11 odu.ced for use:
N (h' x! My)— execution time in cycles of h'x on machine My (u.-j,)
where' X and Y7 could be (for CISC) or 'R (for RISC).
The assumption above say
(2. 4)
CONCEPTION AND METHODOLOGY 20
«From the analysis of the last section, when rephrased in the

















N' l c' Mr)











From (2.5) and (2.6), we also get s So (2.9)




3- o 3 c
S o 3- o
(2.10)
If s0 is established, the whole matrix is known, so that ever
if different compilers are used for the CISC and RISC machine
the speedup in any circumstance is readily known under th«
assumption of the same compiler technology. In the following
chapter, some real statistics is used to estimate practicai
speedup values.
CHAPTER THREE
1—TACK X tvlG5 S370
3-1 INTRODUCTION TO THE 370 ARCHITECTURE
370 is a computer family announced by IBM in the early 70s. With
technological advances and the experience gained from 360, IBM
adds new functions to enhance the performance of 370. Under the
same architecture, different implementations are available which
span a wide range of cost and performance.
370 has about 200 instructions which fall in five classes:
general, decimal, floating-point, system-control and 10
instructions. Each instruction may take different time for
completion, depending on the class it belongs to. With this
extensive and irregular repertoire, 370 is complex enough to be
called a CISC architecture [PattBSD. There are 16 general
registers, 16 control registers and 4 double—word f1 oating—point
registers. As 370 is a 32-bit machine, a word is 32-bit wide, a©
double-word has 64 bits while a half word has 16. Three memory
data types are supported: byte, halfword, and word.
The general registers can be used as index registers in
address arithmetic and accumulators in ALU operations. Control
registers are used to control system facilities. They are only
refered to by privileged instructions to which normal users
cannot gain access. Floating-point registers are used to hold
floating-point operands. The program-status-word, or F'SW fcr
short, is a register in the CPU for instruction sequencing and
program—state recording.
ATTACKING S370 22
There are six basic instruction formats: RR, RX, RS, SI, S,
and SS. Different operand classes are involved in airterent
formats- An RR instruction is a halfwc-d long and a register-to-
register operation is involved; RX, one word, register—and—
indexed— storage; RS, one word, eoi5ter—and—storage; SI one
word, starage—and —immediate; S, ore ord, implied—operand—and—
storage; and SS, three halfwcros, stcrage-to—storage.
The address for roam scorace is calc_1seed bv cwe base-
address plus offset rule using double indexing. In addition to
an index register and a 12-bit disclacement field, a. base
register is used to allow for relccability which facilitates
multiprogramming. If general register 0 is designated in the
index or base register field, no indexing is done as if a zero
were hardwired to register 0. In view of this, general registers
are actually not so general. The addressing capability of 370 is
quite weak since the address space is limited to 24 bits (31 bits
for the later version), and the displacement field to 12 bits.
Some improvements are suggested in the box on the next page-
Almost all S370 models are microprogrammed, with the
exception of Model 195- Low-end models were announced and shipped
in the early 70s- Those models, such as 370125, have no cache
and employ little internal parallelism. The main memory speed,
the control storage speed, and the processor speed are equal for
low-end models, while the processor speed for high-end models is
several times higher than that of the main memory, but is matched
to that of cache.
We have summerized the architectural features of 37u that
pertain to our study. Interested readers may refer to CCase73H
and EIBM78H for more detailed discussions.
ATT=O INS 5370 23
Is the 370 addressing capability good enough?
The 37u address space is 24 bits wide, which is clearly not
enough -for modern multi— user environments. Here we suggest
that the address space be extended to 32 bits. Double
indexing plus 12—bit offset is the only addressing avcde. f-
12—b i t displacement may be quite enough tor adcressmc
small arrays, but it is certainly not enough for larger
arrays and a long jump beyond 4096 bytes. Alexander and
Wcrtman in C Alex 753 reccrc that a. branch i nstructl on is
usually preceded by a 1oad-register instruction because
that register is used as an index to augment the short
offset for farther oranches. To ameliorate the problem,
PC-relative branch is recommended in the paper. But to heal
the addressing problem, a longer offset is suggested here.
In an RX instruction, the base register field is situated
just to the left of tne displacement field. If we sacrifice
one degree of freedom by eliminating base indexing, we can
coalesce the base and the displacement field to get 16-bit
offsets. As double indexing is found unusual and is net
used in 801 CRadin833, coalescing is thought to be quite
acceptable. In CChow873, Chow et al. say that a single 16-
bit signed offset from a 32-bit register addressing mode
is adequate for a general purpose processor. 32-bit offsets
are generated by preceding a 16-bit offset instruction by a
special load upper immediate (lui) instruction. In
effect, the upper half of the index register is loaded with
an immediate value and is added to the 16-bit offset to
obtain the 32-bit absolute address. Together with a
register hardwired with a zero, a wide variety of
addressing modes can be synthesized. With this single
addressing mode in the R2000 processor of MIPS Inc., an
accompanying optimizing compiler is found to be able to
take advantage of this limited addressing capability. By
this token, if we add a new instruction like lui to 3U and
expand its offset field to 16 bits, and further, if we
append a half word immediate field to the halfword RR
instructions and allow this new field to designate a third
register, we have all the addressing capability of R2000







tr af f i c
.oral
























































TABLE 3.1 Reveal ire c~e mu.lti -cycle nature of 370
3-2 SUBSETTING AND STREAMLINING
We will see in this section how simple instructions we want are
filtered out and how they are streamlined- By simple instructions
we mean those instructions that are directly supported by the
necessary hardware and hence can be executed smoothly- RR—ALU
instructions, LOADSTORE, and BRANCH are examples.
The size of the instruction set is not a real problem. Fuller
et al_« in [Full 77D test several architectures and S370 shows
good performance .in sub set ab i 1 i ty and extensibility- The general
instruction subset is a good starting point. What cause problem
are the multicycle instructions (TABLE 3-1).
From TABLE 3-1 we see that only some RR instructions are
single—cycle in nature. Some seemingly simple insti notions such
as the immediate OR (01) instruction requires four cycles. We
count the number of cycles as the number of critical operations
required by the instruction, leaving out the time for instruction

















FIGURE 3-1 The three instruction fcrinats
of the reduced machine.
instruction involves two operations (address evaluation and the
OR operation) and two operand accesses, hence consumes four
cycles.
The -first step is to seperate simple instructions from complex
ones by using a single-cycle -filter. From the consideration o-f
hardware complexity, we let go instructions that consume one or
two cycles, and forget about the residue for the moment, The next
step is to find an economical way to make the two—cycle
instructions run in a single cycle, and to devise some methods to
keep the pipeline smooth. Here we only discuss architectural
issues, implementation related techniques are defer read to the
next chapter.
The resulting instruction set comprises arithmetic, logical,
shifting, data movement and sequence control instructions that
bears some resemblance to Berkeley' s RISC II CFattSl. Had we use
a finer filter to let go only the one-cycle instructions, the
instruction set would have been too restrictive. 'Why don't you,'
suppose someone ask, 'let go also the three-cycle instructions
and get some hardware to make them run in a single cycler Upon
the first trial, we would like to make life easier by leaving out
radically as many things as possible. We may add instructions
after further investigations if they really enhance performance.
ATTACKING S370 26
fcnat atotrc tte complex instructions?
Comdex esses are best synthesized by the compiler using
primitive instructions. Optimizing compilers in some cases
can remove much run—time overhead by sol vino problems at
cofc:Ie-t.:me rather than tackling them by complex
instruct: xs st run—time. In the case ot Direct emulation
cf CISC object programs, we cannot take advantage ot this
sleight d nsnc. Cornele cases are trappea to subroutines
-for emultion. To get ef-iciency, the subroutine should be
help -est memory :ez~e -for example), and possible
Qutoles during suorout:~e callreturn should be avoided. An
alternative path -from the fast memory holding the
subroutines is suggested, t nis path is chosen automatical ly
once a complex instruction is detected. The processor
switches back to the normal instruction stream right after
the emulation is completed. A possible implementation of
this perfect trapping'1 is illustrated in the next chapter.
Still another possibility is to do some flow analysis on
the object code prior to execution. Complex oases are;
treated as macros that ere, on detection, expanded in-line.
As the textual content of the program is altered, non- j
symbolic references must be readjusted. Some registers must j
be freed to hold intermediate values produced by the j
macros. Savingrestoring registers on a per macro basis may;
be very t i me-consu.mmi ng, a better solution is to break the j
program into single-entry, single-exit basic blocks using:
standard techniques CAho773, CLowry693. If in a basic block;
some registers are not used before the macros are expanded, j
they can be used if they are left intact on exit. We simply j
save these registers on entry and restore them on exit. For•
very large blocks such as program loops, the saving is very j
significant. In the trapping scheme, we may use hidden I
registers in the alternative path so that no pre—run time;
optimization is needed, and that transparancy is obtained. j
As a by—product of the single—cycle filtering, the number of
instruction formats is reduced to three: we have only RR, RX and
FS in the 31 instructions chosen (see APPENDIX A). To even out
the ragqedness .in these three formats, a half word is appended to
the RR instructions as a multiplexed registerimmediate operandi
field found in most RISCs. All the 31 instructions are a whole
word long. Decoding is simple because of the uniformity in the





R1— R1 op R2
R1 R2 op (Rd or immediate)
R1~ R2 od (Rd= R1)
As we can see, compatibility is easily en-forced by cup 11cacc~c R1
to Rd. the new -fiolH-
We don't either have to reduce the numcer of idrressn: mccea
intentionally. There is on 1v one address:no mode arc-~~~= -it
been very restrictive. We still support tnree cata types: ~c -r.
halt word, and byte. The 31 instructions on the three date tvcee
have already rn a d e a q u its v e r ss.ii .1 s q e n e r a 1 purpose p r sees s o r.
3.3 AN EVALUATION USING REAL STATISTICS
In CAlex753, ninteen XPL system programs on IBM S360 are
investigated. The most frequent HLL constructs are found to be
ASSIGNMENT, IFTHEN, and CALL. These constructs map simply to
primitive sequences, and not surprisingly, simple instructions
dominate the statistics. Suppose we run the object programs on
ou.r reduced machine. The frequency distribution is unchanged,
hence we can use the statistics to evaluate the new design. We
will compare the new machine with 370125, because we only have
3ICS's data. Without taking pipeline breaks and cache misses into
account, instructions in the RISC repertoire are assumed to run
in a single cycle. We haven't really emulated complex cases, but
we assume the emulation takes about the same number of cycles. We
also estimate the possible increase in program size.
It should be pointed out that the definition of cycle used
here may be different to that used in 3125. In our case, we
partition an instruction into critical parts and arrange hardware
over a pipeline with the number of stages matched to the number
of partitions. Each stage is then assumed to take one (pipeline-)
cycle. In the case of 3125, the microprogramming cycle time is
used instead. The main memory used in the design of th.i= model



























































































































































































































TOTAL: 1007. 2203 505
TABLE 3.2
Performance 0f the reduced machi ne rel ati ve to 1 .c u x.. D a t a
are taken -from CAlex75], CIBM76:, and CCase78D. Assumptions
are t?ken to 'favor the CISC as -far as possible. Highlighted
complex instructions are those with unknown emulation
efficiency. Equal to CISC cycles are asummed in these cases.
ATTACKING S370 29
has an access time matched to the processor- time, one cvcle then
re~fers to either one (micro—; processor on one ©escr. c-cls. The
reader will see in chapter 4 that the first stage in c_r pipeline
design is -for cache accessing, this similarity m one two
de-finitions should be noted.
Although there are ambiguities in the cefinition c~ s cycle,
but it is true that a cycle time as high as 3125's mc i= 320ns
in 3125—1, and 480ns in 3125—2) is easily at. t a i~ a CO®~ crcav's
technology. There-fore, if the number of cycles needed by 3125 is
larger, the execution time can easily be regarded as larger. If
the number of logic levels is used to define cycle time. the
ambiguities could be removed. Yet this definition:= hare to
obtain. The comparison is therefore a crude one.
From TABLE 3.2, it is seen that 35 of the 200 instructions
account for 99.67. of the dynamic statistics. Over 807. are single-
cycle RISC instructions, and about 167. are simple RX instructions
that can be emulated by the RISC in 2 cycles. It is interesting
to note that below 207. of the 370 repertoire shows up in the
statistics, but over 707 of the RISC repertoire have appearance.
Hence, the reduced machine is used more efficiently in this case.
Our reduced machine can emulate 97.77 of the trace in 1.17
cycinstr on the average, while the rest 2.37 in 170 cycinstr.
3125-2 needs 18.5 cycinstr for the same 97.77, the speedup is,
therefore, IS. 51.17~ 16 if we leave out the 2.37 of complex
cases. But if this mere 2.37 is counted, the speedup drops to
less than 5. We thus see how important emulation efficiency is.
By using the equations derived in the last chapter, we can see
that an emulation speed as low as 900 cycinstr is possible i f we
still want speedup. We have now assumed 170 cycinstr, the same
figure in 3125-2. It is believed that we have over est i mat ed the
RISC time for the complex cases: the 2.37 runs even longer than
the 97.77. However, we'd rather favor the CISC.
ATTACK IMG S f~T: TO
As some simple complex instructions needs two RISC
instructions, and RR instructions are extended, the increase in
program size, if the other complex cases are not taken into
account, is around 20%. The decrease in instruction density is,
in this case, not quite serious, but if the 2-3% is counted, the
density may drop significantly, so that an instruction cache is
needed-
3-4 SUMMING UP
We begin with a RISC definition drawn from the angle of
implementation simplicity, and pick out instructions from the
full architecture of the S370 that lend themselves to single-
cycle execution. By subsetting a complex architecture in this
way, we believe that easier comparison and better compability are
resulted. A 'cycle' has oeen interpreted as the time needed by
one of the few critical operations that have been divided to be
distributed over the pipeline to be discussed in the coming
chapter. The reduced instruction set is regular in the sense that
all the instructions have approximately the same number of
critical operations, so that the instructions going into the
pipeline come out one by one on every cycle. A crude evaluation
of the performance of the reduced machine has been made using
statistics from CAlex751. Better instruction usage and speedup
over 3125 are observed. Design issues are discussed in the next
chapter. In the final chapter, more discussion about the reduced
instruction will be found.
ATTACKING S370 31
CHAPTER FOUR
T 1-4 C=r gz izv 14— cr- c— c— r~l f=;» nPQT RK
4.1 PREDECODING
In an overlapped macr:-= design, there are two kinds o-f
unavoidable causality that mav degrade performance. Firstly,
causality exists in tne processing seauences o-f a machine
instruction. Execution o-f the i nstruci ton, tor instance, cannot
overlap with decoding o-f the same instruction -for obvious reason.
Secondly, causality exists among various instructions that are
in overlapped processing. A reg i ster-to-r egi ster addition, -for
instance, cannot get the operands it needs be-fore the slower.,
preceeding load instruction has -filled the right register -from
the storage.
The two examples above relate to the problems o-f data
dependency. Some machines employ internal -forwarding, and others,
delayed load to ameliorate the problems CPattS53. Here we
suggest an alternative technique, called predecoding, that is
found very use-ful to a pipeline design. The problem o-f branching
dependency involving the second kind of causality is perhaps more
serious. It is found that predecoding contributes to solve this
problem, because a shorter and simpler pipeline can be
constructed with the help of this technique. Together with the




















FIGURE 4.1 Data dependency occurs in t3. If address
calculation can be done in parallel with
instruction decoding, the LOAD reeds only
three cycles, and deoendencv is resolved.
It is -found that some decoding tasks, however trivial, cause
dependency (FIGURE 4.1). It part of the instruction decoding is
dene prior to entering the execution path of the processor, the
ruggedness of the instruct ion set is in effect evened out. The
prsceccdirg logic can be placed anywhere before the initial stage
of the pipeline, but we suggest it be interposed between the main
memory and the cache.
If we let the compiler do the predecoding, the compiler has to
know many hardware details. These machine idiosyncrac i es are
supposed not to be taken care of by the compiler because compiler
portability is desirable. Moreover, the predecoded bits of
information that are to be appended to every word in the main
memory may be inconvenient, because powers of 2, such as --A, are
magic numbers in memory hierarchical layers above the cache.
There is no problem to place the predecoding stage in the
critical path of the processor (e.g. before instruction fetch).
But the critical path has the same length just because the
decoding logic is still there. The remaining choice is to
(partly) predecode the instructions en route when they fly from
the main memory to the cache using simple hardware (FIGURE 4.2).
The processor becomes faster because decoding on the fly is
a matter of once and for all, but as cache can capture program
localities, the decoding time may vanish asymtotical1y. The cost





FIGURE 4.2 Processor be-fore








here is not -for the decoding logic, as it is soil! needed anyway
in the control loop. It is for the increase in the cache
capacity. The increase is only a linear function of the
horizontality of the instructions. If the prececoded information
is in the form of a small attached tag, the increase is small but
the advantages may be numerous. The speed promised by the one-
shot predecoding is similar to those offered by the one-shot
compilation as compared with interpretation.
For the example in FIGURE 4.1, if the addressing mode of the
ADD instruction is predecoded, we may overlap address evaluation
with instruction decoding, so that the LOAD instruction finishes
earlier and dependency with the following ADD is resolved. In a
tagged architecture, the tag identifying the data object usually
compromises data precision due to the word length constraint. If
the tag is fully encoded (compressed) in the main memory and then
fully decoded (expanded) by the predecoding logic above the data
cache, both data precision and ease of data processing are
obtained. Another example will be used in section 4.o to
illustrate how predecoding can eliminate the callreturn time
incurred in exception traps.
In RISC, the predecoding technique offers flexibility in
pipeline design. The saving in decoding time is not crucial since
decoding in RISC machines is relatively simple. Yet, decoding
problems still exist in some machines CHenn821 and this technique
may be found helpful.
THF PRDHFRSnR DESIGN 34
4.2 COPROCESSOR SUPPORT
Patterson says in CF'attSD that 'the RISC II speed will be
cu-t to improve, -for most of the CPU time is spent sending
the operands between RISC II and the coprocessor,,,,, Our
nterpr e— a t i on is that PIS us alone are net an s 6ct i ve vehicle
tor f 1 Cd _ing point applications, and that the CPU-coprocessor
inter-face is an area we want to improve in -future RISCs. In
IeoEc:f it is criticizes t a t 'No, rr. Arc! i cat: on Wizard, try
co build a system using this wonderful new chip. You have to
rebuild on the card the parts you just cut off.'
Floating-point instructions are clearly parts to be cut off
from the RISC kernel processor. But if the parts cut off are
organized in function groups (such as a floating-point group and
a decimal group), and if each group is implemented in a dedicated
coprocessor optimized to perform that group of function, and
further, if we try to keep all these processors work in parallel,
we can enjoy flexibility in configurating the system according to
the workload on the one hand, and efficiency promised by RISC
principles and parallel processing on the other hand.
The theme is to find out in what circumstances would the
coprocessors like to communicate with the RISC kernel, and in
what ways can they communicate effectively. From the following
analysis, it can be concluded that the key to eliminating the
CPU—coprocessor bottleneck is to let the coprocessor have enough
registers and operations for the data type it supports. The
coprocessor operation is modelled in the following way:
destination D— (source SI) op (source S2)
If one processor gets data from other processors, there is an
off-Chip communication delay of at least one cycle. To access
register operands (R) in other processors is no better than
accessing slow storage (S). The inter-chip communication penality
estimated in TABLE 4.1 is calculated by assumming that the
average time for the coprocessor operations is n cycles, and that
off—chip access takes one cycle.




— ic y i n
no. ovo,
penalty in%



























TABt_E 4,1 Omf- c n ip c o n t n i c a t i or? pen a 1 i t y. All
reaistE accesses are assummed off—chio.
TABLE 4.1 shows cleanly ~cw bad it would be if one chip gets
f
its operands from the registers of othe chips. The smaller n is
(as in high performance floating-point chip), the worse is the
situation. The problem is even more serious in GaAs environment.
The table suggests that all chips should use its own registers
and communication, if any, should be made through the shared
memory. If each processor has an abundant set of operations for
its data type, the chance to communicate is small. Inter-chip
data references are then limited to data type conversions.
In S370, there is no instruction for explicit communication
between general registers and floating-point registers.
Communication, if any, is made through the shared memory.
Therefore;, it can be concluded that a floating-point chip with
four local registers (as required by 370) attached to the reduced
370, will not have the affliction suffered by Berkeley's RISC II.
Moreover, as most decimal instructions are of SS type, we can see
from TABLE 4.1 that no extra delay is incurred. Hence, we also
conclude that there is no serious CPUBCD-coprocessor problem.
The remaining issue is to find an effective way for the
coprocessors to work in parallel. Suppose the RISC kernel focuses
on interger arithmetic and sequence control, and the coprocessors
only on the operations for their own data type, fine grain
parallelism is then called for naturally: the integer address
THF PROCESSOR DESIGN 36
thmet i c o-f any data type is eel cui at ed by the RISC. while t~e
remaining non—integer operations are dispatched to suitable
zoprocessors. To maximize parallelism, each coprocescr is
suggested to hold an elastic buffer to allow for the speed
di T-f erences o-f the various processors.
An example is used to illustrate how parallelism is evploicec
to get better throughput. Suppose we want to calculate the dec
product o-f two -f 1 oat i ng-pci nt vectors A and B (Earbr). In the
•fc: lowing iwe use index ac :r==si ng in the RISC to access tre
and br-, use a -floating-point chip to calculate arbr and the
accumulation, then use the FTSC to increment the index—pointers,
check -for completion, and possibly, loop back. Things in bold
-face refer to RISC, while others to the coprocessor. We use RISC
register 1 and 2 as pointers. floating-point register 1, 2 and 3



































It is seen that when the floating-point unit (FPU) is busy
executing instructions in the instruction queue, the RISC kernel
goes to process further instructions. In the example above, the
RISC CPU keeps on advancing the pointers, checking for completion
and then looping back when the coprocessor is still in progress.
At least four cycles are saved here due to CPUFPU parallelism. A
simple WAITWAKEUP protocol is assummed for synchronization: when
the FPU queue is full, the CPU is stopped waiting for wakeup.
Special applications may be hand optimized and kept in a program
library for use.
THF PRnCFSSCF DESIGN 37
On chip cache or on—chip f 1 oating—point unit?
In section 4.2 it is seen that there is a bottleneck
between the RISC processor and the -floating-point unit
(FPU) i-f these two units reside in di-f-ferent chips, and i-f
one chip uses register data -from the other chip frequently.
It is natural to think o-f putting the FPU on the same chip
as the RISC kernel, so that off-chip penality no longer
exists. The bandwidth requirement between the RISC
instruction unit and the cache is also very large (see
section 4.4), so integration o-f some on-chip cache is also
very inviting. The current (in the year 1988) complexity o-f
MOS process is around IN transistors, with this limitation,
it is not possible to have the RISC CPU, the FPU, and the
cache subsystem all on one chip. Some manufacturers such as
MIPS Inc. and AMD prefer on-chip cache, while others like
Fairchild and Motorola prefer on-chip FPU. It is true that
the bandwidth requirement between the RISC CPU and the FPU
is not as high as that between the CPU and the cache, and
that in terms o-f regularity, cache is easier to integrate.
However, integration o-f the whole cache subsystem is not an
easy task, the memory management unit with the physical
cache together o-f ten requires too much silicon area to be
included on-chip. Worse, RISC systems often employ the
Harvard-style architecture with separate instruction and
data cache, this almost doubles the cache requirement. Take
Motorola's new chip-set for example CMotor883. Motorola's
88000 family is a three-chip set RISC microprocessor system
with one CPU chip and two cache chips, one for instruction
and one for data. The single-chip CPU includes a RISC
processor and a f1 oaring-point unit. 'By integrating these
functions on a single chip, the 88100 sustains 7-12 million
floating point instructions per second (MFL0PS). This
rating is the highest available of any RISC processor on
tne market today-' On the f1oor plan, the FPU occupies more
area, than the RISC kernel itself. The whole chip has 165K
transistors. The cache chip i nc 1 noes In h. B of p h y s i c a 1_ a. c h e
and a memory management unit, and the transistor count is
75OK. It is, in tnis case, impossible to include the two
f«—f=. chios on ens -ISC kernel. Even if it is possiPie, it
is not desirable because Motorola allows tne cache
suoBvaoem to be expanded to eight such chips. In the
-_cue, yet, when one level of integration reaches several
oi es that of now, the inclusion of on—chip cache is
exceeded- As an alternative, one may include a small amount
-u: sphip cache and a larger off—chip cache do act as a
multi-level system. Together with an on-chip FPU, good
peformance may he achieved.
THE PROCESSOR DESIGN 38
4.3 A SIMPLER BRANCHING SCHEME
1- said in EMcFar863 that 'In a RISC machine, branches are the
roost: significant barrier to achieving single—cycle execution.? It
is also suggested that in order to reduce the cost c the
tranches, the pipeline itself should be short, so that the
ccnoition causing the interlock could come out quickly, and that
the delay slots should be filled by as many useful instructions
as oossifcle.
hiost. RI SC mach i nes emp 1 oy de 1 ayed—tranches to redu.ce the
branch cost. Bubbles due to unconditional branches can be filled
easily because those branches do not depend on the condition code
setting of the previous instructions. In the simpliest case, an
unconditional branch is just permuted with its preceeding
instructlon. For a conditional branch, whether the branch is
taken is not deterministic prior to run time. There are several
ways to fill the bubbles, depending on whether the branch is
guessed taken or untaken. In case of a wrong guess, one must be
sure there will be no catastrophic damage to the program state.
Intensive data flow analysis is required in that case CGross-82 3.
To evaluate the quality of an instruction set, Flynn points
out that compilation time must also be taken into account,
because as high as one half of the computer time is spent on
compilation [FlynnS33. If we already have a very short and
efficient pipeline, it is worth thinking of a simpler alternative
branching scheme that does not require intensive flow analysis.
Programs may run faster after the text is reorganised so that
most bubbles are filled, but transparency and modifiabi1ity are
compromised.
In our scheme, unconditional branches are delayed, because it
is very easy to find useful instructions to fill the slots. To
avoid intensive flow analysis, we may either use normal














1.04 CPI 1.03 CPI
FIGURE 4.3 Estimation o-f cycles per instruction CPI
nf two Branching schemes. N in a node
means No, while Y means Yes.
conditional Branches, or delay them only in circumstances with
hiQh predictability. Program loop is an obvious example. If we
introduce a delayed-branch-on-condition (DBC) instruction, all
the compiler needs to do is to use it only in loops. No flow
analysis is needed. See the following example for explanation.
LOOP














.i n s t r 1
instr N+l
ts
In the first loop, some instructions beyond the branch-back
are squashed using hardware or NOPs if the condition is
satisfied. In the second loop with a DBC instruction, the next
instructions rushing to the delay slots are not squashed if the
condition is satisfied, because they are useful ones transferred
from I ho loop head. Thai is to say, to implement a DBC is very
si mp I e, only a11 ex t r a I)i t i s needed to contr o.1 the squashi ng
direction. I f program 1oops are f requent, our scheme is even
t)e( t r?r t han 1he one usi ng flow analysis. I n FIGURE 4.3: we ex p .1 ai n
why this is so. Some typical f i g u r e s are u s e d E M c F a r 8 61: b r a n c h e s
have 207 f r ec|uency; 507. of them are cond i t i ona 1, and wi t;h i n t h.i s
THE PROCESSOR DESIGN 40
wUC, SU. are taken; anc ZTgam Iggds account: tor halt of the
taker. branches. In tn-e-_-=- uses .tre daca -low analysis is
required, we assume tnso all conditional brc.'nes are guessed
taken (as it is common). -»~t~c guess takes 2. cycles, while all
other cases need one cycle._~ ~~e second case, we onlv take care
at program loops- It is x on.~~ so ov scheme outper-f or m the more
ditticult scheme it loops are --ec_snt (507. or all taken branches
are assummed to spend in Izczz. In tue worst esse that there is
no program loat,the b-r i-:r ---n. :-ceases pv 113., cue r-s CP!
figure only by 37.. The lac is-= is equivalent to not delaying
conditional b r a- c h e s.—-- a d a t i on in performance if f 1 o w
analysis is not used arc Iccic e-~e not frequent is surprisingly
low, since we have assu-eo 3.. efficient pipeline, with this
pipeline all instructions ---c one cycle and exceptionally bad
ones need only two cycles,- ucticn 4.5, the reader will see
that this simple and efficient pipeline is realizable.
4.4 CUTTING THE INSTRUCTION BANDWIDTH
Due to its speed and low code density, a RISC processor normally
requires a high performance cache system. When single cache
organization cannot catch up, separate caches for instructions
and data are called for. It .is shown below that even when split-
cache organization is employed, cache performance may still be
the deciding factor of the CP I (cycle per instruction) figure of
RISC machines.
Delay is incurred on cache misses. Its effect on speedup is
determined mainly by how frequently the miss occurs and when
there is a miss, how lonq will it t a[:! e to load the missed block.
In the following we define the block miss ratio m as the number
of misses divided by the number of block references. We also
assume a block size of B words, a dynamic program size of D
words, and a word transfer time from the main memory to the
cache, including overhead in cache management, of C cycles.
THE ACCESSOR DESIGN 4 1
b1dck miEs retiz
of misses= m.I!
del sv c— :=e=s) x (block transter time)
— i m, I- 1•;( B. C)
~ 2. w.I c- the whole program
- 1- 1 :ar word
It the CISC CP I is re 3z.l all kinds ch delavs, and it we
decouple the et-ec p- :ar~~ ralay -zis one RISC CP I anc use n«
to denote it, then trie tr.e- I EC CP I= nR+ mC.
It mi= instruction r- as ratio
mD= data cacne mie= -ao:o
then m= mx+ 0.3mD
(4. 1
(4. 2:
tor a LOADSTORE architecture and a 307. LOADSTGRE trequency.
Note that even the data cache miss ratio is scaled down by a
tactor ot 0.3 (even lower it there is a large register stack),
its contribution cannot be neglected because m0 is typically
smaller than mz- (4,1) and (42) are found quite accurate.
From (4.1) we see that (4.3)
i.e. the smaller the miss ratio m is, the more sensitive the
speedup s will be to m: remember that m is typically small.




m 0. 0 0.1 0. 3 0.5 0. 7 1. 0
s 10.0 7.1 4.5 3.3 2.6 2. 0
It is seen that no matter how hara we have tried to cut nR,
the true CPI n7 may be limited by the cache pertormanee.
COCESSGR DESIGN 42
We also see that variations in speeduo s is 20 times of that
in m« Such a large amplification msv oe a 1; =-.oe id environments
with many kinds of workloads, each with different miss ratio. A
successful RISC machine desipn hence r3Dii a i ch performance
cache subsystem.
It. is said in CDavidB7] that 'si mo 1 a i-=i'i:t:Dn sens can
result in programs that require two anc a rimes more memory
than the same programs on a machine with a zomola i nst_ct i on
set. (the superset of the=: rple machines tre- ere using;. Our
evaluation of cache pert cr.-ance showed that small cache
sizes, instruction set complexity severely affected the miss
ratio. Fortunately, this aspect of performance can be corrected
through the use of a large caches. Finally, we examined the
amount of bus traffic: on the three machines. Even with a large
caches (64k)? a machine wicn a. simple instruction set can expect
to generate twice as much bus traffic as a machine with a complex
instruction set. Overcoming this potential performance bottleneck
caused by the increased bus traffic will require innovative high-
performance memory systems.•
In addition to increasing the cache size, one could try to do
the following to reduce the miss ratio CSmithS73:
(a) Optimize cache parameters such as line size.
(b) Use a multlevel cache organization.
(c) Use anticipatory fetching such as sequencial prefetching.
In CFlynn873 Flynn et_ al. say 'instruction set designers
cannot afford to ignore issues of code density in favor of
instruction simplicity or decoding ease. Instruction bandwidth
can be a significant component of memory traffic and, ultimately,
processor performance. Using larger register sets to reduce data
traffic from memory makes sense only when efficient instruction
encoding is used to make a corresponding reduction in instruction
traffic. Balanced optimization is the key to overall instruction
set et-f iciency.
THE PROCESSOR DE5ISN 43
Flynn et al» extend the RISC—type machine to include RX and
half word instructions, anc tnat prve i mprovemerrp in memory
traf-f ic is very significant:- Instsac of Huffmann encoding
instructions according to tr.eir freouency of usage, we seek
another approach to reduce i ~ecucc:or- bandwidth but to retain
decoding ease- A RISC instruction set -as low code density
because the redundancy in the simple out r?Qular instructions is
high- If we allow RISC opcoces to ecresent complex cases, the
programs will become mor-e cc~c act.
If complex cases occur w; -n a frequency of f ana a complex
instruction on the average needs L cache block of RISC code for
emulation, straightline expansion of the complex cases will add a
worst case cache miss delay of fL3C cycles to the CPI figure:
n«= nR+ mC+ fLBC (A. 6)
If f= 10., L= 2, B= C= 4, then each instruction requires 3.2
more cycles for completion. Since complex cases are infrequent,
if they are distributed evenly throughout the program, the worst
case delay is not impossible, because subroutines for them are
likely to be replaced on cache misses.
ur solution is to put the subroutines in a separate memory as
fast as the cache, and multiplex this memory with the instruction
cache. On detection of a complex instruction, the alternative
memory is chosen. A suitable subroutine is chosen to emulate that
instruction. This is like microprogramming, with the RISC itself
as the microarchitecture.
Each trap incurs a delay of two branches (of at least two
cycles), one to and the other from the emulation routine. The
overhead is 1007. if the routine has only two .instructions like
those for RX instructions (e.g. a memory to register ADD). The
macro instruction actually functions only as a trap routine































Emulation or an CX—Ape. he e'ulation is a load followed by
an RR— add. When the F.X —Add (II.) is in the instruction buffer,
the next instruction'~ is -etched from the instruction
cache, and II is usee cc lookuo the microstore for the RR—Add
concurrently (a). At the next cycle, 12 and the RR—Add are
already fetched, but F:F:-Add is chosen (b). The RX-Add used to
lookup the microstore is treated as a load by the processor.
When the RR-Add is completed, the processor is switched back
to 12 in the normal stream immediately, without any delay.
The complex instructions are partitioned into classes in such
a way that the subroutine for each class begins with the same
instruction. All RX ALU operations, for example, begin with a
load. The decoder is than tricked to believe that the macro
instruction is the first instruction for the subroutine. As the
processor is busy treating this first instruction, the next
instruction from the normal stream and from the alternate stream
are separately fetched and buffered. With the switch set to the
alternate memory holding the trap routines, the processor will
honor the trap first. When the emulation is completed, the
processor is switched back to the normal stream: this time at
least. one instruction is waiting in the instruction buffer, so
the return is smooth. See FIGURE 4.4 for an example.
We may use the predecocing technique to preaecode the complex
instructions into classes, and use the predecoded tag to fool the
instruction decoder.
~HE PROCESSOR DESIGN 45
example to i 1 lustrate the predecoding technique.
••••
|j Suppose we have a RISC 5r:nitecture and want to enhance its
Ij per-formance by introducing the following:
: 1. Parallelism in instruction sequencing, e.g. decoding and
address evaluation— if the acoressing moce information
j is, due to word—1ength prodiem, embedded in the
| instruction opcode, che arcnitecture is not orthogonal
| with respect to instrjction operation and addressing
: mode. The effect:arc-ess in this case cannot be
i evaluated before the i-scucticn is decoded.
:! 2. Virtual machine exts-s::-!- if the definition of the
I! RISC architecture doer ~ot contain issues such as
ij f1 oating—point and complex instructions to be trapped,
ij inclusion of then a- cause troubles. Sporadic
occurrence of these i r e: -net i ons in the instruction
ij stream requires than me RISC kernel be able to
jj recognize them and men dispatch a suitable handling
I; device. That would unreasonably lengthen the RISC
I; decoding path.
If there are two addressing modes to be predecoded, a
f1 oating—point instruction subset and a complex instruction
subset to be added to the RISC kernel to form a virtual
machine, all we need is three bits separately for the three
kinds of information to be predecoded: an A bit to
differentiate the two addressing modes, an C bit to
differentiate main processor and coprocessor instructions,
and a T bit to differentiate normal and trap i
instructions. The predecoding logic is a very simple i «
combinational circuit which uses the opcode field of an|
instruction to determine its A, C, and T bits:
•7!— rest of instruction—!
opcode
predecoding
1 og i c
The predecoding logic is a
simple 8-input, 3-output circuit
such as a. ROM or FLA device
interposed between the main
memory and the cache.
same as before
A C T opcode
THE =POC£SSQP DESIGN 46
4-5 THE INSTRUCTION PIPELINE
Our aim is to design a simple and e-f-ficient pipelined processor,
so that all the instructions we have chosen execute in one
(pipeline-) cycle. Techniques derived earlier are used to make
the execution smooth. When assigning -functions along the
pipeline, we keep in mind that the distribution should be even so
that each stage takes approximately equal time; and that
conflicts among stages are to be kept small.
We have designed a 3—stage pipeline. Instructions are
sequenced by a 4-phase clock, hence each stage carries out its
function in a four-phase cycle to be defined later. Briefly
speaking, the first stage is instruction fetching (IF); the
second stage is instruction decoding (ID) in parallel with
effective address evaluation (EA); and the third stage is
instruction execution (EX) or data access (DA). Only LOAD and
STORE access memory in the last stage, other instructions need
some kinds of ALU functions; hence DA and EX do not take place
tooether. The following symbols are used in FIGURE 4.5:
IF: insirnotion fetch
ID: instruction decode
EA: o+ecti ve address evaluation!
EX: instruction execution
CC: ooncition code setting
DA: data access:
— LL I 1 Cad
- ST: stere
-! 51 1 t: 3 €3 1 t 0
_CC: .is: 3cgam counter
several points to note. Firstly, internal forwarding
,-— sec in in? design to resolve conflicts. Before the third
stao-e can execute the designated operation, appropriate registers
snoulc ce loaded. The RR function of the third stage is done one
scace ear 11 e at the Fourth phase of the previous stage. This
ai'ows rc= une For the ALU and the condition code setting, but
l_ c ceoendency problem: slots (2a, 4) and (o a, 4) have



































The distribution of -functions
along the 3-stage pipeline.
Each stage has 4 phases, the
functions are identified by
the (STAGE, PHASE) coordinate.
Note that row 2a and 2b occur
in parallel: while 3a, 3b, and
3c are mutually exclusive.
conflicts, since one involve readinq a register that may not be
available until the other has deposited correct data in it. In
this case the updated data from (3a,2) is forwarded to (2a,4).
Similar dependency problem occurs in row 2b and 3a. The
register involved in the RR operation in (2b,1) may be waiting
for the updated data in (3a,4). Data forwarding is more complex
in this case. Note that EA involve double indexing and needs two
phases; the updated data from (3a,2) is forwarded to the second


















There are some special hardware requirements in the pipeline.
A multiport register file is needed to support parallelism. At
least one read port and one write port should be simultaneous, as
seen from (2a,4) and (3a,4). To enable the ALU to complete its
operations in a fixed time of two phases, we should have a barrel
shifter so that the shifting time does not depend on the shifting
amount. Both multiport register file and barrel shifter are
common in today11 s market of uD—bit chips. We have tried to make
arrangements to reduce the number of simultaneous ports required,
and the number is now affordable (see FIGURE 4.5).
THE PROCESSOR DESIGN 48
An address adder is used in the design to shorten the
pipeline. With this extra adder, LOAD and STORE are made single-
cycled. Branch addresses are also calculated one stage earlier,
so that a wrong guess in the branch target only incurs a one-
cycle bubble. As LOAD, STORE, and BRANCH dominate in most
stat i st i cs (over A-07. typically) and all these three instructions
require address evaluation; the inclusion of the extra adder
seems justifiea. Note that CC in (3a, 3) is used to determine
whether the branch will tars o1 ace: LPC in (2b,4) uses CC to
check if the PC is to be loaced by the calculated address.
The ALU and the shifter are connected in parallel. They share
the same buffer registers for holding operands and the result.
The ALU shifting functions are those required by the instruction
set listed in APPENDIX Al. The reader may refer to APPENDIX A2
for the design of the Ai_U arc the condition code setting logic.
From the design we note that the ALU and the CC setting logic is
rather complex, almost all the ALU functionality of the full
S370 architecture is required. Three data types are supported:
word, halfword, and byte.
We split the cache system for instruction and data. Both
caches are interfaced to the processor through two buffers: MB I,
or memory-buffer-in for the instruction cache; and MBO, or
memory-buffer-out for the data cache. Both buffers include a
multiplexor to handle the different data type requirements such
as sign extension and byte insertion. An alternate memory is
multiplexed with the instruction cache. Trap routines are placed
in this alternate memory for emulation of complex cases.
4
THE PROCESSOR DESIGN 49
In most designs, the basic cycle time is limited by the 
cache access time, because the cache is usually the slowest 
element in the processor pipeline. To have a cvcle time 
lower chan that limited by ihe cache access time, the cache 
can pe made to deliver multiple words per cycle. To double 
tne bandwidth between the processor and the instruction 
cache, -for example, we may double the bus width and let the 
cache deliver two consecutive words on each cvcle. But in 
seme cases, a RISC process is implemented by single-cnip 
'■/'LSI and is decoupled -from tne cache chips. in that cases, 
increasing the cache b a n a width by widening the bus is not 
-f ea sidle due to the p i n constr ai nt. Yet, f or systems havi no 
dua1 cache buses 1i k s  MotoroI a * s SS0 00 and Fai r chi1d ’s
Clipper, there is a way to increase the throughput.
cache I-cahce ! ; D —cache In the -figure, SI
delav dc L________, I L_____,_____  and S2 are multi -
! ! ti____ . L__  plexors, e and o
______ 1_____ I_____ !______ j are respectively
51 (with delay d x ) even and odd
VLSI L_________________ _______ instruction bu-f-fer
boundary _ _ _ - . _ _ . - - i . - ~ ~ -  tor the instruction
__________ :_______!______ cache, and DB is
52 (with delay ds>) data butter tor the
______ _____ l_____ [______  data cache. Note
1 l I .. the VLSI boundary
________ ____ [____ 1_____  between the caches
e o DB and the processor.
It a bus transfers n words/sec across the boundary, and the 
data traffic is £Y. that of instruction traffic, for a dual- 
bus system, the throughput is TP1 = (l+6')n, Quite often,
the data bus is underexp1oited, therefore we may multiplex 
the data and instructions on the data bus (see the diagram 
above). When the data bus is idle, odd-address instructions 
are transmitted on it. When the data bus is busy, we may 
stop both even and odd instruction transmission or use the 
even bus to carry both types of instructions alternately. 
In the latter case, an extra multiplexor S2 (hence extra 
delay) is used to handle the even and odd instruction 
seperation. The throughput for the two cases are;
TF'2 = C (1 —S') (2n ) + 6'n 3 / ( 1 +r x ) = n (2-6') / (1+r x ) (r i=di/dc )
TF'T = 2n/(l+r^) (r 2=[di+d2 ]/dc)
If 6' = 0.3, r3 = 2r1? r x *€ CO. 2, 0.33, TF‘2 >TF'3. Therefore, 
for simplicity, the former scheme is suggested, and the 
extra"" instruction bandwidth when r x is very small is 407. 
when S' = 0.3. (Note; for r x < 0.3, TF*2 > TP 1)
Setting better throughput out of the cache buses
THE PROCESSOR DESIGN 50
4-6 BEING IN A DILEMMA
Many RISC designers emphasize the use o-f hardwired control to get
the -fastest possible cycle time RISC promises. Some, however, use
microprogramminq as most CISC machines do (-for a review, see
CGim871). Some systems such as the Pyramid 90X and Ridge 32
emcloy micrcprogramming5 but they still rail in the RISC category
in Tabak7 s taxonomy CTabak863. A similar problem refer to the use
of a large register stack. Some researchers argue that the
register stack has nothinq to do with RISC principies, and they
can well be applied to CISC architectures LHitchB5]„ Vet some
RISC machines prefer a modest size register file, a.nd they
continue to be RISCs.
It should be pointed out that microprogramming a RISC is not
as cumbersome as microprogramming a CISC. The mapping between
macro and micro instructions may be one-to-one as in MIRIS
CDuB863, because RISC instructions are already as simple as
microinstructions, but a little more vertical CPatt833. All we
need is only a small and fast table look-up device like ROM. In
some circumstances, PLA is used as an intermediate solution. For
designs using the building block approach such as those from
Advanced Micro Devices., mi cr opr ogr ammi ng and hardwired control
are both possible.
The second thing to point out is that microprogramming is
really very flexible, one may use it as an expedient during
system developement or migration. Due to its simplicity, the RISC
microstore can be replaced by logic quite easily. In our
simulation study, microprogramming is used because it is easier
to simulate and it allows us to add or modify instructions
conveniently. We also put routines emulating complex instructions
in a separate microstore to reduce the instruction bandwidth.
THE PROCESSOR DESIGN 51
IBM S• j 7U has 16 general registers. If mere registers or=
large register stack similar tc that c-f Berkeley's -IE! II
CPattS51 are added to it, compatibility is difficult to maintain.
Moreover, as we don't want the RISCCISC issues to be c iurr=: bv
the performance -from the register stack, the register f i 1 e is net
enhanced. Flynn et al show that unless there are many r=pisters
organized in multiple register sets, the improvement cameo r---
extending 16 registers to 32 is not significant CFlynn37j, It
standard techniques are used to keep -freoueno data in -eoisoes
CChait82, Hug863, a register file o-f 16 is believed not to oe so
restrictive.
4.7 DESIGN SUMMARY
An explicit scheme o-f the design has not been shown up to now.
Stating every architectural decision quantitatively is difficult,
because the description is very cumbersome. Actually, the
-functional modules supporting the architecture and the
interconnection structure between them have been implicitly
presented in the -foregoing sections which describe some
techniques used in the design. All. these techniques are
incorporated to make the machine run smooth and fast. Instead o-f
putting every -fact quantitatively, we will try in this section to
summarize the details o-f the design so as to make it less
ambiguous. By the way, as the design is simulated using a non-
ambiguous programming language, the program listing (which is
available on request) may be a better -formal description, though
it may not be readable.
The reduced instruction set is a subset of 31 instructions
from the full S370 architecture. Instructions are chosen in such
a way that (1) they are potentially single-cycle in nature, so
that when they pass through a pipeline, the flow is very smooth
due to regularity, and the execution time for them is short due
to simplicity; (2) the reduced instruction set fulfills general
nurnose computing environment. LOAD and STORE are, as compared to
TUF ponrFFFDR HFSTRN 52
other RR instructions, two-cycle instructions, but the design
will not work without them. An dedicated address adder is used to
generate address in parallel with instruction decoding so that
the two—cycle LOAD and STORE become single—cycle instructions. A
BRANCH on the average needs more than one cycle because of
pipeline breaks, but various types o-f delayed actions are
instroducad to improve the branch per f or mane e.'
A very simple but eic:=nt pipeline is designed nor the
reduced instruction set. The pipeline has three stages, each has
its own hardware -for partial processing o-f instructions. Each
instruction actual 1 y needs three pipeline cycles -for completion,
but the three pipeline stages overlap so that three instructions
are in progress on each cycle. Each instruction has different
hardware requirement, an ADD, for example, needs an adder, while
a shift needs a barrel shifter. The design has ail the hardware
requirements fulfilled, and the hardware parts are appropriatel y
connected so that the mahcine works fast. Due to architectural
simplicity, the hardware cost is affordable. A technique called
predecodinq is used to resolve various kinds of dependency
problems inherent in the pipeline to make it smooth.
The reduced machine can be extended to a virtual machine
having the instruction set complexity of a CISC machine, yet the
speed performance of a RISC machine. The enhanced functions are
encoded by the instructions provided by the reduced machine and
are kept in a place separate from the cache. In this way, the
cache performance will not be affected by these functions. The
predecoding technique is used to make the switching between the
cache and the alternate memory smooth. As cache performance may
be the limiting factor to speed, it is thought that encoding
complex functions in this way helps to improve cache performance,
hence contributes to speed.
Due to the relatively low code density and high speed typical
in a RISC machine, the memory system must be commensurat el y -fast
to provide instructions to the processor. The design is supposed
THF PROCESSOR DESIGN 53
to have a cache splitted separately -For instructions and data.
The various cache parameters have not vet oeen determined,
because that would require analysis of large scale experiments
and knowledge from implementation constraints. The reader may
refer to the split—cache design in the simulator described in the
following chapter.
The kernel machine is designed to couple to coprocessors very
effectively. Each processor works on its own data c-oe and all
work in the maximum parallelism allowed. Fine grain parallelism
is employed, processing a single instruction may keep several
processors busy. Communication problems are noted, some special
techniques are suggested to improve performance.
Other data path elements have been discussed in section 4.5.
The reader will find more discussion in the next chapter when the
simulator is discribed.





In this section we will first summerize the works we have so far
done; then, some comments are made to the architecture of the
reduced machine. The reduced machine is found to possess most
RISC features; and its instruction set is more regular,
orthogonal, and composable CWulf813.
At the outset we want to reduce a complex machine to make it
simpler and faster. A subset of the IBM S370 is picked out and
streamlined. Instructions that are potentially single-cycle in
nature are chosen. Techniques are devised to make the machine
execute one instruction in one cycle. The instruction set with 31
kernel instructions, as summerized in APPENDIX Al, contains
arithmetic, logical, shifting, data movement, and sequence
control instructions.
ALU operations come in three-address RR format, each
designating one of the sixteen general registers. One of the two
source operands is allowed to denote an immediate constant. Only
LOAD and STORE access memory using double indexing in the RX
format- the only addressing mode remained. Each load and store
can be operated on the three data types the machine support:
word, halfword, and byte.
EVALUATION AND CONCLUSION 55
All the instructions a.re 32—bit long, with relatively simple
and -fixed -format -favoring instruction decoding. A very simple but
sf f i c i ent pipeline is designed -for tine instruction set. One
instruction in the RISC repertoire, possibly except conditional
branches, goes out on every pipeline cycle- The simplicity and
efficiency o-f the pipeline are results o-f the simplicity of the
reduced machine. To summarize, we have invested hardware in the
kernel instructions that are worth investment. Some methods are
derived to arrange the hardware so that the machine goes fast.
A subarchitecture of the IBM 5370 is streamlined. We now want
to convince ourselves that the design is a RISC. This is achieved
by using common RISC features to examine the machine. From
APPENDIX A1 we can see that the machine is much simpler than the
original architecture. With the cognizance of putting the
complexity on the shoulder of the compiler, we leave it simple.
Some RISC features are listed below to test our design. The
listed result also passes Tabak's and Col well's definition




3. Few and fixed format
4. Few addressing mode
5. Few instructions
6. Hardwire Control




Yes (3 simple formats)
Yes (only 1)
Yes (only 31)
? (but easy to satisfy)
Yes
Regularity, orthogonality, and composabi1ity are the three
general principles that would improve the impedence match between
compilers and a target computer CWulf811. According to Wulf,
Compilers like primitives, not solutions; and there should either
be precisely one way to do something, or all possible ways should
be provided. Fortunately, RISC provides primitives; and the
minimality inherent in its instruction set enables compilers to
compose things in a unique way. Our reduced machine also exhibits
these characteristics that its complex counterpart does not.
EVALUATION AND CONCLUSION 56
Operations in the S370 instruction set are not regularly
de-Fined. There are, -for example, word and halfword addition and
subtraction, but no such instructions in the byte data type.
Compilers there-Fore cannot use a byte to represent a short
integer Only characters can be represented by bytes. As another
example, let's consider the multiply and divide instructions. The
first operand in a multiply must be a even—numbered reoister.
Dividends must be in even—odd register pair. There are word and
halfword multiply, word divide, but no hal-fword divide. All these
contribute to make the S370 irregular.
In the RISC design, we have cut the instructions causing
irregularities either intentionally or inevitably. Halfword
operations are discarded, Data types are taken care of uniformly
by the LOAD and STORE instructions, the operators left are
generic. Multiply and Divide are discarded because they are not
single-cycle in nature. All ALU instructions are now register-to-
register, only LOAD and STORE access memory by only one
addressing mode. These encourage optimization by encouraging
frequent storage data to be kept in registers.
The reduced architecture is also more orthogonal: All ALU
operations can be applied to all data types, using the only (i.e.
all possible) register-to-register addressing mode. If RX
instructions are included, there are two sets of operations, one
for register data, and the other for memory data. The complex
S370 is less orthogonal than the reduced one in this regard. We
have better regularity, better orthogonality, and hence better
composabi1ity. All operations, data types, and addressing modes
are easily composed.
A technique has been devised to extend the RISC architecture
to a CISC virtual machine (see section 4.4). The virtual machine
has the compactness of CISC program representati on and the speed
of RISC. Care must be exercised if the RISC architecture is
extended in this way, because regularity, orthogonality and
composabi 1 ity may be destroyed. If the extension is aimed at
PUAI UATinN AND CONCLUSION 57
direct emulation o-f existing CISC object codes, there is no
problem. But if program compactness is the only aim, trade-o-f-f
should be made to take care also o-f compiler's need,
5-2 THE SIMULATION
The reduced design is simulated using Pascal. The hardware
details presented earl i a a re all simulated. These include the 3—
stage pipeline, the splic cache, and all the sophisticated data
path elements. A floating-point unit with five instructions, and
11 RX emulated instructions are included (see APPENDIX Al).
Microprogramming is used because it is easy to simulate. The
reader may refer to the microinstruction format in APPENDIX A3.
Sample programs are run on the simulator. The execution time
is compared to 370125, because we have only got 3125's data. The
timing formulae of the 3125 instructions taken from CIBM763 are
stored in the simulator for automatic speed measurement. The
simulator can assume pratical or ideal unity cache hit ratio, so
that the effect of cache can be appreciated.
Programs either run in the load-and-go or the single-step
mode. The interpretation speed is about 100 instructionsec in
the load-and-go mode. Critical path elements such as the general
registers are arranged as global resources. They can be peeked at
any time the program is traced in the single-step mode, so that
all the processor mechanics is observable.
Distinct data path element such as the ALU and some elementary
operations like register file readwrite and data multiplexing
are functionally encapsulated in procedures. Simultaneous effects
like the various operations in the three stages of the pipeline
are hidden in a single procedure. Possible interlocks can then be
seen by permuting the execution order of the pseudo-simultaneous
operations. The data forwarding requirements discussed in section
4.5 is actually fulfilled by first executing the forwarding
EVALUATION AND CONCLUSION 58
stage. The array type is used to represent index-addressable
storage units such as registers and caches; while the record type
is used to represent the program status word.
One instruction is -fetched -from the instruction cache to the
instruction register on every cycle. Missed block, either in
instruction or data reference, is loaded to the cache on demand;
and the least recently used block is replaced. Data cache and
main memory coherence is maintained by using the copy—back
policy. The data cosistency problem associated with a multi¬
processor system could he solved by introducing a bus-watch
circuit similar to that used in the Clipper module CHunts?]. The
bus—watch circuit ensures that all the possible bus masters refer
to correct contents by redirecting references -from memory to
cache i-f the image has been modified.
Some low—end 370 models have S kbytes o-f cache, some high-end
models have 32 kbytes, and models in between have 16 kbytes.
CCase78H. 16kbytes are assumed in the the simulation, with 8
kbytes -for the instruction cache and the rest 8 kbytes for the
data cache. Both caches are set associative, with an








On a cache miss, a block of four words (16 bytes) are fetched
from the main memory. The transfer is assumed to take 16 cycles,
while each reference to a 4-byte word in the cache takes one
cycle. Hence the speed of cache is four times that of the main
memory. (If free of pin constraint, we can make the main memory
deliver many words each cycle) For programs exibiting moderate
localities, the instruction cache has a hit ratio as high as 0.9.
Better figures are observed for long loops. The data cache has a
significantly smaller hit ratio. Equation T4.4) accurately
predicts the speedup under the influence of cache performance.
EVALUATION AND CONCLUSION 59
Programs are sequenced through the program counter PC. The PC
may be modi-fied on branches in the second stage. A branch bubble,
if any, is therefore at mosi; one cycle long. At the interface
between the data cache and the processor, there are two registers
MBI and MBQ to hold in and out data. A multiplexor is included in
each of these registers to handle the three data types. Inputs
and outputs of the ALU, the barrel shifter, and the address adder
are buffered. The output of the address adder is the memory
address register MAR. The MAR is used as a pointer to memory
variables or as a branch target. A hidden register is used by
the emulation routines to hold temporary values.
Some squashing logic is implemented at the last two stages of
the pipeline to suppress the effect of instructions to be
flushed. A squash bit is used to activate the squashing logic
during pipeline flush. A similar trap bit is used to activate
trap routines. Perfect trap is implemented for emulation
efficiency. The trap routines are kept in a small microstore.
The five floating-point instructions simulated are load,
store, add, subtract, and multiply, all conforming to the S370
architecture definition. The load and store are actually carried
out by the RISC CPU: the CPU first calculates the effective
address and then initiates a virtual 1oad or stdre. The floating¬
point unit can intercept its own data and instructions from the
shared cache bus. A wait bit in the RISC pipeline is used for the
CPUcoprocessor synchronization.
The reader may refer to the printout example from the
simulation listed in APPENDIX A5. In this example, a floating¬
point number held in floating register zero is reciprocated. The
result is then placed in floating register two. The iterative
algorithm is used. The sequence control used in the iterations is
carried out by the RISC kernel, while the f1 oating-point
operations addsubtract and multiply are dispatched to the
f1 oating—point unit.
EVALUATION AND CONCLUSION 60
5-3 RUNNING LOW-LEVEL SYNTHETIC BENCHMARKS
Synthetic benchmark programs are commonly used to measure the
per-formance of a computer system- In these programs, a typical
and balanced mixture o-f HLL constructs are combines to simulate a
production environment- Some benchmarks are used f c lestinq
numerical aspects and some for system related issues. Without
exception, the benchmarks are written in HLL so as to reflect
also the interaction of compiler and computer arrritecture.
Although HLL synthetic benchmarks have many drawbacks rnhrS4,
F'ul 831, they are widely used as gaug i rig—rods, because, as
standards, they are easily available, and actually many people
have long been using them. Some drawbacks refer to the inability
of these benchmarks to reflect all practical aspects and their
sizes that are so small to be easily fitted into a cache.
We cannot use HLL benchmarks because we do not have any HLL
compiler. But two points are observed; (1) HLL synthetic
benchmarks do not compute anything meaningful, they are just
syntatically correct; (2) when benchmarks are compiled down and
run, typical statistics are obtained, and these statistics can
reflect the interaction between compiler and architecture. From
these observations, we think it worthwhile to make use of low-
level statistics to construct meaningless, but syntatically
correct, machine-level benchmarks to evaluate an architecture.
Hand-written machine-level programs that compute something
meaningful may be used, but compiler aspects are then to be
neglected. Moreover, writing long machine-level programs is a
very tedious task. If there is an easy way to construct large
machine-level programs that upon execution feature compiler-
generated statistics, we have got a virtual compiler. If the
statistics can easily be altered, we have in effect many virtual
compilers for many kinds of workloads. We may, for instance,
change the frequency of LOADSTORE and BRANCH to account for
different register file structures and callreturn patterns.
EVALUATION AND CONCLUSION 61
In CFair823, Faircl ough divides instructions into eight
-Function groups and examines statistics o-f seven machines. The
similar results found are fitted into an analytical expression.
Fairclough also illustrates how to use this expression to design
architecture according to instruction usage. We start out using a
similar grouping method, and then refine each group using other
published results. We reduce the number of groups into three for
clarity. Note that we only focus on computational aspects.
Instruction q r ou.p
a. Data movement
b. Program modification









Data movement group (a) includes LOADSTORE and all other
register-to-register movement instructions. Program modification
group (b) includes various BRANCHES and CALLRETURN. The final
group is self-explanatory. We will refine each group later.
Now we are going to find an easy way to generate programs.
First of all, a small program having single-entry, single-exit
called a basic block is constructed which have the above
instruction mix. Then, ways are found to combine basic blocks
into arbitrarily large programs that preserve the statistics. If
we allow basic blocks to interact only in the following ways, it
can be shown that the basic block statistics is preserved.








EVALUATION AND CONCLUSION 62
2. Looping using Lblocks:
Loon: RFRTN
EC Loop
Lb 1 oc k
3. CallReturn using Cblocks and Rblocks:








If the instruction mix is in the proportion asbsc, the number




Nc, M~ N+ F'mNc, M— 1
(n is a scaling factor)
(with a K times repetition)
(calling to M levels, with F'M calls)
It can easily be shown that a program synthesized in the above
rules, however long, in whatever order, executes NP instructions:
NP= XN+ YN+ ZN= (X+Y+Z)N and X, Y, Z are integers.
Therefore the program has the same instruction mix in the
proportion a:b:c= 10s5:5 in our case. The scaling factor n
should be chosen in such a way that a basic block is not too
small for refinement and structural changes. We have decided to
include 20 a-type, 10 b-type, and 10 c-type instructions in a
basic block.
A program is written to construct benchmarks by combining
basic blocks in the above three ways. A random binary number is
generated and this number represents a benchmark program. A 1
in the number stands for a block begin, while a 0 stands for an
end. The number should therefore start with a 1 and end with a
EVALUATION AND CONCLUSION 63
O. The substring 10 stands -for a basic block, while a
repetition of in, or (10) means a sequence o-f basic block.
(10) is sometimes combined to form an Lblock. There may be
adjacent 1 s and U's, but they must be matched according to the
LIFO rule of block nesting. Obviously, there must be same number
of 1 and 0. After a valid string has been generated,
appropriate blocks are combined and relocated by the program. The
generating program is like a compiler, and randomness is a means
to simulate real world situations.
Conditional execution of basic blocks using IFTHEN is not
implemented, because that will complicate program generation. But
within a block, conditional execution of instructions is allowed.
Actually, the randomness of each generated program together with
CALLRETURN is a simulation of the uncertainty and branching
action in IFTHEN. With the same reason, GOTO within a block to
another block is not allowed.
We are going to refine the distribution of each instruction
group. Statistics from CAlex753 and the Gibson Mix CSer963 are
referenced. Statistics in C A1ex 751 are obtained from system
programs; and that from CSerS61 are obtained from scientific
programs. Floating-point instructions are found in the Gibson
Mix, but it is also under the same instruction group as integer
arithmetic is. Floating-point instructions will not be used in
the synthetic benchmarks although a floating-point unit is
simulated, because the model (370125) we use for comparison does
not have a high performance floating-point chip. Floating-point
and integer ADDSUBTRACT are hence jointed together in our mix
(See TABLE 5.1).
EVALUATION AND CONCLUSION 64
ci- The data movement group:















Note that 12 LOADSTORE instructions are included in the
Sblock and Lb1ock, while 16 are included in the Cblock and
Rblock to account -for greater memory traffic in crossing
subroutines. On the average, there are 357. LOADSTORE.
b. The program modification group:
[A1 ex75 3 CSer86]

















In the Gibson Mix, BRANCHES and CALLRETURN are treated in
the same class. The mixture of the various kinds o-f
branches are taken -from the typical -figures listed in
section 4.3. At most one CALL is included in a basic block.
There may be no CALL in a Sblock or Lblock.

























The actual proportion of ADD, SUBTRACT, shifting, and
compare is not very important, because they are all single-
cycled in our machine. However, practical situation will be
taken into account as far as possible.
TABLE 5.1 Refinement of each instruction group.
See APPENDIX A4 for a typical Sblock.





















































TABLE 5.2 Results -from the execution of seven benchmarks.
All kinds o-f basic blocks are structurally alike. The reader
may refer to APPENDIX A4 for the listing of an Sfalock. Different
programs are generated with different combinations of the basic
blocks. These programs are run on the simulator, and the results
are used to evaluate some aspects of the reduced architecture.
Cache and pipeline performance can be seen because each program
has different execution pattern. The RISC machine execution time
is also compared to that of the CISC to see what speed advantage
the reduced architecture has over the complex one.
The results of the execution of seven programs are listed in
TABLE 5.2. Each program is about 4 kbyte long. Without using the
low—level synthetic benchmark technique., machine programs with
that length are difficult to write. The first three programs are
synthesized with human intervention;, while the last four are
randomly generated. The first program BM C has a very high
proportion of nested CALLRETURN; the second one BM__H has high
locality; while the third BM__L has very low hit ratio.
Results of two cases are shown. The first refers to the case
with matched processormemory bandthwidth, so that no cache is
needed. Without delay due to cache miss, pipeline breaks are
reflected in the CF'I figure. The second case has split caches
with a configuration mentioned in the last section which
EVALUATION AND CONCLUSION 66
describe- the simulator. For simplicity, the data cache miss
ratio and the instruction cache miss ratio are combined using the
-Formula derived in section 4.4. The accuracy o-f those formulae
are within 1OV.. The speedup figure is, however, directly measured
ana calculated by the simulator.
Ail Lr'I figures in case 1 (without cache) are close to unity.
The worst one is found in the first program BM_C. Since the
branch and link instr. I (BAL) used in subroutine cal 1 is
not delayed in thie si mu 1 ator, more pipeline breaks a.re expected
if a. pro g r a m h a s m a n y B AI i. n s t r u c ti ons. Howe v er, 1 o cal i t y is ver y
good in this program. we can see by comparing the second and the
third program (with respectively high and low locality) that
although they have the same CP I figure in case 1, the speedup
differs significantly in case 2 where cache performance is also
considered. Speedup is close to 11 in case 1, while great
variation is found in case 2.
The four randomly generated program BM_R1, BM__R2, BM__R3, and
BM_R4 have striking similarity in performance. The average cache
miss is 0.29, and the average speedup is 5.2- about half of the
ideal case. We have assumed that the cache (and the processor) is
four times faster than the main memory? but the performance gain
is only half as expected. It is hence worthwhile to consider
using modest processor technnlogy to match the speed of main
memory, so that expense in the complex cache system can be saved.
5.4 SOME NOTES TO THE INSTRUCTION SET
The reader may find out that so far we have only concentrated on
the computational aspect of the design. Actually, if memory
mapped IO is used, the reduced architecture can afford all
general computing requirements. However., when some system related
primitives are added., system performance may be greatly enhanced.
We are going to discuss two points that are neglected by most
RISC designs: interruption and multiorocsessor support. Froblems
EVACUATION AND CONCLUSION 67
concerning complexreduced machines compatibi1 i ty are also
discussed.
The interrupt handling sequence is, brie-fly speaking,
interrupt indenti-fication followed by interrupt servicing. The
contents of some control registers and the program status word
(PSW) are examined in an interrupt. Instructions for examining
these registers are privileged. The only problem-state (non-
privilege) instruction involved is the set program mask CSP)
instruction that sets the condition code and the interrupt masks
in the PSW, In terms of functionality, the SFT1 instruction is
simple, it can be added to the instruction set without altering
the main structure of the pipeline.
Our basic reduced architecture only includes instructions in
the problem state. With the SPM instruction, some other system-
related instructions such as loadstore PSWiowhi0hMj and load'
store control registers may be added to simplify interrupt
handling. It is important to keep in mind that new instructions
should be simple ones so as not to violate the single-cycle
nature of the processor. The instructions just mentioned are
simple enough and are worth considering.
As an interrupt is allowed to occur at instruction boundary,
our reduced machine has a faster response time since each
instruction only needs one cycle. Some logic is assumed to flush
and restart the pipeline when an interrupt occurs. According to
the S370 definition, the number of interrupts to be honored is
large, hence the interrupt identification logic may be quite
complex. However, this part of logic can be considered as a
separate design that does not affect our RISC kernel.
In the S370 architecture, there are some instructions for
multiprocessor synchronization that are multi-cycle in nature. We
cannot synthesize these instructions by simply writing routines
for them, because execution of these instructions are accompanied
j-jy execution of some serialization functions- that do not ha a
EVALUATION AND CONCLUSION 68
RISC counterparts. When these -Functions are in execution,
critical resources such as the shared memory are not allowed to
be accessed by other channels and procesors.
To tackle this problem, we propose two primitive instructions:
atomic begin ATB and atomic end ATE. The -function of the ATB
r
instruction is to lock the data bus for exclusive use, so that no
other processor can access the critical resources; while the ATE
instruction is to release the bus. A subroutine for a
synchronization instruction must then be prefixed by an ATB and
ended by an ATE. With these two primitives, synchronization
instructions such as test and set and compare and swap are
easy to synthesize.
As we have concentrated on some 30 hardcore instructions, we
have to confront compatibility problems at the instruction set
architecture level. It is found that if strict compatibi1ity is
to be maintained, the reduced machine must be outcasted from the
S370 family: it is, for instance, impossible to emulate the
branch on count BCT instruction (subtract a register, then
branch if its content is non-zero) according to the original
definition because any straightforward combination such as
subtract then branch changes the condition code, but BCT alone
does not. The problem is, the reduced machine alone may be able
to emulate all complex instructions, but some side-effects are
difficult to undo and worse, difficult to predict. A possible
solution is to enroll all instructions that cannot be emulated by
the kernel, but constraints in another dimension will be
introduced.
Instructions in the SS format induce another problem. Some
p~0S0fcihor like Patter son maintain that RISC instructions should
have simple formats that do not cross word boundary; but some
manufacturers such as Fairchild like to Huffman-encode their
instructions. Each scheme has its pros and cons, but letting
instructions stay within instruction boundaries, as in the case
of the three halfwords SS instructions, leave annoying problems
EVALUATION AND CONCLUSION 69
to the pipeline. Directly emulating SS instructions is quite
ine-f-ficient: even if we have some -Field extracting instructions
to partition-the SS instructions -for examination, the overhead
involved may be very large. For simple circumstances, direct
emulation is good enough; but on the -flip side, direct emulation
does not work.
Perhaps the simpliest solution to the compatibi1ity problem is
to push the architectural compatibi1ity up one level. Zr thie
approach we employ a standard OS inter-Face such as the IBM SAA
(for system applications architecture), and make programs-
compatible at the application level. Due to the usually hardwired
instruction set that is difficult to redefine, RISC has long been
criticized for its inability to maintain low level compability.
Manufacturers such such as Fairchild and MIPS Inc. have followed
the alternative just mentioned, with the exception that the de
facto standard UNIX system V is used in stead of the SAA.
Low level compatibility is sometimes desirable bacause some
existing programs are written by assembler, or even if they are
written by HLL, their source code may not be available. For these
case a trade-off solution is suggested. A macro assembler is
written in such a way that it accepts old model assembler code
and produces new model object code. Complex cases with various
cumbersome formats may be expanded as macros using coprocessor
instructions (if there is a coprocessor for them) and the
primitive instructions the RISC provides. Some post-code
optimizations such as delay slots filling may also be done here.
For programs only available in object code, a simple disassembler
may be written to obtain their assembler equivalents.
EVALUATION AND CONCLUSION 70
5-5 CONCLUSION
A Sj.70 instruction subset has been chosen and made very
streamlined. Every instruction in the subset can be executed in
one cycle, because we have arranged hardware in a simple and
efficient pipeline to support that kind o-f parallelism. Because
of the regularity and simplicity o-f the reduced architecture,
attaining high degree o-f parallelism is relatively simple and
econcmi cal 1 y -feasible.
W e loo k R ISC from the an g I e of i mp 1 emen t a t i on simplicity, and
start out by choosing instructions that are potentially single-
cycled. It is also believed that a compiler can take advantage of
a simple instruction set, and that simple instructions are worth
investment. By leaving out complex instructions, we concentrate
on a kernel and devise methods to make it run fast. Complex cases
are supposed to be synthesized by using primitive instructions,
We thus arrive at a simple and fast kernel machine that outspeed
a CISC model in some circumstances, We have also found some
evidence which shows that the kernel instructions have better
usage because most compiler usage statistics are skewed in the
way that favors primitive instructions.
The simple and fast kernel machine is examined to see whether
it is a RISC. Because we choose the instruction set from the
consideration of single-cycle execution, our reduced machine is
found to pocess most RISC features because many RISC features are
actually derived from the requirement of single-cycle execution.
The reduced instruction set is also found to have better
regularity, orthogonality, and composabi1ity that are generally
welcome by a compiler.
As the instruction repertoire has instructions satisfying
most, if not all, general computing requirements, our design can
be a stand—alone system. But effort has been made to maintain
architectural compatibility, so that migration ease and fair
comparison of the reduced design with the complex architecture
EVALUATION AND CONCLUSION 71
both f8C i 1 i t at ed. However, it is -found that compability at
the instruction set architecture level is c:fficult to obtain,
because the reduced architecture is only a modest subarchitecture
the complex counterpart. A possible way is to rely on
microprogramming, with the RISC itself as the-: c crarch i tecture.
Some inherent complexity of the S370 is fcunc inheritea bv
the reduced design, due mainly to the problem of compability. The
condition code setting logic of the reduces oscnire is, for
example, as complex as the orioinal architecture. Because of the
single 1 o a d address ins t r u c t i o n that load an v a 1 u a ted address to
a general register, to quote another example, a path from the
output of the address adder to the input of the register file is
required. The register file has already been heavily loaded, the
added burden due to this single instruction may result in rriore
serious lithographic or routing problems.
Another difficulty we face is that we don't have a compiler
tailored for the RISC, so that large scale experiments cannot be
done. Yet, some programs run significantly faster on the reduced
machine because we have partly exploited RISC advantages. If such
a compiler is available, there are good reasons to believe that
the new system is even better than the bare RISC machine.
We are not able to draw such a strong conclusion as to discard
the CISC., because the evidence observed so far is not strong
enough. From the reserch, however, there are some clues which
indicate that there are ways to reduce a CISC, and to reduce it
is a worthwhile attempt.















A.V.Aho and J.D.Ullman, Principles of Compiler Design,
Addison—Wesley, 1977, Readino. Mass.
A.G.Alexander and D.B.Wortman, Static and Dynamic
Char acter i st i cs of XF'L Programs, Computer, Nov 75, pp.
41-46.
M.R.Barbacci, Instruction Set Processor Specifications
(ISPS): The Notation and Its Applications, IEEE Trans.
Comput., C—30 (1). Jan 81, pp.24—40.
C.Bell, RISC: back to the Future? Datamation, June 1,
1986, pp.96—108.
R. Bernhard, More Hardware Means Less Software, IEEE
Spectrum, Dec 81, pp.30-37.
J.S.Birnbaum and W.3.Worley,Jr., Beyond RISC: High-
Precision Architecture, F'roc. 1986 COMF'CON, IEEE, Mar
3-6. pp.40-47.
G.Borriello et_ al_., RISC vs CISCs for Prolog: A Case
Study, ASF'LOS II, IEEE., Oct 87., pp. 136-145.
R. P. Case and A.F'adegs, Architecture of the IBM System
370, CACM, 21(1), Jan 78, pp.73-69.
G.J. Chai tin et_ al_., Register Allocation via Coloring,
Proc. SI GF'LAN Symp. on Compiler Construction., Jun 82,
pp.98-105.
T.C.Chen and W.King, Computer Architecture, F'WS-Kent
F'ubl ishinq Company, 1939.
F.Chow et al., How Many Addressing Modes are Enough?
ASF'LOS II, IEEE, Oct 87, pp. 117-121.
R. P. Col well et al_., Instruction Set and Beyond:
Computers, Complexity and Controversy, Computer, Sept
85, pp.8-19.
J.W.Davidson, The Effect of Instruction Set Complexity
on Program Size and Memory Performance, ASF'LOS II,
IEEE, Oct 87, pp. 60-64.
C.Davis et_ al_., Gate Array Embodies System370
Processor, Electronics, Oct 9, 1980, pp.140-143.
REFERENCES R!
CDitz803 D.R.Ditzel and D.A.Patterson, Retrospective on High—
Level Language Computer Arch:tecture, Proc. 7th Int'l
Conf. on Computer Architecture. 1980. dd.97-104.
CDuB863 D.K.DuBose et al_., A Microcoded RISC, Computer
Architecture News, 1986, pp.5-16.
CFair823 D.A.Fairclough, A Unique Microprocessor Instruction
Set, IEEE Micro, May 82, pp.S-18.
CFlynn833 M.J.Flynn, Towards Better Instruction Sets, Proc.
Micro 14, IEEE, 1933, pp.3-S.
CFlynn873 M.Flynn et al_.. ''-nd Now a Case -for More Complex
Instruction Sets, Computer, Sept 87, pp.71—83.
[Full 773 S.H. Fuller and It E, Burr, Measurement and Evaluation o-f
Alternative Computer Architectures, Computer, Oct 77,
pp.24-35.
CGim873 C.E.Gimarc and V, M, Mi lutinovic, A Survey o-f RISC
Processors and Computers o-f the Mid-1980s, Computer,
Sept 87, pp.59-69.
CGross823 T.Gross and J.Hennessy, Optimizing Delayed Branches, ,j
Proc. Micro 15, IEEE, 1982, pp.114-120.
CHeath843 J. L. Heath, Re-evaluation o-f the RISC I, Computer
Architecture News, Mar 84, pp.3-10.
CHennS23 J.Hennessy et_ al., MIPS: A Microprocessor Architec
ture. Proc. Micro 15, IEEE, 1982, pp.17-22.
CHenn843 J.Hennessy, VLSI Processor Architecture, IEEE Trans.
Comput., C-33(12), Dec 84, pp.1221-1246.
CHenn853 J.Hennessy, VLSI RISC Processors, VLSI System Design,
Oct 85, pp.22-32.
CHitch853 C.Y.Hitchcock III and H.M.Brinkley Sprunt, Analysing
Multiple Register Sets, Proc 12th Int?l Sym. Comput.
Archit., IEEE, Jul 85, pp.55-63.
CHug863 M.Huguet and T.Lang, A Reduced Register File -for RISC
Architectures, Computer Architecture News, 1986,
pp.22-31.
CHunt873 C.B.Hunter, Introduction to the Clipper Architecture,
IEEE Micro, Aug 87, pp.6-26.
CIBM763 IBM System370 Model 125 Functional Characteristics,
TRM Corn.. GA—33—15U6—3, 4ED, Nov 76.
Exrsrsrocriur'PQ R7
CIBM783 IBM System570 Principles of Qpertion, IBM Corp., GA22-
77000, 1978.
CKnuth71] D.E.Knuth, An Empirical Study o-f FORTRAN Programs,
Softw. Pract. Ex per. 1(1971), pp. 105-133.
CLowry69] E.S.Lowry and C.W.Medlock, Object Code Optimization,'
CACM, 12(1), Jan 69, pp.13-22.
CLunde773 A.Lunde, Empirical Evaluation o-f Some Features o+
Instruction Set Processor Architectures, CACM, 20(3),
Mar 77, pp.143-153.
CMagenB73 D.J.Magenheimer et al., Integer Multiplication and
Division on the HP Precision Architecture, Computer
Architecture News, ACM, 1987, pp.90-95.
[Man87a] T.Manuel, The Frantic Search for More Speed,
Electronics, Sept 3, 1987, pp.59-65.
CMan87bl T.Manuel, Getting Mainframe Power Out of a CISC Su.per-
micro, Electronics, Sept 3, 1987, pp.66-74.
CMark84] M.Markoff, New RISC Machines Appear as Hybrids with
both RISC and CISC Features, Computer Design, Apr 1,
1986, pp.22-25.
CMcFar863 S.McFarling and J.Hennessy, Reducing the Cost of
Branches, Proc. Micro 17, IEEE, 1986, pp.396-403.
CMcNel87] K.J.McNeley and V.M.Mi 1utinovic, Emulating a Complex
Instruction Set Computer with a Reduced Instruction Set
Computer, IEEE Micro, Feb 87, pp.60—72.
CMilut36] V.Milutinovic and V.Mendoza-Grado, A Survey of
Advanced Microprocessors and HLL Computer
Architectures, Computer, Aug 86, pp.72—85,
CMok87] N.Mokhoff, Scalable RISC-based architecture yields 10-
MI PS workstation, Electronics, Jul 9, 1987, pp.45-48.
CMotor883 Technical Summary of the MCS8000 Family, BR588D and
BR589D, Motorola Inc., 1983.
CNau871 B. A. Naused and B.K.Gilbert, A 32-Bit, 200 Mhz GaAs
RISC for High-Throughput Signal Processing Environ¬
ments, IEEE Micro, Dec 87, pp.8-20.
CNeff36] L.Neff, CLIPPER™ Microprocessor Architecture Over
view, Proc. 1936 C0MPC0N, IEEE, Mar 3—6, pp. _-8—4U.
COhr843 S.Ohr, Conventional Benchmarks Fail to Measure Up for
Multi-user Micros, Electronic Design, Jun 84, pp.61-
62.
REFERENCES R3
C0hr853 S.Ohr, RISC Machines, Electronic Design, Jan 10, 85
pp.175-190.
CPatt82a3 D.A-Patterson and E.S.Pieoho, Accessing RISCs in High-
Level Language Sucnort, IEEE Micro, Nov 82, pp.9-19.
CPatt82b3 D. A. Patterson, A RISCy Approach to Computer Design,
Proc. Micro 13, IEEE, 1982, pp.8-14.
CPatt833 D.A.Patterson, Microprogramming, Sci. Am., 248(3;
Mar 83, pp.36—43.
CPatt843 D. A. Patterson, -TEC Watch, Compute' Archi tecture
News, Mar 84, pc.Il-lQ.
CF'att853 D.A.Patterson, -sauced Instruction Set Computers, 1
LHLM 28(1). Ja rt HO. no.9— 21.
CF'ul833 J.Pulcini, Constructing Benchmarks That Measure Up,
Computer Design, Get 83, pp.161-168.
CRadin833 G.Radin, The SCO Microcomputer, IBM Res. Develop., 2'
(3), May 83, pp.27T-246.
CSer863 O.Serlin, MIPS, Dhrystones, and Other Tales
Datamation, Jun 1, 1986, pp.112-118.
CSmith873 A. J.Smith,, Cache Memory Design: an Evolving Art, IEEE
Spectrum, Dec 87, pp.40-44.
CSnow813 E.A.Snow and D.P.Siewiorek, Implementation and Perfor¬
mance Evaluation of Computer Families, IEEE Trans.
Comput., C-30, (6), Jun 81, pp.443-447.
CStal186 3 W.Stal1ings, Reduced Instruction Set Computers,
Computer Organization and Architecture, 1986, pp.431-
455.
CTabak863 D.Tabak, Which System is a RISC? Computer, Oct 86,
pp.85-86.
CTanen783 A.Tanenbaum, Implications of Structured Programming
for Machine Architecture, CACM, 21(3), Mar 78, pp.237—
246.
CThomp883 T.Thompson, The Intel 80387 vs. The Weitek 1167,
Byte, Mar 88, 13(3), pp.205-216.
CTred863 N.Tredennick, Compcon Panel: The RISC vs. CISC Debates,
Proc. 1986 C0MPC0N, IEEE, Mar 3-6, pp.312.
CUnger843 D.Ungar et al., Architecture of SOAR: Smalltalk on a
RISC, Proc. 11th Int?l Sym. on Comput. Architect.,
1984, pp.188-197.
REFERENCES R4
CWall853 P.Wallich, Towards Simpler, Faster Computers, IttiE
Spectrum, Aug 85.
CWeickS43 R.P.Weicker, Dhrystone: A Synthetic Systems Program
Benchmark. CACM. 7(10). Oct 84. 1013-1030.
CWulf813 W.A.Wul-f, Compilers and Computer Architecture,'
Computer, Jul 81, pp.41-47.
CWol-fe873 A.Wol-fe and B.C.Cole, World's Fastest Microprocessor,








































































































R1,R2, (R3 or imd)
R1, (R2 or imd)
R1, (R2 or imd)
R1, (R2 or imd)




















































































































































Instructions marked by a (c) set the condition coce.
Only the first 31 inetrueitons are in the RISC kernel.
Imd denotes an immediate operand.
condition code setting
Instructions 0 1 7 .'i



















































opcode R1 i j2~ Rd or immediate
opcode R1 X2 B2 D2
opcode R1 00 B2 D2
APPENDIX A1.2
APPENDIX
D EE S X G N OF TI—I EI AI LJ
The 2—input (X,Y) and 1—output (Z) ALU should be able to:
1. gate Y to the output;
2. complement Y;
3. add and subtract 27 s complement numbers;
4. perform logical AND, CP, and XDR.
There are totally seven functions. Five pieces of information are
necessary for the determination of the condition code (CO:
1. the sign of the first operand Sx;
2. the sign of the second operand SY;
3. the sign of the result Sz;
4. the zero flag Z;
5. The carry out Co-
The definition of the 370 instruction set requires the setting CC
in a quite inconsistent manner, therefore the operand code is
also required for a unique setting. Let this code be denoted by
F. F is assumed to be a 4-bit quantity. We have the following:
CC= (CCo CCx
CCo=: f (SxjSyjSzj ZjCojF)
CC%— g(SxjSYj SZ Z.Co?F)
where f ane g are some logic functions of the enclosed arguments.
The CC determination logic is a 9-input, 2-output combinational
circuit which is quite complex. Table look-up is recommended for
design simplicity. There are several points to note:
1. LPR and LNR require examination of the operand sign prior to
possible complementation, some logic is then required to
modify the function code before entering the ALU if the ALU
can only support LCR. If the function code designated by the
instruction is F and that to be fed to the ALU is G, then G
should be a function of both F and Sv- Note that G is only a
3-bit quantity as there are only 8 ALU functions, while F has
four bits.
2. CLR is similar to CR in the operation required- both require
a subtraction to be done. The CC setting is, however, not the
same. The C-C after executing a logical (magnitude) comparison
is determined by first assumming a fictitious sign bit, then
per f ormi ng a 2:'s complement subtraction. The resulting sign is
determined by Co, the carry out. If there is a carry, the
result is positive, hence the CC should indicate a first
operand high.
3. CC should be set according to state priority: overflow, for




THE! M I CRO X NSTRUCT I OH FORMAT
The opcode of an instruction is used to access the micro¬
instruction directly. The macro instruction is then concatenated
with the microinstruction for further processing. The microword,
not counting the accompanying macroinstruction, is 18 bits long.
The microstore is therefore at most 256x18 bits for an 8-bit
opcode. Decoding is simple, microprogramming the processor is




ALU.f unc tion 0.. 3
SHIFTER.function0.. 1
ALU.or.SHIFTER
s~ MW 0. .31,
:= MW 32. .35,
:= MW 36. .37,
s= MW 38,
Load 0.. 2 := MW 39. .40.
Store
Memor v.Lo.St 0. .2
:= MW 41,





:= MW 47. .48.
:= MW 49,




I ar 1,2! to be programmed






! buff er X,Y with
!0;no action











! 2: f ul 1 word
!0sno action
! 1 s PC- MAR
!2s use CC mask
! 0 s R X,RS























































6, 0(B, 1 4)
7, 4(B,14)
8, 8 (B, 1 4)






1 1 f, 1 ui
7, O (0, 0)
15, 15














A, O (B, 1 4)
0, 0 (0, O)
O, O (0, 0)
0, 0 (O, O)
5, 0
5, 3
0, O (0, 0)
00 00 oc
00 00 OA
0 O 00 0£
00 00 oc




































41 50 00 00
41 60 00 06
41 70 00 07
41 80 00 08
41 90 00 09
41 DO OE 00
41 EO OF 00
50 6E Bo OU
50 7E BO 04
50 8E BO 08
50 9E BO OC
IB FF 00 00
47 80 00 58
47 00 00 00
58 FE 00 08
89 FO 00 01
17 FF 00 00
47 70 00 00
12 FF 00 00
47 70 00 00
IB FF 00 00
47 80 00 74
50 6D 00 00
16 60 80 OF
89 60 00 01
53 6D 00 00
IB FF 00 00
19 FO 80 00
47 80 00 38
58 6E BO 00
58 7E BO 04
58 8E BO 08
58 9E BO OC
53 AE BO 00
47 00 00 00
47 00 00 00
47 00 00 00
1A 50 80 00
19 50 80 03
47 00 00 00
APPENDIX A4.
sAR'F'EMO X X A5
F SIMULATION EX AMPI EI
Use tallDniw sinole-fcw CBMmtis are internreted:
: wsa display
i: initialization•
e: enter a code segieat
5: save a code segeent
1: load a code seeaest
t: advance a prcgrai e step
r: run a code segaent
v: view the prograa state
r• rloar fho crroon
» i
input file naae: brrcpl
4096 bytes transferred
run prograe in single step (yn)? n
instruction cache hit ratio is 0.727
data cache hit ratio is 0.816
» v
enter r to view registers, a to view ueaory;
Ut Register duap ttt
rnnfonfc nf npnpral TPfliter?
R0= 00 00 00 00
R4= 00 00 00 00
R8= 00 00 00 00
Bin_ AA AA AA AA
R1= 00 00 00 00
R5= 00 00 00 00
R9= 00 00 00 00
Di7- nn nn nn nn
R2= 00 00 00 00
R6= 00 00 00 00
R10= 00 00 00 00
R14= 00 00 00 00
R3= 00 00 00 00
R7= 00 00 00 00
Rll= 00 00 00 00
R15= 00 00 00 00
contents of floating-point registers:
R0= 43 23 00 00
IR: 41 20 00 00
MAR: 00 00 00 0
X: 00 00 00 00
ra• nn nn oo 0(
R2= 3E 75 05 00
TIR: 41 20 00 C
HBI: 00 00 00(
Y: 00 00 00 3E
FAY• 00 00 00(
R4= 00 00 00 00 R6= 00 00 00 0'
HR: 00 00 00 81
MB0: 00 00 00 0
Z: 00 00 00 3E
EAZ: 00 00 00 0
PSH.nr.• 2 PSW.instr addr: 00 00 CO cycle: 325 H125: 117


