The RISC architecture and computer performance evaluation by Barros, Manuel Filipe Pedrosa de
Calhoun: The NPS Institutional Archive
Theses and Dissertations Thesis Collection
1986-03
The RISC architecture and computer performance evaluation













THE RISC ARCHITECTURE AND
COMPUTER PERFORMANCE EVALUATION
by
Manuel Filipe Pedrosa de Barros
March 1986
Thesis Advisor: Harriett B. Ri2;as
Approved for public release; distribution is unlimited
T226026







DECLASSIFICATION / DOWNGRADING SCHEDULE
3 OISTRI8UTION/AVAILA8ILITY OF REPORT Approved tOr
public release; distribution is
unlimited
ERFORMING ORGANIZATION REPORT NUM8ER(S) 5 MONITORING ORGANIZATION REPORT NUM8ER(S)





7a. NAME OF MONITORING ORGANIZATION
Naval Postgraduate School
ADDRESS (Ofy, State, and ZIP Code)
interey, California 93943-5000
7b. ADDRESS (Ofy, State, and ZIP Code)
Monterey, California 93943-5000




9 PROCUREMENT INSTRUMENT IDENTIFICATION NUMBER









TITLE (Include Security Classification)
E RISC ARCHITECTURE AND COMPUTER PERFORMANCE EVALUATION
PERSONAL AUTHOR(S)












18 SUBJECT TERMS [Continue on reverie if necessary and identify by block number)
RISC Architecture; CISC Architecture;
Computer Performance Evaluation
ABSTRAC" [Continue on reverse if necessary and identify by block number)
A definition of Reduced Instruction Set Computers is developed.
A computer performance model which allows the evaluation of
'chitectural alternatives is presented.
An example on the use of the model to compute the performance
-ternatives for a given application is presented to study the effect
: the addition of an instruction to a processor instruction set.
•3UT-CN/ AVAILABILITY OF ABSTRACT til ABSTRACT SECURITY C'^SSif iCATlON
^.NCLASSlFlED/UNLiMlTED G SAME iS RPT Q DT!C USERS i UNCLASSIFIED
•.^'t Of PtSPONSlSLE !f, DIVIDUAL
'Of H. Rigas
.22b TELEPHONE f/nc/ude Area Code) 22c OFFICE SyMBO'.
I (408)646-2082 62Rr
HORM 1473, 34 MAR 33 APR eaition -nay be used until exhausted
All other editions are obsolete
SECURITY CLASSlFiCAf'ON OF 'HiS =>;





Manuel Filipe Pedrosa de Barros
Lieutenant, Portuguese Kavy
B. S. , Escola Naval, 1978
Submitted in partial fulfillment of the
requirements for the degree of




A definition of Reduced Instruction Set Computers is
developed.
A computer performance model which allows the evaluation
of architectural alternatives is presented.
An example on the use of the model to compute the
performance alternatives for a given application is
presented to study the effect of the addition of an instruc-




II. WHAT IS A RISC ? 11
A. INTRODUCTION 11
B. THE RISC I AND II 12
C. THE 801 MINICOMPUTER 15
D. THE MIPS 16
E. TOWARD A DEFINITION OF A RISC MACHINE .... 17
III. MY APPROACH TO COMPUTER PERFORMANCE EVALUATION . . 18
A. INTRODUCTION 18
B. EVALUATION AND MEASUREMENTS 18
C. THE RISC/CISC CONTROVERSY 20
D. AN EXAMPLE 22
E. SUGGESTED APPROACH ....'. 25
IV. TIMING ANALYSIS 28
A. INTRODUCTION ' 28
B. THE COMPUTER SYSTEM 29
1. Memory and I/O Interface 30
2. The Busses 31
3. The Processor 32
C. THE APPLICATION 32
D. THE PERFORMANCE 35
E. A SPECIAL CASE AND THE RISC 37
F. THE SYSTEM ARCHITECTURE AND TIMING 3 7
V. CONTROL ANALYSIS 43
A. INTRODUCTION 43
B. THE CONTROL UNIT AS A FINITE STATE MACHINE . . 44




D. THE APPLICATION AND THE CONTROL UNIT 47
E. THE MODEL 49
VI. CASE ANALYSIS . . '. 54
A. INTRODUCTION 54
B. THE ADDITION OF AN INSTRUCTION 54
C. THE COST/GAIN TRADEOFF 56
1. Timing Criterion 57
2. Control Unit Complexity Criterion .... 62
D. AN ILLUSTRATIVE EXAMPLE 64
1. The Processor '. . 64
2. The Application 65
3. The Floating Point Representation .... 66
4. The Hardware Involved 67
5. The Model 70
VII. CONCLUSIONS 76
APPENDIX A: FAST FOURIER TRANSFORM 79
APPENDIX B: IBITR FUNCTION 86
APPENDIX C: SINE FUNCTION 87
APPENDIX D: COSINE FUNCTION . 91
LIST OF REFERENCES 95
INITIAL DISTRIBUTION- LIST 96
LIST OF TABLES
I EXECUTION TIME OF EACH SUBROUTINE IN FAST
FOURIER TRANSFORM PROGRAM 72
II FAST FOURIER TRANSFORM APPLICATION PROGRAM
EXECUTION TIME 72
III EFT PROGRAM EXECUTION TIME BEFORE THE
ADDITION OF THE FLOATING POINT MULTIPLY
INSTRUCTION 73
IV EFT PROGRAM EXECUTION TIME AFTER THE
ADDITION OF THE FLOATING POINT MULTIPLY '
INSTRUCTION 74
V PERFORMANCE EFFECTS OF THE ADDITION OF
THE FLOATING POINT MULTIPLY INSTRUCTION ..... 75
LIST OF FIGURES
2. 1 RISC Register Window 14
3. 1 Conceptual View 26
5.
1
Simple Control Unit State Diagram 44
5.2 More Detailed Control Unit State Diagram 45
6. Floating Point Representation 56
6.
2
General Hardware Structure for the Floating
Point Multiply Instruction 68
ACKNOWLEDGEMENT S
I would like to express my gratitude to Prof. Rigas for
her guidance in completing this project.
To my parents for all they taught me and finally and
most important to my wife Carmo and my son Andre for their
constant support and understanding, without which I would
not have got here, I dedicate this work.
8
I. INTRODUCTION
The first Reduced Instruction Set Computer (RISC)
appeared at the end of the 1970' s and since then long and
heated discussions have taken place in the computer archi-
tecture community. These discussions centered around the
validity of the claims made by the RISC proponents regarding
the performance achieved by the proposed machines when
compared to traditional computers that are referred to as
Complex Instruction Set Computers (CISC).
Due to a lack of an appropriate method to evaluate the
performance effects of various architectural features, it is
difficult to resolve the RISC/CISC controversy.
The interest in the ideas proposed by this philosophy
has been growing, and presently many of the major computer
companies are investing a great deal in this new type of
computer architecture.
This thesis tries, first, to define the basic character-
istics of a Reduced Instruction Set Computer, so that it is
possible to focus on the specific architectural features
peculiar to RISC machines.
The approach that in the author's opinion has to be
followed, in order to evaluate computer performance,
together with the author's disagreement on the approach
taken on several published comparisons between RISC and CISC
machines, are presented.
A model for computer performance evaluation is
suggested. This model is composed of two parts. The first
part deals with the timing analysis of the computer perform-
ance. The second part sets a criterion to determine the
efficiency of a given computer control unit when used for a
given application. Finally in order to evaluate the model,
an example is given demonstrating the quantification of the
performance effects of an architectural enhancement to a
system architecture.
The model suggested for computer performance evaluation
constitutes a departure from the current computer perform-
ance evaluation methods, because the attention is centered
on the computer architecture rather than on the measurements
of throughput, response time and mean job turnaround time
where the main emphasis of the evaluation process is put on
the software.
The model is intended to provide a tool for computer
architects to use, so that discussions regarding the
performance achievements of certain architectural features
might be quantified and rational conclusions may be reached.
10
II. WHAT IS A RISC ?
A. INTRODUCTION
In recent years a new type of computer architecture has
received a great deal of attention.
This new architecture is mainly the result of an effort
conducted in an academic environment. Profiting from the new
possibilities that custom VLSI offers, the professors and
students at the University of California at Berkeley,
collaborating in several courses in this area, began
projects on building single chip computers.
Due to limitations of the chip area, available tools and
the available time for the completion of the project,
several simplifications to contemporary architectures were
made. For example, the instruction set was simplified by
eliminating all instructions that might be called composite
instructions. This type of instruction is equivalent, in the
operation performed, to a sequence of other more elementary
( atomized) instructions.
A claim has been made, that the obtainable performance
of these machines was unexpectablly remarkable and this
triggered a major discussion on the subject of the merits of
Rise's.
Feeding the controversy is undoubtly the lack of an
appropriate method or tool to measure computer architecture
performance and the effects of a particular architecture
modification on the computer performance.
From the very beginning the RISC machines were related
to implementation issues in the use of VISI technology.
Proponents called the approach "RISC", for Reduced
Instruction Set Computers, as opposed to the traditional
computers which they referred to as "CISC' S", for Complex
Instruction Set Computers.
11
The "new architecture" proponents didn't present it as a
proposal to enhance, in some way, the prevailing architec-
ture, but as a complete departure from the previous work.
No precise definition has ever been given for the
complete characteristics of a RISC machine, and because of
that, there are now in existence several different machines
all claiming to be RISC' s. Although there are some common
features there is no clear cut agreement on what comprises a
reduced instruction set computer.
No doubt some very valid ideas were brought to the
computer architecture environment by the "RISC philosophy
proponents", but, nevertheless, it constitutes a sure risk
to accept a new idea without an open, substantiative debate
where the benefits are separated from the jargon.
The first step in understanding and identifying the RISC
trade-off is a more precise definition of RISC.
As stated above, several implementations of RISC' s are
already in existence, and, of these, four have undoubtly
enough importance to be mentioned.
They are:
1) The RISC I and II, developed at the University of
California at Berkeley
2) The 801 Minicomputer, developed at the IBM Thomas S.
Watson Research Center
3) The MIPS, developed at Stanford University.
In order to develop a definition of the "RISC" the
existing "RISCs" should be studied.
B. THE RISC I AND II
The RISC I and II were both developed at the University
of California at Berkeley where the acronym RISC originated.
Since both were developed at U. C. Berkeley, they are very
similar in their composition. In fact, RISC II is no more
than an enhanced version of RISC I.
Both are single chip VLSI processors having the
following characteristics:
12
1) They are 32-bit machines. That is, all registers and
busses are 32 bit wide.
2) Instruction Set:
2a) RISC I has 31 instructions
RISC II has 39 instructions
2b) Both have a load/store architecture. This means
that all instructions except load and store are
register-to-register. Load and store are the only
memory-reference instructions.
2c) All instructions except LOAD and STORE are single-
cycle where a cycle is the time it takes to read
and add two registers, and then store the result
back into a register.
2d) All instructions are the same size (32 bits).
There are two different formats but the fields are
at fixed locations.
2e). Addressing Modes:
There are two addressing modes; one for register-
to-register instructions--Register Direct and the





3a) Total number of on-chip registers
RISC I 138
RISC II 198 ^
3b) The processor is organized in multiple overlapping
windows in order to facilitate parameter passing
between procedures.
The windows are organized in a circular buffer
fashion. In the case that the nested procedure
depth is greater than the number of windows minus
one, the values in the window corresponding to the
oldest procedure are stored in memory and this
window is then free to be allocated to the current
procedure. At any time 32 registers are visible
constituting what is called the "current window".
All windows have a fixed size and the composition
shown in Figure 2. 1.
The global registers are common to all procedures,
and therefore they are used to store global vari-
ables. Register KG holds a fixed value of zero.
The low registers are common to the current proce-
dure and to the called procedure, although, m the
called procedure, they will nave a different
number since there they constitute the high regis-
ters of the corresponding window. The high regis-
ters are common to the current procedure and to
the calling procedure. The high and low registers
along with the global registers constitute the
overlapped part of each window and are used for
parameter passing between procedures. The local
registers are only visible m the current window.
4) The control unit is hardwired with most of its logic
implemented using PLA's.
5) Pipeline Stages
The RISC I has two pipeline stages, i.e., depending
on the program sequence it can prefetch the next
















Figure 2. 1 RISC Register Window.
instruction. The RISC II has three pipeline stages,
i.e., depending on the program sequence it can
prefetch the next instruction and store the final
results of the previous instruction in a register,
while it executes the present instruction.
5) Use of Delayed Branch
In order to increase speed and not to discard the
prefetch instruction, when a branch instruction is
executed, the branch takes place only after the
execution of the next sequential instruction.
Typically the compiler arranges for the instruction
following the branch to be part of the loop, see
[Ref. 1].
8) Imolementation
RIBC I is implemented with 4 micron NMOS VLSI
technology with a clock of 8 MHZ and a cycle of 500
NSEC. RISC II is implemented with 3 micron NMOS VLSI
technology with a clock of 12 MHZ and a cycle of 330
NSEC.
14
9) Both RISC I and II have no floating-point support.
C. THE 801 MINICOMPUTER
Developed by IBM at the Yorktown Heigths Research Center
from 1975 until 1979, it was the first machine to follow
what later would be called "The RISC Approach to Computer
Architecture"
.
Due to its proprietary nature, not much is known about
it, but some of the ideas present in its design are known
and have been, in a certain way, the basis for the develop-
ment of RISC I and I I at Berkeley and MIPS at Stanford.
As opposed to the RISCs and the MIPS, the 801 is not a
single chip processor but a minicomputer.
The general approach is the basis for the design of an
IBM NMOS VLSI single chip processor known as ROMP or 802.
The 801 machine is basically a 32 bit architecture with
single-cycle four byte instructions and 32 registers. It has
separate data and instruction cache memories. As in RISC I
and II, the 801 also has a delayed branch scheme, that is
the branch only takes place after the execution of the next
instruction.
The 801 system is said to be compiler-based meaning that
a greater demand is made on the compiler.
The 801 architecture was defined by George Radin in his
article 'The 801 Minicomputer' [ Ref . 2] as the set of run
time operations which:
1) Could not be moved to compile time
2) Could not be more efficiently executed by object code)roduced by a compiler which ' ' ' ' '^ ^ ''*
.evel intent of the program, or
produ ^ understood the high-
3) Was to be implemented in random logic more effec-
tively than the equivalent sequence of software
instructions.
Both data and address busses are 32 bit wide. The





Also a highly-effective optimizing compiler was devel-
oped for the system.
D. THE MIPS
The MIPS computer was developed at Stanford University
by John Hennessy and his students. Its acronym stands for
Microprocessor without Interlocked Pipe Stages.
There are strong similarities with the RISC project at
Berkeley. It has, however, some conceptual differences that
have already been identified by its proponents in Ref. 3 as:
i) more complex user level instruction set.
ii) the main design goal is high performance of the
hardware employed and not simplicity of the
instruction set.
iii) much more complex compiler.
Specifically its characteristics are the following:




2c) All instructions except LOAD and STORE are single-
cycle
2d) Instructions may be 16 or 32 bit long. An opti-
mizing compiler reorders the instructions so that
all lb bit instructions always come in pairs.
2e) Addressing Modes
- immediace




There are sixteen 32-bit general purpose registers.
Hardwired control with most of its logic implemented
using PLA'
s




Separate data and instructions memory
10) No support for floating-point operations
16
11) Implemented with 4 micron NMOS VLSI technology with
a clock rate of 8 MHZ.
E. TOWARD A DEFINITION OF A RISC MACHINE
Four machines have been described as examples of a new
type of computer architecture defined as the RISC architec-
ture, as opposed to the traditional architecture now
referred to as CISC architecture.
Any definition of this architecture will have to encom-
pass the characteristics common to the four previous
examples.
To summarize, a RISC Machine will have the following
characteristics:
1) Simple instruction set where the great majority of
the instructions are single-cycle,
2) Load/store architecture, that is all instructions are
register-to-register with the LOAD and STORE being
the only memory-reference instructions,
3) Very .few addressing modes,
4) Hardwired Control i. e. no microcode,
5) Instructions with one or two sizes and with fields at
fixed locations,
6) Some degree of pipelining,
7) Demand on the compiler to increase performance.
17
III. MY APPROACH TO COMPUTER PERFORMANCE EVALUATION
A. INTRODUCTION
This thesis has been motivated by the rise of the new,
RISC computer architecture trend, described in the previous
chapter, and by the claims made by RISC proponents regarding
the inherent superior performance of RISC when compared to
traditional architectures.
Unfortunately, the claims made for these structures were
not supported by any quantitative arguments. No specific
attention was given to the effects of various factors intro-
duced in the RISC architecture and to the influence that
each factor had on the system performance.
Computer performance evaluation is different depending
on the aspects of performance being evaluated. From the
view point of a potential computer system buyer, there is a
need to identify features in the system which will enhance
the performance for a particular application. From the
viewpoint of a computer architect, performance analysis is a
way to evaluate specific enhancements from which trends in
computer architecture design may follow.
B. EVALUATION AND MEASUREMENTS
In order to perform an evaluation of any kind, one must
take measurements of the system under different conditions.
One wants to take the measurements properly, or else the
evaluation will be unvalid.
In order to guarantee that the evaluation will be based
upon correct data, one has to know:
1) V7hat the measurements are for
The buyer is not worried about any of the architectural
details of the machine, but rather about the throughput of a
system programmed in a high-level language.
18
In contrast, the computer architect must be concerned
with the internal characteristics and the behavior of the
system, even when he is testing a system using programs
written in high-level languages.
Considering the RISC family of machines the correct
point of view is undoubtly the latter one.
2) What is measured
Typically one wants to test how each enhancement to the
computer architecture affects the system performance. In
order to get a realistic comparison of features, only one
feature at a time may differ. If more than one feature is
different, it is difficult to measure the individual effect
of each architectural feature on the system performance.
3) How is the evaluation performed
Because it is not feasible to build a new system each
time one of the architectural features is altered, a model
is required.
Because it is through the use of a model that the
performance effects of any architectural feature will be
determined, this model has to be able to quantify, in a
precise manner, the effects of any change in the
architecture.
4) For which application are the measurements valid
The application for which the system is being used has
an effect on the system performance. No system will show the
same performance in two different environments. For example,
in one application the user might be doing only word-
processing, and, in the second, the system might be
floating-point intensive.
There are, nevertheless, systems that present a balanced
performance throughout a diversified number of applications.
They are the so called "General Purpose Computers". But even
for these, the performance fluctuates, indicating that
general purpose computers have a better performance for some
applications than for others.
19
Due to these reasons, the system performance evaluation
must pay attention to the rigorous definition of the appli-
cation for which the system performance is being evaluated.
This requirement for a precise definition of the appli-
cation, will clarify the validity of the conclusions.
5) Which factors interact with the measurements
In the second question, the need to make just one change
at a time when making the evaluation is emphasized, other-
wise it would be impossible to determine the individual
effect of an enhancement on the system performance.
Specifically if the evaluator has already made measure-
ments for several changes in the architecture and has also
quantified the effect of each of those changes on the system
performance, it is possible to compare two systems, that
differ by all those changes plus an extra one, not yet
considered. As a result of the analysis, the effect of this
last change on the system performance can be quantified.
C. THE RISC/CISC CONTROVERSY
Because the problem being discussed is related to
computer architecture, there is a need for a concise state-
ment defining Computer Architecture as it is commonly
understood.
The adopted definition is the IEEE standard 729-1983
stating Computer Architecture as:
" The process of defining a collection of hardware and
software components and their interfaces to establish a
framework for the development of a computer system. "
In the published papers on RISC, several comparisons of
CISC and RISC examples were made.
The way these comparisons were done did not give any
insight, to the answers to the questions presented in the
previous section, or other similar questions.
The result is that now, no one knows for example, if the
performance of the RISC II is due primarily to its register
20
organization scheme, as some claim, or to the simplicity of
its instruction set, as others do.
Specifically,
1) If one wants to evaluate the effects of reducing the
instruction set, one might pick a CISC machine e. g.
the VAX-11 and consider the improvements due to ail
the instructions whose execution is equivalent to a
sequence of simpler instructions. For each of these
more complex instructions one could determine if the
execution is faster than the equivalent sequence. If
that is not the case, the instruction should be
discarded. If an improvement is seen, then consider
the cost of adding the instruction to the instruction
set.
2) If one wants to evaluate the effects of reducing the
number of addressing modes, one should consider:
- Why are they needed ?
- With which data types are they used ?
- What its the benefit brought by its addition.
3) If one wants to evaluate the effects of overlapped
register windows, one should test implementation of
overlapped windows on several systems and measure, as
a cost/benefit ratio, the effect of overlapped
windows on the system performance.
4) One cannot change more than one feature at a time and
hope to get an idea of what the effect of each
feature is on the system performance.
5) If one wants to do an evaluation using programs
written in a high-level language, one should state
that as a limiting factor. Since different compilers
generate different code, some compilers are better
rhan others and therefore make different contribu-
tions to the system performance. Furthermore, in the
case of compiler generated code, the frequency of
execution of each instruction in the system instruc-
tion set will be different for different high-level
languages. Besides, two different systems v/ith
distinct instruction sets do not necessarily have the
same best compiler.
6) If one wants to make some conclusive statement about
the advantages and disadvantages of the RISC archi-
tecture, one must separate the effects of features
that are orthogonal to the RISC philosophy.
The fact is that in the papers published on RISC' s,
almost all the comparisons made, involved systems with
different instruction sets, different addressing modes and a
different number of registers and registers organization
schemes. Furthermore compiler generated code was used
without considering the performance effects. These are the
reasons why no one can say whether the RISC architecture is
or is not better by itself.
21
In this situation, while the RISC proponents are
bringing some jargon to the architectural environment, those
against RISC are losing track of the possible benefits
present in the RISC philosophy.
D. AN EXAMPLE
As an example, let us pick a common CISC processor, the
MC68000 and consider its addressing modes.
The MC58000 has six basic types of addressing modes,
namely:
1) REGISTER DIRECT - The effective address is the
register designation field in the instruction.
EA = Rn
2) ABSOLUTE - The effective address is that given in the
instruction field itself and it is used directly
without modification
EA = INSTRUCTION FIELD
3) REGISTER INDIRECT - The effective address is the
contents of the designated register
EA = ( Rn )
4) IMMEDIATE - The operand is part of the instruction
itself and no further addressing is needed
5) PROGRAM COUNTER RELATIVE - The effective address is
computed by taking the value in the program counter
register and adding or subtracting an offset value
EA = PC + OFFSET
or
EA = PC - OFFSET
5) IMPLIED - The operand is in a register designated by
the mnemonic of the instruction.
The uses of each addressing mode depends on the
programmer.
Until now, the philosophy present in the design process
was to give the maximum versatility possible to the
programmer, so that he or she could choose the address mode
better suited to his or her needs. The rise of the RISC
architecture brings some questions regarding the correctness
of this philosophy.
22
In order to answer these questions, there is a need to
have a correct method for the evaluation of a system
performance. Together with the evaluation method there are
some points that have to be considered when deciding how
many addressing modes to include in the system instruction
set and how long each addressing mode should be.
The considerations are to:
1) reduce the storage requirements per program
2) reduce the number of bits that must be moved between
processor and memory to execute a program, i. e.
,
reduce the bandwidth requirements on the bus
3) reduce the average length of an instruction, i.e.
,
reduce the required width of the instruction bus.
There is a trade-off between the number of instructions
needed for the system to execute a program and the average
instruction size.
The decision regarding the number of addressing modes to
include is also very much dependent on the application, on
the data types, on the operations involved, on the use of
nested procedures, and how the parameter passing operation
is accomplished between procedures.
Although not considered here, the addressing problem is
also very much related to schemes of memory protection where
one wants to forbid the regular user program from accessing
some part of memory.
Besides how each one of the addressing modes is used, it
is also important to consider the frequency with which each
addressing mode is used.
Not much material is available regarding the usage of
addressing modes. As an example, consider again the
addressing modes of the MC68000.
1) REGISTER DIRECT
Since the operand is, in this case, in a register, no
memory accesses are involved. This provides some speed
advantages when used for operating on frequently-accessed
variables. For infrequently-accessed variables it would not
be used because the number of registers available on-chip is
usually very small.
2) ABSOLUTE
A memory access cycle is involved in absolute
addressing, because the operand is in memory. For this
reason it is not as fast as the previous mode.
Absolute addressing does not have much versatility
because the instruction address field is constant and the
operand must reference a fixed location in memory.
Nevertheless, it is simple. Because no alteration on the
address field of the instruction is performed, absolute
addressing is an efficient mode to use when the operand is
within the range of the instruction.
3) REGISTER INDIRECT
In the register indirect mode, one register access plus
one memory access cycle are involved because the register
holds the operand address and not the operand itself.
The register indirect approach is used when the address
of the operand has just been calculated. It provides
address-range extension, and in fact this extension
increases with the difference between the size of the
instruction address field and the size of the specified
register.
4) IMMEDIATE
Immediate addressing is the fastest way of addressing,
although it is limited by the instruction size. No addi-
tional memory accesses are needed since the operand is
within the instruction itself. Since programs are not self-
modifying it is used only for predefined values constants.
5) PROGRAM COUNTER RELATIVE
The major advantage of relative addressing is that it
allows the generation of position independent code because
the location referenced is always fixed relative to the
24
program counter. The importance of this fact is very much
dependent on the memory management scheme adopted in the
system.
In addition to the regular memory access, an addition or
subtraction must also be executed. It is used in relative
jump instructions e. g. , to set up loops or to set up parame-
ters to be passed to a subroutine.
5) IMPLIED
Implied addressing is equivalent to the register direct
addressing. However, implied addressing restricts the
opcode to the predetermined register specified by the design
of the opcode and the design of the processor.
E. SUGGESTED APPROACH
It is not feasible to build a new system each time a
single architectural feature is changed, in order to eval-
uate its effects on system performance.
As a result, there is then need for a model.
This model should be clear, complete, and able to
reflect the interrelations that exist between the different
components. The model should also be applicable to any
computer system, i. e. , the model should be general.
The model should reflect the performance effects of any






In the method suggested for computer performance evalua-
tion, a comparison is made between a reference system and
the same system with some change. The reference system is
the computer system for which it is desired to determine the
impact of each architectural enhancement. The result of
this comparison will then constitute a measure of the
25
performance effects of the particular change. The concep-
tual view of the system used in the model is illustrated in
Figure 3. 1.
Figure 3. 1 Conceptual View.
Four entities are considered:
1) The Application, any evaluation will only be valid
for a certain application, nor for any application
2) The System being considered
3) The System Instruction Set
4) The Performance, as the object of the evaluation
process.
The instruction set constitutes the central point of the
conceptual view. The application uses it. The system
supports it. The best match will necessarily give the best
performance.
The application is characterized by a set of tasks that
must be performed. Each task is performed with a different
frequency. For each task a program must be written, so that
26
one task is mapped into one program. Each one of these
programs executes in a different time.
The weight of each task or its representation in the
application is then the product of the frequency of its
execution and the corresponding program execution time.
The effects of the application on the system performance
are the frequency of execution of each instruction in the
system instruction set. This together with the average
execution time of the programs of interest will ultimately
lead to a " typical " program of the application.
The system supports an instruction set in two ways: one
by the execution time of each instruction and the other by
the complexity of the control unit necessary to implement
the instruction set.
An instruction set is desired that allows for the
writing of programs with a minimum execution time, but also





In this chapter a detailed analysis of the model for
computer performance evaluation is introduced. As described
in the previous chapter the model is divided into two parts.
In the first part, the model considers a timing analysis. In
this analysis the application determines the dynamic
frequency of execution of each instruction present in the
system instruction set and finally the system architectural
characteristics determine the execution time of each
instruction.
In the second part of the model, which follows in the
next chapter, the model considers the relation between the
application and the control unit necessary to implement the
system instruction set. From this relation a performance
figure is obtained.
Any .architectural feature will have consequences both in
the execution time of each instruction and in the complexity
of the control.
As has already been mentioned the first part of the
model is a timing measure. It will consider the execution
time of the specified application's " typical " program.
Several factors contribute to the execution time of a
program and not all of them are part of the computer archi-
tecture. Some have depend on the implementation of the
system.
The implementation is very much related to the tech-
nology chosen. The technology will determine, for example,
the maximum clock rate obtainable and the number of computer
components to be placed on chip.
Two factors have a great impact on the system perform-
ance, they are the clock rate and the average memory access
28
time. Also the number of components on chip is an important
factor, since one of the most time consuming operations is
to transmit data from one place to another. For example by
being able to have more registers on chip, one might be able
to reduce the average operand access time and therefore
speed up the computer operation. If one considers the
storage registers as part of the system memory then one can
see that the average memory access time is reduced.
In the suggested approach to computer performance evalu-
ation, the main concern is architectural features and not
implementation restrictions due to technology limitations.
The reason for this is that a method to evaluate computer
performance should be general and therefore be able to
survive constant technological change.
B. THE COMPUTER SYSTEM
Any computer system architecture is made of hardware and
software tools. In the area of software, an important factor
is the operating system.
For the sake of simplicity, and since in fact the oper-
ating system is also a program that has to be run on the
system, it can be considered as part of the application in
the computer performance evaluation process.
If the operating system is not considered as part of the
application software there would be a need to track all
calls to the operating system, measure the time the system
takes to execute the correspondent subroutines and subtract
this from the program execution time.




iv) the I/O interfaces
v) glue circuits
29
The processor consists of the portions of the computer
made up of the control unit, the arithmetic logic unit, the
general purpose registers and the busses that connect all of
these.
The memory consists of all the parts of a computer used
for either temporary or permanent storage, for instructions
or for data. The busses are a collection of signal lines
with multiple sources and multiple sinks. They provide for
the intercommunication capability among the other computer
components. The I/O interfaces are the parts of the
computer through which the system communicates with the
outside world.
In order for the overall system to have a good perform-
ance, it is desired to balance the average work done by each
component per unit of time. Since each computer component
has a different function, the work done by each is different
from the others. It is this work that has to be character-
ized, so that an understanding of how to maximize it, is
possible.
One requirement is that the idle time for each component
should be as low as possible. For example the processor
should be in an idle state for a data element stored in
memory as little as possible.
1. Memory and I/O Interface
Both memory and I/O interface can be considered
together, since both are communication media. Memory
performs a communication between two instants in time. I/O
interfaces perform a communication between the computer
system and the outside world.
For both memory and I/O the work is characterized by
how long it takes to correctly receive a unit of information
from the bus and how long it takes to correctly place the
same unit of information on the bus. This unit of informa-
tion will be the same in the case of instructions and data.
This unit of information is then one bit.
30
For both memory and I/O, the measure of their
performance is the number of bits that are received or
transmited per unit of time. This is in fact no more than a
bandwidth in units of bits per second.
For example, a memory unit with a word size of
sixteen bits and an access time of two microseconds performs
the same work as another memory with a word size of thirty
two bits and an access time of four microseconds.
2. The Busses
The function of a bus is to pass information from a
computer component acting as a source to other components
acting as sinks. The memory and I/O interfaces are also
communication media that treat data and instructions in the
same way.
The nature of these signals has no influence on the
characterization of the bus work or the efficiency with
which the bus preforms its work.
The bus work is characterized by:
i) the number of active sources at a time, here
assumed to be one
ii) the number of active sinks
iii) the number of signal lines, i.e. , the bus width
iv) the bus cycle time
As its function is to be a communication medium, the
bus work is measured by a bandwidth in units of bits per
second.
The particular bus bandwidth will be given by:
SCT (^-2)
where
SI - is the number of active sinks
31
WI - is the bus width
BCT - is the bus cycle time
3. The Processor
After receiving data and/or instructions from the
bus, the processor alters this data according to the
sequence of instructions and then delivers the final results
back to the bus.
While the previous computer components treat data
and instructions in the same manner, this is not true for
the processor case. In this case, instructions specify the
operations that have to be performed, and the data consti-
tutes the object on which the operations are performed.
The structure of the processor, i. e. , the specific
configuration of each element is dependent on the instruc-
tion set and on the data types involved. The instruction
set configuration makes requirements on the processor,
because the instruction set is intimately related to the
processor control unit and the datapath.
The data types involved in an application should be
supported by the processor. If, for example, a lot of array
manipulation is done, then it is to be expected that the
system considers some parallel operation capability.
In addition to the data types, the instruction set
is also dependent on the application. Therefore the
processor structure is also dependent on the application.
C. THE APPLICATION
An application is characterized in the same way indepen-
dent of the computer system being evaluated. It is charac-
terized by a certain number of tasks that have to be done.
Each task is executed with a certain frequency. For each
task and for each system there will correspond a program
written with that system instruction set.
The frequency of execution of each task is given by the
number of times (n), that this task is executed in a sample
32
of N tasks. So the frequency of execution of each task is
nothing more than the probability of this task being in
execution at any given time.
where
^L - is the frequency of execution of task i
W - number of times the task i was executed in a
big sample
M - total number of tasks that were executed in
that sample
For each task there is a corresponding computer program.
This program will take some time to execute.
The weight of each task or its representation in the
application will be given by the product of its execution
frequency and its program execution time in the system under
study. .
^i = ^e - i; (A.A)
where
W," - weight of the task i in the particular appli-
cation and for the system in study
It - execution time of the correspondent program
By this it is seen that the weight of the task is both
dependent on the application choice and on the system
choice.
A program is a sequence of instructions. Its execution
time can be divided into smaller pieces where only one
instruction is executed. In this way the program execution
time is given by a sum of products. Each element of the sum
will be referred to a single instruction, and consists of
33
the product of the instruction execution time and the number
of times each instruction is executed.
Therefore each element of the sum will be given by:
where
K/^- - is the number of times that the instruction j
is executed for the particular program
XXT. - execution time of instruction j




- the weight of instruction j in the system
instruction set and for the particular task
J - the total number of instructions in the
system instruction set
Finally, the weight of the application for the system
under study will be given by the weighted sum of its tasks.
So,
X
H-. L V/; (£,.7)
t. = t
but since







W^= 1_ ^,-kT- i^.e)
!.= >
J-
T( = Zl S • (_u)
Sj . KJj . IXT^ (.s)
D. THE PERFORMANCE
A comparison is made between the weights that an appli-
cation has in two different systems. In this chapter, where
a timing analysis is done, the weight of an application
involves the execution time of each instruction and the
dynamic frequency of execution of the same instructions.




VJ(J - is the weight of the particular application
for the reference system
W^ - is the weight of the same application for the
system being considered
Note that the two systems either have two different
instruction sets or the time of execution of each instruc-










Z ^' Z. ^^ ^^\
where
I - is the total number of tasks in the particular
application. It is the same as the number of
programs.
J - is the total number of instructions in the
reference system instruction set
K - is the total number of instructions in the
system in study instruction set
36
Considered in this way the measure of the performance
for a system is better the larger the ratio.
E. A SPECIAL CASE AND THE RISC
If the application involves only one task and therefore
only one program, the performance would be given by.
L M . XxT-
u ^'^-^
4&X
Let us now consider the RISC philosophy. For this case
the value of J is fixed.
The RISC proponents advocate that by reducing the total
number of instructions in the instruction set i. e. , by
reducing the value of K, the performance of the system
inceases. They also advocate that the instruction execution
time for each instruction is reduced by having a simpler,
more straightforward machine with better performance.
Their argument is that the value of the denominator is
reduced because the two previous factors compensate for the
necessary increase in the number of times each instruction
is executed. By reducing the denominator the system will
have a better performance.
F. THE SYSTEM ARCHITECTURE AMD TIMING
As has just been seen, the particular choice of applica-
tion determines the dynamic frequency of execution of each
instruction in the instruction set. To continue the study,
there is now a need to analyze how the system architectural
characteristics influence the system performance.
37
The system structure and its instruction set are neces-
sarily related. For every instruction, the system has to
have the necessary support in terms of the control unit and
the datapath. Also, any new enhancement to the system
architecture will affect the execution time of one or more
instructions. Therefore it will always affect the average
instruction execution time.
The model under discussion considers that each instruc-
tion has a certain associated weight, this weight being
dependent on the application and on the system architecture.
The application determines the number of times each instruc-
tion is executed, i.e., the dynamic frequency of execution
of the instruction. The system architecture determines the
execution time of each instruction. It is this execution
time that will now be studied.
We define the Life Cycle of an instruction (LC) as the
time period beginning at the instant the instruction is
first fetched from memory and ending at the instant the
final results produced by the operation are stored back in
memory.
The instruction execution time will then be some portion
of its time life cycle. This portion will be dependent on
the system architectural characteristics such as pipelining,
parallel processing, instruction prefetching, instruction
queue, etc.
The main phases through which an instruction has to pass
in its life cycle are:
i ) Fetching
ii) Execution
The time the system takes to fetch an instruction is
dependent on the instruction bus width, the instruction
length and the bus cycle time in the following way:
± TroS.TRo CTvoM LEmC^TH
, ^ // ,,\
38
This value for the fetch time will be an average, more
or less rigorous, depending on:
i) instruction size ( fixed or variable )
ii) the availability of the instruction queue
Not all the instructions have the same structure, but
nevertheless, all of the instructions accomplish some trans-
formation on some data. The data might be one or more oper-
ands and the final result in the case of an arithmetic
instruction, or the data might be the contents of the
program counter in the case of a branch.
In order for the system to be able to accomplish the
transformation required by the instruction, it has to:
1) decode the instruction
2) locate the data ( e. g. , addressing modes )
3) place the data in a convenient location to be
transformed, if it is not there already
4) perform the transformation asked for by the
instruction
5) relocate the data in a convenient location.
Whether these phases are performed in a sequential
fashion or in parallel depends on the system architecture.
For example, suppose that the instructions followed a fixed
format with separate and predefined fields for OPCODE and
ADDRESSSING. Then it would be possible to decode the
instruction and the address field simultaneously.
In order for the system to process the addressing mode
and depending on the particular address mode, it may have to
do one or more of the following:
preform data transfers either register-to-
register or memory-to-register;
preform some addition e. g. , in the case of base
addressing, index addressing or branch
addressing;
preform some multiplication e. g. , in the case of
the VAX- 11 index mode.
For the sake of simplicity one could consider all the
data transfers that have to be done while the system
39
executes a program and determine an average time for data
transfer.
Typically if the system has on-chip registers, cache
memory and main memory, the value for the average data
transfer time will be:
\t - — C^AT) ^ ^ (CAT) + -^ CV\A1) ik.\S)
where
R - number of register accesses
C - number of cache accesses
M - number of main memory accesses
T - total number of data transfers
RAT - register access time
CAT - cache access time
MAT - memory access time
and
T = R + C + M ik-\^)
In summary, in the instruction life cycle one has:
TF - fetching time
TDEC - decoding time
TLOC - locating data ( address mode )
TDATA - access data
TOP - perform the operation
TW - write the final results
If the system performs all of these time phases in a
sequential fashion so that there is no overlap, then the
instruction time life cycle will just be the summation of
all the time phases:
40
LCno = TF+TDEC+TLOC+TDATA+TOP+TW (no overlap) C^.H)
If some overlap among the phases is present, then the
instruction time life cycle will be some portion of the
previous value (no overlap case).
LCo = y * LCno (overlap case) C^.IJ)
where
y - is a coefficient that measures the efficiency
of the architectural scheme that accounts for
the overlap possibility. Its value will be
always between zero and one.
Some of the architectural characteristics that might
influence the value of " y "are:





- dual port memories
The architectural characteristics will also determine
the amount of overlap execution among different instruc-
tions. The efficiency of this overlap will then determine
what portion of the instruction time life cycle value will
be the instruction execution time (IXT).
IXT = w * LCo l^.l^)
where
IXT - instruction execution time
41
w - efficiency of the overlap among the time life
cycles of different instructions. Values
ranging from zero to one.
The value of w, that is the amount of overlap will be











In the previous chapters a timing analysis of the system
operation was presented. In it a study was made first of the
application effects on performance through the dynamic
frequency of execution of each instruction, and second of
the system architecture effects on performance through the
execution time of each instruction.
Finally to complete the model being suggested, one has
to consider the requirements that the instruction set poses
on the system in terms of the required control complexity.
These requirements will also be dependent on the
application.
This is also important since no matter what technology
is used in the system implementation, the number of
resources available on-chip will always be limited.
Typically the control unit is implemented using either
microcode or is hardwired e. g. , using programmable logic
arrays. Some of the factors that impact the choice are:
• instruction set complexity
• required control unit size
• possibility of future changes in the instruction set
• speed
The size of the control unit (i.e. , the number of gates
needed to implement the control unit) will determine the
space available on-chip for other components. In the case of
the RISC I and II the smaller control unit and therefore the
smaller power consumption, allowed the designers to add more
registers to the processor chip. With the choice of addi-
tional hardware for the processor , the designers in fact
reduce the average memory access time if one considers the
registers as also part of the system memory.
43
B. THE CONTROL UNIT AS A FINITE STATE MACHINE
The control unit of a computer system can be viewed as a
finite state machine, and therefore can be analyzed as such.
If analyzed in that way, the control unit operation can be
described by a state diagram. In its most simple and most
general case, the state diagram will typically have only two







Figure 5. 1 Simple Control Unit State Diagram.
In a more detailed analysis, the control unit state
diagram will have a tree like format where any vertical path
will correspond to the execution of an instruction, see
Figure 5. 2.
In this case, each and every instruction is identified
and each state although, still belonging to one of the two
major phases fetch and execute, will now correspond to a
microstep in the control unit output sequence while the
system is executing a program.
44
Figure 5. 2 More Detailed Control Unit State Diagram.
Of course this is complicated if the system is able to
deal with more than one instruction at a time. Nevertheless
the complexity of the controller can always be associated
with the number of states.
C. THE CONTROL UNIT COMPLEXITY
Not all the states will count in the same fashion since
there are states that will be common to more than one
instruction or vertical path.
45
The number of these shared states will depend both on
the processor instruction set itself and on the implementa-
tion choices made by the processor designer. For example, in
this last case the processor designer could make use of
microcode subroutines to be shared or called by more than
one instruction.
If states are shared among instructions, then there will
always be some trade-off between the total number of states
of the control unit and its speed. This tradeoff is due to
the fact that when states are shared among different
instructions, the control unit has to have some feedback
capability. The specific value of the feedback will force
the next state of the control unit, when the vertical paths
corresponding to the instructions will ultimately separate
themselves.
No matter what this feedback will be, it will always
have some cost related to it. The cost is the extra time it
takes for the values of the feedback signals to be valid.
Since the cost is time, it will be reflected in the average
instruction execution time, and so affect the performance of
the system in the portion the model described in the
previous chapter.
In this part of the model we focus on the comparisons of
two control units.
The complexity of a particular instruction will then be
dependent both on the number of states it has and on the
number of states which are shared by more than one
instruction.
The cost of adding a new instruction to a certain
processor instruction set is the number of new states that
have to be added to the control unit state diagram. The
addition of this instruction will have a cost on the system
performance that can be minimized by maximizing the number
of states necessary to its execution that are already in
existence in the control unit state diagram.
46
Returning to the control unit the number of states is
then dependent on:
i) the number of instructions
ii) the number of states that are common to more
than one vertical path (or instruction)
iii) the average height of each instruction
Where the height of one instruction is defined as the
number of states in its vertical path.
I>. THE APPLICATION AND THE CONTROL UNIT
In the previous chapter the instruction set and the
dynamic frequency of execution of each instruction together
with the instruction execution time were considered. Now
one wants to know how effective, the control unit is for the
application where the processor is being used.
It has already been seen that the complexity of the
control unit is related to the number of states. One knows
that a smaller and simpler control unit has. an effect on the
processor performance, because more space would be available
on-chip for other resources. One choice might be to add new
registers to the processor chip and thus try to decrease the
average memory access time.
One also wants to minimize the number of instructions
that are needed in order to perform a certain task, so one
has to go back to the application. An application is char-
acterized by a certain number of tasks that have to be done.
Each task is performed with a certain frequency. For each
task a program will have to be written using the instruction
set available. Each program corresponds to a sequence of
instructions used to perform the corresponding task.
Directly from the program it should be possible to
compute the static frequency of each instruction. But that
is not the only frequency that is of interest to the
performance evaluation process. The dynamic frequency of
execution is more important.
47
The two frequencies will be different for each instruc-
tion depending on:
i) program sequence
ii) conditional branches and the most frequent values of
• the variables on condition.
The execution of a program is then a sequence of several
instructions execution.
Since a single instruction corresponds to a vertical
path in the processor control unit state diagram, the execu-
tion of a program will then be an up and down walk on the
state diagram.
When comparing two control units, the one that would
have to execute fewer instructions, supposing that the
average height of an instruction would be the same for both
control units, will be the best. The height of an instruc-
tion is in fact a measure of what the RISC proponents call
the instruction complexity. Because it would be natural that
two different processors have instruction sets with
different values for the average height of an instruction,
the bottom line is that the comparison of two control units'
complexity cannot be done through the counting of instruc-
tions executed, but through the counting of the number of
states through which each control unit has to pass when the
system executes a typical application program.
It is to be expected that if one wants to add an
instruction to a processor instruction set, the control unit
will suffer by an expansion. For a hardwired implementation
e.g. , using PLA's these will have to grow; for a microcode
implementation typically there will be a need to increase
the size of the microcode memory. The amount of the control
unit expansion, will be dependent on the implementation, on
the instruction itself, and on the designer's choice
regarding the number of states that will be shared with
existing instructions. There is a relation between the
number of gates used in order to implement a controller and
48
the number of states present on the controller state
diagram.
Because there is a direct and individual relation
between the control unit states and the gates that compose
the control unit, and because one wishes to use each and
every one of these gates a similar number of times in order
to increase the overall efficiency, then for better effi-
ciency it is desirable that all states are used in a
balanced way. With some similarity one might say that the
efficiency of the use of an instruction set increases when
all the instructions in that instruction set tend to be used
an equal number of times.
An application has an indirect relation to the number of
states through which the control unit has to pass in order
for the system to execute the corresponding programs.
In the optimum case the control unit will have the
following characteristics:
i) minimum number of gates
ii) for the specific application all states will be used
in a balanced number of times
iii) no state exists that will never be used.
E. THE MODEL
Assume that a control unit has a total number of states
T. Associated with each state there will be a certain
number of gates. This number will be dependent on the imple-
mentation choice, either microcode or hardwired logic. Of
these T states, an application uses S states, and of these S
states some states will be used more than others.
The weight of the application is related to the number
of states through which the control unit has to pass in
order to execute the corresponding programs.
Each state has some weight associated with it. This
weight will be dependent on:
i) the number of times the state is used
ii) the num.ber of instructions that share the state
49
iii) the number of gates needed for implementing each
state.
The complexity of an instruction will be related to its
height, that is the number of states in the corresponding




Cj ^ L. VJj^ iSA)
where
C\
- complexity of the instruction j
Wj^ - weight of state h




(^ - number of gates per state ( implementation )
U^ - number of instructions to which the state is
common
The weight of an instruction will be the product of the
number of times the instruction is executed for a given
program times the instruction complexity.
That is
VJ^ . M^ « C^ C^.i)
where
^\ - number of times the instruction j is executed
As in the previous chapter, the weights of the task and
the application will be:
50
where
^i - weight of task i
V,' - frequency of task i for a certain application
X - number of instructions in the instruction set





1 r: L nJ; L A (s.o
'"
-^ '
,-. ^ J^.. Uj,
where
Wq .- weight of the application
T - number of tasks in the application of
interest
T - number of instructions in the processor
instruction set
H - height of each instruction
Similar to the timing analysis in the previous chapter,
the performance of the system under study will be given by:
Tlrf . Mi. C^-"')
where
'^a.
~ weight of the application for the reference
system




ta I9 r '"" ^i=' -t., UJL ,_,^.
K
r f. i: M. E ^
c- 1 M.~\
where
^ - number of tasks (programs) in the application
«J - number of instructions in the reference
system instruction set
K - number of instructions in the system under
study instruction set
U - height of instruction j in the reference
system instruction set
L - height of instruction k in the system under
study instruction set
N- - number of times instruction j is executed
while the reference system executes the
typical application program
1^^ - number of times the instruction k is executed
while the system under study executes the
same program
<^o - number of gates per state in the reference
system control unit
^1 - number of gates per state in the system under
study control unit
L/_^ - number of instructions that share state h in
the reference system control unit state
diagram
52
U^ - number of instructions that share state 1 in





As an example we will analyze the change in performance
of a particular application program when some floating point
capability is added to a processor which currently performs
fixed point arithmetic.
In this case study, the performance effects of the
program code sequence will not be considered. These effects




Specifically, the case consists in the possible addition
of a floating point multiply instruction to a processor
instruction set. The processor that was chosen was the
Motorola MC580Q0. The application for this evaluation is
the computation of a Fast Fourier Transform.
B. THE ADDITION OF AN INSTRUCTION
The addition of an instruction to the original instruc-
tion set has several consequences.
First of all if a hardwired controller is used the
processor's control unit must be expanded so that the
instruction is incorporated. The amount of the control unit
expansion is dependent on the number of new states that the
instruction under consideration will add to the control unit
state diagram and also on the control unit implementation.
In fact, one of the reasons to use microcode in the
implementation of an instruction set is due to the flexi-
bility it gives in any future changes of the instruction
set.
54
Second and depending on the operation performed by the
instruction, some hardware will have to be added to the
processor. The amount of hardware that will have to be added
to the processor is dependent both on the hardware that
already exists on-chip, that the instruction might use and
is dependent also on how fast one wants the instruction to
operate.
The addition of more hardware to the processor will
cause a rise in the power consumed by the processor. Due to
a limited power dissipation capability, the net effect of
the increase in the number of gates that constitute the
control unit and the datapath will be a reduction in the
size of existing processor components or a migration of some
off-chip, so that the power consumed by the processor stays
constant.
One choice might be to replace some of the registers
available on-chip by the hardware necessary for the new
instruction. By reducing the number of registers on-chip,
there will be a decrease in the ratio of register accesses
to the number of main memory accesses.
In the case of a Load/Store architecture such as the
RISC architecture, a reduction in the number of registers
will cause an increase in the dynamic frequency of execution
of LOAD and STORE instructions relative to the other
instructions.
In a traditional architecture, where the LOAD and STORE
instructions are not the only memory reference instructions,
the effect of reducing the number of on-chip registers is an
increase in the average instruction execution time because
the proportion of memory accesses to register accesses will
increase.
This increase in average instruction execution time will
cause an increase in the typical application's program
execution time. It is this increase in execution time, that
55
will have to be overcome by the addition of the new instruc-
tion to the processor instruction set, so that in fact the
program execution time might suffer a reduction rather than
an increase.
C. THE COST/GAIN TRADEOFF
The floating point multiply instruction after being
added to the processor instruction set, will replace the
sequence of instructions that the processor had to execute
every time a multiplication of two floating point numbers
was called for.
In order for the addition of the floating point multiply
instruction to be considered, the instruction has to pass
several tests. The first test requires the instruction
execution time to be smaller than the correspondent instruc-
tion sequence execution time.
If that is not the case, then there is no point in
adding the instruction to the processor instruction set.
So, consider:
Ini - execution time of the new instruction
Iseq - execution time of the corresponding sequence of
instructions
For the addition of the new instruction to be consid-
ered:
Ini < Iseq (6.1)
Assume then that in fact the above condition is true,
then
Iseq = Ini + Igain C^-2.)
or
56
Ini / Iseq = c L^-^)
where c < 1
For the sake of simplicity, consider that the applica-
tion of interest is composed of only one task. That is to
say that the effects on the processor performance will be
considered only within the context of a program.
The model suggested for computer performance evaluation
has two parts, a timing analysis and a control unit
complexity analysis. These two parts of the model will give
rise to two distinct criteria to which the addition of the
instruction will have to comply. So that the gain in the
processor performance that is obtained, will surpass the
reduction or cost in the processor performance due to the
requirements brought by the same instruction to the
processor architecture.
1. Timing Criterion
The timing model says that the effects of the addi-
tion of one instruction to the system instruction set, on
the system performance will be measured by:
T
L Mo4
?<f9 = il! __^ (4.i<)
J
7 KJa- La. X, NJiftew Lv\tL*/
vJ-»
where
J" - is the number of instructions on the original
system instruction set
57
N-^ - number of times that the instruction j is
executed before the addition of the new
instruction to the processor instruction set
Lj - execution time of the same instruction j on
the original system
^*j
- number of times that the instruction j is
executed after the addition of the new
instruction
La; - execution time of the instruction after the
addition of the new instruction
I^Ku*/ - number of times the new instruction is
executed
Lk«»j - execution time of the new instruction
The numerator is a measure of the execution time of
the application program before the addition of the instruc-
tion under consideration. The denominator is a measure of
the execution time of the application program after the
addition of the new instruction.
The sequence of instructions in the original
instruction set that implements the operation performed by
the new instruction is executed a number of times. This
number will be equal to Nnew.
The sequence execution time will consist of the





^uy - number of times that the instruction j of the
original instruction set is executed during











k I k\ vl
where
Mo. - number of times the instruction j of the
original instruction set is executed outside
the sequence.
For improvement in performance:
Perf > 1 C^.J)
This indicates that it is worthwhile to add the new









L- N^o- L- + Nwa^ Lioc, y Z— Kl<x. La; + Mao-j LwiW ("^.lo)
J
The right term of the inequality corresponds to the
increase in the application program execution time, that was
caused by the suppression of some hardware components of the
processor e. g. , some registers.
This increase, caused by an increase in the number
of instructions that have to be performed--case of the LOAD
and STORE instructions in a Load/Store architecture, or
caused by an increase on the average instruction execution
time--case of a traditional architecture.
Therefore
60
On the left term of equation 5. 1
,
Lseq - Lnew
represents the gain in execution time that was obtained by
substituting the sequence of original instructions by the
new instruction, each time the operation was performed.
So,
Lseq - Lnew = Timing Gains = Tgain (^•I'i)
Then,
Nnew Tgain > Tcost C^''^)
or
Nnew > Tcost / Tgain U.\^*)
Based on an timing analysis, it is only advantageous
to add the new instruction if:
1) Lseq > Lnew
and
(t.u)
2) Nnew > Tcost / Tgain U-H)
To put it in another way, the addition of an
instruction to a processor instruction set will only
increase performance if that instruction is executed a
61
sufficient number of times during the application programs
execution. The exact number of times the instruction must be
executed is given by the above criterion.
2. Control Unit Complexity Criterion
Concerning the analysis of the control unit
complexity one has:
J W
z <., r (^o
j*l ^ i... UJU
J H Wf44^
) Ma- ^ ^
-V
NJkcju) ? Uo
' X... LlA. JL.-^
^JL
?..f
= ir ""' "-^ (tA«)
Since the implementation of the control unit will be
the same and the implementation determines the value of GO,




As in the timing analysis one wants:
Perf > 1 i^-^^)
62
That is
As before, the execution of the sequence will
consist on the execution of several instructions, then
L Ni; 2- -L - H Mo. 2- -1-4 W^tw Z- M«v ^ ~ ^^-^^3
Then
U
Z_ N)o. Z_ -L + M^ew Z- NliM. ^ >




J ^'. U^ i... u^ J 4
" .II-





Ls - represents the gain in the number of states,
obtained each time the operation performed by
the instruction and/or the sequence is
executed.
Es - represents the cost in the number of states
due to the addition of the new instruction
Then
Nnew * Ls > Es (6.2S)
or
Nnew > Es / Ls (6.2 6 )
D. AN ILLUSTRATIVE EXAMPLE
An example is now presented to clarify the use of the
model suggested through the present and previous chapters/
The example quantizes the effects of adding a floating
point multiply instruction to an existing processor instruc-
tion set.
As has been previously stated, the values determined for
the increase or decrease on the system performance will only
be valid for a given application.
1. The Processor
The Motorola MC58000 is selected for this example.
The MC58000 is a widely known microprocessor that has a
simple instruction set offering no floating point support.
The MC68000 has a 15-bit data bus and a 32-bit
address bus. In addition to the Program Counter and Status
Registers, the MC58000 has seventeen 32-bit registers. These
registers are divided into two groups. -The first group.
64
composed of eight registers are general purpose data regis-
ters. The second group, composed of the remaining nine
registers is used mostly for handling addresses.
In total, there are fourteen addressing modes on the
MC68000, although they can be studied in six basic types.
These addressing modes are already described in chapter two
of this thesis.
The instruction set of the MC68000 consists of 56
basic instructions, having from zero to two addresses. Each
instruction can use several addressing modes. This fact
determines that the MC68000 does not follow a Load/Store
architecture.
The instruction set of the MC68000 supports five
basic types of data:
bits
bytes ( 8 bits)
words ( 16 bits)
longwords (32 bits)
Packed binary-coded decimal (BCD) with two digits per
byte
The input/output on the MC68000 is memory-mapped,
i. e. , all I/O interfaces share the address space with
memory.
Considering the implementation of the MC68000, it is
a single-chip VLSI HMOS processor with a typical clock rate
between 4 and 12 MHZ and with a typical memory access of 4
clock cycles.
2. The Application
For the application we choose a program that
computes a Fast Fourier Transform. This program was
obtained from ' The Fast Fourier Transform' by E. Gran
Brigham [ Ref . 4]. The program is written in Fortran. The
flowchart of the computation done by this program is on page
161 of the above reference. The program itself appears on
page 164 of the same book.
65
From the reading of the program, one can immediately
verify that some of the operations that are called for could
not be directly implemented with the MC68000 instruction
set.
For these operations it was necessary to use either
subroutines present in ' Microprocessor Systems, a 15-Bit
Approach' by William J. Eccles [ Ref . 5] or newly written
subroutines. The subroutines to handle floating point
numbers in the MC68000 came from Ref. 5.
The subroutines that were written are shown on
appendixes C and D, these subroutines compute the sine and
the cosine of an angle, according to an algorithm presented
in the ' Software Manual of the Elementary Functions' by
William J. Cody, J. R. and William Waite [Ref. 6: pp.
125-143]
.
The translated program for the Fast Fourier
Transform computation is shown on Appendixes A and B.
3. The Floating Point Representation
The floating point representation that was chosen is
the IEEE proposed standard for single precision. This stan-
dard determines a 32-bit long representation of a floating
point number, shown in Figure 5. 1.
s ExPouEf^r MAW TiiiA
=> 1 8 *? 31
Figure 6. 1 Floating Point Representation.
This standard has the following characteristics:
i) 32 bits are used
66
ii) radix of two
iii) the radix point before the first digit with assumed
one to the left
iv) mantissa
iv. a) sign position -
iv. b) value position - 9-31
iv. c) representation - normalized, sign/magnitude
v) exponent
V. a) sign position - no sign
V. b) value position - 1-8
V. c) representation - biased exponent, bias =
127(dec)
V. d) range of exponent - -126 to 127
vi ) range of floating point number - +- 5. 9*10**-39 to
+-1. 7*10**38
All the subroutines that handle the floating point
data and that were used obey to this standard, so does the
hardware necessary to implement the floating point multiply.
4. The Hardware Involved
The general structure of the hardware required for
the implementation of an additional floating point multiply
r
instruction in the MC68000 instruction set was obtained from
the 'Introduction to Computer Architecture' [ Ref . 7: p. 80]
and is shown on Figure 6.2.
The hardware consists of:
i) three 32-bit registers, these can be some of the
already existing data registers on the. MC68000,
ii) an 8-bit adder used for the exponent addition, that
could just be the adder already existing on -the
MC68000,
iii) a multiplier used for the mantissa multiplication,
iv) an exclusive-or gate for the product sign calcula-
tion,
v) a normalizer and converter
With the hardware structure that was chosen it is
possible to perform in parallel the determination of the
sign of the result, the addition of the two exponents, and














Figure 6. 2 General Hardware Structure for the
Floating Point Multiply Instruction.
The execution time of the floating point multiplica-
tion instruction will then be determined by the slowest of
these three distinct and parallel operations.
The sign computation involves just one exclusive-or
gate gate and therefore takes a maximum of one clock cycle.
The addition of the two exponents involves in fact
the addition of the two exponents, followed by the subtrac-
tion of the bias since this has also to be performed concur-
rently with the determination of exponent overflow or
underflow.
From [ Ref. 7] the addition of the contents of two
registers using the MC68000, takes 4 clock cycles to
complete. After this addition an extra clock cycle will be
taken for the determination of exponent overflow and under-
flow together with the subtraction of the extra bias.
68
Therefore it is concluded that the addition of the two expo-
nents will take a maximum of 5 clock cycles.
For the mantissas multiplication, a multiplier will
have to be added to the processor hardware. According to
"Digital Systems: Hardware Organization and Design by
Frederick J. Hill and Gerald R. Peterson ' [ Ref . 8] the
multiplier structure that gives the best cost/performance
tradeoff in terms of the hardware involved and the time it
takes to perform a multiplication is a multiplier that uses
a carry-save adder. There a carry save adder type multi-
plier was chosen.
Also, according to [Ref. 8: p. 361] the time that a
carry-save adder takes to perform an N-bit multiplication
using a adder for which each addition/shift cycle takes two
clock cycles is given by:
Tmult = (N+l)Tc U.27)
where
Tc - is the clock cycle time
In the case being discussed the multiplication
involves two operands - the mantissas. Each mantissa is
24-bits long. Therefore according to the formula shown
above, the multiplication of the two mantissas will take 25
clock cycles. This makes the the multiplication the longest
operation involved.
Note that, the detection of a zero product can be
done concurrently with the multiplication, since a zero
product will happen only in the case where one of the oper-
ands is zero.
The normalization must still be done sequentially.
The normalization involves at most one left shift of the
mantissa product and a decrement of the product exponent.
69
There is only at most one shift, since the mantissas of both
operands are in normalized form and therefore their values
are between 0. 5 and 1. In the worst case, the two mantissas
are both 0.1 (binary) and so their product will be 0.01
(binary). In this case only one left shift is necessary in
order to normalize the mantissa of the product.
The normalization requirement that the standard
makes on the mantissa, also dictates that any overflow or
underflow of the exponent product does not have a possible
recovery.
In conclusion, the floating point multiply instruc-
tion with this hardware will take approximately 25 clocks to
complete.
The hardware that would have to be added to the
iyiC58000 would only consist cf the 24 bit carry-save adder,
the exclusive-or gate and some logic to determine overflow
or underflow of the exponent and a zero product.
All this hardware will be more or less equivalent to
two of the 32-bit registers existing on the MC68000. Say
then, that due to power dissipation limitation^ on the
MC68000 two of the 32-bit data registers would then be
removed from the MC68000, in order to add the additional
hardware necessary to implement the floating point multiply
instruction.
5. The Model
As stated previously, the addition of the instruc-
tion will have some costs. One of these costs has been
referred in the previous subsection, it is the removal of
two of the data registers.
As one might expect the removal of some of the
registers from the MC68000 will have an effect on the system
performance by reducing the number of registers accesses and
increasing the number of main memory accesses.
70
In the specific case of the application that is
being considered, this is not true because, at most, six of
the eight data registers are used at one time. Therefore,
for this specific case, the timing costs involved due to the
addition of the floating point multiply instruction will be
zero.
For each and every subroutine involved in this
application, the execution time of the subroutine was
computed following a worst case and a best case criteria.
The difference between the two execution time values for
each subroutine arises due to data dependencies on the
number of times each instruction is executed.
The execution times of each subroutine were then
combined, best with best and worst with worst, in order to
define two boundary lines for the final execution time of
the whole program.
For the specific case of the floating point multiply
subroutine, the smallest execution time corresponds to a
multiplication of two floating point numbers where one of
them is zero. The longest execution time for the same
subroutine corresponds to the multiplication of two numbers
where an exponent underflow occurred after the normalization
step. Here, for the same reason as before, the normaliza-
tion requires at most one left shift.
Specifically, the values obtained for the execution
times of each subroutine are shown in Table I in terms of
clock cycles.
For the whole program the execution time will be
dependent on the values of the data and on the number of
entry points (N) to the Fast Fourier Transform computation.
The values obtained in terms of clock cycles and number of
required floating point multiplies are shown in Table II.
The best case and the worst case execution of a
floating point multiply subroutine takes respectively 203
71
TABLE I
EXECUTION TIME OF EACH SUBROUTINE
IN FAST FOURIER TRANSFORM PROGRAM










APPLICATION PROGRAM EXECUTION TIME
N BEST CASE WORST CASE
16 572482+352MULTFP 1899074+736MULTFP
32 1418194+880MULTFP 4734674+ 1840MULTFP
64 3484658+2 112MULTFP 114442 10+4416MULTFP
128 8198594+4928MULTFP 26770882+ 10304MULTFP
256 18901458+ 11264MULTFP 61352402+23552MULTFP
512 42902562+25344MULTFP 138417186+52992MULTFP
1024 9618622 6+56320MULTFP 308440946+ 117760MULTFP
2048 2 13497794+ 123904MULTFP 680458178+259072MULTFP
4096 469394450+2 7033 6MULTFP 14882 17 106+565248MULTFP
and 640 clock cycles to execute. For a clock rate of 10 MHZ,
the program execution time before the addition of the new
instruction will be is in Table III.
72
TABLE III
FFT PROGRAM EXECUTION TIME BEFORE THE ADDITION







15 0.054 0. 234
32 0. 150 0. 584
54 0. 391 1. 411
128 0. 920 3. 299
255 2. 119 7. 558
512 4. 805 17. 042
1024 10. 752 37. 957
2048 23. 855 83.594
4095 52. 427 182. 953
For the same clock rate, the program execution time
after the addition of the floating point multiply instruc-
tion is shown in Table IV.
The best case is the one where the implementation of
the floating point multiply offers less gain.
For the best case
Tgain = 203 - 25 = 177 clock cycles
For the worst case
Tgain = 504 - 25 = 578 clock cycles
As already explained, for both cases Tcost is zero.
This is due to the fact that in the particular application
program two of the general purpose data registers are never
used. In the case that all general purpose data registers
were used in the application program this would not be true.
If this happened then there would be an increase in the
ratio of the number of register accesses to the number of
73
TABLE IV
FFT PROGRAM EXECUTION TIME AFTER THE ADDITION







16 0. 058 0. 192
32 0. 144 • 0.478
64 0. 354 1. 156
128 0. 833 2. 704
256 1.919 6. 196
512 4. 356 13. 979
1024 9. 765 31. 150
2048 21.672 68. 719
4096 47. 642 150. 291
main memory accesses, causing an increase on the average
operand access time and an increase on the average instruc-
tion execution time.
Using the formula for the model regarding the timing
analysis the performance effects of the addition of the
floating point multiply instruction come as shown in Table
V.
From these results one can see that the improvement
on the MC68000 performance due to the addition of the
floating point multiply instruction for this specific appli-
cation varies between ten and twenty percent and is




PERFORMANCE EFFECTS OF THE ADDITION OF THE
FLOATING POINT MULTIPLY INSTRUCTION
N BEST CASE WORST CASE
Perf Perf
16 1. 11 1. 22
32 1. 11 1. 22
64 1. 11 1.22-
128 1. 11 1. 22
256 1. 10 1. 22
512 1. 10 1. 22
1024 1. 10 1. 22
2048 1. 10 1. 22
4096
.
1. 10 1. 22
75
.VII. CONCLUSIONS
This thesis began by making an identification and char-
acterization of a new and controversial type of computer
architecture called RISC for Reduced Instruction Set
Computers. The rise of this new computer architecture and
the discussions that followed regarding its performance,
when RISC machines are compared with CISC machines, has
shown the need for an appropriate tool to evaluate computer
performance from an architectural point of view.
This thesis suggests a model to be used by computer
architects to determine the performance effects of an
enhancement to a computer architecture. The computer evalu-
ation process is important, since it generates have a quan-
tified perception of the influences that each enhancement to
the system architecture will have on the system performance.
The availability of a model to do computer performance eval-
uation is therefore essential in the decision-making process
for determining which architectural features a system should
have to optimize its performance for a certain application.
To develop this model for the evaluation of computer
performance, a conceptual view of what determines the system
performance was formed. It is the author's opinion that the
performance of a system results from the quality of the
match between a particular application requirement and the
architectural characteristics of the system. This match is
done through the customization of the system instruction
set.
The model that is suggested is divided into two parts.
The first part makes a quantification of the effects that an
architectural enhancement to the system has in the execution
time of a "typical" application program. The second part of
the model compares the efficiency of the design of two
75
systems control units. In both parts the model considers
that the application determines the number of times each
instruction of the system instruction set is executed.
For the first part, the system architecture determines
the execution time of each instruction. For the second part,
the system architecture determines the number of states
through which the system control unit will have to pass
during the execution of the application program(s).
Finally, an example on how to use the model, in order to
determine what are the costs and benefits of adding an
instruction to a processor instruction set for a particular
application, is given.
The program that was used to apply the model is a bit
misleading in the quantification of the cost/benefit ratio
of the enhancement. This is due to the fact that in opposi-
tion to what should be expected, the program does not use
all the system architectural resources and so, even before
the addition of the new, instruction does not optimize the
system performance. If that were not the case and the
program was an optimal one for the application of interest
and for the processor chosen, then, surely, the enhancement
to the system architecture would have some costs.
In any event and even considering that the example is a
bit misleading, the author arrived at two criteria, each one
derived from one of the parts of the model, for which the
addition of an instruction to a system instruction set has
to obey so that the performance of the system for the
particular application is increased.
These two criteria will be applied if the new instruc-
tion execution time is smaller than the execution time of
the sequence of instructions that implemented the function
before the addition of the new instruction to the system.
For the first part of the model the criterion for the
addition of the new instruction, states that:
77
Nnew > Tcost / Tgain
where
Nnew - is the number of times the new instruction
is executed for the particular application
Tgain - is the difference in the execution times
of the sequence of instructions that had
to be executed by the system every time
the operation was performed before the
addition of the new instruction and the
execution time of the new instruction.
Tcost - is the increase in the application program
execution time that was caused by the
suppression of some hardware components of
the processor
For the second part of the model, the criterion for the
addition of the new instruction, states that:
Nnew > Es / Ls
where
Ls - represents the gain in the number of control
unit states, obtained each time the operation
performed by the the instruction and/or the
sequence is executed.
Es - represents the cost in the number of states
due to the addition of the new instruction to
the system instruction set.
The two parts of the model need to be thoroughly checked













LOOPl BEQ. S 100
102 MOVE. N2,D1
L00P2 BEQ. S 101
MOVE. NUl , D2
MOVE. K,D3




























DO 101 1=1, N2
P=IBITR( K/2**NU1,NU)
ARC = 6. 283185*P/FL0AT(N)
convert N to float, point













































































; convert P to float, point
clear D4 except exponent
D4 <-- FLOAT(P)
store FPP
A2 points to Floating Point
Working Register
















































































( A2 ) , ( Al
2( A2) ,2( Al)
S,AG
GETFP













FPWR <-- partial TREAL
FPACC <-- TREAL
store TREAL

















































































FPWR <-- partial TIMAG
FPACC <-- TIMAG
store TIMAG
XREAL( K1N2 ) =XREAL( Kl ) -TREAL





FPACC <-- XREAL(Kl) -TREAL
store
XIMAG( K1N2 ) =XIMAG( Kl )
-
-TIMAG

















































XREAL( Kl ) =XREAL( Kl ) +
+TREAL




FPACC <-- final XREAL(Kl)
store
XIMAG( Kl ) =XIMAG( Kl ) +
+TIMAG
change sign of -TIMAG
FPWR <-- TIMAG
FPACC <-- FPWR
FPWR <-- partial XIMAG( Kl
)












CMP. W N,D1 ;IF (K. LT.N) GO TO
BMI 102
CLR. W K ;K=0
SUBI. W #1,NU1 ;NU1=NU1-1




MOVE. #1,D1 ;D0 103 K=1,N
BEQ. S 103
























D1,A3 A3 --> XREAL(K)
ADDA. D1,A5 A5 --> XIMAG( K)
ADDA. D2,A4 A4 --> XREAL( I)
ADDA. D2,A6 A6 --> XIMAG( I)




MOVE. L (A5),TIMAG TIMAG=XIMAG(K)
MOVE. L (A4),(A3) XREAL(K)=XREAL( I)
MOVE. L (A6),(A5) XIMAG(K)=XIMAG( I)
MOVE. L TREAL,(A4) XREAL( I)=TREAL













D0-D3,-(A7) ; save registers
MOVE.
W
J, Jl ; J1=J
CLR. W IBIT ; IBITR=0
MOVE. NU,DO ;D0 200 1=1, NU
BEQ. S 2000
MOVE. J1,D1 ; J2=Jl/2
ASR. W #1,D1
.




SUB. W D2,D3 ;D2 <-- (J1-2*J2)
ASL IBIT
ADD.W D3, IBIT

























#-l,SGN SGN <-- -1
BCHG #bit,DO DO <-- -DO
BRA 200
MOVE. #1,SGN •SGN <-- 1
MOVE. DO,Y •Y <-- DO
CMP. L YMAX^DO lYMAX - DO
BPL 300
error message
MOVEA. L Y,AO ;A0 --> Y
JSR GETFP . FPWR <-- Y
MOVE. 1/PI,(A1) ; FPACC <-- inverse of pi
MOVE. 1/PI,2(A1)
JSR MULTFP iFPACC <-- Y/P
I
MOVEA. L Y/P I , AO ;A0 --> Y/P
I
JSR STEP • store Y/PI
MOVE. Y/PI,D1 •Dl <-- Y/PI
MOVE. D1,D2
ANDI. mask,Dl ;D1 <-- mantissa
BSET #bit,Dl ; insert hidden bit
LSR #7,D2 hi D2 has exponent
SWAP D2 • lo D2 has exponent
SUBI. #127, D2 ; extract bias




































MOVEA. 1 X 1 , AO
JSR GETFP
JSR ADDFP













shift left mantissa by
exponent value, max = 8
leave only integer part
mantissa in lo Dl
N <-- integer of mantissa
XN <-- FLOAT(N)
N even ?
if even do nothing
otherwise
















































































, ( A2 )




























FPACC <-- ( r4*G+r3)*G+r2
FPWR <-- G
FPACC <-- ( ( )*G+r2)*G
FPWR <-- rl




































test value of SGN
if positive do nothing
otherwise
change sign of result
restore registers

















AND I mask, 1 X| clear sign bit
MOVEA. |X|,AO ;FPWR <-- |X|
JSR GETFP
MOVE. PI/2,(A1) •FPACC <-- PI/2
MOVE. PI/2,2(A1)
JSR ADDFP •FPACC <-- IXl+PI/2
MOVEA. Y,AO store Y
JSR STEP
MOVE. Y,DO DO <-- Y
CMP. L YMAX,DO YMAX - DO
BPL 100
error message
100 MOVEA. Y,AO •AO --> Y
JSR GETFP FPWR <-- Y
MOVE. 1/PI,(A1) FPACC <-- inverse of pi
MOVE. 1/PI,2(A1)
JSR MULTFP FPACC <-- Y/P
I
MOVEA. L Y/P I , AG AO --> Y/PI
JSR STEP store Y/PI
MOVE. Y/PI,D1 Dl <-- Y/PI
MOVE. D1,D2
ANDI. mask,Dl , Dl <-- mantissa
BSET #bit,Dl insert hidden bit
LSR #7,D2 hi D2 has exponent
SWAP D2 lo D2 has exponent
SUBI. #127, D2 extract bias







































MOVEA. 1 X 1 , AO
JSR GETFP
JSR ADDFP
MOVEA. TEMP , AO
JSR STEP
MOVEA. L XN , AO
JSR GETFP
clear N
if zero go to 400
N <-- 1
shift left mantissa by
exponent value, max = 8
leave only integer part
mantissa in lo Dl
N <-- integer of mantissa
XN <-- FLOAT(N)
N even ?
if even do nothing
otherwise



































MOVEA. F,AO ;FPWR <-- F
JSR GETFP
MOVE. ( A2 ) , ( Al
)
; FPACC <-- F
MOVE. 2(A2),2(A1)
JSR MULTFP ; FPACC <-- F*F
;G = F*F
MOVE. (A1),(A2) ;FPWR <-- G
MOVE. 2(A1),2(A2)
MOVE. R4,(A1) •FPACC <-- r4
MOVE. R4,2(A1)
JSR MULTFP •FPACC <-- r4*G
MOVEA. G,AO • store G
JSR STEP
MOVE. R3
, ( A2 ) •FPWR <-- r3
MOVE. R3,2( A2)
JSR ADDFP FPACC <-- r4*G+r3
MOVEA. G,AQ FPWR <-- G
JSR GETFP
JSR MULTFP FPACC <-- ( r4*G+r
MOVE. R2






















































FPACC <-- ( r4*G+r3)*G+r2
FPWR <-- G
FPACC <-- ( ( )*G+r2)*G
FPWR <-- rl








test value of SGN
if positive do nothing
otherwise
change sign of result
restore registers
return to main program
94
LIST OF REFERENCES
1. Katevenis, Manolis H.
,
Reduced Instruction Set
Computer Architectures for VLSI , Ph. D. THesis,
University of California at Berkeley, 1983.
2. Radin, George, "The 801 Minicomputer", IBM Journal of
Research and Development , Volume 27 Number 3"^ May
1983.
3. Stanford University Computer Systems Laboratory,
Technical Report 223, MIPS : A VLSI Processor
Architecture , by Hennessy, and others, November, 1981.
4.' Brigham, E. Oran, The Fast Fourier Transform ,
Prentice-Hall, 1974.
5. Eccles, William J. , Microprocessor Systems . a 16-bit
Approach , Addison-Wesley, 1985.
6. Cody Jr , William J. and Waite, William, Software
Manual for the Elementary Functions , Prentice-Hall,
1980. '
7. Stone, H. , and others. Introduction to Computer
Architecture , Science Research Associates, ~T98D^
8. Hill, Frederick J. and Peterson, Gerald R. , Digital





1. Defense Technical Information Center 2
Cameron Station
Alexandria, Virginia 22304-6145
2. Library, Code 0142 2
Naval Postgraduate School
Monterey, California 93943-5002








5. Dir. Serv. Instrucao e Treino 1




6. Manuel Pedrosa de Barros 4
Celula 5 Bloco 5 Lote D, 3 Direito
2795 Linda-a-Velha
Portugal
96





I
Thesis
B2403
2.1
Barros
The RISC architecture^^^
and computer perform-
ance evaluation.
7 FEB 87
4 NOV 87
3 2 069
32511
Thesis
B2403
c.l
21121?
Barros
The RISC architecture
and computer perform-
ance evaluation.

