Investigations in CPU design: a triple-instruction computer. by Chung, Wai-tung. & Chinese University of Hong Kong Graduate School. Division of Computer Science.
Investigations in GPU design: 




in partial fulfillment of the^  requirement 
for the Degree of Master of Philosophy 
Wai-Tung Chung 
Supervisors Prof. T.C. Chen and Mr. KH. Lee 
Department of Computer Science 




 \ — i / / 
<4
 -
 - 、 . .
 ： - / /
 . 





















U . A 7 6 ? - 、
 I
 一 L < 少 














































Investigations in CPU design: 
A triple-instruction computer 
Master Thesis 
Wai-Tung Chung 
Supervisors Prof. T.C. Chen and Mr. KH. Lee 
Department of Computer Science 
The Chinese University of Hong Kong 
Abstract 
Although the vector architecture, Long Instruction Word architecture and superscalar 
architecture have their successes in different application areas, they have certain limitations. 
From the view point of designing scientific machines, a new observation to the conventional 
Long Instruction Word architecture is introduced，the Triple-Instruction Computer，which 
employs an unconventional instruction structure. A kind of smart superinstruction, crystallized 
from new ideas such as self branch instruction, coupled operations and new observation to the 
dumb long instruction word, is introduced. Moreover，the idea of using more powerful instruction 
as an atomic programming unit is also incorporated. 
A Triple-Instruction Computer simulator is built and results of the simulation study are 
analyzed and summarized. The performance of the new architecture，in terms of reducing the 
complexity of the assembly code’ increasing the run-time efficiency, and minimizing the 
programming effort, is encouraging. 
Besides, the use ofAPL as the architectural simulation language is also a worthwhile 
experience for computer architecture investigation. Some implementation techniques, advantages 
and disadvantages of using APL are also presented. Comments about CPU design process are 
provided as a summary. 
Acknowledgement 
I gratefully acknowledge the helps of all the people who gave me support either in 
the academic life or in other areas during my stay at the Chinese University of Hong Kong. 
Specially thanks are given to those helped me to finish this project: Prof. Tien-Chi Chen and 
Mr. Kin-Hong Lee who are my supervisors and have given me information and warm 
support to finish this work; Mr. Alb Lam, Mr. Fan Law, Mr. Edward Ho and other members 
of M-Club who helped in solving a lot of technical problems; Ms May Tang and Ms Carol 
Pang who helped to proofread the thesis; Mr. Lodau Chung, Mrs Mada Leung, Mr. Rex 
Lau, Mr. David Wong, Mr. Mike Mak, Ms Nerissa Ho, Mr. Kun Wu and Ms Ting Teng who 
helped in preparing the typesetting of this thesis. 
Table of Contents 
1. Introduction 1 
1.1 Central Processing Unit innovation 
1.2 Long Instruction Word computer 
1.3 Prior attempts 
2. The new architecture \\ 
2.1 The triple-instruction word 
2.2 Functional view of the architecture 
2.3 Inter-functional units synchronization 
2.4 Instruction set design 
2.5 Special features 
3. Simulation of the architecture 39 
3.1 Computer architecture simulation 
3.2 The simulation language used: APL 
3.3 Simulation environment 
3.4 Simulation design 
3.5 The micro-architecture 
3.6 Implementation details 
4. The supporting environment —…… 53 
4.1 The environment 
4.2 The Pseudo-machine configuration 
4.3 Assembly language description 
4.4 Details of the utilities 
5. Evaluation — 
5.1 Case Study 
5.2 Results and comparison 
5.3 Summary — 
6. Discussion and conclusion …… …… 96 
6.1 The triple-instruction computer 
6.2 Use of APL for architectural simulation 
6.3 Further considerations 
7. References * 皿 
8. Appendix I: Program listing for the TIC simulator 
9. Appendix II: Screen dump of the simulation runs 
Chapter 1 Introduction 
Computer Technology has made incredible progress in the past five decades. 
Innovation in computer design principles and breakthroughs of electronic components create 
rapid rate of improvement. This rapid improvement enables computer science to be one of 
the most fruitful and interesting areas in the coming decades. Since most of the recent 
designs are based on evolution, rethinking of the design principles and working directions 
may bring out fantastic and encouraging successes. During the past five decades, the ideas 
of stored program, pipelining, vector processing, multiprocessing and reduced instruction set 
computing are good examples of architecture innovation of constructing high performance 
machines. 
Moreover, there have been dramatic changes in cost structure of computers because 
of the development of VLSI and its steady development. As Grosch's law stated, the cost 
of computational power grows at the rate of the square root of the computational power. 
For instance, in 1960s, when hardware was so expensive that a user could not afford to 
purchase a 1 M-byte machine, and therefore computer was not affordable for many 
application areas. Today, a personal computer is obtainable at the price of one thousand 
(US) dollars and it has more storage capacity, higher reliability and faster speed than a 
mainframe computer bought in 1960s for millions dollars. The Grosch's law not only predicts 
that future machines will have better cost-performance ratio, it also raises up a fundamental 
and critical question: If the electronic engineers continuously squeeze more and more circuit 
capacity in one silicon chip, can the computer architects fill up this "extra room" with better 
designs and keep the overall performance growing in pace with the advances in technology? 
The answer is quite clear. As the 'software engineering experts revealed decades ago， 
software renovation is not and will not be able to keep in pace with the hardware 
renovation. It is straight forward to expect, if we do not put extra effort to strengthen the 
architectural renovation, maybe another "hardware" crisis will occur several years later and 
there is no "expected" periodical cost-performance gain anymore. 
From the view point of designing scientific machines, the LIW (Long Instruction 
Word) architecture is a remarkable alternative that makes good use of the available 
additional circuit capacity. In our research, a new observation to the conventional LIW 
1 
architecture is introduced - the Triple-Instruction Computer (TIC), an unconventional 
instruction structure aimed at maximizing the use of available hardware, increasing 
parallelism for extra performance, simplifying the inter-functional units synchronization and 
making the machine codes more compact and understandable than the conventional LIW 
codes. 
1.1 Central Processing Unit innovation 
1.1.1 Classical von Neumann model 
In 1940s, J. Presper Eckert, John Mauchly and John von Neumann reiterated the idea 
of stored-program devices, automatic electronic general-purpose calculating machines and 
hence crystallized the idea and created the world's first electronic general-purpose computer, 
called ENIAC (Electronic Numerical Integrator and Calculator) and wrote a memo 
proposing a stored-program computer called EDVAC (Electronic Discrete Variable 
Automatic Computer). This memo has served as the basis for the commonly used term "von 
Neumann architecture". Since 1960s, practically all machines were designed using the von 
Neumann blueprint as the starting point, with changes made to circuit technology, altered 
storage, and added features. 
The register-level diagram of a classical von Neumann computer, namely a typical 
accumulator architecture machine, is shown in figure 1.1. The CPU consists of two functional 
units: the instruction decoder (I-UNIT) and the arithmetic unit (A-UNIT). In short, machine 
instructions are fetched from main memory to CPU one by one according to the content of 
the program counter and stored into the instruction register. The loaded machine instruction 
will then be executed accordingly after the accurate decoding process. 
When arithmetic or input/output instructions are being executed, the AC 
(accumulator), MQ (accumulator extension called the multiplier-quotient register) or [AC，MQ] 
and the memory data register (MDR) serve as temporary place holders. Since the purpose 
of AC, MQ, [AC，MQ] and MDR in this sort of architecture are different, there is no need 
to specify to which of them is being referred in the machine instruction and hence we have 
the simplest form of instructions (figure 1.2) [VON NEUMANN]. 
2 
• BUS mm 
Me mo r y 
~ ？ n i i i I I I ! i i i I I I I I j • ]-•- I I 
1' 丨 j I , H 「 I ! I 
I O P ADDRESS 卜 卄 一 十 M A R ， ― ― ― 4 L_ MDR 〜 
_望 
I i p AC MQ ——1 
H - — — — 」 ! ! 
j I - U N I T i ! A - U N I T 
j h / 0 
Figure 1.1 A classical von Neumann computing model 
Mem! = Mem^ + Mem 3 
The corresponding instructions are: 
LOAD Mem2 (* load content of Mem2 to AC *) 
ADD Mem3 (* add MDR, content of Mem3, to AC *) 
STORE Mem! (* store AC to Mem! *) 
Figure 1.2 Machine instructions of accumulator architecture machine 
Due to the use of implicit operand，this sort of architecture resulted with two great 
advantages: short instruction length and minimized internal state of the decoding process. 
At the era of vacuum tubes and relays, it was only possible to realize such a simple 
architecture. 
3 
Soon after, computer scientists realized machine performance can be simply improved 
by adding resources to the old design. Several remarkable changes to the von Neumann 
architecture will be discussed in later sections. 
1.1.2 General-purpose registers architecture 
The bottleneck of the accumulator-based architecture is the restricted temporary 
storage in CPU and data latency becomes unnecessarily long. Since data transfer time 
between CPU registers is far shorter than that between CPU and memory, an obvious 
improvement from the original architecture, the general-purpose registers architecture, 
which includes more fast-registers to the CPU, was introduced a decade after. 
Unlike to its predecessor, there is no more implicit operand. The machine 
instructions, accompanied with the CPU itself, becomes more complicated. Moreover, the 
programming effort of the new architecture increases rapidly. 
When high level language was widely used，the situation became worse. Whenever 
considering a particular high level programming language, no application programmer knows 
the existence of the internal registers within CPU. If the instruction requires the use of 
registers for all arithmetic operations, the compiler is, usually, left with the task of 
generating code to manage the registers and optimize their usage. As a result, compilers 
become more complicated. In short, the semantic gap between the concepts in high-level 
languages and those in the computer architecture contributes to software unreliability, 
unreasonable program size and compiler complexity [MYERS], 
Though the discussion for the indispensability of internal registers within CPU is 
aroused by the semantic gap problem, the significant performance gain makes almost every 
successful architecture afterwards employing such kind of fast storage. In the late 90s，some 
members of academy and engineers began to accept: when dealing with the reduced 
instruction set computer design, the more the register, the better the performance. Indeed, 
some systems nowadays even make use of hundreds of. registers grouping into several 
register windows [PATTERSON 85][STALLINGS]. Since the reaffirmation of RISC design, the simple 
Load-and-Store architecture with extensive use of general purpose registers becomes the 
major trend of CPU design in both the industry and academia. 
. 4 
1.1.3 Vector machines 
For most scientific and engineering applications, such as solving a lot of linear 
equations, vast matrix multiplication, etc.，they normally consists of large amount of vector. 
based floating point operations. The scalar architecture is not adequate. 
In 1970s, researchers invented a kind of super-machine, namely the vector machine, 
to solve large numerical problems. The first successful vector machine CRAY-1 was 
introduced in 1976 [RUSSELL]. A vector machine, consisting of a vector unit, provides high 
level operations that works on floating point vectors, linear arrays of floating point numbers, 
instead of a floating point number. The vector instruction is similar to a high level loop in 
simple architecture, with each iteration computing one element of the floating point vector, 
updating the indices, and branching back, to the start of loop. For example, figure 1.3 shows 
the steps for a vector machine to add 64 floating point numbers with another 64 floating 
point numbers and the steps for a scaler machine. 
Vector Machine Scalar Machine 
Prepare Vector Register 1 Get the starting address of vector A 
Prepare Vector Register 2 Get the starting address of vector B 
Vector-ADD the two Vector Register For i = 1 to 64 
Store back the result Add the two vector elements 
Store back the result 
Calculate, next address of vector A 
Calculate next address of vector JB 
End一for 
MIMMWI—MB—WMMMIMMHBHIimHMMllWiWIHmiBWHIillMWHIMWIMItltMBIikiMI  mftjBjHiMHp—MMUWMMJWMBWMWuiyMMW 丨丨丨丨》俨_丨"_丨111818111^181108<811018<181(|!{|1  
Figure 1.3 Vector instructions venus scalar instructions 
A single vector instruction specifies a great deal of work - an entire loop. The 
number of instruction fetch drops and control hazards that would normally arise from the 
loop back branch are eliminated accordingly. Moreover, fetch of the floating point vector 
from main memory can be highly interleaved and a shorter data latency is resulted. This 
single instruction stream-multiple data stream (SIMD) architecture [FLYNN 66] is most suitable 
for handling scientific calculation, for example solving linear equations, which consists of 
loops of massive repetitive and identical operations. 
5 
Unfortunately, vector machine has its own drawback. First of all, if short vectors, for 
example 30 elements, are being manipulated, the start-up overheads become more 
significant and there is no real gain for having such an expensive architecture. Moreover, 
many vector machines have comparatively slow scalar units. Machine with higher peak 
vector performance can be outperformed by a fast scalar machine. In the late 1980s, the 
rapid performance increases in superscalar machines makes the difference between 
expensive vector supercomputer and economical superscalar machine obscure. Nevertheless, 
the idea of using compact and powerful instructions in vector architecture is still useful in 
many types of applications. 
1.1.4 Superscalar machines 
Apart of vector machines, another method of increasing the execution speed is to 
issue more than one instruction per clock cycle. To incorporate duplicated execution and 
control units, as linking up several scalar machines, the superscalar machine would produce 
a higher instruction-execution rate. For example, if a superscalar machine has 4 instruction ‘ 
decode-and-execute streams, 4 independent instructions are issued per clock cycle. Assume 
that the pipelined execution can be completed in 3 cycles, the machine seems to have a rate 
of executing 1.33 instruction per cycle. When compared with vector architecture, the 
superscalar architecture provides more parallelism with less overheads. The performance 
unbalance problem between vector unit and scalar unit is solved and lots of conventional 
applications can exploit the parallelism of the superscalar machines while the performance 
of the equivalent program running on vector machines are quite poor. 
Unfortunately, a superscalar machine can issue only a small number of independent 
instructions in a single clock. If the instructions in the instruction stream are dependent, for 
instance, they refer to the same memory storage location, only the first instruction in the 
sequence can be issued. The lookahead mechanism of the dynamic instruction scheduler may 
not be adequate to generate optimal execution and performance is hence lost. 
6 
Besides，for there are several instruction pipelines running simultaneously in a 
superscalar machine, the efficiency loss is more significant when pipeline hazard1 occurs. 
For effectively utilizing the parallelism available in a superscalar machine, more complex 
instruction decoding, inter functional unit synchronization, as well as more complicated 
dynamic instruction scheduling algorithm are required. Hence, the overall performance of 
the architecture may not achieve the theoretical maximum. 
Other than superscalar architecture, a very similar architecture depending on an 
intelligent compiler for creating a pack of independent instructions that can be 
simultaneously issued to different functional units, and the hardware dose not dynamically 
make any decisions, is regraded as the Long Instruction Word computer (LIW). Details of 
this architecture is shown in section 1.2. 
1J2 Long Instruction Word computer 
The parallelism of the superscalar architecture cannot be extended as the vector 
architecture because of the difficulties of determining whether the instructions in the 
instruction stream can all be issued simultaneously without knowing what order the 
instructions and the corresponding dependencies. An alternate is the LIW or the Very Long 
Instruction Word computer (VLIW) architecture. 
It is difficult to have a clear cut between LIW and VLIW architecture, general 
speaking, when compared with LIW, a VLIW architecture has longer word length2. Both 
the LIW or VLIW architecture employs multiple functional units. For example, the ELI-512 
has a horizontal instruction word over 500 bits to initiate eight 32-bit integer operations, 
eight 64-bit pipelined floating-point calculation, 8 pipelined memory accesses and 32 register 
accesses and will do 10 to 30 RISC-level operations per cycle [FISHER 83]. Rather than 
issuing multiple independent instructions to functional units by the dynamic scheduler during 
run-time, an intelligent compiler packs the multiple operations into one very long 
instruction, the Very Long Instruction Word. To maximize the resources utilization, a 
Pipeline hazards prevent the next scheduled instruction from executing in the pipeline. This reduce the 
performance from the ideal speedup gained by pipelining. 
2Normally consists of hundreds of bits. 
7 
technique called trace scheduling by unrolling loop and scheduling code across basic blocks 
is developed [FISHER 81][ELLIS]. 
The VLIW architecture usually consists of one control unit issuing one single long 
instruction per cycle. Each long instruction consists of many tightly coupled independent 
operations and each operation requires a small, statically predictable number of cycles to 
execute. These properties simplify the inter functional units synchronization and minimize 
the efficiency loss when pipeline hazard occurs. 
Although the VLIW architecture provides more parallelism than the superscalar 
architecture, the c?de arrangements are unintuitive and nearly impossible to follow. The 
VLIW compiler is difficult to design. For producing optimized code, the compilation process 
is time consuming. Moreover, the object code generated by the VLIW compiler may not be 
optimum due to limited parallelism and code size explosion. Possibly there is a limited 
amount of parallelism available in instruction sequences but there may not be enough 
operations to fill the very long instruction words. Besides, whenever instructions are not 
filled up, the unused functional units translate to wasted bits in the instruction encoding. 
The major amendment to these machines tries to exploit the large amounts of parallelism 
without paying the unbearable overheads. 
For making use of the concept of LIW and eliminating the undesirable effect, a new 
architecture, Triple-Instruction Computer (TIC), is being proposed. 
1.3 Prior attempts 
Since 1982，Chen [CHEN 82] has been working on an extraordinary architecture for 
a large-scientific computer with very unconventional instruction structure. He found that 
concise instructions not only could enhance hardware coordination but also make the data 
and control flows more orderly. 
Based on the observation to the RISC based general-purpose register architecture, 
the vector architecture and VLIW architecture, in 1992, we started to investigate a new LIW 
based architecture with extraordinary definition between different functional units. In our 
research, a new observation to the conventional LIW architecture is introduced - the Triple-
Instruction Computer (TIC), an unconventional instruction structure aimed at maximizing 
8 
the use of available hardware, increasing parallelism for extra performance, simplifying the 
inter-functional units synchronization and the most important goal - making the machine 
codes more compact and understandable than the conventional LIW codes. 
The TIC architecture consists of three standalone but interrelated functional units, 
which are the fixed point arithmetic and addressing preparation unit, the floating point 
arithmetic unit and a branching unit. The triple-instruction computer is basically a kind of 
special LIW computers with only one set of functional units. The focus of the architecture 
is not the degree of parallelism using a large number of computation units but the triple-
instruction characteristics. Three kinds of instructions, namely X-type (fixed point), F-type 
(floating point) and B-type (branch), are combined to form a 64-bit triple instruction word. 
All the three functional units work in parallel with certain inter-locks in order to solve the 
synchronization conflicts. 
Comparing with the superscalar machine, the TIC also applies the RISC design 
philosophy but no real-time instructions scheduling is needed. All the instructions re-
arrangement are done by compiler, and partially, the definite interlock restriction between 
different functional units, and thus the complexity of the processor is relatively simpler. 
Moreover, a standalone branch unit is added，instead of sharing the classical integer unit, 
to control execution sequence explicitly and hence some new control techniques are 
introduced. Comparing with the LIW and VLIW, TIC uses three interrelated functional 
units instead of tens of independent functional units. As a result, it is easy to write TIC 
instructions. Each TIC instruction has evident one-one correspondence in normal triple-
instruction form: Fixed part for preparing addresses, Floating point part for calculating the 
result and Branch part determines whether a branch is necessary. For most applications, 
there is no need for the TIC compiler to perform clumsy optimization. The TIC architecture 
assimilates the idea of vector machines in crystallizing the idea of coupled operations. 
In summary, the ultimate goal of this architectural investigation project is to make 
super-powerful compact instruction contains an entire loop and hence reduces the number 
of instruction fetches rapidly. Moreover, the regular and standardized instruction format 
makes the developing and debugging process of TIC machine codes easier. Besides，the 
explicitly parallel operations and clean assignment of instructions to co-operating units will 
9 
improve the overall machine efficiency. In addition, two new architectural concepts: coupled 
operation and Self branch technique are introduced. 
The initial evaluation result of the TIC architecture drawn from the two testing cases, 
Gaussian elimination inner loop and Matrix multiplication loop, is quite encouraging. The 
compactness of machine instruction rapidly increases and the run-time efficiency escalates. 
Notwithstanding the TIC architecture is just a preliminary idea of architectures employing 
highly parallel "intelligent" codes, several concepts are worthy for discussions in computer 
architecture designs. 
10 
Chapter 2 The new architecture 
As stated in previous sections, LIW architecture gives more parallelism for the 
inclusion of more circuits. The problem left to be considered is the over-complexity of the 
specific "intelligent" compiler and the cost performance ratio of the architecture. 
What is the suitable word length? It is obvious that 512-bit per word gives more 
parallelism than the 32-bit one by compensating the difficult compiling process for such a 
dumb architecture. For reducing the complexity of the compiler, a special kind of LIW 
computer, named Triple-Instruction Computer (TIC), is being investigated. As opposed to 
conventional LIW or VLIW, the newly introduced architecture consists of only one set of 
functional units and the logical word length is not too "long" indeed although the blueprint 
of the architecture came from the conventional Long Instruction Word architecture. 
Therefore, the focus of the architecture is not the degree of massive parallelism in large 
number of computation units but the characteristics of the triple-instruction. 
2.1 The triple-instruction word 
The word length of the virtual machine is set to 32-bit arbitrarily and each instruction 
is 64-bit long and known as a Triple-Instruction Word (T1W). We are going to demonstrate 
how to make use of massive parallelism of a relative low cost intelligent compiler. A small 
model Triple-Instruction Computer, with only 64-bit word, is introduced in this thesis. 
From the view point of the RISC or even CISC machines, the 64-bit instruction is far 
more than enough for operation specification if compact addressing modes are properly 
included. The reason for using such a lengthy instruction is to provide three sets of 
independent control codes for three inter-dependent functional units: the fixed point unit, 
the floating point unit and the branch unit. Hence, each Triple-Instruction Word is made 
up of 3 parts correspondingly (Figure 2.1). They are: Fixed Point part, Floating Point part, 
and Branching part. Each part is used to control the execution sequence of the 
corresponding functional unit separately so as to make the execution of each functional unit 
concurrent with each other. 
11 
TIW format : r ^ d-Hi-h) 
Length : 26-bi t 17-bit 21-bi t 
Fixed part F1 oating part Branch part 
:.::Bit：;•；;：；0::。::；；,：：. ::::::: 25 26 42 43 63 
Figure 2.1 Triple-Instruction Word format 
A fixed point instruction could be a load, store, integer arithmetic/logic operations, 
or system operations while a floating point instruction could be any floating point 
operations. A branch instruction could be a procedure call or return, branch on certain 
、 
conditions. 
In contrast with the common superscalar or LIW machines, a dedicated branch unit 
is included in the Triple-Instruction Computer. This branch unit, instead of the classical / 
integer unit, responsible for sequence control, that is, to check whether a branch is valid and 
actually execute the branch. As a result, the fixed point unit responsible for memory accesses 
and integer arithmetic/logic operations only, the logical comparisons and actual branching 
operations are then shifted to the branch unit. 
For this improvement, unlike other LIW or VLIW computers, for which an intelligent 
compiler is required to generate executable machine codes, a relative simple compiler is 
required to produce TIC program. Each TIC instruction has evident one-one 
correspondence to normal triple-instruction form: Fixed point part for preparing addresses 
or indexes, Floating point part for calculating the result and Branch part for determining 
whether a branch is valid. For most calculations, there is no need for the TIC compiler to 
perform clumsy optimization. Detailed discussion is placed in chapter six. 
As discussed in last paragraph, to make the TIC instructions more readable and the 
pipelined implementation possible and straightforward, the execution of branch part 
instruction will be held up until the corresponding fixed point unit and the floating point 
unit are completed. Moreover, for the inclusion of the self branch 1 instruction, the branch 
unit must be deferred to provide a valid logical branching state. 
1 Refer to section 2.5.1. 
12 _ ‘ 
Fixed Floating Branch Fixed Floating Branch 
I 83 r I t B3 
J X3:F3 
J X3:F3 I B2 
. . 1 82 1 U I X2:F2 
丨 1 1 I X2:FE - J j I B1 
:、、、、 t , I ： B1 1 ‘：“：-j j , ^ X1:F1 
教遍霍濃滞•經戀: :mmmmkmmmm、： . 
Execution Pattern of simplest TIC Possible Floating/Branch overlapping 
Obviously, more execution overlapping leads to shorter execution time. The ultimate 
pipelined version of Triple-Instruction Computer is to pack fill up all of the three pipelines 
without any bubble. For simplicity, the simplest one will be investigated at the very 
beginning. 
2.2 Functional view of the architecture 
The hardware components block diagram of the TIC architecture is shown in Figure 
2.2. The basic components of the proposed machine consists of eight parts: 
M a Bus control unit/memory management unit with two cache memory, 
m an instruction refresh unit and an instruction schedule unit， 
m a fixed point arithmetic and logic unit (Fixed point ALU), 
m a floating point computational unit (Floating point unit), 
a a branching unit, 
4 a set of special purpose registers (32-bit), 
m a set of general purpose registers (32-bit) and 
m a set of floating point registers (64-bit). 




















































































































































































































































































































































































































































































































































































































































































































2.2.1 Fixed Point ALU 
The fixed point arithmetic and logic unit 
is responsible for all fixed point arithmetic and i ^ r e q i ster fi le~I 
logic operations. It may be used for address A1 81 r,+ 
. n <^-)k>< I x 
computation, indexing etc. All fixed point 二 . R> 、 UULTIPLEXER 
registers, for example, the general purpose K t 
registers and the special purpose registers “ 
J ！ FI XED ALU 
including the MAR i? XDR i? DBASE and so on • 丨 
can be treated as source or destination H 关 ~~• DATA BUS 
I X CONTROL QATE 
locations of the ALU directly. The program 丨 ct.mp 丨 
CT • mp 
status word (PSW) will be set according to the ~~" ^ 
computation result of each operation. 
Figure 2.3 A closer look at the ALU 
Moreover, since the atomic instruction 
add-and-add (ADD2) and multiply-and-add (MULADD) are included in the instruction set, 
special circuit and control design, one extra non-programmable temporary registers (CTemp 
for fixed point ALU and Ftemp for floating point ALU) are employed. As shown in Figure 
2.3，the fixed point ALU takes in two operands. One directly comes from fixed point register 
file while the other comes from either the register file, the fixed point instruction 
(immediate data), the branch instruction (branch offset) or the hidden temporary register 
(CTemp). 
Besides normal fixed point instructions, two additional instructions, add-and-add and 
multiply-and-add are also included to make address calculations and matrix/vector 
manipulation simpler. Moreover, the inclusion of these two atomic instructions also 
facilitates the one-instruction looping mechanism where a loop can be specified in one 
Triple-Instruction word and hence no instruction fetch is necessary during the loop. Detailed 
discussion is also placed at chapter six. 
15 
2.2.2 Floating Point ALU 
All floating point operations are assumed to be executed by the floating point unit. 
The IEEE floating point double precision format [IEEE] with 11-bit exponent and 52-bit 
fraction is employed. The program status word (PSW) will also be set according to the 
computation result of each operation. 
Generally, fixed point ALU supplies the memory address for operand retrieval which 
may be required by the floating point unit. Both the floating point unit and the fixed point 
ALU work on their own set of register file. 
Similar to the fixed point ALU, the floating point ALU also intakes two operands. 
One directly comes from floating register file while the other simplify comes from either the 
register file or the hidden temporary register (Ftemp). Also, two additional atomic 
instructions, add-and-add and multiply-and-add are also included. 
2.2.3 Register files 
There are two sets of register file: the fixed point register file with thirty-two 32-bit 
registers, and the floating point register file with sixteen 64-bit registers (Figure 2.4). 
According to the functional characteristics, they can be divided into two types: special 
purpose registers and general purpose registers. 
The lower 16 general fixed point registers (RO to R15) are responsible for all fixed 
point，index, and addressing computation. The upper 16 fixed point registers are reserved 
for special purpose and named correspondingly. Though user can use them as other general 
purpose registers, they must be used more careful because of their implicit purposes. The 
special purpose fixed point registers can be further classified into four types: memory 
address register (R17 and R25 are Memory Address Register, and R19 and R27 are Memory 
Store Address Register), memory data register (R18 and R26 are fiXed point Data Register, 
and R20 and R28 are fiXed point Store Data Register), base address register (R21 is Stack 
BASE, R22 is Code BASE, and R30 is Data BASE register), and the miscellaneous register 
(R16 is Program Status Word, R29 is Stack Pointer, and R31 is Program Counter). 
16 
Layout of the fixed point register 
Reg Mnemonic Description 
0-15 R0 .. R15 General purpose register 
16 FSW Program Status Word 
17 MAR1 Memory Address Register #1 
18 XDR1 fixed point Data register #1 
19 MSAR1 Memory Store Address Register #1 
fixed point Store Data Register #1 
21 SBASE Stack BASE register 
22 CBASE Code BASE register: 
23 Reserved 
24 Reserved 
25 MAR2 Memory Address Register #2 
26 XDR2 fixed point Data register #2 
27 MSAR2 Memory Store Address Register #2 
28 XSDR2 fixed point Store Data Register #2 
29 SP Stack Pointer 
30 DBASE Data BASE register 
31 PC Program Counter 
Layout of the floating point register 
Reg # Mnemonic Description 
0..5 FRO " FR5 Floating point registers 
6 FDR1 Floating point Data Register #2 
7 FSDR1 Floating point Store Data Register #1 
8..13 FR8 .. FR13 Floating point registers 
14 FDR2 Floating point Data Register #2 
15 FSDR2 Floating point Store Data Register #2 
Figure 2.4 Register File Map 
There are 16 floating point registers. Four of them are reserved for special usage 
(FR6 and FR14 are Floating point Data Registers, and FR7 and FR15 are Floating point 
Store Data Register). 
Program Status Word is used to record the current status of the machine. The bits 
in this register are either set or clear to indicate whether the certain condition's occurrence. 
As mentioned in the above two sections, the PSW is set after each ALU operation. Several 
additional register bits are included to hold the status flags internally and the PSW will be 
17 
updated only when it is accessed by the programmer. Detailed bit description of the PSW 
are shown in Figure 2.5. 
Layout of the Program Status Word 
J 一 . . .； ： . . I ： .""""j I ~ ： “ I p ' J • • ： ' ； [ . . , . ~ ‘ ~ 
reserved FZ FU FO - SF XZ XO ~ SX CF IF 
0 2 0 2 1 2 2 23 24 2 5 2 6 27 2 8 29 30 3 1 
Description of individual bit 
Name Description . 
IF (Interrupt Enable Flag) If it is set, a certain type of interrupt (a maskable interrupt) 
can be recognized by the CPU; otherwise, these interrupts 
are ignored 
CF (Carry Flag) Set when either a carry is produced in addition or a borrow 
is needed in subtraction. 
SX (Fixed Sign) It is equal to the most significant bit of the last fixed point 
arithmetic operation resuft. 
XO (Fixed Overflow) It is set if an overflow has occurred in the fixed point ALU. 
XZ (Fixed Zero) . It fs used to indicate the result of the ALU operation is zero. 
SF (Floating Sign) It is equal to the sign bit of the last floating point operation 
result 
FO (Floating Overflow) Similar to XO. 
FU (Floating Underflow) It is set if the result of the last floating point operation is too 
small to be represented by a floating point number. 
FZ (Floating Zero) Similar to XZ. 
Figure 2.5 Layout of the Program Status Word 
Code BASE and Data BASE registers are used to relieve the effort of relocation. 
The effective address of the instruction is the summation of Program Counter and Code 
Base register. The effective address of data is the summation of the Memory Address 
register/Memory Send Address Register and the Data Base register. Stack base register 
points to the bottom of the system stack and the Stack Pointer points to the top element in 
the stack. 
18 -
2.2.4 Bus Control Unit/Memory Management Unit 
As mentioned in the introduction section, this research is concentrated on the 
internal activities of CPU. The bus control strategy and memory management strategy are 
not under our consideration. Therefore, a high performance bus control unit and memory 
management unit is assumed incorporating with 232 programmer addressable space2. It is 
also assumed that the access time of the memory is fast enough to match the machine 
operations. 
2.2.5 Cache Memory and Memory Access 
For simplicity, we will not define the size of the two caches first but also assume they 
work efficiently with the aid of the appropriate control units and control mechanisms. In 
order to provide good services, two separate caches are included for loading instructions and 
data. 
Data are loaded and stored from or to the cache through the memory ports. A 
memory out port or in port is either made up of a Memory Address Register (MAR) and a 
Memory Data Register or a Memory Sending Address Register (MSAR) and a Memory Sending 
Data Register. 
For loading, there are total four Load-ports: two for fixed point, using tht fiXed Data 
Register (XDR), and two for floating point, using Floating Data Register (FDR). The address 
is put into the memory address register of the port and then a memory read/write operation 
is initiated. The desired data will be put into the corresponding memory data register with 
a certain delay. Data_rdy hardware tags are used to signal the availability of data in each 
port. Moreover, an overall Data一rdy tag is employed for the whole engine. It is used to keep 
the engine suspended when the requested data is not available. 
For storing, there are also four memory Store-ports, two for fixed point, using the 
fiXed Send Data Register (XSDR), and two for floating point, using Floating Send Data 
Register \¥SDK). Data placed at the memory store data register will be stored in the 
2 it is a projection of the theoretical infinite memory. The size is bounded by the length of the memory 
address register or address bus. 
19 
memory location specified in the memory send address register of the corresponding 
memory port. The load and store request will be generated according to the control bit R, 
W，and T. The R signal generates a read request while the W generates a write. The T bit 
specifies the type of memory operations, either fixed point or floating point. 
2.3 Inter-functional units synchronization 
2.3.1 Instruction Pipeline and Decoding Mechanism 
The instruction pipeline can be viewed as a multi-level instruction register, which 
stores the most recent pending instruction for execution. The size of the pipeline is four 64-
bit words. 
There are three individual decoders for the partial instructions : fixed point engine, 
floating point engine and branching engine. Each partial instruction controls the execution 
sequence of the corresponding functional unit separately while certain interlocks between 
them occur. Therefore, the execution of each functional unit is concurrent with each other. 
Figure 2.6 reveals the three decoding engines which have to keep in pace for maintaining 
data consistence. Three partial instruction completion signals, namely the XEND, FEND 
and BEND, are used for synchronization purpose. 
2.3.2 Instruction Schedule Unit and Instruction Refresh Controller 
The instruction schedule unit determines the pace of the three individual functional 
units and the instruction pipeline refreshing mechanism according to the partial instruction 
completion signals and the self一branch (S-BR) signal. At each cycle, both the fixed unit and 
Floating unit are invoked at the beginning. After all of them are complete，the branch unit 
is then invoked. 
When the load enable (L-ENC) signal is set, all of the 64 bit instructions and the 
corresponding valid tag are shifted down one register while read request for the new 
instruction will be generated. 
20 
If execution overlapping is introduced, the normal pipeline sequence will be broken 
by branching, interrupt or procedure call/return. An instruction pipeline REFRESH request 
will be initiated in case of abnormal break of pipeline. Normally, the instruction refresh 
controller decides when to pre-fetch and shift the instructions in the pipeline. 
Fr om Cache Jo Cache 
二 1 本 本 -O I , I I o • • 1 I z • . . i I f I ^ — 
« y y # r.«) j ！ ^ z 
"“ —1—pv j ~‘ 
M
 X4 F4 B4 OJ [ INSTRUCTION ^ I Nl T 
® X3 F3 B3 U REFRESH refresh ^ 
Z < … CBASE 
- X2 F2 B2》•“，-"• j CONTROLLER 
- V X1 F1 ~ B 1 n A 4 4 一 1 I 1 ~ f T " ; S-BR 、 ！ '"I VC — 1 I _i_ I J 
一 I J ^ 
XZ. ( - — — — - - - — ― ‘ j 
t /y j 
: j ~ ~ i l l 
丨 V ！ L H j NSTRUCTI OM 
_ SCHEDULE 
： U N I T 
v ^ op ]t y op v y op “ “ ^nf 
m,cro en6,nes n ^ I r r ^ | R^ | ! 
Ft XED UNIT ——丨 FLOATING UNIT ——j BRANCH UNIT 5 
！ X E N D j F E N D j B E N D 
I I I ！ ！ Cont r ol !Cont r ol 
i i I 
t i ——； 
Control ‘ j 
memor y r oad/ wr i t a w W 
Figure 2.6 Micro engines synchronization block diagram 
As stated in section 2.1, there is a diversity of inter-functional units synchronization 
design. Figure 2.7a and figure 2.7b reveal the dependence relationship between some control 
signals. For example, the BNEW (initiation of execution of branch unit) is set if both FEND 
and XEND is set and it is not the starting TIC instruction. Modification should be made for 
the above figures if other execution overlapping is assumed. 
Besides the inter-functional unit delays, the Triple-Instruction Computer have to 
delay its execution when two memory accesses are required within one TIC instruction. In 
21 
the case of read-modify-store self branch loop3，where both data fetch and data store are 
required within one TIC instruction, the data store operation will be executed first when 
FSDR is ready. After the stored is initiated, address/index will be modified in the fixed 
point unit and the next memory read will then be generated according to the newly updated 
addresses. 
R e s o t B E N D 
r — • 
XNEW ^  j—(AND J"^  I NS-V 
^ V ^ — I N I T 
FNEW ^ ( O R J 
E N D ^ ’ B E N D 
^ (HOT 
B N E W ( A N D ) V _ _ _ _ y F E N D 
. ^ (and j 
1 X E N D 
j 
j R e s a t X E N D & F E N D 
Figure 2.7a Instruction Schedule Unit 
^ 丨END 
广 “ < V S . B R 
L - EHC ^ ~ ： — ^ A N D y - ^ ( W r j ^ 
• ^ ^ I I I 
R E F R E S H ——~( OR ) 
— I NI T 
^ X * P C 
A d d r •丨‘ — “ 1 + ) : C B A S E 
. . - - - -
Figure 2.7b Instruction Refresh Controller 
3Refer to section 2.4.4. 
22 
2.4 Instruction set design 
The instruction set of the Triple-Instruction Computer is quite different from classic 
architecture. Usually, all machine instructions of such a classic architecture are classified 
into several categories, according to either addressing mode, nature of instruction or nature 
of data being manipulated. For the Triple-Instruction Computer, the instruction set actually 
consists of three logical instruction sets: 
-Instruction set for fixed point partial instruction 
-Instruction set for floating point partial instruction 
-Instruction set for branch partial instruction 
A TIC instruction is formed by concatenating three partial instructions from each of 
the three logical instruction set. There is no restriction for combining the three partial 
instructions to form a TIC instruction. Obviously, useless TIC instructions may be formed 
if improper partial instructions are combined. In section 2.4.1 to 2.4.3，partial instruction set 
for all of the three functional parts are discussed in detail. The overall characteristics of the 
TIC instructions and the programming tips are discussed in section 2.4.4. 
The Fixed point instruction can be further divided into two sub-categories: Register-
Register (RR) format (described in section 2.4.1) and Register-Immediate (RI) format 
(described in section 2.4.1). 
2.4.1 Fixed point instructions (RR) format: 
The Fixed point instruction (Register-Register) format contains all fixed point 
operations for preparing the operand address for the floating point unit. The address can 
be manipulated as a 32-bit pattern in any possible way and is stored in a fixed point register, 
and/or the Memory Address Register (MAR), and/or the Memory Store Address Register 
(MSAR) according to the R and W bit. All possible memory access is summarized as below. 
23 
Addressing R W T Memory Memory Address Memory Data Remark 
M o d e Access Register Register 
Register 0 0 * NO Nil Nil No memory access 
has been made 
M e m o r y - 1 0 1 Floating MAR FDR Read a floating point 
Register (F) P o i n t r e a d number from memory 
into FDR according to 
MAR 
M e m o r y - 1 0 0 Fixed MAR XDR Read a fixed point 
Register (X) P° i n t r e a d number from memory 
into XDR according to 
MAR 
Register- 0 1 1 Floating MSAR FSDR Store a floating point 
M e m o r y (F) point store number from FSDR to 
memory according to 
MSAR 
Register- 0 1 0 Fixed MSAR XSDR Store a fixed point 
M e m o r y (X) point store number from XSDR to 
memory according to 
MSAR 
Memory - 1 1 1 Floating MSAR FSDR Store a floating point 
M e m o r y (F) point store M A R FDR number from FSDR 
and then according to MSAR 
Floating and then read a 
‘ point read floating point number 
from memory into FDR 
according to MAR1 
Memory - 1 1 0 Rxed MSAR XSDR Store a fixed point 
Memory (X) point store M A R XDR number from XSDR 
and then according to MSAR 
Pixed and then read a fixed 
point read point number from 
memory inio XDR 
according to MAR 
All of the available fixed point instructions are annotated in the following tables 
according to the nature of the instruction. 
1For facilitating the one-instruction loop, the memory-memory mode instruction is not executed in 
common sequence: read data from memory, perform calculation and store back data to memory. Delay 
read mechanism is introduced so as to make the data fetch and data store operation consistent in a TIC 
Instruction with self branch, memory read and memory write operation. 
24 
RR format 
Coupled instruction (l=coupled; 0=normal) 
Type of memory access 
(l=Floating point; 0=fixed point) 
Generate a memory read (l=read) 
「 Generate a memory write (l=write) 
0 opr c T R W Rx Ry R^ |T 
~ ~ ~ 1 ~ 1 ~ 1 ~ 1 ~ ~ ~ ~ ~ L — 1 I _ I _ I _ _ I _ L _ J _ I I I I I I I 
0 5 10 15 20 25 (26 bits) 
Arithmetic: Operation: (Rx)2 <_ (Rx) oprl ((Ry) opr2 (Rz)) and; 
(MAR) <- (Rx) oprl ((Ry) opr2 (Rz)) (if R= l ) and; 
(MSAR) <- (Rx) oprl ((Ry) opr2 (Rz)) (if W=l ) 
where oprl and opr2 is defined by opr 
Mnemonic Bit pattern oprl opr2 Description 
ADD 00001 NOP ADD Add (Ry) to (Rz) and then store result into (Rx) 
SUB 00010 NOP SUB Subtract (Rz) from (Ry) and then store result into (Rx) 
MUL 00011 NOP MUL Multiply (Ry) by (Rz) and then store result into (Rx) 
DIV 00100 NOP DIV Divide (Rz) from (Ry) and then store result into (Rx) 
ADD2 00101 ADD ADD Add (Rx), (Ry) and (Rz) and then store result into (Rx) 
MULADD 00110 ADD MUL Multiply (Ry) by (Rz) first, and then add (Rx) and 
store result back to (Rx) 
MULADD一Rz 00111 ADD MUL Multiply (Ry) by (Rz) first, and then add (Rx) and 
、store result back to (Rz) 
2For the MULADD一Rz instruction, the result are store back to Rz instead of Rx. 
25 
Shift/Rotate: Operation: Shift/Rotate the register (Ry) by (Rz) bits and the result 
is stored back to Rx; T, R, W is ignored 
Remarks: Bit0 is the most significant bit 
Mnemonic Bit pattern Description 
SL 01000 Shift bit(s) to left of the operand by feeding zero into bit31 and put bit0 
to carry bit 
SR 01001 Shift bit(s) to right of the operand by feeding zero into bit。and put 
bit31 to carry bit 
RL 01010 Shift bit(s) to left of the operand by feeding bi、into bit31 and put bit0 
to carry bit 
RR 01011 Shift bit(s) to right of the operand by feeding bit31 into bit。and put 
b'it31 to cany bit 
RCL 01100 Shift bit(s) to left of the operand by feeding carry bit into bit31 and put 
bit0 to carry bit 
RCR 01101 Shift bit(s) to right of the operand by feeding cany bit to bi、and put 
blt31 into carry bit 
Move & Exchange: 
Mnemonic Bit pattern Description 
MOV 01110 Move (Ry) to (Rx); Rz is ignored; Move (Ry) to MAR if R=1，to 
MSAR if W= 1 and T declares the memory access type 
EXCH 01111 Exchange (Ry) with (Rx); T, R, W, Rz is ignored 
26 
Logical: Operation: (Rx) < - (Ry) opr (Rz) for AND, OR, XOR operations; T, R，W, 
is ignored 
(RJ <- NOT (Rj) for NOT operation; T, R, W is ignored 
Mnemonic Bit pattern Description 
AND 10000 Perform (Ry) AND (Rz) and store result into (Rx) 
0 R 10001 Perform (Ry) OR (Rz) and store result into (Rx) 
XOR 10010 Perform (Ry) Exclusive-OR (Rz) and store result into (Rx) 
NOT 10011 Perform NOT (Rx), NOT (Ry) and NOT (Rz) and then store results 
back to (Rx), (Ry) and (Rz) correspondingly 
Clear/Set Flag: 
Mnemonic Bit pattern Description 
CLR 10100 Clear the (Rz)th bit of the register (Ry) to zero and put the result back 
to register (Rx); other field is ignored 
SET 10101 Set the (Rx)th bit of the register (Ry) to one and put the result back 
to register (Rx); other field is ignored 
Push/Pop: 
Mnemonic Bit pattern Description 
PUSHX 1 1000 Push fixed point register (Rx) onto the stack; theT and W bits must 
be set to 1; Ry二 10011 and Rz= 10100; 
POPX 1 1 0 0 1 Pop the stack and store to fixed point register (Rx); the T and R bits 
must be set to 1; Ry= 10001 and Rz= 10010; 
27 
Increment/Decrement: 
Mnemonic Bit pattern Description 
I N C 11100 Perform (Rx) +1，(Ry) +1 and (Rz) +1 and then store results back to 
(Rx), (Ry) and (Rz) correspondingly; T, R, W is ignored 
D E C 11101 Perform (Rx)-1，(Ry)-1 and (Rz)-1 and then store results back to (Rx), 
(Ry) and (Rz) correspondingly; T, R, W is ignored 
Other: 
Mnemonic Bit pattern Description 
NOP 00000 No operation 
CLEAR 10110 Transfer word of 0s to (Rx); other field is ignored 
SET 10111 Transfer word of 1 s to (Rx); other field is ignored 
FtoX 11110 Convert FRO to R0 and R1 where bit0 of R0 is the most significant bit 
HALT 11111 Stop the machine 
28 
2.4.2 Fixed point instructions (RI) format: 
RI format 
r Coupled instruction (l=coupled; O=normal) 
Type of memory access 
(l=Floating point; 0=fixed point) 
Generate a memory read (l=read) 
厂 Generate a memory write (l=write) 
1 opr 1 C T R W Rx Ry opd 
~ I ~ ~ I ~ I ~ ~ ~ _ L J _ i _ I _ I _ _ I _ I _ I _ I I I I I I I I 
0 5 10 15 20 25 (26 bits) 
Note: 1) Rx, Ry can only be the general purpose register (RO to R15) 
2) Since opd consists of 8 bit, a noil-negative value3 ranged from 0 to 256 can be 
represented. . 
3) The fixed point instruction (RI) format is j i ^ a s same as the RR format except 
the third operand is immediate mode. Moreover, for increasing the length of 
immediate operands, there are only 4 bits for storing the operation code. Some 
functions are thus not included. 
3 The non-negative number representation is selected since variables reference can be calculated by 
adding a non-negative offset to the memory data base register. 
29 
Operation: 
(Rx)4 <- (Rx) oprl ( (Ry) opr2 opd ) and; 
(MAR) <- (Rx) oprl ( (Ry) opr2 opd ) (if R= l ) and; 
(MSAR) <- (Rx) oprl ( (Ry) opr2 opd ) (if W= l ) 
Mnemonic Bit pattern oprl opr2 Description 
MOV 0000 MOV MOV Move opd to (Rx), (Ry); if R=1, move it to MAR; if 
W=1, move it to MSAR 
ADD 0001 NOP ADD Add (Ry) to opd any then store result into (Rx) 
SUB 0010 NOP SUB Subtract opd from (Ry) and then store result into (Rx) 
MUL 0011 NOP MUL Multiply (Ry) by opd and then store result into (Rx) 
DIV 0100 NOP DIV Divide opd from (Ry) and then store result into (Rx) 
ADD2 0101 ADD ADD Add (Rx), (Ry) and opd and then store result into (Rx) 
MULADD 0110 ADD MUL Multiply (Ry) by opd first, and then add (Rx) and 
store back to (Rx) 
AND 1000 NOP AND Perform (Ry) AND opd and store result into (Rx) 
OR 1001 NOP OR Perform (Ry) OR opd and store result into (Rx) 
XOR 1010 NOP XOR Perform (Ry) Exclusive-OR opd and store result into 
(Rx) 
4Same as the RR format instructions. 
30 
2.4.3 Floating point instructions format: 
For there is no memory access in floating point unit, the instruction set is quite 
simple and just as most conventional machines. 
—Co u p l e d instruction (l=coupled; O=normal) 
opr C Rx Ry Rz 
I I _ I _ _ _ I _ I _ I _ _ I _ I _ I 1 I I I 
26 31 36 42 (17 bits) 
Operation: (Rx)5 <- (Rx) oprl ( (Ry) opr2 (Rz)) 
(FSDR) <- (Rx) oprl ( (Ry) opr2 (Rz) ) (if W=1 and T=l) 
Mnemonic Bit pattern oprl opr2 Description 
NOP 0000 NOP NOP No operation 
ADD 0001 NOP ADD Add (Ry) to (Rz) any then store result into (Rx) 
SUB 0010 NOP SUB Subtract (Rz) from (Ry) and then store result into (Rx) 
MUL 0011 NOP MUL Multiply (Ry) by (Rz) and then store result into (Rx) 
DIV 0100 NOP DIV Divide (Rz) from (Ry) and then store result into (Rx) 
ADD2 0101 ADD ADD Add (Rx), (Ry) and (Rz) and then store result into (Rx) 
MULADD 0110 ADD MUL Multiply (Ry) by (Rz) first, and then add (Rx) and 
store result back to (Rx) 
MULADD一Rz 0111 ADD MUL Multiply (Ry) by (Rz) first, and then add (Rx) and 
store result back to (Rx) 
MOV 1000 MOV MOV M o v e (Rz) to (Ry), (Rx) 
XtoF 1001 N/A N/A Convert R0 and R1 to FRO where bito of R0 is the 
most significant bit 
5Same as the RR format instructions. 
31 
2.4.4 Branch instructions format: 
One of the main advantages of the Triple-Instruction Computer is the inclusion of 
the self branch type instruction. Detailed discussion is placed at section 2.5. 
Self-branch bit (l=self-branch; O=normal) 
厂 Branch mode (l=immediate mode; 
0=register mode) 
B opr M Rx Ry opd 
~ I I ~ I _ I _ I _ L _ J _ I _ I _ I _ _ I _ I I I I I 
43 48 53 58 63 (21 bits) 
Note: 1) Rx, Ry can only be the general purpose register (RO to R15) if the fixed point 
register file is being used. 
2) Since opd consists of 7 bit, the 2，s complement value ranged from -64 to 63 
can be represented if the immediate mode is employed. 
3) For register mode (M=0), only bits 57-60 will be considered. 
4) If it is an unconditional branch, the opd consists of 15 bit (bit 49-63), ranged 
from -16384 to 16383. 
5) All branches are relative branches. 
6) Delay branch is employed to improve pipeline efficiency. 
Operation: If Relations(Rx,Ry) 6 holds then 
(PC) <- (PC) + opd (if M=l , opd is a immediate . tv/os 
complement value) 
(PC) <- (PC) + Reg (opd) (if M=0, opd is a fixed point register 
number ranged from 0 to 15) 
delations means the mnemonic relations EQ, NEQ’ LTE etc. For example, LT(Rx，Ry) indicates the 
condition (Rx) is less than (Ry). 
32 
Mnemonic Bit pattern Branch Description 
(bit 44-47) Condition 
N〇BR 0000 unconditional No Branch 
RET 0001 unconditional The called procedure will resume execution to the 
calling procedure by the RETURN instruction which 
pops the DBASE, CBASE, PSW and return address 
from the internal stack 
EQ 0010 Equal to Branch to destination if (Rx) = (Ry) 
NEQ 0011 NOT equal Branch to destination if (Rx) < > (Ry) 
LT 0100 Less than Branch to destination If (Rx) < (Ry) 
LTE 0101 Less than or Branch to destination if (Rx) < (Ry) 
equal to 
- 0111 - reserved 
CALL 1000 unconditional When procedure CALL is invoked, the information of 
the calling procedure (return address, PSW, CBASE 
and DBASE) will be pushed into the system stack 
then the system control will be passed to the called 
procedure by simply changing the PC value 
BCF 1001 carry Branch on Carry Flag 
BXO 1010 fixed zero Branch on Fixed Point Zero Flag 
BSF 1011 floating sign Branch on Floating Point Sign Flag 
BFO 1100 floating Branch on Floating Point Overflow Flag 
overflow 
BPU 1101 floating Branch on Floating Point Underflow Rag 一 
underflow 
BFZ m o floating zero Branch on Floating Point Zero Flag 
B R n i l unconditional Branch to destination immediately 
- 33 
2.4.4 Put it together: the TIC instruction 
As stated before, useless TIC instructions may be formed if improper partial 
instructions are combined. Firstly, consider the following TIC instructions: 
[1] RR:ADD R0，R1，R2 | F:NOP | B:NOBR 
[2] RR:NOP I F:ADD FR0，FD0，FDR | B:NOBR 
[3] RR:NOP I F:NOP | LT Tx Rl ,R2 r 
For each of the above TIC instructions serves for only one operation, either fixed 
point, floating point or branching, the Triple-Instruction Computer behaves just like a 
conventional machine with higher cost and lower performance due to clumsy decoding logic 
and complex synchronization mechanism. As a result, we have to fill up the three functional 
units to increase the overall performance. Several useful TIC instruction format is shown 
below: 
(i) Summation of N numbers 
[1] RI:ADD_Tf_R R2，R2，@07 | F.NOP | B:NOBR 
[2] RI:ADD_Tf_R R2,R2,@2 | F:ADD FR0,FR0,FDR | B.LT Tx B R2,R1,-
-where R2 holds the address of the first memory location to be added and R1 
holds the address of the last memory location to be added and the sum will he 
put in floating point register FRO. 
This pair of TIC instructions illustrates looping in Triple-Instruction Computer. The 
first TIC instruction initiates a floating point read. The second self branch8 instruction 
provides an modified address for next floating point read and add up the corresponding 
7@0 denotes immediate mode constant zero. 
8Detailed description are placed in section 2.5.1. 
_ 34 
floating point registers. This TIC instruction will stay at the instruction pipeline until the 
branching condition is violated, that is, R2 is less than R l . 
(ii) Read in, modify and write back a number 
[1] RI:ADD一Tf—R R0，R0，@0 | F:NOP | B:NOBR 
[2] RR:ADD一Tf_R一W R0,R0,R1 | F:MULTI FR7,FR0,FR6 | LT_Tx_B R0,R2,-
-where RO holds the address of the first memory location to be modified, Rl 
holds the offset between adjacent cells, R2 holds the last memory location to be 
modified and FRO holds the multiplying constant. 
This pair of TIC instructions illustrates a read-modify-store loop. As in case (i), the 
first TIC instruction initiates a floating point read. The second instruction modifies the 
floating point number stored in FDR, writes it to FSDR and then stores it back to the 
corresponding memory location. After that, another floating point read is initiated to load 
up the FDR for next TIC instruction. 
2.5 Special features 
2.5.1 Self branch instruction 
The main advantage of TIC is compressing three different but related operations into 
one instruction. As stated before, fixed point unit prepares the operand address for the 
floating point unit. Depending on the result of the operations, the branch unit determines 
whether the branch is to be taken. A self-branch instruction will be continuously repeated 
for execution while the condition is valid. This features fully utilize the function of each part 
and allow the repetition of simple operations, for example, we can add 100 floating numbers 
by specifying the self-branch condition is "100 > theftxedpoint index". The B-bit, self-branch 
bit, is specified in the partial branch instruction for informing each part to retain the self-
branch instruction in the pipeline after the execution has finished. If the condition of the 
35 
self-branch is violated, the instruction on the pool will be flushed and the next TIW will be 
decoded and executed as normal. Details are shown in Figure 2.7b. 
Here are some TIC instruction sequences to demonstrate the purpose of self_branch 
instruction: 
Summation of N floating numbers 
Registers are initialize tn: 
R1 = address(NUM[N]); R2 = address(NUM[0]) 
R3 = address(SUM) FRO = 0.0 
TIC machine instruction sequence: 
[1] RI:ADD Tf_R9 R2,R2,@0 | F:NOP |B:NOBR 
[2] RI:ADD Tf_R R2，R2，@2 | F:ADD FRO,FRO,FDR |B:LT_Tx B10 R2，R1，-
[3] RI:ADD一Tf_W R3，R3，@0 | F:MOV FSDR，FR0，FR0 |B:NOBR 
The inclusion of self-branch instruction makes the TIC program shorter. Moreover^ 
the loop structures can be easily mapped into one H C instruction, and hence making the 
TIC compiler simpler. 
2.5.2 Coupled operation 
There are altogether 32 fixed point registers and 16 floating point registers and they 
are grouped into fixed couples and floating couples. In general, applied to the two register 
file, the general purpose registers (R0,R8), (R1,R9),…，（R7，R15) are coupled register pairs. 
Additionally, a few coupled register pairs are formed due to their purpose. For example the 
two MARs and the two XDRs form couples. As a matter of fact, the register files are 
9A RX format ADD instruction where type of opd is immediate value, type of memory access is floating 
point READ. 
10A (fixed) self一branch instruction depending on the result of the inequality equation R2<R1; if the 
equation holds, system Program Counter keeps unchanged. 
36 
mapped carefully and the general formula to calculate the couple register can be 
established: 
For fixed point Ri? the coupled register of Rj is Rj where j is equal to: 
(i + 8) mod 16 11 (fori € [0..15]) 
16 + ((i+8) mod 16) 12 (for i 6 [16..31]) 
For floating point FR^ the coupled register of R) is R} where j is equal to: 
(i+8) mod 16 (for i e [0..15]) 
In order to use lesser bit to specified more information, it is wise to introduce the 
concept of coupled register, coupled ALU, and coupled operation. For example, the 
following are coupled operations since the same operation is applied to corresponding 
coupled register pairs, namely (R1，R9) and (R2，R10): 
R I <- R I Add R2 
R9 <- R9 Add RIO 
They can be specified in a single instruction: 
R I <- R I Add C R2 
The two operations can be specified in one instruction so as to reduce the decoding 
time. If shadow ALU and data bus are employed, the coupled operation can be done in 
parallel. Moreover, the inclusion of couple operation is not related to any implementation 
details13, that is, the designer is requested only to define the coupled register pairs 14 and all 
11 It is equivalent to 
12lt is equivalent to (i_16). 
13For example If only one set of ALU and Bus is available, the sequential execution of the coupled 
operation is a must. On the other hand, if 4 sets of identical ALUs and Buses are employed concurrent 
execution of the coupled operation is possible by simply duplicating the control signals for the shadow ALUs 
and Buses while altering the operands selection circuits according to the coupled register groups definition. 
37 
details are left out. It provides greater efficiency gain when dealing with vectors or matrk 
manipulations, the basic computation component of scientific calculation. 
2.5.3 Coupled-Read, Modify and Store instruction (CRMS) 
For most common applications, a 64 bits Triple-Instruction word is capable of writing 
an entire loop with the use of the self branch instruction. The processing capacity is doubled 
when coupled operation mode is on. Those two features can be further combined to form 
a powerful primitive: the Coupled-Read, Modify and Store atomic instruction. 
During execution of one TIC instruction, two floating point operands are retrieved 
by a coupled-read operation, and the next operand addresses are coupled-updated by the 
fixed unit as long as the floating point operation is performed by the floating point unit. 
After the floating point modification, the result is stored back to the corresponding address. 
To achieve the single instruction loop, delay read is assumed. Detailed illustrations are 
provided in the case study sections. The CRMS instructions are used in writing the Gaussian 
elimination inner loop case and the matrix multiplication case and will be considered as one 
of the core constructs of the TIC programs. 
Up to now, we have outlined the detail designs and special features of the Triple-
Instruction Computer. Simulation design will be discussed afterwards in chapter 3. 
^Coupled register groups is used in case of the logical couple is formed by more than two registers. 
38 
Chapter 3 Simulation of the architecture 
The architectural view of the Triple-Instruction Computer was illustrated in previous 
sections. Besides instruction set design, functional organization design and logic design, all 
other implementation details encompassing integrated circuit design, packaging, power and 
cooling are left behind. Since the proposed architecture is new and at the early stages of the 
design process, iterative refinement and modification is a necessity. 
To understand the behavior of the machine and to evaluate various strategies for the 
pipeline control, software based simulation is conducted. The results of the simulation help 
us to understand the characteristics of the proposed TIC architecture and provides hints to 
make modification before the full implementation commences. 
3.1 Computer architecture simulation 
3.1.1 Previous approach 
The design of computer hardware should be conceived as an iterative top-down 
process and should be strongly supported by appropriate software (CAD) tools [GILOI, 
1980]. As a result, many tools were developed in the past few decades. Specification and 
simulation system is such a tool that it provides users with a complete description of the 
functional behavior of computer hardware at the register transfer level (RT level) instead 
of building the simulation system all from the beginning. This kind of "hardware description 
languages" (HDL) efficiently supports various kinds of VLSI design, hardware verification, 
silicon compilation etc. [HART, 1987]. Different implementors have their own independent 
notation, levels of abstraction, basic object types and operations for their proposed HDL and 
it will certainly miss the demand of portability. As a matter of fact, some standardization 
efforts had been made [HART, 1987][PILOTY, 1985][STAN, 1985]. 
So far, the HDLs and simulator packages are not yet widespread as claimed by the 
supporters. On the other hand, many architects still use general purpose or simulation 
languages such as C, Pascal, APL or SIMSCRIPT to write their architecture simulations. 
Due to the costly expenses and the noticeable complexity of the specification and simulation 
39 
packages，many designers, especially the academia ones, long for using the easy-access 
general purpose languages to construct their hypothetical designs. Furthermore, the well-
known successful project, the IBM System/360 and its successor System/370, were simulated 
and conclusively specified by APL, a high level language which is developed mainly for 
architecture specification. 
As the CPU nowadays becomes more and more complicated, the processor datapath 
cannot be easily worked out and hardwired control based simulation seems impossible. 
Wilkes [WILKES, 1953] was ahead of his time in recognizing that the design of control 
signals and datapaths are extremely complex and lot of amendment is necessary before the ： 
final version can be produced. The formation of IBM 360, microprogramming machines, 
enabled the instruction set to be changed by altering ihe contents of the micro control store 
without touching the hardware. [TUCKER, 1967] 
3.1.2 Our approach 
Even though APL was not originally designed for architecture simulation, it has been 
chosen to simulate the proposed Triple-Instruction Computer for two main reasons. First 
of all, the language processing power, communication protocol and the interpretative 
working environment make the implementation straightforward and self documented. 
Moreover, as an architecture investigation research project, it is a good objective to explore 
the architecture specification language widely used since 1960s. The characteristics of using 
APL in simulating Triple-Instruction Word Computer will also be discussed in later sections. 
During the preliminary stage of the research, the datapath, instruction set, control 
signals and even functional units of the TIC are frequently changed. A hardwired based 
simulation is inefficient because of the difficulties of tracing the datapath. Consequently, the 
Triple-Instruction Computer will be simulated based on microprogramming techniques. A 
general micro control unit is built initially and three micro engines are subsequently 
constructed for interpreting the H W accordingly with appropriate micro instructions and 
start addresses filled in the control stores and start address generation ROM. 
A brief description of APL is given in section 3.2 and the simulation environment and 
detailed simulation design are described in section 3.3 and 3.4 respectively. Tlie details of 
40 
the general micro control unit, Fixed Micro Engine, Floating Micro Engine and Branch 
Micro Engine are described in section 3.5. 
3.2 The APL language 
APL stands for A Programming Language, which was initially devised as an 
alternative mathematical notation by Kenneth in the early 60s. It is currently available on 
most mainframes, minis, workstations and personal computers. The excellent power of the 
language divides into three categories: language processing power, communication protocol 
and the interpretative working environment. 
a. language processing power 
< Ease of expressing mathematical concepts & computational constructs with 
extremely short programs and even one-line expressions. For instance, a one-line 
expression may perform several assignment sub-expressions including dozens function 
calls: 
num2—BtoTV bitvec—TVtoB [_ IEEE64toV ieee^-VtoIEEE641 numl 
where numl and num2 are decimal values, ieee is 64-bit vector in IEEE double 
precision format and bitvec is a 32-bit vector. 
a A rich set of binary bit pattern (arrays of zero and one only) manipulating 
functions such as take t, drop I, decode 丄，encode T, etc. For example, the 
conversion of 2，s complement bit pattern to decimal value can be written in one APL 
statement (see figure 3.2). 
M Built in powerful vector & matrix manipulating functions such as expand \, matrix 
inverse S, reshape p, transpose Q, etc. It is extremely helpful when dealing with 
1 Refer Figure 2.3 for explanation of the functions. 
41 
programs loading，programs unassembling etc. For instance, the following APL 
statement can be used to load the program into main memory starting at memory 
location 1000: 
MEM[1000+program {where the length of program is 8 words} 
A A powerful set of operators on boolean operands, reflecting the operation of 
computer hardware. 
< Comprehensive syntax that allows for the introduction of special declaration of 
hardware resources, signals, and data representations. 
b, communication protocol 
Unlike most other popular computer languages, APL is a highly interactive language. 
APL usually helps by giving an environment in which ideas can be put forth, rather than 
providing a translation. For such an interpretative system, benefits come from thinking the 
problem entirely, working interactively, thereby seeing results at once and being able to 
verify the data or to alter the algorithms. Such kind of interaction encourages two-way 
communication between users and APL. 
This property is certainly helpful for brain storming, especially at the early stages of 
the design cycle. For example, it is easy to have minor changes to the experimental 
architecture, such as exchange the execution sequence of two functional units. For iterative 
design and architecture investigation, the interactive communication protocol should be the 
best answer. 
c. interpretative integrated environment 
The APL system is a full-featured application development environment operating 
on APL programming language. Several objects and functions form a workspace. For coping 
with an application, users may invoke any function to manipulate the desired subset of 
42 „ _ 
objects. The programmer is nearly free from concerning with the machine environment. As 
a r e s u l t ， d e b l ^ i n g ^ FORTRAN or C, in contrast to APL, is more difficult because of the 
need of concerns with memory locations, data declaration statements, the computer 
environment, etc.. 
In short, APL allows users to see a big picture more easily by subordinating details. 
No wonder an APL programmer said,"…COBOL counts hydrogen and oxygen atoms in the 
vicinity; APL tells you if it's raining." [EISEN, 1990] 
In our case, a lot of global objects are declared while lots of functions are defined 
to operate the objects. Objects consist of components of the Triple-Instruction Computer, 
synchronization signals, and even test data. Functions simulate different levels of operation. 
For example, the function GAUSSIAN simulates an operating system level operation for 
initializing the memory and registers, loading the program into the main memory and 
starting the execution of the Triple-Instruction Computer while the function uXMPX2 
simulates a micro-machine level operation for selecting the address of next micro 
instruction . 
Only the related objects and functions are considered during the observation to the 
architecture and most of the implementation burdens are ignored. For example, if the 
memory access mechanism is concerned，only functions STOREmemlNlT, FETCHmemlNiT, 
STOREmemCOMPLETED and FETCHmemCOMPLETED are required and direct examination or 
even modification to the objects MEM (memory), XREG[i] (fixed point register), FREGfi] 
(floating point register) in the intermediate stages are possible. The resulted working 
environment is thus very suitable for iterative architecture design and investigation. 
3.3 Simulation environment 
The simulation is implemented on STSCs APL*PLUS PC system (version 9.0) 
running on the IBM-compatible personal computer under DOS version 3.3 or above, where 
80386 machine is preferred but not necessary. 
The STSCs APL*PLUS PC system combines the APL interpreter with an application 
_ development interface. The language interpreter facilities a rich set of primitive functions, 
2Refer to section 3.5: Micro Multiplexer #2 
43 
operators and system functions. Several system functions are extremely valuable for 
architecture simulation3. 
Firstly, the Attention Latent ^™1®®1™™88™®®™™!®™®™®™®™™®*®^®^ 
Expression function DALX contains an APL L00P: 
expression to execute in the event of an { fo" fve「y cyc!e) 
U i 4 1 1 Calculate the Combinational Logic 
attention exception (usually generated by ^Ttl iTllll'ln Displays 
pressing Ctrl-Break). If four monitors are • • 
•ALX — '-SWAP' 
required to display the system states of the RETURN； 
three functional units and the instruction I f branch back t0 Lo°P 
SWAP: identify current MODE 
scheduler. A round robin scheme can be switch to next monitor and update current MODE 
branch back to RETURN I 
implemented using IHALX function as ： 
shown in figure 3.1. 
figure 3.1 A round robin monitoring 
Secondly，the inherent easy-to-use algorithm 
graphics functions, such as DGWRITE, 
•GLINE, DGSHADE, etc., make the graphical monitor implementation feasible. 
Furthermore, the delay execution function, DDL, delays execution for the time 
requested. It is used to adjust the speed of the graphical animation, when the simulator runs 
on different types of machines. 
Besides the system functions, APL*PLUS also provides an excellent working 
environment for undertaking computer architectural design simulation. Under the DOS 
environment, it is almost impossible to carry out complex architecture simulation without 
capacity problem. For example, if an abstract machine requires 64K of memory words, a 
total of 256K bytes real memory is demanded if word length of the target machine is 32-bit. 
The use of Virtual Workspace mode may be a practical solution. It increases the apparent 
size of the workspace by employing a supplementary Virtual Workspace file which is used 
to store defined functions and objects that are not required for immediate use. Objects are 
swapped in and out of the real workspace as needed automatically. 
3A simple von Neumann machine is simulated using these system functions using APL as a 
demonstration. 
44 
3.4 Simulation design 
A top-down implementing method is applied based on the Register Transfer level 
block diagram. Firstly, a general microprogramming engine is written and verified. 
Subsequently, several operational routines for managing the micro control store such as to 
LOAD micro instruction file from DOS native file are established. 
Afterwards, the instruction schedule unit is constructed by simulating the 
synchronization signals generated by all the functional units, such as XNEW, XEND, 
FNEW, FEND, etc. Consequently, the synchronization signals of each functional unit is 
defined. 
In the next step, the individual function unit is established. It consists of a micro 
control unit, the corresponding synchronization signals and data unit which includes data 
registers, Arithmetic & Logic Unit and data paths. Moreover, the micro control programs 
for all micro engines (fixed unit, floating unit, and branch unit) are specified in order to 
master the corresponding function unit. 
Finally, after all functional units become operational, we attempt to refine and 
improve the workable draft architecture by removing bottlenecks and combining new ideas. 
For example, since each functional unit executes at its own pace and a logical state must be 
created in order to have a valid branching point, several logical state patterns may be 
investigated for reaching the best solution. 
For clarity purpose, all APL functions and objects of machine level starts with a 
uppercase letter and the functions and objects of microprogramming level starts with a 
lowercase letter. 
The mapping of architectural features to APL can be summarized as the following 
seven main categories: 
CL logic gate level operations 
In mapping gate-level operations into APL program, concise and readable notations 
are provided. There are five standard logical operators, namely NAND A, NOR V，NOT 
45 
OR V，and AND A. For instance, refer to figure 2.7b of chapter 2，the instruction load 
enable signal (L-ENC) is determined by the self—branch signal (S-BR), valid condition (VC)， 
and the micro-unit instruction END signals, namely the XEND, FEND, and BEND. The 
program segment for implementing the above case is shown. 
L-ENC — ENDA~S-BRAVC 
b. data storage and dataflow 
APL variables are used to simulate the data storage devices: registers and memory 
cells，since an assigned value remains available until it is changed by other assignment 
statement. We put the data in decimal form instead of binary though both of them can be 
represented by an APL variable. Consider the example architecture, 32-bit word, 32-bit fixed 
point registers and 64-bit floating point registers are employed/It is expensive and almost 
impossible to employ binary formats in memory because of the huge size. Inside the CPU, 
the data stored in registers may be in either formats. For simplicity and homogeneity 
purposes, all the internal registers employ decimal format except the instruction pipes 
(registers) are in binary format. 
As stated in figure 2.2, dataflow from registers to registers. A register's input is result ；-
of an n-inputs function as shown: 
REGj 一 / ( R E G p R E G f i R E G 》 f o r j, k, 1 are any register, including i 
and f may consist of any composite operation 
The example program segments for partial instruction (l)RX:MOV_R4, 
(2)F:MULADD are shown: 
MAR — REG[2x4tl0i5INST] — REG[2 丄 4tl4iINST] — 2±8tl84lNST] ⑴ 
REG[2丄4t3UINST] — REG[2±4t3UINST] + REG[2±4t35iINST] x REG[2丄打39iINST] (2) 
4A RX format MOV instruction with a READ memory operation. 
5Drop 10 bit from the vector INST and then take the first 4 element of the result vector. 
46 
c. ALU design 
Since all the arithmetic operations have been implemented efficiently using APL in 
the IBM system/360 project，an abstract data manipulating approach is employed, that is, 
the ALU is realized by arithmetic operators, not logical operators. This kind of abstraction 
helps to simplify the problem and to focus on the nature of the issues, the interrelationship 
between the multi-functional units and the most efficient and satisfying design of the 
proposed Long Instruction Word architecture. 
d. data encoding and decoding 丨旧1丨M 
• VALUE —BtoV BITVEC 
-L1xiBITVEC[1]=0 
For simplicity, data are stored as VALUE 一啤(1+2丄-BITVEC) ••<) 
, . , , . J ^ ” L1: VALUE —2丄BITVEC 
decimal numbers instead of a binary • 
pattern. Several functions are used for 
converting data from decimal to binary, and F i ^ r e 3 '2 ^ example program listing for 
“ converting bit pattern in two's complement 
vice versa. These functions are format to decimal format 
implemented by the primitive functions 
encode and decode. An example function for converting bit vector in two's complement 
format to decimal is shown in figure 3.2 while the f u l l set of data encoding and decoding 
functions are listed in figure 3.3. 
47 
Function Name source f«mtat r ^ . 
7V tnR 丨 ” M 9 钽 Description 
TVtoB decimal value 32-bit binary (2's C) use for displaying bit pattern of 
fixed point registers and memory 
x\, locations 
… f ' b , t b , m a r y (2，s C ) d e d m a l value reverse function of TVtoB 
V t 0 , E E E 6 4 d e c ! m a l v a l u e IEEE floating (double) use for converting decimal value to 
IEEE double precision format (64-
.r-r-r-« , • bit) for the Floating point unit 
7+ u l t o ‘ E E E float,n9 ( d o u b l e ) decimal value reverse function of VtolEEE64 
VtoHEXS decimal value Hexadecimal (8-digits) use for converting decimal value to 
Hexadecimal format instead of 
binary pattern for compact display 
mode 
III li I 丨 1111 ill 1111 III W I'llll mill l i l t e m m a a a w a W M H ^ i i K g M i B a a f e a t o M i B i a i ^ m a m B i f f i i m m M B i a i M m m B W B W i a B M S B H M W m B B i E B a 
Figure 3.3 List of convert programs 
e. multiplexing 
As shown in figure 2.3, the fixed point ALU takes two operands, one directly comes 
from fixed point register file while the other comes from either the register file, the fixed 
point instruction, the branch instruction or the hidden temporary register (CTemp). A few 
bits come from the microinstruction control multiplexer's selection. For example, the micro-
instruction of the fixed point unit and the branching unit employ two bits and one bit for 
such purpose correspondingly. 
BIT Selection Description 
(1) Fixed unit 00 CTemp hidden register 
01 B from register file 
10 Fx immediate operand 
(2) Branch unit 0 B from register file 
1 Br absolute branching address 
The operand outgoing from the multiplexer may be decoded by the micro instruction: 
(CTemp,B,Fx)[l + 2x/iFIX[15 16]] (assume d O is 1} (for fixed unit) 
(B，Br)[l + /xBR[15]] {assume HLO is 1} (for branch unit) 
48 
where MFIX and MBR stands for micro instruction of fixed unit and branching unit 
correspondingly. 
/. memory access (with cache) 
There are eight memory ports, two for fixed point store, two for fixed point fetch, two 
for floating point store and two for floating point fetch. Each port contains a Memory 
Address Register (MAR) and a Memory Data Register (MDR). A memory fetch or store 
request will be generated according to the control bit T (type of memory operation) when 
the MAR and/or the MDR is loaded. A data ready tag is attached to each MDR for 
indicating whether the requested data is available (i.e. 1=ready and 0=not一ready). All tags 
are set to 1 initially. When a fetch request arises, the corresponding tag is reset to 0 until 
the data come back. 、 
The Memory Access Time (MAT) varies from case to case since cache memory is included. 
Assume the MAT is ranged by MATmin and MATmax. The access time may be: 
(MATmin + ? 1 + MATmax - MATmin) - 1 {assume HLO is 1} 
And synchronization problem can be resolved by keeping a Data一NOT一ready flag: 
DNOTrdy^-((data_rdy[i]=0)A(isMDR REG_A))V((data_rdy[j]=0)A(isMDR REG B)) 
where 1) data一rdy[i] is the data ready tag of the ith MDR 
2) i is the identification number of the MDR being selected by channel A 
3) isMDR is a function for checking whether the selected register is a MDR 
The micro engine will be suspended by simply keeping the /xPC unchanged when its 
corresponding Data一NOT一ready flag is set. 
49 
g. control signals synchronization 
As stated in (a), all control signals can be simulated by elementary logical operators. 
Considering the instruction fetch control and the interlocks of the three micro-control unit 
in figure 2.6 and figure 2.7，the control signals, such as L-ENC, S-BR, REFRESH, etc., are 
defined as logical variables which holds a value of either "0" or "1". Details for implementing 
the control signals are listed in figure 3.4. 
Every control signal, which is examined at the end of each micro-instruction cycle， 
is a function of the other control signals and the signals generated by the micro-control 
units，ALUs, or other special events of the system. The signals affect the machine state so 
as to alter the execution sequence. It is obvious but meaningful that the evaluation order 
of these variables is non-trivial and the corresponding dependency graph is shown in figure 
3.5. 
Signals Name Description Logical expression 
I NIT program Generated by external event. External signal 
initiation Indicating the first instruction 
comes into the CPU and 
initializing the micro-engines to 
work. 
INS-V instruction A ready jag bit of the first entry of 1 ^ valid instruction 
valid the instruction pipeline indicates 0 _ invalid instruction 
whether it is a valid instruction. 
XEND fixed end End signal of the fixed unit micro- a field of the micro-instruction 
engine which indicates the 
preceding partial fixed instruction 
is completed. 
FEND floating End signal of the floating unit a field of the micro-instruction 
end micro-engine, 
BEND branch End signal of the branch unit a field of the micro-instruction 
end micro-engine. 
50 
END end of the An long instruction word, namely 已END 
preceding the three partial instructions, are 
instruction completed. 
VC valid Valid branch condition. T is a tag ((T = Fx)A(XB = 1))V((T = Ft)A(FB = 1)) 
condition bit of the branch instruction 
indicating to which unit the 
branch operation according. 
S ' B R s e l f A s mentioned at section 2.3.2. a field of the branch partial instruction 
branch 
REFRESH re-load the whole instruction INIT V (VCA-S-BR) 
pipeline. 
L-ENC load shift the 64-bit instruction ENDA-(VCAS-BR) 
enable pipelines down one register and 
read a new instruction from 
memory location with address: 
PC+CBASE+length of pipeline. 
\ XNEW6 new fixed start decoding the next fixed INS-V 八（INITVBEND) 
partial instruction by the micro-
6ngi 门 
FNEW new start decoding the next floating INS-V A (INITVBEND) 
floating partial instruction. 
BNEW new start decoding the next branch XENDAFEND 
branch partial instruction. 
Figure 3.4 Control Signals 
According to figure 3.5，the evaluation order of the control signals can be easily 
obtained. If modification to the synchronization signals is required, only a refinement to the 
dependency diagram is need and thus the new evaluation order and the equivalent APL 
ejhe XNEW, FNEW and BNEW signals are reset automatically after triggered the micro engines to start 
for a new Instruction. 
51 
statements can be easily obtained accordingly. This systematic approach simplifies the design 
process. 
(BNEW L E H D \ 
No c y c l i c dependency Is found ^ 乂 
Figure 3.5 Dependency graph of control signals 
52 
3.5 The micro-architecture 
3.5.1 The General Micro Control Unit 
Hie general structure of the Micro Control Unit for each of the three decoder is 
shown in figure 3.6. It mainly consists of a Start Address Filter circuit, Multiplexers for 
selecting the address of the next micro instruction, a micro Program Counter, a Control Store, 
and a micro Instruction Register. 
I R.opcode 
^ nn q^ I 
^ I oolc- up t abl a mi cro.PC 
: — ' f ^ j j NEW 
" ： 2 L _ _ J l W — — 一 一 Mu l t ip lexer 口 二 二 二 丄 丄 力 
« • n r^ 
C 一 长 
： R I , i , 
X ！ 3E I 
L ^ J j CONTROL STORE 
j ( 32-b i t x 1024) 
| mi cr o_l R 
end Co n d br _add r cont ro l s igna ls 
1 I 1 I r—"""-1 T"—1 
I I 
I I 
t T T T " T 
Go to I SU 
Figure 3.6 The Micro Engine Block Diagram 
53 
Start Address Generating circuit 
The start address generating circuit employs a simple table look up method. A tiny 
erasable Read Only Memory is used to store all the starting addresses of the valid operation 
codes (OP). Considering the case that the length of the operation code field is N，the size 
of the erasable ROM can be calculated as: 
2N * (length of a micro address) 
The full map of the start address generation ROM will then be discussed in details 
in the later sections. 
b. Micro Multiplexer # 2 
This multiplexer (/xMPX2) is used to select the address of the next micro instruction. 
The incoming choices are: 1) the old micro program counter (/xPC) value; 2) the old ^PC 
value plus one; 3) a new start address from the Start Address generation circuit; and 4) the 
branch address from the micro instruction register (juIR). 
By embracing several tricky ideas, the multiplexer also helps to solve local 
synchronization problem. Two more signals named DATArdy and ALUrdy are introduced. 
Owing to the implicit memory read protocol and execution pipeline, it is not assumed that 
each instruction consists of memory read and thus, the DATArdy (DATA ready) flag is used 
to signal the availability of the requested memory operand (put in the Memory Data 
Register). Whenever the read request is generated but desired data has not yet come back, 
the DATArdy will be reset to 0，it suspends all operations by simply keep the /jl?C 
unchanged. The ALUrdy (ALU ready) flag functions under the same principle while it is 
used to indicate the availability of the Arithmetic and Logic Unit. 
Besides the above two signals, the output of the multiplexer also depends on other 
conventional control signals, namely LDBr and NEW. Summary of these signals are shown 
below: 
54 . 
一 “― r — _ _ ， 
L D B r L o a D B r a n c h if L D B r "s set to 1, ,uPC will be set to the branch address field 
of the MIR, that means a micro branch occurs. 
NEW NEW instruction if NEW is set to 1, ,PC will be set to the new start a d d r e s s 一 
generated by the Start Address Generating Circuit, that 
means a new machine level instruction is loaded and a 
branch to the corresponding micro instructions occurs. 
DATArdy DATA ready flag The DATA ready flag indicates whether all desired Memory 
Data Registers (MDRs) are ready. It is reset to 0 if any Data 
has not yet come back from memory. 
ALUrdy ALU ready flag The ALU ready flag indicates whether the ALU operation is 
completed. It is reset to 0 if the ALU is not ready for next 
micro instruction. 
And the function table of the multiplexer is shown below: 
LDBr NEW DATArdy ALUrdy OUTPUT Remarks 
0 0 0 X /iPC /tPC remains unchanged 
0 0 X 0 mPC 
0 0 1 1 ^PC + 1 Normal case: add one to the old /iPC 
X 1 X X new start A new machine level instruction is 
address loaded, read from the start address 
generating circuit 
1 0 X X /ilR.BrAddr A valid micro branch occurs, the 
branch address of the /ilR is loaded 
(where X means Don't Care) 
55 
c Micro Program Counter (^iPC) 
The micro program counter stores the address of the next micro instruction. For 
example，if the MPC contains a 10，the tenth element of the Control Store will then be 
copied to the MIR at next cycle. 
(L Control Store 
The Control Store is another erasable read only memory. It stores the micro 
instructions for all of the machine level instructions. The size of the Control Store can be 
calculated by the formula n * w bits where n is the number of micro instructions demanded 
and w is the width of each micro instruction. Conventionally, n is a complete power of 2 to 
fully utilize the addressing space and the size of w varies from case to case depending on 
the number of control signals needed. 
e. Micro Instruction Register (fdR) 
The micro instruction register holds the current micro instruction. Each micro 
instruction can be further divided into three parts. The micro sequencing controls, execution 
controls, and inter-micro一engine synchronization controls. 
The micro sequencing control bits is used to indicate the micro branching conditions 
and branching address. As well, bits for execution control keep all control signals for the 
data latches, multiplexers selection, ALU operations, etc.. The inter-micro一engine control 
bits are used to synchronize all of the other micro engines. Detailed micro instruction 
format of each micro engine will be shown in later sections. 
3.5.2 Micro Instruction Execution 
The execution cycle of a micro instruction can be further divided into 3 sub-cycles. 
Cycle! and cycle2 to cyde3 form a three-stages pipeline. The three cycles include: 
56 
Stage 一 Cycle # | " ^ ^ ； " “ “ 1 
Fetch Stage Cycle, latch the new micro instruction to /JR; and update current 
MPC for next cycle • 
Operand Stage Cycle, gate i叩ut registers (operands) to the ALU of the Function-Unit 
Execute Stage Cycle3 perform the function denoted by the micro instruction and gate 
result back to output register 
Since a three stage pipeline is employed, the MPC and MIR will both be changed at 
each cycle. Consequently, any valid micro branch will abort the pipeline and thus delay 
branch is assumed, that means, micro-instruction^ and micrb-instructioni+2 will be executed 
despite the micro一instructioiii is a valid branch micro instruction. 
3.5.3 Fixed Micro Engine 
As revealed in figure 2.6 in chapter 2 and figure 3.6，the fixed micro engine have 4 
external inputs, which are the operation code of the current partial instruction 
(IR.Fixed.opcode), the partial instruction completion signal (XNEW) from instruction 
schedule unit, the availability flag of the ALU (ALUrdy) and the availability flag of memory 
data register (DATArdy), and generates control signals to cooperate the ITC. Details of the 
micro instruction format is shown in section 3.5.3 (i). 
(i) Micro Instruction Format 
F E X 
E t X E 
N o I N 
(MPX) (BUS一 A) C X T D 
Cond BrAddr , , ALU A B C 
1 I I I I 1 I I I I I ' I I I 1 I I ' I 1 I I I I 1 11 I I I 
1
 5
 15 _ 17 22 24 26 28 31 
Note: 1) Bit 1 to bit 14 are used to control the micro instruction sequencing. In the 
simplest TIC model, this two fields are reserved and set to 1111 0000000000 
when no micro branch is necessary. 
57 
2) Bit 15 to bit 31 are used to cooperate the execution of the fixed unit. 
Moreover，the XEND signal is transmitted to the instruction schedule unit for 
the purpose of inter-functional units synchronization. 
3) the meaning of each field can be summarized in the following table. 
Field Name Description 
Cond Branching A valid branch signal LDBr is generated by matching the 
condition specified branch condition and the program status word (PSW). 
For the TIC instructions being implemented, the micro 
instructions are code sequentially and no branch is necessary. 
The cond field of all micro instruction is set to 1 1 1 1 , representing 
no micro branch is required. 
BrAddr Branch Address The 10 bits branch address. 
MPX Multiplexer The MPX field controls the fixed point multiplexer used for 
selecting the appropriate operand from BUS B, totally 4 choices 
can be made: 
0 0 BUS B selects Ctemp (BUS C's temporary register) 
0 1 BUS B selects register file or fixed immediate value 
1 0 BUS B selects TABLE Clear (a preset table with 32 
entries which is used as a mask to clear bits) 
1 1 BUS B selects TABLE Set (another mask for set bits) 
ALU ALU operation 00000 No Operation 10000 A AND B 
00001 A + B 10001 A OR B 
00010 A - B 10010 A EXOR B 
00011 A* B 10011 NOTB 
00100 A-^B 10100 Clear A's B,h bit 
00101 reserved 10101 Set A's Bth bit 
00110 reserved 10110 Clear ALL bits 
I - 1 
58 
00111 reserved 10111 Set ALL bits 
01000 shift A to left by B bits 11000 reserved 
01001 shift A to right by B bits 11001 reserved 
01010 rotate A to left by 已 bits 11010 reserved 
01011 rotate A to right by B bits 11011 reserved 
01100 rotate to left with carry 11100 increment B 
01101 rotate to right with carry 11101 decrement B 
01110 transfer B directly 11110 reserved 
01111 reserved 11111 reserved 
BUS_A Bus A selection operand of BUS A: 
00 IR.Rx the register specified in Rx field 
01 IR.Ry the register specified in Ry field 
10 IR.Rz or the register specified in Rz field (RR 
IR.opd format) or opd field (Rl format) 
11 SP the fixed register # 29，Stack Pointer 
BUS一B Bus B selection Same as BUS_A. 
BUS^C Bus C selection Same as BUS_A. 
ENC Enable write C If ENC is set to 1, the result of the ALU operation will be stored 
back to the register specified by BUS C. 
FtoX Floating to fixed If the FtoX is set to 1, the floating point register #0 (FRO) will be 
latched to fixed point register #0 (R0) and #1 (R1) where bito of 
R0 is the most significant bit. 
EXIT System Halt If the EXIT is set to 1, the program will halt and control will be 
returned to the Operating System. 
XEND fixed END The partial instruction completion signals. The set XEND (=1) is 
used to indicate that the fixed part of the current machine level 
instruction is completed. 
59 
(辽)The Control Store Map 
The control store of the fixed micro engine is carefully designed and fully debugged. 
The following table shows all instructions in the control store and the corresponding 
machine level TIC instruction being interpreted is remarked. 
— instruction 「： Remark 
1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 00 00000 00 00 00 0 0 0 0 dummy micro instruction 
2 1111 0000000000 01 00001 01 10 00 1 0 0 1 [RR-ADD.1]1 or [RI-ADD.1] 
3 1 1 1 1 0000000000 01 00010 01 10 00 1 00 1 [RR-SUB.1] or [RI-SUB.1] “ 
4 1 1 1 1 0000000000 01 00011 01 10 00 1 0 0 1 [RR-MUL.1] or [RI-MUL.1]~~~ 
5 1 1 11 0000000000 01 00100 01 10 00 1 0 0 1 [RR-DIV.1] or [RI-DIV.1】 
6 1111 0000000000 01 00001 01 10 00 0 0 0 0 [RR-ADD2.1】 or [RI-ADD2.1] 
7 1111 0000000000 00 00001 00 00 00 1 0 0 1 [RR-ADD2.2】 or [RI-ADD2.2] 
8 1 1 1 1 0000000000 01 00011 01 10 00 0 0 0 0 [RR-MUUDD.1】 or [RI-MUU\DD.1] 
9 1111 0000000000 00 00001 00 00 00 1 0 0 1 [RR-MUUDD.2] or [RI-MULADD.2] 
10 1111 0000000000 01 00011 01 10 00 0 00 0 [RR-MULADD一Rz.1] 
11 1111 0000000000 00 00001 00 00 10 1 0 0 1 [RR-MULADD一Rz.2】 
12 1111 0000000000 01 01000 01 10 00 1 0 0 1 [SL.1] 
13 1111 0000000000 01 01001 01 10 00 1 0 0 1 [SR.1] 
14 1111 0000000000 01 01010 01 10 00 1 0 0 1 [RL.1] 
15 1111 0000000000 01 01011 01 10 00 1 0 0 1 [RR.1] 
16 1111 0000000000 01 01100 01 10 00 1 0 0 1 [RCL.1]— 
17 1111 0000000000 01 01101 01 10 00 1 0 0 1 [RCR.1] 
18 1111 0000000000 01 01110 00 01 00 1 0 0 1 [MOV.1] 
1 Read as 1st micro instruction of the machine level instruction RR format ADD. 
60 
1 9 1 1 1 1 0000000000 01 00001 00 01 00 1 0 0 0 [EXCH.1] 
~ ‘ — 
2 0 1 1 1 1 0000000000 01 00010 00 01 01 1 0 0 0 [EXCH.2] 
2 1 1 1 1 1 0000000000 01 00010 00 01 00 1 0 0 1 [EXCH.3] “ ~ ~ ~ ~ 
2 2 1 1 1 1 0000000000 01 10000 01 10 00 1 0 0 1 [AND.1] "“ 
2 3 1 1 1 1 0000000000 01 10001 01 10 00 1 0 0 1 [OR.1] — — — — 一 一 I 
2 4 1111 0000000000 01 10010 01 10 00 1 0 0 1 [EXOR.1] “ ~ 
2 5 11” 0000000000 01 10011 00 00 00 1 0 00 [NOT.1] ~ I 
2 6 1111 0000000000 01 10011 0001 01 1 0 00 [NOT.2] “ “ 
27 1111 0000000000 01 10011 00 10 10 1 0 0 1 [NOT 3] ~ “ 一 
29 1111 0000000000 11 10101 01 10 00 1 0 0 1 [SET.1] 
30 1111 0000000000 01 10110 00 00 00 1 0 0 1 [CLRA.1] 
31 1111 0000000000 01 10111 00 00 00 1 0 0 1 [SETA/I] 
32 1111 0000000000 01 11100 00 11 11 1 0 00 [PUSHX.1] 
33 1111 0000000000 01 01110 00 11 01 1 0 0 0 [PUSHX.2] 
34 1111 0000000000 01 01110 00 00 10 1 0 0 1 [PUSHX.3] ! 
35 1111 0000000000 01 01110 00 11 01 1 0 0 0 [POPX. 1 ] 
36 1111 0000000000 01 11101 00 11 11 1 0 0 0 [POPX.2] 
37 1111 0000000000 01 01110 00 10 00 1 0 0 1 [POPX.3] | 
38 1111 0000000000 00 00000 00 00 00 0 0 0 1 reserved 
- 39 1111 0000000000 00 00000 00 00 00 0 0 0 1 reserved — 
40 1111 0000000000 01 11100 00 00 00 1 0 0 0 [INC.1] 
41 1111 0000000000 01 11100 00 01 01 1 0 0 0 [INC.2] 
42 1111 0000000000 01 11100 00 1Q10 1 0 0 1 [INC.3] 
43 1111 0000000000 01 11101 00 00 00 1 0 0 0 [DEC.1] 
61 
4 4 1111 0000000000 01 11101 00 01 01 1 0 0 0 ~ \ [DEC.2] “ ~ 
4 5 1111 0000000000 01 11101 00 10 10 1 0 0 1 7DEC.3] ~ 
46 1111 0000000000 00 00000 00 00 00 0 1 0 1 [FtoX.1] 
47 1111 0000000000 00 00000 00 00 00 0 0 1 1 [HALT.1】 
4 8 1111 0000000000 01 01110 00 10 00 1 0 0 0 [RI.MOV.1]~“ ： ~ 
4 9 1 1 1 1 0000000000 01 01110 00 10 01 1 0 0 1 [Ri-MOV.2] 一 ~ 
50 1111 0000000000 00 00000 00 00 00 0 0 0 1 [NOP. 1 】 
For example, the /xinstruction stored in /^ address 2 is M i l 0000000000 01 00001 01 
W M S 3 M - 现 � the fixed part of the TIC instruction is ADD, no matter a RR type or 
RI type, the fixed micro engine will load this micro instruction first. The interpretation of 
this juinstruction is: 
1) The register being used in BUS一A (REG一A in figure 2.2) is specified in the 
IR.Fixed.Ry (bit 15 to bit 19 of the current TIW if it is RR type). 
2) The register being used in BUS B (REG B in figure 2.2) is specified in the 
IR.Fixed.Rz (bit 20 to bit 24 of the current TIW if it is RR type). 
3) The register being used in BUS一C (REG一C in figure 2.2) is specified in the 
IR.Fixed.Rx (bit 10 to bit 14 of the current H W if it is RR type). 
4) The operation of the fixed ALU is ADD. 
5) BUS一C should be latched back to the register file. 
6) There is no Floating—to一Fixed transfer. 
7) It is not a HALT instruction. 
8) This is the last /zinstruction of the corresponding TIW. 
Some TIC instructions, such as ADD_and_ADD and Multiply_and_ADD, contain 
more than one jLiinstruction. The /xinstruction count is shown on the remark field 
correspondingly. 
62 
( i U ) 恥 start address generation ROM 
The start address generation ROM is generated according to the operation code field 
of the fixed part Triple-Instruction Word and the fixed control store mentioned in 3.5.4.ii. 
For example, if the IR.Fixed.opcode is equal to 00000，it is a NOP instruction and the 
starting ^address is 50 while the IR.Fixed.opcode is equal to 00011，it is a MUL instruction 
and the starting juaddress is 04. 
The content map of the start address generation ROM for RR type instructions and 
R l instructions are shown below: 
RR-format: (32 TIC instructions only) 
5 0 02 03 04 05 06 08 10 12 13 
14 1 5 16 17 18 19 22 23 24 2 5 
2 8 2 9 3 0 3 1 32 3 5 38 39 40 43 
4 6 4 7 
RI-format: (16 TIC instructions only) 
4 8 02 03 04 05 06 08 01 01 0 1 
01 01 01 01 01 01 
The operating and design principle of the floating and branch micro engine are just 
as the same as the fixed micro engine. Details of the micro instruction format, control store 
design and start address generation ROM are described in section 3.5.4 and section 3.5.5. 
3.5.4 Floating Micro Engine 
(i) Micro Instruction Format 
X F 
E t E 
. N o N 
(MPX) (BUS_D) — F F D 
Cond BrAddr ALU D E F 
1 I 1 I I I I I I I 1 I I 1 1 I I I I 1 I I I I I I 1 
_ 1 5 1 5 1 6 1 9 2 1 2 3 2 5 2 7 
where the meaning of each field can be summarized as follows: 
63 
Fiekj Name ••叩 Description fl 
MPX Multiplexer The MPX field controls the floating point multiplexer used for 
selecting the appropriate operand from BUS E, totally 2 choices 
can be made: 
0 BUS E selects Ftemp (BUS F's temporary register) 
• 1 BUS E selects register file 
ALU ALU operation 000 No Operation 100 D + £ 
0 0 1 D + E 101 Transfer E directly 
01〇 D - E 110 reserved 
011 D * E 111 reserved 
BUS—D Bus D selection Operand of BUS D: 
00 IR.Rx the register specified in Rx field 
01 IR.Ry the register specified in Ry field 
10 IR.Rz the register specified in Rz field 
11 - reserved 
BUS^E Bus E selection Same as BUS__D. 
BUS_F Bus F selection Same as BUS_E. 
ENF Enable write F If ENF is set to 1, the result of the ALU operation will be stored 
back to the register specified by BUS F. 
XtoF Fixed to floating If the XtoF is set to 1, the fixed point register #0 (R0) and #1 
(R1) will be concatenated and latched to the floating point 
register #0 (FRO) where bi、of R0 is the most significant bit. 
FEND floating END The partial instruction completion signals. The set FEND (=1)Js 
used to indicate that the floating part of the current machine 
level instruction is completed. 
64 
间 The Control Store Map 
^ ^ instruction R e m ~ 
1 1111 0000000000 0 0QQ OOOOOQQQQ dummy micro instruction 
2 1111 0000000000 0 000 00 00 00 0 0 1 [NOP.1] ———— 
3 1111 0000000000 1 001 01 10 00 1 0 1 [F-ADD.1] 
~ ' 
4 1111 0000000000 1 010 01 10 00 1 0 1 [F-SUB.1] 
5 1111 0000000000 1 011 01 10 00 1 0 1 [F-MUL1] — 
6 1111 0000000000 1 100 01 10 00 1 0 1 [F-DIV.1] 
7 1111 0000000000 1 001 01 10 000 0 0 [F-ADD2.1] 
8 1111 0000000000 0 001 00 00 00 1 0 1 [F-ADD2.2] 
9 1111 0000000000 1 011 01 10 00 0 0 0 [F-MULADD.1] 
10 1111 0000000000 0 001 00 00 00 1 0 1 [F-MULADD.2] 
11 1111 0000000000 1 011 01 10 00 0 0 0 [F-MULADD_Rz.1] 
12 1111 0000000000 0 001 00 00 10 1 0 1 [F-MULADD_Rz.2] 
13 1111 0000000000 1 101 00 01 00 1 0 1 [F-MOV.1] 
14 1111 0000000000 0 000 00 00 00 0 1 1 [XtoF.1] 
(iii) The start address generation ROM 
The content map of the start address generation ROM for floating point instructions 
is shown below: 
0 2 03 04 0 5 07 06 09 11 13 14 
01 01 01 01 01 01 
65 … 
3.5.5 Branch Micro Engine 
ft) Micro Instruction Format 
P P B 
U U R P P R E 
S S E O O E N 
H H G P P G D 
_ ？°?d . 1 • , • ？ ^ 7 7 . 1 f T - r c l 1 , 1 1 . M d | 
1 5 1 5 1 9 2 2 2 5 2 7 
where the meaning of each field can be summarized as follows: … 、 
Field Name Meaning 一 ~ ~ ~ Description 
SET一VC Valid Branch specify the condition used to set the Valid Condition signal, 
Condition there are totally 14 variations: 
0000 No Branch 1000 reserved 
0001 reserved 1001 Branch if Carry Flag 
is set 
0010 BUS G Equal To BUS 1010 Branch if fixed Zero 
H is set 
0011 BUS G Not Equal To 1011 Branch if Floating 
BUS H point Sign Flag is set 
0100 BUS G Less Than 1100 Branch if Floating 
BUS H point Overflow flag is 
set 
0101 BUS G Less Than or 1101 Branch if Floating 
Equal To BUS H point Underflow flag 
- is set 
0110 reserved 1110 Branch if Floating 
point Zero Flag is 
set 
0111 reserved 1111 unconditional branch 
66 
“ 
PUSH Push to internal If PUSH is set to 1, the register specified in the PUSH REG 
s t a c k f ie|d will be pushed onto the internal stack. 
PUSH REG push register There are totally 4 registers may be pushed to internal stack: 
00 program counter 10 code base 
01 program status word 11 data base 
POP Pop from internal If POP is set to 1, a word will be popped from the internal 
s t a c k s t a c k and put to the register specified in the POP REG field. 
POP REG pop register Same as PUSH REG. 
1 increment internal Increment the internal stack pointer (ISP), 
stack pointer 
D decrement internal Decrement the internal stack pointer (ISP), 
stack pointer 
BEND Branch END The partial instruction completion signals. The set BEND (=1) 
is used to indicate that the branch part of the current machine 
level instruction is completed. 
(ii) The Control Store Map 
fiaddress ^instruction Remarks 
1 1111 0000000000 0000 0 00 0 00 0 0 0 dummy micro instruction 
2 1111 0000000000 0000 0 00 0 00 0 0 1 [NOBR.1] 
3 1111 0000000000 0000 0 00 1 00 0 0 0 [RET.1] 
4 1111 0000000000 0000 0 00 0 00 0 1 0 [RET.2] 
5 1111 0000000000 0000 0 00 1 01 0 0 0 [RET.3] 
6 1111 0000000000 0000 0 00 0 00 0 1 0 [RET.4] 
7 1111 0000000000 0000 0 00 1 10 0 0 0 [RET.5] 
67 
r — — —p-—— 
8 1 1 1 1 0000000000 0000 0 00 0 00 0 1 0 [REJ.6】 "" 
9 1111 0000000000 0000 0 0 0 1 11 0 0 0 [R E 丁 .7】 一 ‘ 
10 1111 0000000000 0000 0 00 0 00 0 1 1 [R E T 8] "“ ~ 
11 11 ” 0000000000 0010 0 00 0 00 0 0 1 [EQ. 1] — — — 一 — — 1 
~ 一 1 1 ll 
1 2 川 1 0000000000 0011 0 00 0 00 0 0 1 [NEQ.1] ： ’ 
13 1111 0000000000 0100 0 00 0 00 0 0 1 [LT.1] — — ― 】 
14 1111 0000000000 0101 0 00 0 00 0 0 1 [LTE.1] 一 "1 
15 1111 0000000000 0000 0 00 0 00 0 0 1 reserved 1 
16 1111 0000000000 0000 0 00 0 00 0 0 1 reserved “ 1 
17 1111 0000000000 0000 0 00 0 00 1 0 0 [CALL1] 
18 1111 0000000000 0000 1 11 0 00 0 0 0 [CALL.2] — 
19 1111 0000000000 0000 0 00 0 00 1 0 0 [CALL.3] | 
20 1111 0000000000 0000 1 10 0 00 0 0 0 [CALL.4] ii 
21 1111 0000000000 0000 0 00 0 00 1 0 0 [CALL5] 
22 1111 0000000000 0000 1 01 0 00 0 0 0 [CALL6] 
23 1111 0000000000 0000 0 00 0 00 1 0 0 [CALL.7] 
24 1111 0000000000 0000 1 00 0 00 0 0 0 [CALL.8] | 
25 1111 0000000000 1111 0 00 0 00 0 0 1 [CALL9] 
26 1111 0000000000 1001 0 00 0 00 0 0 1 [BCF.1] !| 
27 1111 0000000000 1010 0 00 0 00 0 0 1 [BX0.1] p 
28 1111 0000000000 1011 0 00 0 00 0 0 1 [BSF.1] 
29 1111 0000000000 1100 0 00 0 00 0 0 1 [BF0.1] 
30 1111 0000000000 1101 0 00 0 00 0 0 1 [BFU.1] 
31 1111 0000000000 1110 0 00 0 00 0 0 1 [BFZ.1] 
32 1111 0000000000 1111 0 00 0 00 0 0 1 [BR.1】 
68 
闽 The start address generation ROM 
The content map of the start address generation ROM for Branch instructions is 
shown below: 
02 03 11 12 13 14 15 16 17 26 
27 28 29 30 31 32 
3.6 Implementation details 
3.6.1 Summary of the all simulation functions 
Function Name Description 
L I W C main program for simulating the Triple-Instruction Computer 
MICROB micro branch engine 
MICROF micro floating point engine 
M1CROX micro fixed point engine 
lRC instruction refresh controller 
IGU instruction schedule unit 
REG一A REG一B REG一D REG一E prepare the REG ？out signal for latching to REG ？ in next 
REG_G REG H cycle 一 — 
UPDATEfreg UPDATExreg latch the data back to the floating point register files, fixed 
UPDATEpcvc point register file or update the PC 
BREADY FREADY XREADY test the data availability for branch unit, floating point unit 
and fixed unit 
XALU FALL) simulate the real ALU operations . 
XMPX1 FMPX1 simulating the fixed and floating point multiplexer 
XTEST_DBZ FTEST一DBZ test for run time error: divided by zero ‘ 
BRANCHER simulate the branch operations 
BFILTER XFILTER micro address filter for /^ branch engine, Mixed engine and 
micro FFILTER /tfloating engine 
engine 
function ^BIRrdy /tXIRrdy prepare the /iBIRin signal for latching to jxBIR, ^XIR and 
/tFIRrdy /tFIR 
/iBMPXI /tXMPXI simulating multiplexer #1 for the micro engines 
jiFMPXI 
/tBMPX2 fiXMPX2 simulating multiplexer #2 for the micro engines 
/tFMPX2 
69 
3.6.2 Program listing for the important functions 
H ie APL implementation of the fixed micro engine, the instruction refresh controller 
and the instruction schedule unit are listed in figure 3.7 and figure 3.8. All other program 
listings for simulating the architecture are listed in appendix I. 
[0] UWC pc;time;MODE： 
[T] ：： EXIT-OO UXPGHJFPCHJBPO-1 0 IR-64pO 0 MODE-1 
I! 
f4] MSAR1 next-MSAR2next-0 一 -
; | 5 ] XSDFlt rdy—XSDR2rdy—FSDR1 rdy—FSDR2rd^0 (* SET ALL MSDR NOT AVAIlABf 
I 二 , ： " 二 E H : 
I S ； ^ . S E T 
}??j ' f P H > 0 : f S ^ C K ^ 4 p O (* I N t T l N T E R N A L S T A C K , 
[11J (• 一__—— JNIT ALL SYNCHRONIZATION VARIABLES *) 
【12】NOX—NOF—NOB—INITH 0 BEND—XEND—FEND^ O 0 cvcie-count-1 
脚 LOOP: 
f t 4 l 口-’[UWC] cycle:，0 O^cycle 0 • - ’ PC =’ • OXREG[1 +31] • 5 
m 冗 冗 (* FIXED ENGINE*) 
]；! Z ^ r L (* FLAOTfNG ENGINE *) 
P 9 | 二 。 （* BRANCHING ENGINE •》 
I 2 1 i , S U (* INSTRUCTION SCHEDULE UNIT *) f23J , R C (• INSTRUCTION REFRESH CONTROLLER 
[25J :: EXITHJX旧[30】 ( - SET THE HALT FUG A 
[26} cycle^cycle+1 
127] —L1xi(delay<0) 
[28] time^-DDL delay 0 -L2 
[29], L1;EPUT '[LIWC] Press any key to CONTINUE ,..； 0 o OpOlNKEY 
[30J L2:-LOOPxi(EXIT=0) 
[31] END:EPUT '[LIWC]?1 END ' o -o 
【0】ISU;oldNOB;oldNOX;oldNOF 
|1] BNEW-(HNIT^(FEND A XEND) 
L2NL1XI~(FENDA XEND} 
[3] NOB—(BNEWA NOB) 
: [4 ] L1:IEND-BEND ， 
[5] FNEW-XNEW-!NS_V 八(INITVBEND) 
[6] "4_3xi(XNEWM) 一 
[7] XREAD_STARTED<-FREAD_STABTED^XWR!TE_STARTED-FWRlTE_STARTED-0 
[8] L 3 : - E N [ > i , ~ ( I N I 1 V B E N D ) - 一 -
[9】NOR•〜(FNEWA NOF) 
[10} NOX—(XNEWA NOX) 
[11] END:-0 
Figure 3.7 The Triple-Instruction architecture simulation program (LIWC) and the 
Instruction Schedule Unit (ISU) 
70 
[0] MICROX 
[ ” . (* (A)micro engine logic - - *) 
[2] uXMPXI • uXMPX2 • uXIRrdy 
[3] (* (B) Update micro engine registers 一 *) 
[4] uXPC-uXMPX2out 
[5] uXIR^uXIRin 
[6] (* — ~ — (C) Execution of DATA . 
[7] FETCHmemCOMPLETED " 』 





[16J XBrAddr^-BtoV uX!R[4 + \ 10] 
[17J LAST XENO-XEND 
[18J -L11x""l(XENDA~FEND) f* 丽 F Q R T H F F E N n q | r N ] . . *、 
[19} XENCVuXIR[31]A(NOX=0)A XDATArdy A XALUrdy ( F E N ° S , G N A L ) 
[20] L11:^ L10x\(XEND*1) 
[21] NOX—1 
[22] (D) UPDATE SYNCHRONIZATION SIGNAL *) 
[23} L10:-Exi~XNEW 、 
[ 2 4 1 O- ' [MICR0X] X’ • t>count • CK start at cycle 1 0 Ocyc le 
[251 XNEWKJ 0 BENOO O NOB-1 
[0】 !RC;lPL 
。 二 三 『 … , (* INSTRUCTION PIPELINE LENGTH 
[2] O'flRCJ lENDr, 0 DHEND 0 O-' $ INtT: ’ 0 O-INIT 
[3] O - ' $ S_BR: ' 0 O S _ B R 0 [ > ' $ VC: ' 0 O V C 
[4] REFRES»-H-INITV(VC A ~ S_BR) 
[5】 - L l x—REFRESH} 一 
[6] VCHNnvo 
[8] IR-(VtoB32 MEM[XREG[1 +22]+XREG[1 +31]}) l(VtoB32 MEMp(REG[1 +22]+XREG[1 +31] + 1J) 
[9] IR2—(VtoB32 MEM[XREG[1 + 221+XREG[1 +31]+2]),(VtoB32 MEM[XREGf1+22}+XREG[1 +31]+3]) 
[10J IR3-(VtoB32 MEM[XREG[1 +22}+XREG[1 + 3 t ] + 4])t(VtoB32 MEM[XREGf1 +221+XREG[1 -f 31]+5J) 
[11] IR4—(VtoB32 MEMfXREG[1 +22]+XREG[1 +31] + 6]),(VtoB32 MEM[XREG[1 +22]+XREG[1 +31] + 7]) 
；[12r EPUT '[IRC】 REFRESH THE INSTRUCTION PIPELINE ….……’ 0 -END 
[14] L1: 
[151 L一ENCHEND A - ( S B9A VC) 
[16] — 、 卜 L ENC) 
[17} IRHR2 — 
[18} IR2HR3 
[201 IR4^-(VtoB32 MEM[XREGf1+22]+XREG[1+31] + IPL]),(VtoB32 MEM[XREG[1+22]+XREG[1+31] + IPL+1j) • 
XREG[1 +31J-XREGf1 + 31]+2 
[21J EPUT ’[IRC] FETCH A NEW ISNTRUCTION IN .…...…….1 
[22J END:VOS_BRH) • 崎 
Figure 3.8 The micro fixed engine and the Instruction Refresh Controller (IRC) 
71 
Chapter 4 The supporting environment 
4.1 The environment 
As stated in the previous chapter, APL is a highly interactive language and the APL 
system is a full-featured application development environment operating on APL language. 
A lot of global objects are declared while lots of functions are defined to operate the 
objects. Besides the functions used for simulating the H C architecture, some functions are 
written for making the simulation simple. Such functions can be farther divided into three 
categories: 
< pseudo operating system utilities; 
,simulation utilities; and 
< hardware monitor 
All these environment functions are stored within a workspace altogether with the 
architecture simulating functions. It is convenient to start another investigation by building 
new architecture simulating functions on top of the supporting functions. 
The machine configuration can easily be examined and edited by using the 
environment functions. Most implementation details are abstracted from the computer 
architect. Details of the environment functions are described in section 4.4. 
4.2 The Pseudo-machine configuration 
As stated in section 2.2，other than the CPU characteristics, all other machine level 
configuration, such as memory size, memory access methods, bus arbitration circuits, 
microprogramming details are abstracted from the simulation. Besides, within the pseudo-
machine, all numbers are stored as integers instead of binary patterns. The detailed 
consideration is placed in the following sections. 
72 
4.2.1 Memory management unit and bus control unit 
The pseudo-TIC is assumed to be with 232 programmer addressable space, as the 
length of the fixed pointed registers, so as to simplify the physical address generating 
process. For making the simulation runs faster, the pseudo-machine in the elementary 
simulation consists of a 16K words memory only. If the advanced simulation jobs require 
larger memory capacity, the pseudo-machine can easily be tuned under the APL 
interpretative integrated environment. 
The inclusion of the following environment function can easily change the memory 
size of the pseudo-TIC to memory一size and initialize each cell to zero. 
MEM — memory一size p 0 
Besides memory configuration, the memory access cycle is also simplified by the 
supporting environment. There are eight memory ports, two for fixed point store, two for 
fixed point fetch, two for floating point store and two for floating point fetch, when a 
address is placed in the memory address register and/or the memory data register, the 
corresponding memory fetch and/or store request will be generated according to the control 
bit T. The simple memory access are resulted from the use of the environment functions 
FETCHmemCOMPLETED, FETCHmemlNIT, STOREmemCOMPLETED and 
STOREmemlNIT. 
4.2.2 Cache memory and memory access 
The memory access time varies from case to case since cache memory unit is 
included. In our simulation, the memory access delay is also simulated by the environment 
functions FETCHmemCOMPLETED, FETCHmemlNIT, STOREmemCOMPLETED and 
STOREmemlNIT. We can modify the memory access time (maximum and minimum) and 
the delay function in the above four environment functions to alter the cache configuration. 
Again, the focus of the research is the inter-relationship of the three units of the Triple-
Instruction Computer, the cache configuration is set as constant. 
73 
4.2.3 Control store of the micro engines 
Other than the main memory, three erasable read only memories, namely the control 
store for fixed micro engine (/.XRAM), for floating micro engine (MFRAM) and for branch 
micro engine (MBRAM), are embraced in the pseudo-machine. The size of the three control 
stores are set to 50，14 and 32 words. Contents of the control stores are saved according to 
the tables in section 3.5 by a micro control store generating function. 
4.2.4 Start address generation ROM of the micro engines 、… r 
Similar to the control stores, four start address generation ROM, namely the 
RRSTART, RISTART, FSTART and BSTART, are created to give the correct starting 
address of each Triple-Instruction Word. There are two distinct start address generation 
ROMs for the fixed part Triple-Instruction Word. One is for RR-format type instructions 
and one is for Rl-format instructions. The size of the four ROMs are set to 32, 16，16 and 
16，which is corresponded to the number of available instructions. 
4.2.5 Storing all binary patterns as integers 
Within the pseudo-machine, all binary patterns are stored as integers. For example, 
64-bit Triple-Instructions, 32-bit fixed point integers, 64-bit floating point integers, and even 
31-bit fixed micro instructions are all stored as integers. There are two advantages for 
storing all these different binary patterns as integers. Firstly, when compared with using 
exact binary patterns, the run time memory requirement for our pseudo-machine will be 
kept to a lower level. Secondary, in our architecture simulation system, these binary patterns, 
not only the length, but also the meaning of each bit will be changed from time to time 
frequently. It is more flexible to store all these patterns in integer form since direct 
arithmetic operations can be performed. When the data is being retrieved, a corresponding 
conversion utility is used to convert the data back to its original binary pattern. 
For example, considering the fixed micro instruction for fixed point ADD, H||_ 
0000000000 01 00001 01 10 00 1 0 0 1, the data stored in micro control store is simply the 
74 
i n t e g e r • _ • _ . I f t h i s Minstruction is fetched from control store, it will be converted 
back to a 31-bit long binary pattern and stored in the corresponding register for further 
process. Figure 4.1 shows the preset contents of the micro control stores. 
M X R A M 
(50 31-bit words): 
2013265920 2013300105 2013301129 2013302153 2013303177 2013300096 2013266953 
2013302144 2013266953 2013302144 2013266985 2013307273 2013308297 2013309321 
2013310345 2013311369 2013312393 2013313097 2013299784 2013300824 2013300809 
2013315465 2013316489 2013317513 2013318152 2013318232 2013318313 2013352329 
2013386121 2013321225 2013322249 2013327608 2013313240 2013313065 2013313240 
2013328632 2013313161 2013265920 2013265920 2013327368 2013327448 2013327529 
2013328392 2013328472 2013328553 2013265925 2013265923 2013313160 2013313177 
2013265921 ‘ 
/ x F R A M (14 27-bit words): 
125829120 125829121 125833925 125834437 125834949 125835461 125833920 
125829637 125834944 125829637 125834944 125829653 125835813 125829123 
MBRAM (32 27-blt words): 
125829120 125829121 125829152 125829122 125829160 125829122 125829168 
125829122 125829176 125829123 125830145 125830657 125831169 125831681 
125829120 125829120 125829124 125829568 125829124 125829504 125829124 
125829440 125829124 125829376 125836801 125833729 125834241 125834753 
125835265 125835777 125836289 125836801 
Figure 4.1 The content of the THREE micro control stores 
Likewise, the TIC programs are also stored as integers. Some accessory utilities are 
used for decoding and encoding the programs. Figure 4.2 shows some TIC programs stored 
in the pseudo-machine. 
The trade off of the above design is that the run time efficiency of the simulation run 
will be decreased by the enormous pattern conversions. When the architecture is almost 
fixed and modification becomes more seldom, it is an alternative to store all the patterns 
in binary form and hence no conversion is needed. Obviously, the run time memory 
requirement for the simulation will increase. 
75 
1 . P r o g r a m for doing simple addition and multiplicah^ 
201330956 37748736 2374811652 1723858944 2080374784 0 
2. Program for finding the sum of 100 floating point numbers: 
2374533120 0 2374533252 13897983 2370617376 3758096384 2080374784 0 
3. Program for calculating the Gaussian 日imination Inner Loop: 
2407530496 0 2411724952 3251896959 2080374784 0 0 0 , 
4. Program for performing Matrix Multiplication: 
2350235648 0 3021242496 0 2407530528 33554432 125829272 231998207 
2371453088 3758096384 1879631616 299526 2384707584 0 3022061696 1048562 
1879500544 299400 2215706688 0 3021766784 0 23511532801048554 2080374784 
MMrnnmwiWMwuMMiwuMmmhM-mmnnmrom ‘ ：:, + l l l l l l im— 
Figure 4.2 TIC programs stored in the pseudo-machine 
4.3 Assembly language description 
Although it is straightforward to write Triple-Instruction program directly, to 
manipulate and understand such a 64-bit binary pattern is not an effortless job. To increase 
the readability of the TIC program, a pseudo-assembly description is defined. The pseudo-
assembly language is not a real assembly language but a higher level language description 
for the Triple-Instruction Computer. It will be widely used in chapter 5 for the discussion 
of the two practical cases. 
The pseudo-assembly language is one-one eoFresponding to the Triple-Instruction 
Computer word. It provides a clear and unambiguous way to describe a Triple-Instruction 
program. Each pseudo-assembly statement is preceded by a word counter and consists of 
three parts: fixed, floating and branch part. For example, figure 4.3 consists of a TIC 
program for finding the summation of 100 floating point numbers. 
For each TIC word, the first field shows the TIC word count which is used for 
calculating branching displacement, symbolic address correspondence, etc. The second field 
consists of either a reserved word fixed, floating or branch followed by a colon which is used 
76 _ 
[13 F i x e d : R2:=R2+00 
FreadfR9v ； ； prepare the OPERAND address 
: : :: N O p : •
 j
: ； f etch OPERAND 
B r a n c h : NOBR 
[2] F i x e d : R2J=R2+@2 
F read[R2] p r e p a r e n e x t OPERAND a d d r e s s 
: = 二 ， ： S e ? f = R R ° + F K R 1 , o ‘ a c c u m u l a t e t h e numbers 
[3] F l x e d ^ ' R I ^ R S ^ S ^ ( R 2 = < R 1 ) ‘ c heck w h 細 垂 g o t o : T I C # 3 
Fwr i t t r p-5 i ; p r e P a r e t h e RESULT'S a d d r e s s 
F l o a t i n g , FR7r=FR0 3 9 t ° r a ^ f ^ 1 b a c k t o 啦 m 。 r y 
Branch： NOBR '• p r e p a r e t h e RESULT t o FSDR 
[4] F i x e d : HALT 
F l o a t i n g : NOP 
B r a n c h : NOBR 
Figure 4.3 A simple TIC program represented in pseudo-assembly language description. 
for showing the partial instruction type. The third field shows the real operations. 
In the first TIC word of the program in figure 4.3, the floating point part and the 
branch part is idle and hence a NOP (No OPeration) and NOBR (NO BRanch) is filled. 
The fixed part is filled up with two operation lines, the first line describes the arithmetic 
operation while the second line mentions the required memory operation. Surely, more than 
two lines in one partial TIC word is possible while coupled operations or memory store with 
delay read operations are included. 
The use of pseudo-assembly language description simplifies the test data preparation 
process and increase the maintainability and reusability of the testing programs. 
4.4 Details of the utilities 
The environment fimctions are used to facilitate the architecture simulation. Some of 
them serve as pseudo operating system utilities such as loader, unassembler and debugger 
while some serve as simulation accelerator such as number conversion utilities and system 
monitor. Details are shown in the following sections. 
—_ 77 
4.4.1 Pseudo operating system utilities 
辽 TIC batch program 
For simplifying the test program and filtering all unnecessary details, only the core 
part of the algorithm will be coded using Triple-Instruction words while all other trivial 
details will be described by the TIC program batch utility. For instance, a simple TIC 
program described in program 1 in figure 4.2 is used to test for multiplication of two 
registers and floating point addition of a memory word to a register. The TIC program 
consists of only three words for all initialization and result verification are done by the batch 
utility. 
[0} AUTORUN1 
[1] XREG—32pO • FREG—16pO 0 MEMM28pO 
[2] XREG[1 2 3 4 H ) 10 20 10 
[3】FREG[1 2 3 4 H ) 1 2 10 
[4] MEM[11]—BtoV 32tVtolEEE64 5 0 MEM[12】—BtoV 32lVtoiEEE64 5 
通5膽 MEM[i6]—PROGRAM1 
[6] STATUS O DINKEY 0 MEM • DINKEY 
[7] LIWC 1 
— — — — — ^ • .': •' ' : '. • 
Figure 4.4 A TIC batch program for program 1 of figure 4.2. 
The function autorunl serves as a batch program as well as the loader. Line 1 is used 
to define the size of the fixed point register files, floating point register files and main 
memory. As stated in the previous section, changes to this line will modify the Pseudo-
machine configuration. Lines 2 to 4 initialize the memory and register files. Line 5 loads the 
TIC program into the main memory and line 7 starts the execution from the appropriate 
address. Line 6 is used to verify the machine condition before executing the TIC program. 
The TIC batch program is easy to write since all statements used are in standard 
APL syntax. The inclusion of the TIC batch programs significantly reduce programming 
efforts by ignoring all unrelated details and focus the simulation tests onto the right area. 
78 
4.4.2 Simulation utilities 
Several functions are used to convert decimal value to IEEE 64-bit floating point 
format，hexadecimal format, two's complement format, unsigned binary format. As stated 
in section 4.2.5，it is straightforward to store all binary patterns in decimal form and carry 
out conversion if necessary. Examples are shown in figure 4.5. 
IEEE 64-bit floating point number to decimal value: 
[0】RES-IEEE64toV BITVEC;SIGN;EXP;MAN 
【1】RES—1 0 —Ltxi(o_+/BITVEC) 0 RES-0 0 -END 
[2】L1 :SIGN-1 rBITVEC • BlTVEOl iBITVEC 
[3] EXfM 1 rBITVEC 0 MAN -^11 jBiTVEC 
【4】EXP—2*((2 丄 EXPH023) 、 
[5] L2;MAN-0.5XMAN 0 RES^RES+MAN[t J 0 MAN-1 ‘MAN 
[6】-L2xi(0<pMAN) 
[7] RES-RESxEXP 
[8] -ENDxi(S!GM=0) 0 RES-0-RES 
[9] END:-0 
Decimal value to IEEE 64-bit floating point number: 
[0】RES—VtoEEE64 A;BD;AD;SIGN;EXP 
[1] -LOxiA^O 0 RES 6^4pO O-END 
[2] LQ:SIGN-0 0 -L1xiA>0 
[3] SlGN-1 0 A -^A 
[4] L1:BD{A 0 AD-A七A 
[5] BD-BDtoB BD 0 AD-ADtoB AD 
[6] _L2xi(BtoVBD)=0 
[7] EXP-(pBD)-1 0-L4 
[8] L2:EXP-1 • BD-OpO 
[9] L3:EXP-EXP^(AD[EXP]=0) 
【10J -L7xi (AD [EXP] -1) A(EXP=1) 




[15] RES-SIGN,(11r16pVtoB11 EXP+1023),(n53TBD,AD) 0-END 
[16J L8:RES-SIGN,(11 r16pVtoB11 EXP +1023),((-EXP)I(52-EXP)TAD) 0 - 0 
[17] L5>L6xi(EXP>1024) 
[18] RES—SIGN,(11 p0),(1022i(1022+52)rAD) • -END 
[19] L6:RES-SIGN,(11p1),(52pO) 
[20] END:-0 
Figure 4.5 Simulation functions for floating point numbers conversion. 
79 
Likewise， the functions FETCHmemCOMPLETED, FETCHmemlNIT, 
STOREmemCOMPLETED, STOREmemlNIT, READ and WRITE are also used to 
simplify the si画lation. Details of the functions are attached in the appendix. 
4.4.3 Hardware monitor 
Since all of the Triple-Instruction components are stored as APL objects, several 
utilities are written to reveal the TIC machine status, micro engines status and contents of 
register files and memory. The functions SHOW, SHOW一FREG，SHOW XREG, STATUS, 
MXSTATUS,从FSTATUS，and MBSTATUS are used to report the current machine status. 
Graphical implementation is possible by using the DALX, •GINIT，DGWRITE, DGLINE, 
•GSHADE functions. Two of them are listed in figure 4.6 while other are placed in the 
appendix. 
Show all status: 
[0] STATUS 
m mf m：，o 26TIR 
[2] ’ FIR: ' 0 17r26ilR 
[3] B旧：，21T43UR 
[4] ' PSW:，• VtoHEX8 XREG[1+16] 0 EPUT ” 
[5] ’ uXIR:，0 VtoHEX8 BtoV uXIR • ’ ’ 
[6] ’ uFIR: ‘ • VtoHEX8 BtoV uFIR • CV r 
[7] u已丨R: 1 0 VtoHEX8 BtoV uBIR • EPUT “ 
[8] uXPC: ' 0 VtoHEX4 uXPC 
[9] •-，uFPC: ' 0 VtoHEX4 uXPC 
[10] uBPC: ’ 0 VtoHEX4 uBPC 0 EPUT " 
[11] SHOW一XREG 
[12|; SHOW~FREG 
Show the fixed point register file: 
[0】SHOW一XREG;I 
[ 1 ] 1 - 1 一 
[2] D-'tSHOV^XREG]' 






Figure 4.6 Some hardware monitor functions. 
80 
Chapter 5 Evaluation 
After completed the TIC simulator, several short programs have been used to fully 
test the partial correctness of the simulator. Afterwards, two common scientific applications, 
Gaussian Elimination Loop and Matrix Multiplication Loop, are selected as benchmark 
programs to compare the efficiency of the Triple-Instruction architecture. Comparisons are 
made upon the a RISC-based generic Load-and-Store architecture DLX. Number of 
execution cycles, number of instruction fetches, number of branches and number of possible 
interruption point of the two architectures for the given problems are collected for making 
a quantitative analysis between these two architectures. Encouraging results are obtained. 
5.1.1 Case One: Gaussian Elimination Inner Loop 
Assume we have to solve the linear equations system listed as follow: 
-X1 + X2 + 2X3 = 2 [ - 1 1 2 2 ] (K t h r ow) 
3X1 - x 2 + x3 = 6 [ 3 - 1 1 6 ] (P
t h row) 
-X1 + 3x2 + 4x3 = 4 [ - 1 3 4 4 ] 
The Gaussian method eliminates the lower triangular entries of the original input 
matrix and hence a upper triangular matrix is provided for the next stage: backward 
substitution. The resulted row2 is calculated by adding 3 times row1 to the original row2. 
For simplifying the H C program, the address and index registers should be initialized 
to certain values. The pseudo-machine configuration should be set to: 
RO ：^  60 { address of AJ P } FRO : = 3 { the modifier: AJ^^-A^K } 
R8 := 52 { address of A^p } 
R4 : = 60+ (2x3) { terminating address, A^P+3 } 
MEM[51..74] := values of the original matrix 
81 
. And the corresponding pseudo-assembly description for the Gaussian elimination 
inner loop program is shown as follows: 
[TIC#1] Fixed: R0:=R0+@0 




[TIC#2] Fixed: Fwrite[R0] 
R0:=R0+@2 
R8:=R8 + @2 
Coupled_Fread[R0,R8] 
Floating: FDR1： =FDR1 + FR0*FDR2 
FSDR1:=FDR1 
Branch: Self Branch if (RO = < R4) 
[TIC#3] Fixed: HALT 
Floating: NOP 
Branch: NOBR 
The first H C word is used to filled up the single instruction loop pipeline by coupled-
reading in two floating point operands, AJ P and A^p at this moment. The second TIC word 
is a Coupled-Read-Modify-and-Store atomic instruction. Firstly, the modified AJ P will be 
written back to memory after the FSDR is loaded with approbated operand. Meanwhile, the 
fixed unit prepares the address of AJ P+1 and A^P+1 and read back two floating operands 
simultaneously by another coupled operations. Finally, the branch unit determined whether 
a valid self-branch occurs by its comparison circuit. After finished the one-instruction loop, 
the third TIC word simply stops the execution. 
The Triple-Instruction program segment for the above pseudo-assembly description 
is listed as follow: 
[1] 8 F 8 0 0 0 0 0 0 0 0 0 0 0 0 0 = 1 0001 1 1110 0000 0000 00000000 ( X ^ 
0 000 0 0000 0000 0000 ( F ^ 
0 0000 0 0000 0000 0000000 (Bq) 
[2] 8FA00098A1D4 027F = 1 0001 1 1111 0000 0000 00000010 (X2) 
一 0110 0 0110 0000 1110 (F2) “ 
— 1 0100 0 0000 0100 1111111 (B2) 
[3] 7 C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 = 0 11111 0000 00000 00000 00000 0 (x3) 
J 0000 0 0000 0000 0000 (F3) 
0 0000 0 0000 0000 0000000 (B3) 
82 
The expected result of the program segment should be: 
MEM[60 61] =2 
MEM[62 63] = 7 
MEM[64 65] = 12 
And the resulted matrix should be: 
[ - 1 1 2 2 ] 
[111 -2 12 ] 
[ - I . 3 4 4 ] 
The result of the Triple-Instruction Computer simulator is the same as the expected 
result and the experimental architecture is proved to be correct. While the run time 
efficiency cannot be easily compared between different architectures, some meaningful 
figures are collected for further discussion. The statistics are listed in figure 5.1. 
•"^^^^M^^AAMIFLMIA^GA^ABMEMMSMIIIBIHWBIWKGMMBBWHUIBB^—^IM^^ TIIIIIIIIIIIII旧__II11 I I H T W � _ I I I I _ M J I L U M M U - — — — — — ^ ^ - - ^ ^ ^ ^ ^ ^ ^ 
(a) Total cycles needed in TIC < Non-Pipelined version • ; 31 
(b) Total Fixed Addition]Subtraction!Move needed: 9 
Fixed Multiplieation needed: 0 
Floating Move needed: 0 
Floating Addition | subtraction needed: 3 
Floating Multiplication needed: 3 
Floating Division needed: 0 
Memory read (data) needed: 6 
Memory store (data) needed: 3 
Mbmory read_instmction) needed: 3 x N (implementation dependent) 
< the loop back branch breaks the execution pipeline for 3 times • 
Comparison needed; 3 
9 + (3 x 4) + (3 x 4) + (6 x 1.5) + 3 + (3 x N) + (3) + 1 = 49 + 3N 
Figure 5.1 Statistics for the Gaussian elimination inner loop program 
—. 83 -
5.1.2 Case Two: Matrix Multiplication 
Assume we have to calculate the product of the NbyN matrix, say, (A x B) and then 
put the result back to matrix C. 
' A U A12 A1S … W j ( 肌 B12 B13 … 份 ( C l l C12 CIS …Ch^ 
A2J A22 A2S … B 2 � ^ 2 2 B23 …B2n C21 C22 C23 …C2n 
AS1 A32 A3S …A3n x B31 B32 B33 ... B3n = C31 C32 C33 ... C3n … * • • • … « • • 
• • • • • 番 � 
^Anl An2 An3 …Ann) Bn2 Bn3 … B m ) [cnl Cn2 Cn3 …Cm, 
where C[iJ] = f； (A[i9k] ^b[kj]) 
k-i 
For simplifying the TIC program, the address and index registers should be initialized 
to certain values. The pseudo-machine configuration should be set to: 
RO: = 100 { address of AI K } FRO: =0 {temp sum} 
R8: = 300 { address of B^j } FR1: = 0 {reset temp sum} 
{ step to find next Aj K，the size of a floating point number} 
R9: 二 R l xR2 { step to find next size of a floating point number x N } 
R2: -10 { dimension of the input matrix } 
R3: = l { I } 
R4: = i { J } 
R5: = 0 { termination address of each inner loop, calculated at run time } 
R6: = 498 { address of C u - 2 } 
R7: = 100 { address of A1V for restoration } 
R15:=298 { address of B u - 2，for restoration } 
—- 84 — 
MEM[100..299] ： 二 The values of the origmal matrix A 
^ ^ / 5 6 7 8 9 10 
^ ^ 1 5 16 17 18 19 20 
31 I I I I 24 25 26 27 28 29 30 
J J 3 4 3 5 3 6 3 7 38 39 40 
^ 4 J 4 J 4 4 45 46 47 48 49 50 
】1 52 53 54 55 56 57 58 59 60 
• 1 62 63 64 65 66 67 68 69 70 
二 7 2 73 74 7 5 76 77 78 79 80 
8 1 82 83 84 85 86 87 88 89 90 
9 1 9 2 9 3 94 95 96 97 98 99 100 
MEM[300..499] : = The values of the original matrix B 
1 2 3 4 5 6 7 8 9 0 
1 2 3 4 5 6 7 8 9 0 
1 2 3 4 5 6 7 8 9 0 
1 2 3 4 5 6 7 8 9 0 
1 2 3 4 5 6 7 8 9 0 
1 2 3 4 5 6 7 8 9 0 
1 2 3 4 5 6 7 8 9 0 
1 2 3 4 5 6 7 8 9 0 
1 2 3 4 5 6 7 8 9 0 
丄 2 3 4 5 6 7 8 9 0 
And the corresponding pseudo-assembly description for the Matrix Multiplication 
Loop program is shown as follows: 
[TIC#1] Fixed: R5:^R7+@0 
Floating: NOP 
Branch: NOBR 
[TIC#2】 Fixed: R5: = R5+R2*@2 
Floating: NOP 
Branch: NOBR 
[TIC#3] Fixed: R0: = R0 + @0 




[TIC#4] Fixed: RO:续 R0 + R1 
R8: = R8 + R9 
Coupled_Fread[R0,R8] 
Floating: FRO:=FRO + FDR1*FDR2 
Branch: Self branch if (RO = < R5) 
[TIC#5] Fixed: R6: = R6 + @2 
Fwrite[R6] 
Floating: FSDR1: = FR0 
Branch: NOBR 
—_ 85 
[TIC#6] Fixed: R4: = R 4 + 1 
Floating: NOP 
r T I C # 7 i PV
1!,'^ Relative_Branch ( + 3) if (R2 < R4) 
Fixed: R8: = Rl5 + @0 
. R0: = R7 + @0 
Floating: NOP 
Branch: NOBR 
[TIC#8] Fixed: R8.. = R8 + R4*@2 
Floating: NOP 
r T I C # 9 1 ^
r a n ; h : Relative_Branch (-7) 
Fixed: R3: = R3 + 1 
Floating: NOP 
r T T r , i m Relative一Branch ( + 4) if (R2 < R3) 
1丄丄1#10』 Fixed: R4: = @l 
Floating： NOP 
Branch: NOBR 
[TIC#11] Fixed: R7: = R7 + R2*@2 
Floating: NOP 
Branch: NOBR 
[TIC#12] Fixed: R8: = R15+@2 
Floating: NOP 
Branch: Relative Branch (-11) 
[TIC#13] Fixed: HALT 一 
Floating: NOP 
Branch: NOBR 
The Triple-Instruction program segment for the above matrix multiplication program 
is listed as follow; 
[ 1 ] 8C15COOOOOOOOOOO = 1 0 0 0 1 1 0 0 0 0 0 1 0 1 0 1 1 1 0 0 0 0 0 0 0 0 
0000 0 0000 0000 0000 
0 0000 0 0000 0000 0000000 
C 2 ] B 4 1 4 8 0 8 0 0 0 0 0 0 0 0 0 = 1 0 1 1 0 1 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 1 0 
0000 0 0000 0000 0000 
0 0000 0 0000 0000 0000000 
[ 3 ] 8 F 8 0 0 0 2 0 0 2 0 0 0 0 0 0 = 1 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
1000 0 0000 0001 0000 
0 0000 0 0000 0000 0000000 
C 4 ] 0 7 8 0 0 0 9 8 0 D D 4 0 2 F F = 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 
0110 0 0000 0110 1110 
1 0100 0 0000 0101 1111111 
[ 5 ] 8 D 5 9 8 0 A 0 E 0 0 0 0 0 0 0 = 1 0 0 0 1 1 0 1 0 1 0 1 1 0 0 1 1 0 0 0 0 0 0 0 1 0 
1000 0 0111 0000 0000 
0 0000 0 0000 0000 0000000 
[ 6 ] 7 0 0 8 E 7 0 0 0 0 0 4 9 2 0 6 = 0 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 0 0 1 1 1 0 0 
0000 0 0000 0000 0000 
0 0100 1 0010 0100 0000110 
- [ 7 3 8 E 2 3 C 0 0 0 0 0 0 0 0 0 0 0 = 1 0 0 0 1 1 1000 1000 1111 0 0 0 0 0 0 0 0 
0000 0 0000 0000 0000 
0 0000 0 0000 0000 0000000 
[ 8 ] B 4 2 1 0 0 8 0 0 0 0 F F F F 2 = 1 0 1 1 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 
0000 0 0000 0000 0000 -
0 1111 1 111111111110010 
[ 9 ] 7 0 0 6 E 7 0 0 0 0 0 4 9 1 8 8 = 0 1 1 1 0 0 0 0 0 0 0 0 0 1 1 0 1 1 1 0 0 1 1 1 0 0 
0000 0 0000 0000 0000 
0 0100 1 0010 0011 0001000 
[10] 8411004000000000 = 1 0000 1 0000 0100 0100 00000001 
0000 0 0000 0000 0000 
o 0000 0 0000 0000 0000000 
86 
[11] B41C808000000000 = 1 011Q 1 nnnn n n , n 
nnnn ° 二二。2 0111 0010 00000010 。000 0 0000 0000 0000 
[ 1 2 ] 8 C 2 3 C 0 8 0 0 0 0 F F F E A = ？ 二 ? ？ ？ ： _ _ 。 
I 0 0 0 0 1000 1111 00000010 0000 0 0000 0000 0000 
[13] 7C00000000000000 = 0 ^ 1 。 : 1 ^ ^ 0 
‘ 111 0000 00000 00000 00000 0 
0000 0 0000 0000 0000 
o 0000 0 0000 0000 ooooooo 
The expected result of the program segment should be: 
MEM[500..699] = The resulted matrix C 
J 5 2 2 0 2 7 5 3 3 0 3 8 5 4 0 0 4 9 5 0 
2 ^ 5 二 ， 7 7 5 9 3 0 1085 124。 1 3 9 5 o 
• 二 7 6 5 1 0 2 0 1 2 7 5 1 5 3 0 1 7 8 5 2 0 4 0 2 2 9 5 0 
I I I 。 二 1 4 2 0 1 7 7 5 2 1 3 0 2 4 8 5 2 8 4 0 3 1 9 5 0 
^ ^ ^ ^ 1 8 2 0 2 2 7 5 2 7 3 0 3 1 8 5 3 6 4 0 4 0 9 5 0 
^ ^ 1 6 6 5 2 2 2 0 2 7 7 5 3 3 3 0 3 8 8 5 4 4 4 。 4 9 9 5 0 
^ ^ H 二 1 9 6 5 2 6 2 。 3 2 7 5 3 9 3 0 4 5 8 5 5 2 4 0 5 8 9 5 0 
l ^ l 1 5 1 0 2 2 6 5 3 0 2 0 3 7 7 5 4 5 3 0 5 2 8 5 6 0 4 0 6 7 9 5 0 
8 5 5 1 7 1 0 2 5 6 5 3 4 2 0 4 2 7 5 5 1 3 0 5 9 8 5 6 8 4 0 7 6 9 5 0 
9 5 5 1 9 1 0 2 8 6 5 3 8 2 0 4 7 7 5 5 7 3 0 6 6 8 5 7 6 4 0 8 5 9 5 0 
Also，some statistics are collected and listed in figure 5.2. 
[1] Total cycles needed in TIC: 11063 
[2] Total Fixed Addition | Subtraction | Move needed:�3727 
Fixed Multiplication needed:.-209 
Fixed Division needed:: 0 
Floating Addition | subtraction needed: -1000 
Floating Multiplication needed: -1000 
Memory read (data) needed; 2000 
Memory store (data) needed: 1000 
Memory read (instruction) needed: -1100 x N 
Comparison needed: -1110 
Total cycles needed in normal serial wac/iiwes (approximation): 
3727+ (209x2) + (1000x4；) +:_(3拟游 + (2000 义 1.5)十1000+(1100xN) + 1110+1 = 17256+1100N 
Figure 5.2 Statistics for the Matrix Multiplication program 
87 
5.2 Results and comparison 
After verifying the correctness of the Triple-Instruction Computer simulator 
performance analysis will be earned out. T W dimensions of performance W111 be surveyed' 
that is, the complexity of the assembly code, the run-time efficiency and the compiling effort.' 
For making the argument more concrete, the above cases are coded by a generic 
Load-and-Store architecture proposed by John Hennessy and David Patterson, called DLX. 
DLX is a polyunsaturated computer and the design philosophy of DLX is veiy similar to the 
a V e r a g e ° f a n u m b e r o f r e c e n t experimental and commercial machines, such as, the AMD 
2 9 K ) D E C s t a t i o n 3 • ， H P 850，IBM 801，Intel i860，MIPS M/120A，MIPS M/1000 
Motorola 88K, RISC I，SGI 4D/60，SPARCstation-1, Sun-4/110 and Sun-4/260 [PAT].' 
. T h e architecture of DLX, similar to HC，employs 32-bit general-purpose fixed point 
registers and 64-bit floating point numbers and supports all primitive operations of TIC. It 
is easy to make comparison between the two architecture and hence it is chosen as control. 
5.2.1 Complexity of the TIC code 
Before having discussion on the complexity of the Triple-Instruction Computer code, 
the Gaussian elimination inner loop and the matrix multiplication loop have been coded 
using DLX assembly language and are shown in figure 5.3 and figure 5.4. 
R1 ：= 60 L1： LD F4, (R2) 
R 2 : = 5 2 MULTD F4, F4, F0 
LD F2, (R1) 
R3 : = 66 ADDD F2, F2, F4 
R4 ：= 0 SD (R1), F2 
ADDI R1，R1, #2 
F0 ：= 3 ADDI R2，R2, #2 
F2 : = 0 SLE FB 
F4: 二 0 BEQZ R4, LI 
HALT 
"""WiWlifflMilffiiiTWBnMHMIIFllffBiBWHBTi—BfflffiftlPlil1,.1"1—,--"""^™—“9™—aMilJfliiimW-gTnililfl 
Figure 5.3 The Gaussian elimination inner loop in DLX assembly language. 
The TIC version of the Gaussian elimination inner loop consists of only 3 TIC words, 
6x32 bits, while the DLX version consists of 10 DLX instructions, 10x32 bits. Likewise，the 
TIC version of the matrix multiplication loop consists of 13 TIC words，26x32 bits, while the 
88 
E S m 。 L 1 : A D D 丨 R5, R7, # 0 
3 0 0 MULT R1，R2，#2 
R 2 : = 1 0 L 2 . ^ F 2 ，（ R 1 0 ) 
RQ on L D F4, (R8) 
MULTD F4，F4，F2 
二 f：： ADDD F0，F_F4 
r ^ ：： n ADDI R _ R _ #2 
R6 •'： 498 A D D 丨 R8，R8，#20 
R7 ：： J L E R1, R10, R5 
^ - - J 0 0 BEQZ R1 L2 
R 1 5 ： = 2 9 8 ADDI R6:R6，#2 
TO 〜 n S D ( R 6 ) ， F 0 
„ ；： ^ ADDI m，R 4 , # 1 
^ X SLT R1, R2, R4 
F 4 : = 0 BEQZ R1, L3 
MULT R1, R4, # 2 
‘ ADD R8, R15, R1 
ADDI R10，R7, # 0 
J .. L1 
L3: ADDI R3, R3，#1 
SLT R1, R2, R3 
BEQZ R1, L4 
ADDI R4，RO, # 1 
MUCT R _ R _ # 2 
ADDM 




Figure 5.4 The matrix multiplication loop in DLX assembly language. 
DLX version consists of 29x32 bits. It is obvious that the length of the TIC programs always 
shorter than DLX programs for the use of a more descriptive instruction set. 
The superinstruction contained within the 64-bit Triple-Instruction word has the 
power of 4 floating point operations, 5 fixed point operations,. 4 fetch or store operations 
and 1 branch instruction for a total of 14 nontrivial atomic operations. This number may not 
be realized in practice, but an entire Gaussian elimination inner loop or an matrix 
multiplication inner loop, • requiring 9 or 8 DLX instructions, can be specified in one 
superinstruction. Comparing the two implementations of Gaussian elimination inner loop, 
TIC version is half the size of the DLX version because of the high utilization rate of the 
three functional units. 
89 
It can be noticed that whenever —— 
umt and branch unit decrease, the s o o - ^ ^ ^ ^ ^ K 
overall compactness of the TIC program e o o ^ ^ ^ J |H 
1 調 I 
t h e _ n is small. T^e ， 
proportion of TIC instructions for Matrix 
inner loop mu|t丨p|丨cat|on 
indices calculation is much higher than ~ 
, r igure Compactness of machine instruction-
the Gaussian inner loop, and therefore, TIC verms DLX implementation 
lots of the TIC instruction power is wasted and a longer TIC program is resulted. 
While the TIC code is compact, it is not complex. The TIC architecture is a hundred 
percent RISC design, without any complex instruction. The extended descriptive power, 
compactness is not obtained by adding complicated instructions but actually, having a neJ 
and talent view point to the existing simple instructions. The TIC assembly language is very 
simple. As the inherent relationship of the three functional units, that is, fixed point for 
preparing address, floating point unit for real calculation and branch unit for sequence 
control，it is even easier to write TIC codes than DLX codes，even though a TIC word 
consists of 64 bits, three different partial instructions. 
5.2.2 Run-time efficiency 
The run-time efficiency of a computer architecture is hard to be measured in a 
qualitative way. Our discussion will mainly start from considering the following quantities: 
number of execution cycles, number of instruction fetches, number of branches and number 
of possible interruption point. 
The statistics of for the Gaussian elimination inner loop programs and matrix 
multiplication programs are collected and listed as follow and summarized in figure 5.6: 
90 
instruction # o f | # o f p o s s i b l e 1 
cycles etches branches interruption point 
Gaussian TIC ： • . 厂 ： . o t 乂./:?,〉::：：|.:::/:: .^.: .^ :二么:.厂:二'.:二:. \ ： , ⑴ 3 Nil 4 ^ n 
elimination ： ：；, 
inner loop DLX 49 + 3N1 28 3 ~ 2 7 ~ ~ ~ ~ ~ 
^ 丁 丨 c ~ ~ ^ ^ ~ r 1 7 1 7 I 
multiplication 
DLX 17256 + 1100N 9236 ^TlO 9235 
~ ” ‘ “ ‘ 
, 
5 � ^ ® = = i 
~ — z z j 
f Im—U 
# of 杯 ^ ^ ~ir 
w # o f # of branch # n f 
execution instruction 5 + L 
cycle f e t c h interruption 
* point 
Figure 5.6a Comparison of run-time efficiency for Gaussian elimination inner loop 
programs of TIC and DLX implementation 
According to the table, the potential power of the TIC architecture is revealed. The 
integration of self branch instruction, coupled operation, Read-modify-store operation, long 
1N is the number of instruction fetches, normally, one cycle for a fetch is assumed. 
91 
20000 r ^ n ^ — - i 
14000 -Z ~I 
1 0 0 0 0 - Z 
8 0 0 0 - Z 
6 0 0 0 - Z -
4 0 0 0 — — — — . ^ ^ H J 
^ 丁IC architecture 
execution j n s t f u l n # o f b r a n c h # of 
cycle fe t ch Interruption 
一 point 
Figure 5.6b Comparison of run-time efficiency for matrix multiplication programs of TIC 
and DLX implementation — 
instruction word with three essential functional unit causes the number of instruction fetches, 
number of branches, number of possible interruption points and number of execution cycles 
dropping significantly. 
Due to the use of one instruction loop, the number of instruction fetches and 
branches drops. Also, calls for interruption only occurs at the end of the loop, but not in the 
middle. This further reduces the time for interruption handling. In conclusion, the gain will 
be more significant when the TIC architecture is used to handle real life jobs in which 
exceptions occur more frequently. 
To compare the run-time efficiency, tests with different data size have been carried 
out (figure 5.7). For the matrix multiplication inner loop programs, the large the matrix, the 
longer the inner loop. The run-time data are collected and summarized in the following 
table. A performance gain ratio (TIC/DLX) is defined by dividing the TIC figures by the 
DLX figures. For example, when we are considering number of execution cycles for the two 
architectures, the less the ratio, the better the performance. 
92 
M a t r i X # of ir^ruction ~ # of b r a n c h e s # of possible 
multiplication fetches 
c n e s interruption point 
R* ^ 
(MatriX S i 2 6 ) 論 I DLTI Ratio I T IC DLX Ratio I T IC I DLX Ratio 
3 X 3 7 5 3 3 2 0.225 I 12 36 0.333 93 331 0.281 
4 X 4 1 3 3 7 1 6 0.186 80 0.250 715 
5 x 5 207 13化 0.157 l l l l 150 7m _ | 議 _ 1 3 1 5 0 . 2 3 3 — 
6 x 6 2 9 7 2180 0.136 l l l l f l 252 0.167 477 2 1 7 9 0 . 2 1 9 T ；: : 
7 x 7 403 3356 0.120 _ _ _ 392. 0.143 _ _ _ 3 3 5 5 0 . 2 0 8 
8 x 8 I I I I I I 4892 0.107 _ | _ | 576 0.125 973 ~ 4 8 9 1 0 . 1 9 9 
9 x 9 _ _ _ 6836 0.097 _ _ _ 810 0.111 _ _ _ 圓 6 8 3 5 0 . 1 9 2 
10x10 8 轉 9 2 3 6 0.088 ||||0；|：- 1110 0.099 t717 9235 0.186 
As shown in the table，the performance of the TIC increases while the length of loop 
increases. It is because the TIC architecture adapts the idea of compact instruction format 
from vector architecture. Besides, the superinstruction-based looping structure greatly 
reduces the number of instruction fetches, number of branches, etc. and hence reduces the 
execution cycles of the TIC. 
5.2.3 Programming effort 
When compared with most existing LIW and VLIW code" the TIC code is much 
easier to generate. Unlike the traditional VLIW computer, the HC makes use of groups of 
functional units, a fixed part, a floating point part and a branch part. The inherent order of 
the three sub-operations, which, are preparation of addresses of floating point operands， 
calculation of the floating point operands and determination of execution sequence, guides 
the users to work with the TIC in a smoother way. For most applications, no trace compiler 
93 
^ ‘ ‘ ^ “ — 
R e m a r k c 彻 less the ratio, the better the performance 
° - 3 5 s 
I 0.3 ^ 
O 0 2 "I" I 1 嶋丨 WW 卯 叫叫 WTOWWWowwgww^ 
I�. 1 5 一 
• 0.1 I ^ ^ ^ f e g Of branch 
• 0 0 5 j I ^ # of Interruption I 
| Q | | | | 丨 ^ Saaooooowoooaaaoooo >,| J 
3 4 5 6 7 8 9 10 
Dimension of matrix 
Figure 5.7 Comparison of run-time efficient ratio for matrix multiplication programs of 
TIC over DLX implementation for different loop sizes. 
like optimising tools are necessary. The compilation time of TIC programs will be kept in 
a reasonable range. 
Likewise, when compared with the vector machines code, TIC code is also much 
easier to manage. The flexible coupled operation helps in reducing the wastage of floating 
point processing units. Also, it is simple for the user to select either coupled mode or single 
mode from time to time in one program by just setting the couple switch. In summary, the 
programming effort for writing TIC codes is less than for those of VLIW, 
5.3 Summary of the architecture 
In conclusion, a new architecture extracting key ideas from VLIW-based, vector-
based, and superscalar architectures, the Triple-Instruction Computer architecture is 
introduced. The smart superinstruction, crystallized from the ideas of self branch operation, 
coupled operations, long instruction word computer, vector machines, produces short codes 
and increases the run-time efficiency significantly. Moreover; the idea of using more 
94 
powerful instruction as an atomic programming unit to reduce the number of instruction 
fetches, number of branches, etc. is proven to be beneficial and constructive. 
The simulation results are encouraging and the new architecture is proven to be have 
its value in the development of computer architecture in future days. 
95 
Chapter 6 Discussion and Conclusion 
We are currently investigating a triple-instruction computer architecture. A simulator 
W a S b u i l t a n d _ 细址 icam scientific applications have been implemented for comparison. 
Discussions are made according to the Triple-Instruction Computer architecture and the use 
of APL language as a architectural simulation. 
6.1 The triple-instruction computer 
According to the encouraging simulation results, the new parallelism enhancement 
concepts of the Triple-Instruction Computer are certified to be correct hypothesis. These 
crystallization design out of the LIW architecture, vector architecture, and load-and-store 
RISC architecture shows an unlimited potential for further investigation and examination. 
The design of the three different inter-related functional units and the implicit 
execution sequence of the three micro engines greatly simplified the process of generating 
H C codes. An innovative self-branch concept is also be included. As a result the TIC code 
lengths, as well as the complexity are low. 
Moreover, parallel execution of the three partial instructions demonstrates another 
method of utilizing the addition parallelism other than superscalar, vector and LIW 
architecture. 
Finally，a kind of compact and powerful Triple-Instruction word is proposed. The 
outcome superinstruction is attained by joining several ideas: 
a- three-operand address code with ADD2, MULADD atomic instruction 
The three operands instruction for fixed point unit and floating point unit provides 
，compute and originate，type operations for calculating complex index in the fixed point unit 
in one cycle. Moreover, the multipty-and-add (MULADD) and add-and-add (ADD2) 
instructions of the floating point unit help in solving the "store-fetch" forwarding problems 
and provide up to four floating point operations in one cycle and thus strengthen the 
scientific processing power, such as vector and matrix manipulalion. 
96 
b. coupled operations 
Coupled operations mean to perform the specific operation on all elements of the 
coupled-register pairs. Consequently, twice the amount of operations can be specified in one 
l n S t m C t i 0 n S° aS t 0 r e d u c e t h e dec°ding time. If shadow ALU and data bus are employed, 
the coupled operation can be done in parallel. More worthwhile, the inclusion of couple 
operation is almost independent of architecture. Hiis also produces significant efficiency 
gain when dealing with vector or matrix manipulations. 
c self-branching technique 
A self-branch instruction will continue to be executed while the condition is valid. 
Tlie inclusion of self-branch instruction makes the TIC program shorter. Moreover, the loop 
structures can be easily mapped into one TIC instruction. 
么 coupled-Read, modify and store instruction (CRMS) 
A powerful primitive coupled-read-modify-store atomic instruction helps in providing 
sufficient operands for the floating point unit when the TIC is in its fully loaded mode. Two 
floating point operands are retrieved by a coupled-read operation, and the next operand 
addresses are coupled-updated by the fixed unit as long as the floating point operation is 
performed by the floating point unit. After the floating point modification, the result is 
stored back to the corresponding address. This feature provides the one-instruction loop with 
enough fetch-and-store power. 
The resulted TIC word therefore has the power of 4 floating-point operations, 5 
fixed-point operations, 4 fetch and store operations and 1 branch instruction, a total of 14 
nontrivial operations. Although this number may not be usually realized in practice, the idea 
of clustering enough processing power into one superinstruction is verified to be remarkable. 
In the case studies, the one-instruction loop is easy to generate using such a kind of 
superinstruction and the simulation results reveal that number of instruction fetch, number 
of branch and number of possible interruption point drop rapidly when compared with 
— - 9 7 
S U P e r S C a l a r m a c h i n e s . R e d 她 如 of number of instruction fetch and number of branch 
°bV1°USly l 6 a d S t 0 f a S t e r e x 識 t i o n speed. The idea of superinstruction also reduces the 
n U m b C r ° f t i m e S f 0 r c h e c k i nS i r r up t i on and provides better performance accordingly. In 
C°n C l U S i 0 n ' t h e 1 1 0 a r c h i t e c t u r e i n i^tes some alternative ideas of increasing parallelism 
and is proven to be effective. 
6.2 The use of APL for architectural simulation 
Besides the proposed TIC architecture, as the thesis title stated, we are also 
intyested in the architecture investigation process itself. As mentioned in previous chapters, 
computer architecture design process is an iterative trial and error process and we cannot 
be prevented from making amendment and alteration to the draft design continually. These 
unending modifications slow down the implementation process and increases error rate. 
APL demonstrates its potential for serving as an architectural simulation tools over other 
alternatives. 
Since IBM employed APL as a hardware description language as well as a simulation 
language for system 360，it has been mentioned to be an outstanding programming 
languages for computer architecture design simulation. According to the simulation being 
carried out, we can conclude that for prototyping a new unconventional architecture, such 
as the TIC, the use of APL helps in minimizing programming effort, shortening the project 
completion time, as well as achieving a certain industrial standard for common 
understanding to the simulation. Some worthwhile experiences are summarized below: 
cl prototype test for core design 
For our design, the most critical and urgent parts of the design is the instruction 
schedule unit and interlocks between micro engines. In APL, we can still outline the design 
without defining the internal details, that means, the three micro engines do nothing except 
accepting and generating synchronization signals. Each micro engine can be filled up 
afterwards and there is no need to modify the core program any more. The process is just 
like constructing a building from skeleton to details. Unlike most of the other programming 
一 98 
languages，APL provides better understanding for the overall structure of the system at the 
Very b e g l n n i n g a n d m a k e s ^ e design process more straight forward. 
b. frequent amendments to the draft architecture 
The draft architecture changes frequently for fixing problems. One faulty consequence 
of this phenomenon is the modification of instruction bit pattern. For example, the RR 
format version n is as follows: 
0 o p r [ C | X | F [ T | R X R y R Z [ 7 ] 
. ‘ ‘ 1 1 1 1 1 1 1 ' I ' I I I I I I I I I I • • ！ 1 
0 5 10 15 20 25 
Where T is used to indicate either a read or a write, the last bit is a don't care bit. 
After the authors find that both read write may be occurred in the same instruction (one 
at the beginning of the cycle and another at the end), the RR instruction may be amended 
as follow: 
o o p r c X F R W Rx Ry r ^ 
1~~ 1~ 1~~ 1~~~~LJ_____I__I__！_I____I__I I 1 1 I i I I 
0 5 10 15 20 25 
It is an awful situation since all indexes related to the RR fixed instruction needed 
to be changed. For example, the Rx is INST[11..15] rather than INST[10"14]. For most of 
the programming languages, it is necessary to alter all reference of INST[xxx]. Luckily, in 
APL，we can use the function drop and take1 to minimize the global amendment scheme. 
Therefore, careless errors can be minimized. 
Moveover, the APL codes are organized into groups of functions and a structural 
approach is implied. It is more manageable even though frequent corrections exist. 
1Rx-5 T 11 I INST 
… 99 
c. capacity problem of microcomputers 
Since computer architecture design simulation usually requires huge memory space, 
almost all realistic system, such as the 80384 and 68040 micro-processors, are simulated in 
mainframe or mini-computer. The STSC*PLUS APL system supports virtual workspace, just 
as other APL implementors, which provides hints to tackle the capacity problem. 
Additionally, the standard processing power of the APL language facilitates the 
simulation. In summary, the advantages and disadvantages of using APL for architectural 
simulation are as follows: 
Advantages 
i. Powerful vector processing capabilities simplify the programming process 
ii. Fast modelling and easy changing: suitable for the iterative design process 
iii. More flexible to model the non classical architecture when compared with the 
HDLs and simulator packages 
iv. Solution to capacity problem 
v. Compact codes and structural approach 
Disadvantages 
i. Difficult for those without APL programming experience to handle. 
ii. The export utility of APL environment is not well equipped and since the 
character set of APL is not compatible with that of ASCII. Conversion 
problems occur when program listings and screen dumps are needed, 
APL has been chosen as the simulation language. The multi-level of abstractions and 
special communication protocol make the language suitable for computer architecture design 
simulation. Moreover, the project also proves that microcomputers can also be used to do 
computer architecture simulation. Owing to the existence of microcomputer-based APL 
system, simulation can be carried out with simple and inexpensive computing equipments 
and the benefit of large scale machine, such as larger space capacity, can be retained. 
100 
6.3 Further considerations 
The idea of the H C architecture are proved to be worthy while the implemented 
simulator is still have room for improvement. Firstly, the pipeline of the Triple-Instruction 
Computer should be modified. If each floating point operation completes in each c^cle, the 
execution speed increases right away while several partial instructions can be initiated at a 
time so as to achieve a high execution speed. Consequently, the design of maximum time-lag 
among different functional units2 must be carefully reconsidered. Tliis design maintains the 
machine in a synchronous mode for branch or interruption as well as facilitates the run-time 
efficiency. Moreover, the idea of coupled register groups should be explored and 
investigated in depth. Dynamic reconfiguration mechanism of coupled register groups can 
be considered as a new category of super-superscalar machines. 
Besides adjusting the architecture, it is interesting to find out a clear presentation 
method to show and describe the execution behavior of the TIC. This presentation helps to 
standardize and simplify the description of the execution behavior of the VLIW-based and 
superinstruction-based machines. 
Moreover, the characteristics, design principle and thoroughly analysis of the TIC 
compiler should be studied. A HC compiler should be built to run benchmark programs so 
as to make a solid comparison on compilation effort of the TIC machine and VLIW 
machine. As stated in this thesis, for a given scientific problem, the compilation effort of the 
Triple-Instruction Computer is much more less than the Long Instruction Computers, for 
making a concrete proof, a quantitative comparison is strongly desired. 
In conclusion, the TIC architecture is proved to be effective. Several architectural 
innovations are originated. The idea of superinstruction produces encouraging results. In 
addition, the use of APL language for writing architectural simulation is realized to be easy 
to manipulate, clear to understand and efficient to develop simulation. 
^hat is, when any functional unit is idle, whether the next partial instruction can be executed ahead. We 
are interested to the maximum number of lookahead executions. That is, if the fixed point instruction X5 is 




[ANDERSON] ANDERSON, D.W” SPARACIO, F.J.，AND TOMASULO, R.M.，"The IBM 360 
Model 91: Machine philosophy and instruction handling；' IBM Journal of Research and 
Development, Vol. 11. No. 2, pp. 8-24，January 1967. 
[BLA] BLAAUW, G.A, Digital System Implementation, Prentice-Hall,_ Inc., 1976. 
[CHEN] CHEN, T.C, "Coordinated Machinery for Performance in Automatic Computing," 
Summer work Report at Dept. 513, IBM Watson Research Centre, 1982. 
[CHEN] CHEN, T.C” AND KING, W.K, Computer Architecture, lecture notes of course held 
in the Chinese University of Hong Kong, 1990-1992. 
[COLWELL] COLWELL, R .P , NLX, R.P., O'DONNELL, JJ . , PAPWORTH, D.B.，AND RODMAN, 
P.K.，"A VLIW Architecture for a trace scheduling compiler," IEEE Trans, on computers, 
Vol. 37，No. 8，August 1988. 
[ELSEN] EISENBERY, M” AND PEELLE，H.A., "A survey of "APL Thinking"，" APL Quote 
Quad，ACM Press, Vol. 21，No. 2，pp. 5-8，December 1990. 
[ELLIS] ELLIS, J.R.，Bulldog: A Compiler for VLIW Architectures, The MIT Press, 1986. 
[FISHER] FISHER, J.A.，"Trace scheduling: A technique for global microcode compaction," 
IEEE trans, on Computers, Vol. 30，No. 7，pp. 478-490，July 1981. 
[FISHER] FISHER, J.A., "Very Long Instruction Word Architectures and the ELI-512," Proc. 
of the 10th Annual International Symposium on Computer Architecture Conf., IEEE 
Computer Society and ACM, pp. 140-150，June 1983. 
102 
[ F I S H E R ] F I S H E R ， J . A ” ELLIS，J.R.，RUTTENBERG, J.C.，AND NICOLAU, A” "Parallel 
processing: A smart compiler and a dumb machine； Proc. of SIGPLAN Conf. on Compiler 
Construction, Palo Alto, CA, pp. 11-16, June 1984. 
[FLYNN] FLYNN, M.J,，"Very high-speed computing system； Proc. IEEE, Vol 54，No. 12，pp 
1901-1909，December 1966. 
[FLYNN] FLYNN, M J .，A N D HUCK, J.C” Analyzing Computer Architecture, IEEE Computer 
Society Press, 1989. 
[GILOI] GILOI, W.K., AND BEHR，P.M.，"APL*DS - An APL-based Hardware Specification 
Simulation System," APL 80，North-Holland Publishing Company, pp. 53-61, 1980. 
[HART] HARTENSTEIN, R.W., Hardware Description Languages, North-Holland Publishing 
Company, 1987. 
[IBM] "The IBM RISC System/6000 processor," collection of papers, IBM Journal of 
Research and Development, Vol 34，No. 1，January 1990. 
[IEEE] ANSI/IEEE Std 754-1985, An American National Standard: IEEE Standard for 
Binary Floating-Point Arithmetic, 1985. 
[MCFARLING] MCFARLING, S. AND HENNESSY, J., "Reducing the cost of branches," Proc. 
of the 13th Symposium on Computer Architecture, Tokyo, pp. 396-403，June 1986. 
[MYERS] MYERS, G.J., Advances In Computer Architecture, 2nd ed, John Wiley & S—ons， 
Inc., 1990, 9-33. 
[NlCO] NlCOLAU, A., AND FISHER, J.A.，"Measuring the parallelism available for very long 
instruction word architecture," IEEE Trans, on computer, Vol. 33，No. 11, pp. 968-976， 
November 1984. 
103 
[PATTERSON] PATTERSON, D.A” "Reduced Instruction Set Computers," Communications of 
the ACM, Vol. 28，No. 1, pp. 8-21，January 1985. 
[PATTERSON] PATTERSON, D.A. AND HENNESSY, J.L” Computer Architecture ： A 
quantitative Approach, Morgan Kaufmann Publishers, Inc., 1990. 
[PILOTY] PILOTY, R . ， A N D BORRIONE, D .， " H i e CONLAN Project: Concepts, 
Implementations, and Applications," IEEE Computer, Vol. 18，No. 2，pp. 81-92，February 
1985. ' 
[RUSSELL] RUSSELL, R. M” "The Cray-1 computer system," Communications of the ACM, 
Vol. 21, No. 1，pp. 63-72, January 1978. 
[SMITH] SMITH, J.E., "A study of branch prediction strategies," Proc. of the 8th Symposium 
on Computer Architecture, Minneapolis, pp. 135-148，May 1981. 
[SMITH] SMITH, J.E., "Decoupled access/execute computer architectures," ACM Trans, on 
Computer Systems, Vol. 2，No. 4，pp. 289-308, November 1984. 
[SMITH] SMITH, J.E., AND PLEZKUN, A.R., "Implementing precise interrupts in pipelined 
processors," IEEE Trans, on computers, Vol. 37, No. 5，pp. 562-573, May 1988. 
[STALLINGS] STALLINGS, W.，"Reduced Instruction Set Computer Architecture," Proc. of the 
IEEE, No. 1，pp. 38-55, 1976. 
[VON NEUMANN] VON NEUMANN, J.； "First draft of a report on the ED VAC." Reprinted 
in W. Aspray and A. Burks, eds.，Papers of John von Neumann on Computing and 
Computer Theory, The MIT press，1987, 17-82. 
104 
‘ 
Appendix I: Program listing for the TIC simulator 
_ s p a c e : 2 MATRIX ” 
. A p r 1 1 , 1994 6 : 4 4 PM D i r e c t o r y 
1 L I S T - 1 3 U P D A T E x r e g - 3 1 
口 孤 ― 1 L I W C - 1 4 V t o B - 3 3 
裒0RUN1—1 M A T MTTT - , 
MAT_MUL-14 V t o B l l - 3 3 
IDRUN2-1 MTPPHR on 
MICROB-20 V t o B 1 7 - 3 3 
10RUN3-2 MTPPOTT OI “ 
M丄CROF-21 V t o B 2 4 - 3 3 
C0RUN4-2 MTPPnv oo 
M工CROX-22 V t o B 2 7 _ 3 3 
E : ) B " 2 MICROX一MICROF-22 V t o B 3 1 - 3 3 
HLTER-2 PUTFtoMEM-2 3 VtoB32—33 
駢 C H D R - 3 READ-24 V t o B 3 3 - 3 3 
E A D Y ~ 3 REG—A-24 V t o B 8 - 3 3 
t T V " 4 REG—B-25 V t o H E X 1 6 - 3 4 
REG—D-25 V t o H E X 4 - 3 4 
K : 3 L A Y ~ 4 REG—E-25 V t o H E X 8 - 3 4 
REG—G-25 V t o I E E E 6 4 - 3 4 
REG—H-2 6 WRITE-35 
紅卜5 RW-26 XALU-3 5 
一 ST0REmemC0MPLETED-5 R W a b l e - 2 6 X F I L T E R - 3 6 
隨 — S T O R E l t i e m l N I T - ? SETC-2 6 XMPX1_3 6 
E2:HmemCOMPLETED-8 SETF-27 XREADY-37 
© : H i a e m I N I T - 9 SETX-27 XTEST一 DBZ-38 
FILTER-9 SHOW-27 X t o F - 3 8 
SHOW—FREG-27 u B I R r d y - 3 8 
RCIDY-10 SHOW一 XREG-27 UBMPX1-38 
r i :T__DBZ-10 STATUS-28 UBMPX2-38 
ST0REmemC0MPLETED-2 8 u F I R r d y - 3 9 
阶 S I A N - l l STOREmemINIT，29 u F M P X l - 3 9 
贬 一 1 1 T B L c - 2 9 UFMPX2-3 9 
S ! : f r o m M E M - l l T B L s - 2 9 uSTATUS-3 9 
• 6 4 t o V - 1 2 T I T L E - 2 9 u X I R r d y - 3 9 
奴 1 2 T V t o B - 2 9 UXMPX1-40 
3 ^ - 1 3 U P D A T E f r e g - 3 0 一 UXMPX2-4 0 
现 1 3 U P D A T E p c v c - 3 1 
• S P a C e : 2 ^ ^ A p r 1 1 , 1994 6 : 4 4 PM P a g e , 
| e c t s : A D t o B ALTER AUT0RUN1 AUT0RUN2 
•RES—ADtoB A ； TEMP ； S IGNIF ICANT ； ONE ； S 
二 TEMP—A O RES—pO • SIGNIFICANT—2 3 O S—0 
' L 1 : 0 N E " ^ T E M P X 2 • TEMP-(TEMPX2)-ONE • RES^RES,ONE O S-SvQNE • SIGNIFICANT 
— S I G N I F I C A N T — 
二 —L Ix l (S IGNIF ICANT之0)a(TEMP关0) 
々 E N D : ^ 0 
V 
•B—ALTER A ; T 
1 B—pO 
2 LOOP:—Ex L ( 0 = p A ) 
35 T<~1 个A O A ^ - l l A 





L XREG<-3 2pO O FREG«-16p0 O MEM<-128pO 
2: XREG[ 1 2 3 4]— 0 10 20 10 
3: FREG[ 1 2 3 4]— 0 1 2 10 
4. M E M [ l l ] < - B t o V 32个Vto IEEE64 5 . • MEM[12]^-BtoV 32山VtoIEEE64 5 
5 i MEM [ 1 6 ] —PROGRAM1 




XREG<-3 2pO O FREG—16p0 O MEM<-256pO 
XREG[2 3 4]— 208 10 210 
I—10 O C—l 
4 LOOP:—LEAVEx i (C>100) 
M E M [ I + 1 ] — B t o V SSTVtoIEEEGA C O MEM[工+2]—BtoV 3 2 i V t o I E E E 6 4 C 
•—I—1+2 O O C + 1 O —LOOP 
2 LEAVE ： MEM [ L 8 ] <-PROGRAM2 • C 
B ； STATUS O LIINKEY • MEM O •工NKEY 
^.i LIWC 1 
1H 
V；! V 
傘 印 a C e : 2 M A T R U A p r 1 1 , 1994 6 : 4 4 PM P a g e 2 
• s e t s : AUTORUN3 AUT0RUN4 BDtoB BFILTER 
•AUT0RUN3 
XREG—3 2 p 0 O FREG—16p0 O MEM—128p0 
XREG[1 2 3 4]— 0 10 20 10 O XREG[9 10 1 1 12]— 0 100 200 12 
F R E G [ 1 2 3 4]— 0 1 2 10 • FREG[9 10 1 1 12]— 0 10 20 100 
厶 M E M [ 1 1 ] — B t o V 3 2 T V t o I E E E 6 4 5 O 巧EM[12]—BtoV 3 2 i V t o I E E E 6 4 5 
M E M [ 1 3 ] — B t o V 32个Vto IEEE64 50 • MEM[14]—BtoV 3 2 i V t o I E E E 6 4 50 
S MEM[16]—PROGRAM3 
又 STATUS • •工N K E Y O MEM O 0INKEY 
S LIWC 1 
V 
VAUTORUN4；工,• C 
U XREG—32pO • FREG<-16pO O MEM<-256pO 
2S XREG[2 3 4]— 30 10 210 
3£ 工—10 • C<-1 
45 LOOP:—LEAVExi (C>10) 
5c M E M [ I + l ] < - B t o V 32个Vto IEEE64 C O MEM[I+2 ] —BtoV 3 2 i V t o I E E E 6 4 C 
6c 工—工+2 • C—C+l • —LOOP 
” LEAVE ： MEM [ L 8 ] <-PROGRAM2 O 0<-C 
88 STATUS O •工N K E Y O MEM • OINKEY 
卵 LIWC 1 
V 
•RES—BDtoB A；TEMP；ONE 
丄 TEMP—A O RES<-p0 





1 L I ： UADDR^-BSTART [ 1 + B t o V OP] O ->E N D 
END:-»0 
V 
o c s p a c e : 2 MATRIX A p r 1 1 , 1994 6 : 4 4 PM Page 3 
fci 2c t s : BRANCHER BREADY 
VBRANCHER;OP 
II —ENDxi ( (NOB=l ) v (BDATArdy关1) ) fl 一一 BRANCHER k e e p i d l e — 
2 OP—1+2丄uBIR[14+14] 
3t fl [BRANCHER] BRANCHER~OP=' • OP 
钻 —ERRXL(0P<1)A(0P>16) 
—(NOBR, R2 , EQ, NEQ, L T , LTE, R3 , R4 , R l , BCF, BXO, BSF, BFO, BFU, BFZ , BR) [ OP ] 
纪 — B A L U r d y ^ - 1 
扒 EQ ： VC<-REG_Gout=REG_Hout O —END 
舶 NEQ:VC—REG_Gout关REG—Hout O —END 
铤 LT:VC—REG一Gout<REG一Hout O —END 
^ LTE ： VC<-REG_Gout<REG_Hout O —END 
U BCF:VC—(VtoB3 2 X R E G [ 1 + 1 6 ] ) [ 1 + 3 0 ] O —END 
BXO:VC—(VtoB3 2 XREG[ 1+16 ] ) [ 1 + 2 7 ] O —END 
1- BSF:VC—(VtoB3 2 X R E G [ 1 + 1 6 ] ) [ 1 + 2 5 ] O —END 
各 BFO:VC—(VtoB3 2 XREG[1+16] ) [ 1 + 2 3 ] END 
I BFU:VC—(VtoB32 XREG[1+16] ) [ 1 + 2 2 ] • —END 
1：- BFZ:VC—(VtoB32 XREG[ 1 + 1 6 ] ) [ 1+21 ] • —END 
h- R l ： ->END 
] R 2 : — E N D 
] R 3 : — E N D 
] R 4 : — E N D 
]NOBR:VOO • —END 
]BR :VC<~1 O —END 




A — - TEST I F REG—G REQUIRES XDR1 
R — ( l = X D R l r d y ) v ( ( i + i 8 ) 关 （ 1 + ( 2 丄 工 r [ 4 9 + l 4 ] ) ) ) • 
kl 
A TEST I F REG—G REQUIRES XDR2 
i Ll.-R^ RA ( ( l = X D R 2 r d y ) v ( (1+26) ^ (1+(2±IR[49+l4] ) ) ) ) O —L2 
4 k\ 
fl TEST I F REG一H REQURIES XDR1 
^ L2 :R^ -Ra ( ( l = X D R l r d y ) v((1+18)^(1+(2HR[53+l4])))) O ->L3 
Sj| …… 
ill A TEST I F REG_H REQURIES XDR2 
R A ( ( l = X D R 2 r d y ) v ( ( i + 2 6 ) ^ ( l + ( 2 l I R [ 5 3 + l « 4 ] ) ) ) ) O —END 
i|| 
中 END： fi \n<r-' [BREADY] DATArdy : • • 口—R 
I ->0 
V 
l l i s p a c e : 2 MATRIX A p r 11, 1994 6 : 4 4 PM Page 4 





_ L I :TVALUE—BtoV BITVEC 
• • 
1WALUE—BtoV BITVEC I VALUE—2丄BITVEC 
V 
I 
•DISPLAY V ; I ; M ; E 
I — V [ l ] O M—侈 O E — V [ 1 ] + 2 X V [ 2 ] X V [ 3 ] 
[DISPLAY], 






TEMP—A • RES—pO O COUNT—0 O SIZE—4 0 








: k S r C e : 2 ^ ^ Apr 11, 1994 6:44 PM P a g e 5 
1 | e c t s : F A L U FETCH—STOREmemCOMPLETED 
• F A L U ; O P 
！ R W [ F A L U ] F A L U r d y i n FALU=' O D^-FALUrdy 
1 A ^ ' [ F A L U ] R E G _ D o u t = ' O ^ R E G _ D o u t O , REG_Eou t= ' O R E G E o u t 
E^NDX-L (NOF=l) V ( F D A T A r d y ^ l ) 一 
-^ READYXL ( F A L U r d y = l ) 
F A L U r d y — F A L U r d y + 1 
—END 
READY：OP—1+2 丄UFIR [15+L 3 ] 
— E R R x l ( 0 P < 1 ) a ( O P > 8 ) 
— ( N O P , A D D , S U B , M U L , D I V , E , R l , R 2 ) [OP] 
I ] ADD: REG_Fout^Ftemp<-REG Dout+REG E o u t 
丨]REG—Fout一一Ftemp一一REG_Dout一+REG—Eout一 • SETF O FALUrdy—_2 O —END 
I ] SUB ： REG_Fout—Ftemp—REG一Dout-REG—Eout 
I ] REG_Fou t_^ -F temp_^ -REG_Dou t_ -REG_Eou t_ O SETF O FALUrdy—-2 O —END 
I I MUL: R E G一 Fout—Ftemp—REG 一 DoutxREG一 E o u t 
j ] REG—Fout一—Ftemp一—REG_Dout一XREG一Eout— • SETF O F A L U r d y — O —END 
I ] DIV:REG—Fout—Ftemp—REG Dout- fFTEST DBZ REG E o u t 
一 ‘ — ~ ： 圓 ： 
]REG—Fout—釦Ftemp——REG—Dout_+FTEST—DBZ REG_Eout_ O SETF O FALUrdy— -2 O ->EN 
D 
- ]E:REG—Fout—F七emp—REG—Eout 
: ] REG_Fout_«-F temp_^-REG_Eout_ O —END 
: ] R l : — E N D 
:j] R2 : ->END 
:j] NOP:—END 
:l] ERR： EPUT 7 [ FALU] F-工NTERNAL ERROR: FLOATING OP OUT OF RANGE' 





A _ - - RESTORE A L L ？DRrdy —— 
— L l O x i ( N E X T _ X D R l r d y = 1 ) 
XDRl rdy—NEXT一XDRlrdy • NEXT一XDRlrdy—1 
•！ L 1 0 : — L l l x t (NEXT_XDR2rdy= 1) 
.XDR2rdy—NEXT—XDR2rdy • NEXT—XDR2rdy—1 
3 L 1 1 : - » L 1 2 X L ( N E X T _ F D R l r d y = l ) 
FDRl rdy—NEXT一FDRl rdy • NEXT一FDRlrdy—1 
S L 1 1 : - > L 1 2 X L (NEXT_FDR2Rdy= 1) 
关 FDR2rdy—NEXT一FDR2rdy • NEXT一FDR2rdy—1 
U “ . 
I.] A c h e c k w h e t h e r t h e MEMory FETCH i s c o m p l e t e d — -
I 丨 ] X D R l r d y ^ X D R l r d y + ( X D R l r d y < 0 ) o - > L l x , (XDRl rdy^O) 
. ] [ F E T C H — S T O R E m e m C O M P L E T E D ] PORT1 M D R K ' 
3 口 " 双 肪 [ 1 8 + 1 ] — m E M [ 1 + X R E G [ 3 0 + 1 ] + x r e g [ 1 7 + 1 ] ] O x D R l r d y ^ l 
R ----PORT 1 一 —— 
] L I : X D R 2 r d y ^ X D R 2 r d y + (XDR2rdy<0) O (XDR2rdy^0) 
] ' [ F E T C H一S T O R E m e m C O M P T E L E D ] PORT2 MDR2<^ 
j ] D-XREG [ 2 6 + l ] .MEM [ 1+XREG [ 3 0 + l ] -f-XREG [ 2 5 + l ] ] O X D R 2 r d y ^ l 
A - ~ P O R T 2 
k S P a C e : 2 ^ ^ A p r 1 1 , 1994 6 : 4 4 PM P a g e 6 
sec t s : FETCH一 STOREmemCOMPLETED ( C o n t ' d ) — 
] L 2 : F D R l r d y — F D R l r d y + ( F D R l r d y < 0) • —L3 x i ( F D R l r d y ^ 0) 
] ' [ FETCH一STOREmemCOMPLETED ] PORT3 F D R K ‘ 
j ] •< -FREG[6+ l ]< - IEEE64 toV(D^(32p2 )TMEM [ l+XREG[30+ l]+XREG[17+ l ] ] ) , ((32
P
2) tMEM 
[1+XREG [3 0 + 1 ] +XREG [ 17+1 ] + 1 ] ) O FDRlrdy—1 fl ——PORT 3 —— 
U L3 :FDR2rdy—FDR2rdy+ (FDR2rdy<0 ) O -^L4x l (FDR2rdy^0) 
i ] [FETCH—STOREmemCOMPLETED] PORT4 FDR2<' 
^ •^-FREG[14 + l]^IEEE64toV((32p2)TMEM[ l+XREG[30+ l]+XREG[25+ l]]) , ((32p2)TMEM[ 
1+XREG [3 0 + 1 ] +XREG [ 25+1 ] + 1 ] ) O FDR2rdy—1 fl — P O R T 4 —— 
) 
i ) L4： 
A c h e c k w h e t h e r t h e MEMo 
r y WRITE c a n be c a r r i e d o u t 
I I — L 5 X L ( X S D R l r d y ^ l ) 
11 '[FETCH_STOREmemCOMPLETED] WRITE • 
|l Q^-MEM[ 1+XREG[3 0+1] +XREG[ 19+1 ] ] ^-XREG[2 0+1] • • “ 〉 P O R T 1 ' <0 XSDR2rdy—0 
•丨 L 5 : — L 6 x i ( x S D R 2 r d y 关 1 ) 
[FETCH一STOREmemCOMPLETED] WRITE • 
MEM[1 十XREG[30+1]+XREG[27+1] ]—XREG[28士 1 ] • PORT2 A O XSDR2rdy—0 
i L 6 : x L ( F S D R l r d y ^ 1) 
[FETCH—STOREmemCOMPLETED] WRITE ' 
MEM[1+XREG[30+1]+XREG[19+1] ]—2丄32TVtoIEEE64 FREG[7+1] O ' : ' 
MEM[1+XREG[30+1]+XREG[19+1]+1]—2丄32 iVto IEEE64 FREG[7+1] O FSDRlrdy<-0 
U<r-'> PORT3 ' 
M L7:->ENDXL (FSDR2rdy关 1) 
1 [FETCH—STOREmemCOMPLETED] WRITE 1 
• —MEM[1+XREG[30+1]+XREG[27+1] ]—2丄32TVtoIEEE64 FREG[15+1] O , : ' 
MEM[1+XREG[30+1]+XREG[27+1]+1]—2丄32iVto IEEE64 FREG[15+1] O FSDR2rdy—0 
PORT4' O —END 
f L l O l E P U T ' INTERNAL ERROR： KEMORY PORT NUMBER OUT OF RANGE' 
3] END:—0 
V 
卜 s p a c e : 2 MATRIX A p r 11 , 1994 6 :44 PM Page 7 
i e c t s : FETCH STOREmemlNIT — 
•FETCH STOREmemlNIT 
— 
A BY PASS I N V A L I D READ/WRITE INSTRUCTION 
-^ENDxL-RWab le I R 
fl i n i t i a t e NEW FETCH a n d NEW STORE 
A I R [ l + 6 ] = C o u p l e i n s t r u c t i o n i n d i c a t i o n (1一coup led ) 
A I R [ l + 8 ] = P u t r e s u l t i n t o . MAR ( f i x e d / f l o a t i n g READ) ( 1 - e n a b l e ) 
fi I R [ l + 9 ] = P u t r e s u l t i n t o MSAR ( f i x e d / f l o a t i n g WRITE) ( l ~ e n a b l e ) 
fi I ! R [ l + 7 ] =5 T y p e o f o p e r a t i o n ( 1 一 F l o a t i n g ) 
n 
1] fi [FETCH—STOREmemlNIT] u X I R [ 3 1 ] = ' O u X I R [ 3 1 ] 
I ] A CO" [FETCH一STOREmemlNIT] u F I R [ 2 7 ] = ' O u F I R [ 2 7 ] 
1] — L l x i (XREAD_STARTED=1) v ( u X I R [ 3 1 ] ^ l ) 
] R E A D ( I R [ 9 ] A ( 〜 工 R [ 8 ] ) ) , 1 fl n o r m a l f i x e d r e a d 
] R E A D ( I R [ 9 ] A 卜 I R [ 8 ] ) A I R [ 7 ] ) , 2 fl ： c o u p l e f i x e d r e a d 
: ]XREAD__STARTED—1 
- ] L I ： —L2xL (XWRITE—STARTED=1) v ( u X I R [ 3 1 ]关 1) 
；;Z?Zl1 R [ 1 0 ] A^1 R [ 8 ] ) ) f l A … … … … N 。 R M A L F I X E D W R I T E 
= T E ( U [ 1 0 ] A ( ~ I R [ 8 ] ) a ] [ R [ 7 ] ) , 2 fl „ „ „ c o u p l e f i x e d w r i t e 
)]XWRITE—STARTED—l 
• ] L2:->L3XL (FREAD—STARTED=1) V (UFIR[27]关 1) 
J ’ ” L 1 1 ' n o r m a l f l o a t i n g r e a d 
1] ^ D ( I R [ 9 ] A I R [ 8 ] A I R [ 7 ] ) / 4 A c o u p l e f l o a t i n g r e a d 
I ] FREAD一STARTED—l 
| 丨 ] L 3 : - > E N D X L (FWRITE一STARTED=1) V (uFIR[27]关 1) 
] 逝 ( 巩 1 0 ] A ! R ⑷ n o r x n a l f l o a t i n g w r i t e 
] W R I T E ( I R [ 1 0 ] a I R [ 8 ] a I R [ 7 ] ) , 4 fl „ c o u p l e f l o a t i n g w r i t e 





i c s p a c e : 2 MATRIX A p r 1 1 , 1994 6 : 4 4 PM Page 8 
i s c t s : FETCHmemCOMPLETED 
H A 
- . . - _ - -
•FETCHmemCOMPLETED 
A——RESTORE A L L ？DRrdy —— 
—LIOXL ( N E X T一X D R l r d y = l ) 
X D R l r d y < - N E X T _ X D R l r d y O NEXT一XDRlrdy—1 
哗 L 1 0 : — L l l x i ( N E X T 一 X D R 2 r d y = l ) 
3 XDR2rdy—NEXT—XDR2rdy • NEXT一XDR2rdy—1 
63 L l l : — L 1 2 x i (NEXT一FDRlrdy=l )_ 
1 FDRlrdy—NEXT一FDRlrdy • NEXT一FDRlrdy—1 
I L 1 2 : - » L 1 3 x l (NEXT_FDR2rdy= l ) 
丨 FDR2 rdy—NEXT—FDR2 r d y O NEXT—FDR2 rdy—1 
) ] 一 
: L ] n — - c h e c k w h e t h e r t h e MEMory FETCH i s c o m p l e t e d 
] ] L 1 3 : X D R l r d y 4 - X D R l r d y + ( X D R l r d y < 0 ) • — L l w (XDR l rdy 的 ） 
] ] n < _ / [FETCHmemCOMPLETED] P0RT1 MDR1<' 
t] D^-XREG [ 18+1 ] .-MEM [ 1+XREG [ 3 0+1 ] +XREG [ 17+1 ] ] O XDRlrdy^l 
A PORT 1 
… L I : XDR2rdy«-XDR2rdy+ (XDR2rdy<0) • —L2xi (XDR2rdy关0) 
I ; ] [FETCHmemCOMPLETED] PORT2 MDR2<' 
� ] D^-XREG [2 6+1] ^ MEM[ 1+XREG [30+1] +XREG [25+1]] O XDR2rdyf-1 
A PORT 2 — 
l] L2 : F D R l r d y « - F D R l r d y + ( F D R l r d y < 0 ) O ->L3xl (FDR l rdy^O) 
h [FETCHmemCOMPLETED] PORT3 F D R K ' 
| i ] • ^ F R E G [ 6 + l ] ^ I E E E 6 4 t o V ( ( 3 2 p 2 ) r M E M [ l + X R E G [ 3 0 + l ] + X R E G [ 1 7 + l ] ] ) / ( ( 3 2 p 2 ) r M E M [ l 
+XREG [ 3 0 + 1 ] +XREG [ 17+1 ] + 1 ] ) O FDRlrdy—1 R ——PORT 3 — 
j •] L3 : FDR2rdy<-FDR2rdy+ (FDR2rdy<0) O ->ENDxl (FDR2 rdy^0) 
j .] [FETCHmemCOMPLETED] PORT4 F D R 2 < " 
j ] • ^ FREG [ 1 4 + l ] ^ I E E E 6 4 t o V ( ( 3 2 p 2 ) rMEM [ l +XREG [ 3 0 + l ]+XREG [ 2 5 + l ] ] ) , ((32P2)TMEM[ 
1+XREG [3 0 + 1 ] +XREG [ 2 5+1 ] + 1 ] ) • FDR2rdy<-l A — P O R T 4 —— 
j ] END:—0 
V 
; k T C e : 2 ^ ^ A p r 11 , 1994 6：44 PM P a g e 9 
l|)ects: FETCHmemlNIT FFILTER FMPX1 
•FETCHmemlNIT 
I A — BY PASS I N V A L I D READ/WRITE INSTRUCTION -
丨 — E N D x i 〜 R W a b l e I R 
A i n i t i a t e NEW FETCH and NEW STORE —— 
fi 
A I R [ l + 6 ] = C o u p l e i n s t r u c t i o n i n d i c a t i o n ( 1 - c o u p l e d ) 
A I R [ l + 8 ] = P u t r e s u l t i n t o MAR ( f i x e d / f l o a t i n g READ) ( 1 - e n a b l e ) 
A I R [ l + 9 ] = P u t r e s u l t i n t o MSAR ( f i x e d / f l o a t i n g WRITE) ( 1 - e n a b l e ) 
A I R [ l + 7 ] = T y p e o f o p e r a t i o n ( 1 - F l o a t i n g ) 
I 1 ] —LI (XREAD一STARTED=1) v ( uX IR [31 ] 尹 1 ) 
].]READ(lR[9]A^1R[8]))fl A n o r m a l f i x e d r e a d 
: ] R E A D ( I R [ 9 ] A ( ^ I R [ 8 ] ) A I R [ 7 ] ) , 2 fi — c o u p l e f i x e d r e a d 
]XREAD一STARTED—1 
j ] L1:->ENDXL (FREAD_STARTED=1) v ( u F I R [ 2 7 ] 关 1) 
] R E A D ( I R [ 9 ] A I R [ 8 ] ) , 3 fi n o r m a l f l o a t i n g r e a d 
U R E A D ( I R [ 9 ] A I R [ 8 ] A I R [ 7 ] ) , 4 PI c o u p l e f l o a t i n g r e a d 
丨j ] FREAD—STARTED—1 
丨 ] E N D : — 0 
V 
VuADDR分FFILTER OP 




—(L1,END) [ 1+2 丄 UF IR [15 ] ] fl 1 : REG—E 




^ k S p a C e : 2 ^ ^ A p r 11 , 1994 6：44 PM p a g e 1 0 
3ects: FREADY FTEST一DBZ FtoX ‘ 
•R—FREADY；xyz；S 
) A TEST I F REG一D REQUIRES FDR1 
x y z — ( 2 P 2 ) 丄 U F I R [ 1 9 20 ] fl — REG—D: 0=Rx, l = R y , 2=Rz 3=ERR —— 
R — ( l = F D R l r d y ) v ( s — ( 1 + 6 ) 尹 （ l + ( 2 丄 （ 2 6 i I R ) [ ( 5 + x y z x 4 ) + i 4 ] ) ) ) 
R—RA ( ( I R [ 3 1 ] « 1 ) v ( l = F D R 2 r d y ) vs ) 
I I I 
A TEST I F REG一D REQUIRES FDR2 
*1 L l : x y z — ( 2 p 2 ) 丄 U F I R [ 1 9 2 0 ] fi — REG_D: 0=Rx, l = R y , 2=Rz 3=ERR —— 
R<-RA( ( l = F D R 2 r d y ) v (s—(1+14)关 ( l+ (2丄（264IR) [ ( 5 + x y z x 4 ) + 1 4 ] ) ) ) ) 
: ] R — R a ( ( I R [ 3 1 ] « 1 ) v ( l = F D R l r d y ) v s) 
fl TEST I F REG—E REQURIES FDR1 
： 丨 ]L2 :xyz—(2p2)丄uFIR[21 2 2 ] fl — REG—E: 0=Rx, l = R y , 2=Rz 3=ERR 
- ] R — R a ( ( l = F D R l r d y ) v ( S — ( l+6)关（ l + ( 2丄（26丄工R ) [ ( 5 + x y z x 4 ) + 1 4 ] ) ) ) ) 
- ] R < - R A ( ( I R [ 3 1 ] ^ 3 1 ) v ( i = F R 2 r d y ) vs ) 
:;] A — TEST I F REG_E REQURIES FDR2 
-i] L3 : x y z — ( 2 p 2 ) 丄 U F I R [ 2 1 22 ] R - - REG—E: 0=Rx, l = R y , 2=Rz 3=ERR —— 
1 ) R—RA((l=FDR2rdy)v(S—(1+14)关（ l+(2丄（26丄工R) [ ( 5 + x y z x 4 ) + 1 4 ] ) ) ) ) 
- ) R — R a ( ( I R [ 3 1 ] ^ 3 1 ) v ( l = F D R l r d y ) v s ) 
6 1 
；丨 END： fi [FREADY] D A T A r d y ： 書 • R 
V 
•OUT—FTEST—DBZ I N 
J OUT—IN 
-- —O X L I N乒 0 
1 EPUT 'F-ARITHMETIC ERROR： DIVIDED BY ZERO' o OUT-1 
V 
V F t o X 
I ">ENDXL ( 0 = u X I R [ 2 9 ] ) 
I 丨 XREG[1 ]—BtoV 32TV to IEEE64 F R E G f l ] 
I XREG[2 ]—BtoV 3 2 i V t o I E E E 6 4 F R E G f l ] 
I END:—0 
V 
卜 k s p a c e : 2 MATRIX A p r 11 , 1994 6 : 4 4 PM Page 11 
^ e c t s : GAUSSIAN GETC GETFfromMEM 
•GAUSSIAN 
A I n i t i a l i z a t i o n , r e g i s t e r f i l e s 
XREG—32p0 O FREG—16p0 • MEM<-128pO 
‘I n The P r o g r a m 
MEM[L8]—GAUSSIAN—PROGRAM 
fl — - - T h e O r i g i n a l 3x4 M a t r i x — 
PUTFtoMEM 5 1 , " 1 fl " 1 1 2 2 
PUTFtoMEM 5 3 , 1 fl 3 1 6 
PUTFtoMEM 5 5 , 2 fi " 1 3 4 4 
: ] ] P U T F t o M E M 5 7 , 2 
：丨]PUTFtoMEM 5 9 , 3 
PUTFtoMEM 61,"1 
I ) PUTFtoMEM 6 3 , 1 
I ] PUTFtoMEM 6 5 , 6 
PUTFtoMEM 67, " 1 
PUTFtoMEM 6 9 , 3 
PUTFtoMEM 7 1 , 4 
i ] PUTFtoMEM 7 3 , 4 
)] 
) ] A s e t t h e r e g i s t e r t o a p p r o i a t e v a l u e - 一 -
；^ X R E G [ 1 + 1 6 ] " B t ° V ( ° 0 1 1 1 1 ' 2 6 P 0 ) a — SET THE PSW TO APPROIATE MODE — 
； ] X R E G [ l + 5 ] — 3 fl ( N j 
] X R E G [ l + 6 ] ^ - l A (K) 
丨 ] X R E G [ l + 7 ] — 1 + 1 A ( J ) 
' ] X R E G [ l + 0 ] — 6 0 fl ( A [ J , P ] ) 
] X R E G [ l + 8 ] < - 5 2 fi ( A [ K , P ] ) 
] X R E G [ l + 4 ] — 5 0 + ( 2 x 8 ) fl TERMINATION CONDITION 
] F R E G [ l + 0 ] — - ( 3 + —1) ft ( _ M [ J , K ] — A [ J , K ] + A [ K , K ] ) 
j ] A e x e c u t e t h e p r o g r a m 
j ] STATUS O DISPLAY 5 1 3 4 O DINKEY 
j ] LIWC 1 
j ] DISPLAY 5 1 3 4 
V 
•A—GETC 
A [GETC] READ FROM C ： ‘ 
T—VtoB3 2 XREG[1+16 ] 
A — T [ l + 3 0 ] 
V 
VV—GETFfromMEM 工 
V — I E E E 6 4 t o V ( V t o B 3 2 M E M [ I ] ) , (V toB32 M E M [ I + 1 ] ) 
V 
i c s p a c e : 2 MATRIX A p r 1 1 , 1994 6 : 4 4 PM Page 12 
I E E E 6 4 t o V IRC 
•RES—IEEE64toV B ITVEC; SIGN; EXP;MAN 
RES—1 • — L l x t (0关+/BITVEC) O RES—0 • —END 
5 L1:SIGN<-1TBITVEC • B I T V E O l l B I T V E C 
EXP—ll亇BITVEC O MAN—l l iB ITVEC 
^ EXP—2* ( (2 丄 EXP) -1023 ) 
L2:MAN—0.5XMAN O RES—RES+MAN[ 1] O MAN—11MAN 
->L2XL (0<pMAN) 
！ RES—RESXEXP 
~>ENDX L (S IGN=O) O RES—O-RES 
V 
I 
• I R C ; 工 P L 
] I P L — 4 X 2 fi I N S T R U C T I O N P I P E L I N E LENGTH 
• 口 [ I R C ]工E N D . : , O Q - I E N D O , $ I N I T ： , • I N I T • , $ S B R : , • s 
一 B R • $ VC: ‘ O 口—VC 一 
R E F R E S H — I N I T V ( V C A ~ S _ B R ) 
— L I X I 卜 R E F R E S H ) 
V O - I N I T — O 
A ' [ I R C ] REFRESH： ， O REFRESH 
I 工 ( V t o B 3 2 MEM[XREG[1+22 ] + X R E G [ 1 + 3 1 ] ] ) , ( V t o B 3 2 MEM[XREG[1+22] +XREG[1+31] 
+1]) 
！I I R 2 — ( V t o B 3 2 MEM[XREG [ 1+22 ] +XREG[ 1 + 3 1 ] +2 ] ) , ( V t o B 3 2 MEM[XREG[ 1+22 ]+XREG[1+ 
3 1 ] + 3 ] ) “ 
I R 3 — ( V t o B 3 2 MEM[XREG[1+22 ] +XREG[ 1 + 3 1 ] + 4 ] ) , ( V t o B 3 2 MEM[XREG[1+22] +XREG[ 1+ 
3 1 ] + 5 ] ) — 
干 I R 4 — ( V t o B 3 2 MEM[XREG[1+22 ] +XREG[ 1 + 3 1 ] + 6 ] ) , ( V t o B 3 2 MEM[XREG[1+22] +XREG[ 1 4 -
3 1 ] + 7 ] ) 
EPUT " I R C ] REFRESH THE INSTRUCTION P I P E L I N E • —END 
L I : A S—BR—IR[44] fi SET THE SELF+BRANCH B I T 
I ] L—ENC—IENDA〜（S—BRAVC) 
I I —ENDxi卜L—ENC) 
II： I I R I R 2 
li3| IR2—IR3 
II i IR3—IR4 
I R 4 — ( V t o B 3 2 M E M [ X R E G [ l + 2 2 ] + X R E G [ l + 3 1 ] + I P L ] ) , ( V t o B 3 2 MEM[XREG[1+22] +XREG[ 
1 + 3 1 ] + I P L + 1 ] ) O XREG[1+31 ]—XREG[1+31 ]+2 
K EPUT ' [ I R C ] FETCH A NEW ISNTRUCTION I N … • , 
END:VC<-S_BR—0 • ->0 
V 
r k s p a c e : 2 MATRIX ^ ” … ， 产 
L c t s . T ^ TCTT T P ‘ 1 9 9 4 6 : 4 4 P M Page 13 工SOF I S U L I S T 
•A—ISOF B/MAXISP 
MAXISP—64 
— L I x l ( B < 0 ) 
(B>MAXISP) 
—END 
L1 :EPUT , [ I S O F ] B：INTERNAL STACK UNDERFLOW, O ISP—1 • —END 
L2 :EPUT , [工SOF] B：INTERNAL STACK OVERFLOW , • ISP—MAXISP 
END:A—ISP O ； 
V 
• I S U ; o l d N O B ； o l d N O X ; o l d N O F 





•丨 —L3x i (XNEW#1) 
* 
XREAD—STARTED—FREAD 一 STARTED—XWRITE 一 STARTED—FWRITE一 STARTED—0 
L3 ： —ENDxL ~ ( INITVBEND) 
N O F — ( F N E W A N O F ) 
N O X 一 (XNEWA N O X ) 
l l] END:^0 
V 
V L I S T G;V;L3:W1;C 
IE 0 0 
2 • — ' [ L I S T ] , 
PC L1:V—2 个G • G—2iG 
继 L I W l — ( V t o B 3 2 V [ l ] ) , V t o B 3 2 V [ 2 ] 
\!\<r-r I ' O C O W 1 O C + l • - > ‘ • 26TLIW1 
纪 F - > r O 1 7 T 2 6 i L I W l 
九 B - > ' • D<-43 iL IWl 
C—C+2 
9e -^LIXL (pG>0) 
V 
！ 
_ s p a c e : 2 MATRIX A p r 1 1 , 1994 6 : 4 4 PM P a g e 1 4 
^ e c t s : LIWC MAT 一 M U L 
• L IWC p c ; t i m e ; M O D E 
EXIT—0 • u X P O u F P O u B P C — l •工R—64p0 • MODE—1 
I XDRlrdy-XDR2rdy-FDRlrdy-FDR2rdy-1 ft SET ALL MDR AVAILABLE 一 
NEXT—XDRlrdy—NEXT 一 XDR2rdy—NEXT—FDRlrdy—NEXT 一 FDR2rdy—1 
MSARlnext<-MSAR2next<-0 
1 XSDRlrdy—XSDR2rdy—FSDRlrdy—FSDR2rdy—0 fl SET ALL MSDR NOT • AVAILABLE -
XDATArdy—FDATArdy—BDATArdy—1 A SET ALL DATArdy FLAG ~ 
XALUrdy—FALUrdy—BALUrdy—1 R - - - - SET ALL ALUrdy FLAG — 
XREG[ l+3 0 ] — p c - 1 O XREG[ l+2 2 ] — p c - l fl ——SET DBASE AND CBASE TO 0 
XREG[1+31]—pc O VC—O 
: ] 工 S P — 0 O ISTACK—64p0 fi I N I T INTERNAL STACK 
二 ] fi INIT ALL SYNCHRONIZATION VARIABLES 
二 ] NOX—NOF—NOB—INIT—1 • BEND—XEND—FEND—0 O cyc le—coun t—1 
I] LOOP： 
I ] ' [ L IWC] c y c l e : ' O Q—cycle O ' PC =' O D<-XREG[ 1+31] O ' 
一 — 一 一 一 一 一 “ 一 一 I 
I] MICROX R FIXED ENGINE ：一 
I j fl 'PC [ a f t e r X] = , • XREG[32] 
MICROF fl FLAOTING ENGINE 
fl 'PC [ a f t e r F ] • XREG[32] 
江 I MICROB fl BRANCHING ENGINE —— 
巧 A 'PC [ a f t e r B] • XREG[32] 
^ I S U fl ——INSTRUCTION SCHEDULE UNIT 
巧 fi 'PC [ a f t e r I S U ] = ' o D<-XREG[32] 
l i IRC fl ——INSTRUCTION REFRESH CONTROLLER -
>」 R 'PC [ a f t e r IRC ] XREG[32] 
) ] E X I T — U X I R [ 3 0 ] 。 
L J A SET THE HALT FLAG • 一 一 一 一 … 5] cycle—cycle+1 
:7] -^ LIXL (delayco) 
5] time—[]DL de lay o -»L2 
) ] L i : EPUT , [ L 工 W C ] P r e s s any k e y t o CONTINUE . . . . , • 0 0 p D I N K E Y 
： ) ]L2:—LOOPXL ( E X I T = 0 ) 
L] END ： EPUT ' [ L I W C ] 一 „— E N D — 
£i i> lu 一 ' O 
VMAT_MUL 
A I n i t i a l i z a t i o n , r e g i s t e r f i l e s 
XREG—32p0 • FREG—16p0 • MEM^-1024p0 
.. . I . • • 
A The P r o g r a m 
MEM[ VpMAT_MUL_PROGRAM] —MAT一MUL一PROGRAM 
A The O r i g i n a l TWO 10x10 M a t r i x 
PUTFtoMEM 1 0 1 , 1 fl 1 2 3 4 5 6 7 8 9 10 
PUTFtoMEM 1 0 3 , 2 fl 1 1 12 13 14 15 16 17 18 19 20 
PUTFtoMEM 1 0 5 , 3 fi 
：丨 ]PUTFtoMEM 1 0 7 , 4 
i S t e p a c e : 2 MATRIX A p r 11 , 1994 6 : 4 4 PM Page 15 
》丨 iects: MAT一MUL ( C o n t ' d ) 
； ] P U T F t o M E M 1 0 9 , 5 
: ] P U T F t o M E M 1 1 1 , 6 
: : ] P U T F t o M E M 1 1 3 , 7 
: : ] P U T F t o M E M 115 , 8 
PUTFtoMEM 1 1 7 , 9 
: ] P U T F t o M E M 1 1 9 , 1 0 
- ] P U T F t o M E M 1 2 1 , 1 1 
: ] P U T F t o M E M 1 2 3 , 1 2 
: : ] P U T F t o M E M 1 2 5 , 1 3 
A] PUTFtoMEM 127 ,14 
PUTFtoMEM 1 2 9 , 1 5 
⑷ PUTFtoMEM 1 3 1 , 1 6 
PUTFtoMEM 1 3 3 , 1 7 
PUTFtoMEM 13 5 , 1 8 
PUTFtoMEM 1 3 7 , 1 9 
PUTFtoMEM 13 9 , 2 0 
PUTFtoMEM 1 4 1 , 2 1 
PUTFtoMEM 14 3 , 2 2 
h ] PUTFtoMEM 1 4 5 , 2 3 圓 
I〕] PUTFtoMEM 1 4 7 , 2 4 
| L ] PUTFtoMEM 1 4 9 , 2 5 
I 2] PUTFtoMEM 1 5 1 , 2 6 
j 3] PUTFtoMEM 1 5 3 , 2 7 
“ ] P U T F t o M E M 1 5 5 , 2 8 
I 5] PUTFtoMEM 1 5 7 , 2 9 
! 5] PUTFtoMEM 159,3 0 
m PUTFtoMEM 161,31 
“]PUTFtoMEM 163,32 
I PUTFtoMEM 1 6 5 , 3 3 
I ) ] PUTFtoMEM 1 6 7 , 3 4 
I L] PUTFtoMEM 1 6 9 , 3 5 
PUTFtoMEM 171,3 6 
||l] PUTFtoMEM 173,37 
⑴ PUTFtoMEM 1 7 5 , 3 8 
0l>] PUTFtoMEM 1 7 7 , 3 9 
Ji] PUTFtoMEM 179,4 0 
[ ) ' ] P U T F t o M E M 1 8 1 , 4 1 
o)丨 ]PUTFtoMEM 1 8 3 , 4 2 
{j1] PUTFtoMEM 1 8 5 , 4 3 
[1 丨] PUTFtoMEM 187,44 
{].] PUTFtoMEM 1 8 9 , 4 5 
： [ ] P U T F t o M E M 1 9 1 , 4 6 
t | ] PUTFtoMEM 1 9 3 , 4 7 
[ ] P U T F t o M E M 1 9 5 , 4 8 
[ ] P U T F t o M E M 1 9 7 , 4 9 
[ j ] PUTFtoMEM 1 9 9 , 5 0 
[ f ] PUTFtoMEM 2 0 1 , 5 1 
tl；] PUTFtoMEM 2 0 3 , 5 2 
[ | ] PUTFtoMEM 2 0 5 , 5 3 
孙 k s p a c e : 2 MATRIX A p r 11 , 1994 6 : 4 4 PM Page 16 
H e c t s : M A T — M U L ( C o n t ' d ) ( C o n t ' d ) 
[ , ] P U T F t o M E M 2 0 7 , 5 4 
[• :] PUTFtoMEM 2 0 9 , 5 5 
[•1] PUTFtoMEM 2 1 1 , 5 6 
[•:] PUTFtoMEM 2 1 3 , 5 7 
[«] PUTFtoMEM 2 1 5 , 5 8 
[ e ] PUTFtoMEM 2 1 7 , 5 9 
[e ! ] PUTFtoMEM 2 1 9 , 6 0 
h ) PUTFtoMEM 2 2 1 , 6 1 
U ] PUTFtoMEM 2 2 3 , 6 2 
) ] P U T F t o M E M 22 5 , 6 3 
：)] PUTFtoMEM 2 2 7 , 64 
)L] PUTFtoMEM 22 9 , 65 
H ] PUTFtoMEM 2 3 1 , 6 6 
I 
( ) ] P U T F t o M E M 2 3 3 , 67 
11] PUTFtoMEM 2 3 5 , 6 8 
|> ] PUTFtoMEM 2 3 7 , 69 
I ) ] PUTFtoMEM 2 3 9 , 7 0 
r ] PUTFtoMEM 2 4 1 , 7 1 
h ] PUTFtoMEM 2 4 3 , 7 2 
• )] PUTFtoMEM 2 4 5 , 7 3 
I ) ] PUTFtoMEM 2 4 7 , 7 4 
L] PUTFtoMEM 2 4 9 , 7 5 
J ! ] PUTFtoMEM 2 5 1 , 7 6 
I i ] PUTFtoMEM 2 5 3 , 7 7 
i \ ] PUTFtoMEM 2 5 5 , 7 8 
I ' 
|> ] PUTFtoMEM 2 5 7 , 7 9 
[h] PUTFtoMEM 2 5 9 , 8 0 
「 ] PUTFtoMEM 2 6 1 , 8 1 
I ； ] PUTFtoMEM 2 6 3 , 8 2 
1 >] PUTFtoMEM 2 6 5 , 8 3 
“ ] P U T F t o M E M 2 6 7 , 8 4 
I . ] PUTFtoMEM 2 6 9 , 8 5 
I ! ] PUTFtoMEM 2 7 1 , 8 6 
I t ] PUTFtoMEM 2 7 3 , 8 7 
1 ：] PUTFtoMEM 2 7 5 , 8 8 
B;] PUTFtoMEM 2 7 7 , 8 9 
I ； ] PUTFtoMEM 2 7 9 , 9 0 
0 r ] PUTFtoMEM 2 8 1 , 9 1 
I ！ ] PUTFtoMEM 2 8 3 , 9 2 
B 丨 ] P U T F t o M E M 2 8 5 , 9 3 
S )0] PUTFtoMEM 287,94 
0 H ] PUTFtoMEM 2 8 9 , 9 5 
0 )2] PUTFtoMEM 2 9 1 , 9 6 
1 )3] PUTFtoMEM 2 9 3 , 9 7 
Eli4] PUTFtoMEM 2 9 5 , 9 8 
:D )5] PUTFtoMEM 2 9 7 , 9 9 
丨 “ 6 ] PUTFtoMEM 2 9 9 , 1 0 0 
；[17] PUTFtoMEM 3 0 1 , 1 fl 1 2 3 4 5 6 7 8 9 0 
| ! 8 ] PUTFtoMEM 3 0 3 , 2 fll234567890 
i r k S P a C e : 2 ^ ^ A p r 1 1 , 1994 6 : 4 4 PM P a g e 17 
|jects: MAT__MUL(Cont'd) (Cont'd) (Cont'd) 
109] PUTFtoMEM 3 0 5 , 3 R 
110] PUTFtoMEM 3 0 7 , 4 
111] PUTFtoMEM 3 0 9 , 5 
:12] PUTFtoMEM 3 1 1 , 6 
|:13] PUTFtoMEM 3 1 3 , 7 
j 1 4 ] PUTFtoMEM 3 1 5 , 8 
115] PUTFtoMEM 3 1 7 , 9 
！16] PUTFtoMEM 3 1 9 , 0 
| l 7 ] PUTFtoMEM 3 2 1 , 1 
1118 ] PUTFtoMEM 3 2 3 , 2 
| l 9 ] PUTFtoMEM 3 2 5 , 3 
3|20] PUTFtoMEM 3 2 7 , 4 
| 2 1 ] PUTFtoMEM 3 2 9 , 5 
122] PUTFtoMEM 3 3 1 , 6 
I 23] PUTFtoMEM 3 3 3 , 7 
I 24] PUTFtoMEM 3 3 5 , 8 
i 2 5 ] PUTFtoMEM 3 3 7 , 9 
q 26] PUTFtoMEM 3 3 9 , 0 
127] PUTFtoMEM 3 4 1 , 1 fl 0 1 1 
128] PUTFtoMEM 3 4 3 , 2 fl 0 1 1 
• 9 ] PUTFtoMEM 3 4 5 , 3 A 0 0 1 
i 3 0 ] PUTFtoMEM 3 4 7 , 4 131] PUTFtoMEM 3 4 9 , 5 
32] PUTFtoMEM 3 5 1 , 6 
33] PUTFtoMEM 3 5 3 , 7 
34] PUTFtoMEM 3 5 5 , 8 
35] PUTFtoMEM 3 5 7 , 9 
B6] PUTFtoMEM 3 5 9 , 0 
17] PUTFtoMEM 3 6 1 , 1 A 0 1 1 
i38] PUTFtoMEM 363,2 A 0 1 1 
和 9 ] PUTFtoMEM 3 6 5 , 3 A 0 0 1 
^ 10] PUTFtoMEM 3 6 7 , 4 
j 11] PUTFtoMEM 3 6 9 , 5 
S 12] PUTFtoMEM 3 7 1 , 6 
i 13] PUTFtoMEM 3 7 3 , 7 “ 4 ] 3 5 81 5 7 9I 6 PUTFtoMEM 3 7 9 , 0 7 8 1 , 1 fl 0 1 1 8    3 2 A 1 
! 4 9 ] PUTFtoMEM 3 8 5 , 3 R 0 0 1 
150] PUTFtoMEM 3 8 7 , 4 
p i ] PUTFtoMEM 3 8 9 , 5 
p 2 ] PUTFtoMEM 3 9 1 , 6 
t 5 3 ] PUTFtoMEM 3 9 3 , 7 
p 4 ] PUTFtoMEM 3 9 5 , 8 
155] PUTFtoMEM 3 9 7 , 9 
15 6 ] PUTFtoMEM 3 9 9 , 0 
| 5 7 ] PUTFtoMEM 4 0 1 , 1 A 0 1 1 
I 
j r k s p a c e : 2 MATRIX A p r 1 1 , 1994 6 : 4 4 PM Page i 8 
| ) e c t s : MAT—MUL(Con t 'd ) ( C o n t ' d ) ( C o n t ' d ) ( C o n t ' d ) 
||58] PUTFtoMEM 4 0 3 , 2 fl 0 1 1 
| 5 9 ] PUTFtoMEM 4 0 5 , 3 fi 0 0 1 
50] PUTFtoMEM 4 0 7 , 4 
51] PUTFtoMEM 4 0 9 , 5 
S52] PUTFtoMEM 4 1 1 , 6 
i 5 3 ] PUTFtoMEM 4 1 3 , 7 
|D4 ] PUTFtoMEM 4 1 5 , 8 
^55] PUTFtoMEM 4 1 7 , 9 
| 5 6 ] PUTFtoMEM 4 1 9 , 0 
• 5 7 ] PUTFtoMEM 4 2 1 , l A 0 1 1 
. 5 8 ] PUTFtoMEM 4 2 3 , 2 fl 0 1 1 
3 J 
…59] PUTFtoMEM 4 2 5 , 3 fi 0 0 1 
G 70] PUTFtoMEM 4 2 7 , 4 
'i 71] PUTFtoMEM 429 , 5 
^ 72] PUTFtoMEM 4 3 1 , 6 
^73] PUTFtoMEM 4 3 3 , 7 
I 74] PUTFtoMEM 43 5 , 8 
^75] PUTFtoMEM 4 3 7 , 9 
I 76] PUTFtoMEM 4 3 9 , 0 
( 7 7 ] PUTFtoMEM 4 4 1 , 1 fl 0 1 1 
t 7 8 ] PUTFtoMEM 4 4 3 , 2 .A 0 1 1 
^ 79] PUTFtoMEM 4 4 5 , 3 fl 0 0 1 … 
C 30] PUTFtoMEM 4 4 7 , 4 
^31 ] PUTFtoMEM 44 9 , 5 
1 S2] PUTFtoMEM 4 5 1 , 6 
I 33] PUTFtoMEM 4 5 3 , 7 
I 34] PUTFtoMEM 4 5 5 , 8 
Jb5] PUTFtoMEM 4 5 7 , 9 
D 6 ] PUTFtoMEM 4 5 9 , 0 
87] PUTFtoMEM 4 61,1 q o 1 l 
88] PUTFtoMEM 4 6 3 , 2 A 0 1 l 
89] PUTFtoMEM 4 6 5 , 3 P 0 0 1 
90 ] PUTFtoMEM 4 6 7 , 4 
91 ] PUTFtoMEM 4 6 9 , 5 
i92 ] PUTFtoMEM 4 7 1 , 6 
93] PUTFtoMEM 4 7 3 , 7 
94 ] PUTFtoMEM 4 7 5 , 8 
95 ] PUTFtoMEM 4 7 7 , 9 
96 ] PUTFtoMEM 4 7 9 , 0 
97 ] PUTFtoMEM 4 8 1 , 1 fi 0 1 1 
98 ] PUTFtoMEM 4 8 3 , 2 A 0 1 1 
99 ] PUTFtoMEM 4 8 5 , 3 A 0 0 1 
00] PUTFtoMEM 4 8 7 , 4 
01 ] PUTFtoMEM 4 8 9 , 5 
02 ] PUTFtoMEM 4 9 1 , 6 
03 ] PUTFtoMEM 4 9 3 , 7 
04 ] PUTFtoMEM 4 9 5 , 8 
05 ] PUTFtoMEM 4 9 7 , 9 
06 ] PUTFtoMEM 4 9 9 , 0 
r k s p a c e : 2 MATRIX A p r 1 1 , 1994 6 : 4 4 PM Page 19 
j e c t s : MAT 一 M U L ( C o n t ' d ) ( C o n t ' d ) ( C o n t ' d ) ( C o n t ' d ) ( C o n t ' d ) 
07] 
08] R s e t t h e r e g i s t e r t o a p p r o i a t e v a l u e 
丨 09 ] XREG[ 1 + 1 6 ] — B t o V ( 0 0 0 0 0 0 , 2 6 p 0 ) R - - SET THE PSW TO APPRO工ATE MODE - -
||10] 
j 1 1 ] XREG[ 1+0]«-100 n (STARTaddr o f A) 
|12 ] XREG [1+8 ]—3 00 fi (STARTaddr o f B) 
113 ] XREG[ 1 + 1 ] <-2 fi ( n e x t A, s i z e o f a IEEE64 r e a l ) 
[14] XREG[ 1+9 ]—2x10 r ( n e x t B , s i z e o f a IEEE64 r e a l x D I M ( A ) ) 
| l 5 ] XREG[1+2]—10 R ( N ) 
|16] 
17 ] XREG[ 1+3 A ( r o w numbe r , 1=1) 
1 8 ] XREG[ 1 + 4 f i ( c o l u m n number , J = 1 ) 
1 9 ] X R E G [ l + 5 ] — 0 fi ( TERMINATION CONDITION ) 
20] 
1211 X R E G [ l + 6 ] — 4 9 8 fi ( STARTaddr o f C ) 
22] 
23] X R E G [ l + 7 ] — 1 0 0 fl ( b a c k u p o f t h e STARTaddr o f A ) 
24] XREG[1+15]<-298 A ( b a c k u p o f t h e STARTaddr o f B ) 
I 
^25] 
, 2 6 ] F R E G [ l + 0 ] 4 - 0 fi ( TEMP SUM OF EACH ELEMENT ) 
P27] 
P 8 ] A e x e c u t e t h e p r o g r a m 
f 2 9 ] S T A T U S • D ISPLAY 1 0 1 10 10 O DISPLAY 3 0 1 10 10 O DINKEY 
[ 3 0 ] LIWC 1 
| 3 1 ] D ISPLAY 5 0 1 10 10 
| 3 2 ] 
p 3 ] 
|I34] 




I r k s p a c e : 2 MATRIX A p r 11, 1994 6 : 4 4 PM Page 20 
i j e c t s ： MICROB 
•MICROB 
II fl (A) m i c r o e n g i n e l o g i c 
i I uBMPXl O UBMPX2 O u B I R r d y 
I A ( B ) U p d a t e m i c r o e n g i n e r e g i s t e r s 
I  uBPC<-uBMPX2 o u t 
!.S u B I R — u B I R i n 
A (C) E x e c u t i o n o f DATA 




1] BBrAddr—BtoV uBIR[4+ilO] 
2] A BDATArdy^-BREADY 
3] BEND—UBIR[27 ] A (NOB=0) ABDATArdyABALUrdy 
4] —LIOxl(BEND关1) 
5] NOB—l 
6 ] q (D) UPDATE SYNCHRONIZATION SIGNAL 
7 ] L10 :—Ex卜BNEW 
8 ] ^ ' [ M I C R O B ] B ' O a c c o u n t • s t a r t a t c y c l e ‘ • c y c l e o c o u n t ^ c o u n t + 
1 
9 ] BNEW—0 O XEND—FEND—0 • NOX—N〇F—1 
0] E:—0 
JL] A END OF MICROB 
V 
I, -- • < "— 
( r k S P a C e : 2 贴 顶 工 乂 A p r 1 1 , 1994 6 : 4 4 PM Page 2 1 
e j e c t s : MICROF 
•MICROF 
] fl ( A ) m i c r o e n g i n e l o g i c 
] u F M P X l O UFMPX2 O u F I R r d y 
] R ( B ) U p d a t e m i c r o e n g i n e r e g i s t e r s 
] uFPC<-uFMPX2 o u t 
] u F I R — u F I R i n 
] A (C) E x e c u t i o n o f DATA 
] FETCH一STOREmemlNIT 
I ] REG—D O REG_E • FMPX1 
！ ] FDATArdy—FREADY 
0 ] FALU 
11] X t o F 
2 ] U P D A T E f r e g 
;3] fi ' [ M I C R O F ] F D R l ^ 7 O F R E G [ l + 6 ] 
4 ] STOREmemCOMPLETED 
|5 ] F B r A d d r — B t o V u F I R [ 4 + i l O ] 
|5] —LLLXI (FENDA~LAST_XEND) FI WAIT FOR THE XEND SIGNAL 
P ] A- THE LAST一XEND SIGNAL IS USED TO SIMULATE THE X UNIT AND F UNIT EXECUT 
E 
fi- IN PARALLEL. 
[|9 ] FEND—UFIR [ 2 7 ] A (NOF=0) A ( F D A T A r d y = l ) A ( F A L U r d y = l ) 
K } L11:^L10XL(FEND^L) 
| L ] NOF—1 
12] FI- (D) UPDATE SYNCHRONIZATION SIGNALS 
IB ] L10 :—Ex i~FNEW 
[ 1 ] • — ' [ M I C R O F ] F ' • c o u n t O s t a r t a t c y c l e ‘ o c y c l e 
j lo] FNEW—0 O BEND—0 O NOB—1 
t o ] E:—0 
I 7] fi- END OF MIGROF 一 一 一 
V 
• rkspace： 2 MATRIX A p r 1 1 , 1994 6 : 4 4 PM Page 22 
j e c t s ： MICROX MICROX—MICROF 
VMICROX 
I A (A) m i c r o e n g i n e l o g i c 
I uXMPXl O UXMPX2 O u X I R r d y 
I A (B) U p d a t e m i c r o e n g i n e r e g i s t e r s 
11 uXPC<-uXMPX2out 
u X I R — u X I R i n 
丨 R — - - (C) E x e c u t i o n o f DATA 
I FETCHmemCOMPLETED fl - - U p d a t e memory r e f e r e n c e - - -
|[ REG一A O REG一B O XMPX1 
i 丨 XDATArdy—XREADY 
[ ) ] X A L U 
i：] F t o X 
I ] ] UPDATExreg 
I ] R —SKIPx i ( u X I R [ 3 1 ] 关 1) 
A FETCH_STOREmemINIT 
| ) ] ASKIP : 
| i ] XBrAdd r—BtoV u X I R [ 4 + i l O ] 
IR ] LAST一XEND—XEND 
|I] -»L11XL (XENDA-FEND) FL ——WAIT FOR THE FEND SIGNAL 
|>] XEND—uXIR[31] a (N0x=0) AXDATArdyAXALUrdy 
1)] L11:->L10XL (XEND关 1) 
I . ] NOX—1 『 
|>] FL (D) UPDATE SYNCHRONIZATION SIGNAL 
HI] L10:^EXL-XNEW 
• , [MICROX] XV o c o u n t • , s t a r t a七 c y c l e , • c y c l e 




] R ( A ) m i c r o e n g i n e l o g i c 
] uXMPXl • UXMPX2 O u X I R r d y f\ ( uX ) 
] UFMPX1 O UFMPX2 • u F I R r d y A ( uF ) 
] fl ( B ) U p d a t e m i c r o e n g i n e r e g i s t e r s 
] uXPC<-uXMPX2out A ( uX ) 
] u X I R — u X I R i n 
] uFPC<-uFMPX2out A ( UF ) 
] u F I R — u F I R i n 
] A (C) E x e c u t i o n o f DATA 
0] — S K I PXL ( u X I R [ 3 1 ] 关 1 ) 
L] STOREmemlNIT 
2] S K I P : FETCHmemCOMPLETED fl ~ U p d a t e memory r e f e r e n c e —— 
3] REG一A O REG一B O XMPX1 




T: 3 ] FALU 
[9 ] F t o X O X t o F 
i r k s p a c e : 2 MATRIX A p r 1 1 , 1994 6 : 4 4 PM Page 23 
i j e c t s : M I C R O X _ M I C R O F ( C o n t ' d ) PUTFtoMEM 
b ] UPDATExreg 
j l ] U P D A T E f r e g 
2] — S K I P l x i ( u X I R [ 3 1 ]关 1 ) 
|3] FETCHmemlNIT 
11 ] SKIPl：STOREmemCOMPLETED 
j o ] X B r A d d r — B t o V UXIR [ 4 +L1 0 ] 
5] LAST XEND—XEND 
j 7] — L l l x i (XENDA〜FEND) R — WAIT FOR THE FEND SIGNAL 
3] XEND—uXIR[31] a (N0x=0) AXDATArdyAXALUrdy 
9 ] L L L : — L l O x i ( X E N D 关 1 ) 
j 3] NOX—l 
1] A (D) UPDATE SYNCHRONIZATION SIGNAL 
2] L10:-^L19XL~XNEW 
13] O c o u n t • , s t a r七 a t c y c l e ' O 口 — c y c l e 
1] XNEW—0 O BEND—0 O NOB—1 
I 5] L19： 
5] F B r A d d r — B t o V UF IR [ 4 + L 1 0 ] 
7] —L2lxi(FENDA~LAST—XEND) fl ——WAIT FOR THE XEND SIGNAL 
I 
邓 ] A - THE LAST一XEND SIGNAL I S USED TO SIMULATE THE X UNIT AND F UNIT EXECUT 
E 
9] A - I N PARALLEL. 
0] FEND— UFIR[27] A (NOF=0)A (FDATArdy= l ) A ( F A L U r d y = l ) 
1] L21 :—L20x l (FEND 关 1) 
2] NOF—1 
3] FI (D) UPDATE SYNCHRONIZATION SIGNAL 
4] L2 0 :-»EXL~FNEW 
5] W • c o u n t O s t a r t a t c y c l e ， • c y c l e 
6] FNEW—0 • BEND—0 O NOB—1 
7] E:—0 
fl END OF MICROX一MICROF 
V 
•PUTFtoMEM V ; I ; D 
] I — V [ l ] • . D—V[2] 
] M E M [ I ] — B t o V 32TV to IEEE64 D O MEM[ 1+1 ] —BtoV 3 2>lVtoIEEE64 D 
V 
i r k s p a c e : 2 MATRIX Ap r 11 , 1994 6 : 4 4 PM Page 24 
| j e c t s ： READ REG一A 
•READ C 
I] A 一 一 一 c[l] i s D IS /ENC, and C [ 2 ] i s PORT number 
I ] — E N D x t ( C [ l ] = 0 ) A - - - PORT DISABLED 
] - ^ L I O x l ( C [ 2 ] < 1 ) V ( C [ 2 ] > 4 ) 
J A E m u l a t e t h e memeroy f e t c h d e l a y , 1 - 2 c y c l e s r a n d o m l y 
) [ R E A D ] ‘ . 
丨] — ( L 6 , L 7 , L 8 , L 9 ) [ C [ 2 ] ] 
. ] N E X T — X D R l r d y — - l x ? 2 • , r e a d f r q m p o R T 1 , • 巧 卯 A — _ P 0 R T 
； ] L 7 : a — N E X T — X D R 2 r d y — _ l x ? 2 o , R E A D F R 0 M P0RT2 ‘ • —END - — P 0 R T 2 - - _ 
N E X T — F D R l r d y — - l x ? 2 o R E A D FROM P0RT3 ' O .END P — P O R T 3 - - -
.0] N E X T — F D R 2 r d y — - l x ? 2 o , R E A D F R 0 M P 0 R T 4 , • , E N D fl P 0 R T 4 一 _ 一 
: 1 ] L 1 0 : E P U T ' [ R E A D ] INTERNAL ERROR： MEMORY PORT NUMBER OUT OF RANGE, 
12] END:—0 
V 
•REG— A ； x y z ； v e c t o r 
] xyz—(2p2)丄UXIR[22 23] fi REG一A: 0=Rx, l=Ry, 2=Rz 3=SP —— ] — E R R X L ( x y z < 0 ) v ( x y z > 3 ) 
] — L l x t ( x y z 关 3 ) 
] REG—Aout—XREG[ l+2 9 ] O —END 
] L l : — L 2 x i ( l R [ i ] = i ) 
] A RR format 
] R E G 一 A o u t — X R E G [ l + ( 2 丄 v e c t o r — I R [ ( 1 0 + x y z x 5 ) + L 5 ] ) ] 
] A s e t t h e c o u p l e r e g i s t e r 
] R E G — A o u t _ — X R E G [ l + ( 2 i ( l T v e c t o r ) , (~1T 1 4 , v e c t o r ) , ( 2 i v e c t o r ) ) ] O —END 
:0] R RX f o r m a t Rz i s n o t assumed t o be s e l e c t e d h e r e 
1] L2 : R E G一A o u t — X R E G [ l + ( 2 丄 v e c t o r — I R [ ( 1 0 + x y z x 4 ) + L 4 ] ) ] 
2 ] fi s e t t h e c o u p l e r e g i s t e r 
3 ] REG一Aout——XREG[ l+(2丄（一 lTvector ) , ( l i v e c t r ) ) ] O —END 
4] ERR： EPUT ' [REG_A] INTERNAL ERROR： REGISTER SET OUT OF RANGE' 
5 ] END:—0 
V 
I . 
I . — 1 ~ 
li 
I I I I I I 
i 
丨 I I I • 
I ^ H I I • • I • • 
C k S P a C e : 2 ^ ^ A p r 11 , 1994 6 : 4 4 PM P a g e 25 
j e c t s : REG一B REG一D REG_E REG 一 G 
•REG一B；xyz；vector 
] x y z ^ ( 2 p 2 ) x u X I R [ 2 4 2 5 ] R - - R E G — A : 0 = R x , l = R y , 
2=Rz 3=SP 
] — E R R x i ( x y z < 0 ) v ( x y z > 3 ) 
] ->L1XL ( x y z ^ 3 ) 
] REG—Bout—XREG[1+29] • —END fl S P 一 一 ― 
] L I : — L 2 x i ( I R [ l ] = l ) 
] fi RR f o r m a t 
] REG一Bou t—XREG[ l+ (2 丄 vec to r—IR[ ( 1 0 + x y z x 5 ) + L 5 ] ) ] 
) A s e t t h e c o u p l e r e g i s t e r 一 - -
] REG一Bou t——XREG[ l+ (2丄 （ l t vec to r ) , ( - I t l i v e c t o r ) , ( 2 i v e c t o r ) ) ] • —END 
〕] fi RX f o r m a t — 
L] L 2 : - > ( L 3 , L 3 , L 4 ) [ x y z + 1 ] 
；2] I i 3 : R E G — B o u t < - X R E G [ l + ( 2 i v e c t o r — I R [ ( 1 0 + x y z x 4 ) + i 4 ] ) ] R Rx o r Ry 
3] fi s e t t h e c o u p l e r e g i s t e r 
i ] REG一Bout一—XREG[l+(2丄（〜1 个 v e c t o r ) , ( l i v e c t o r ) ) ] O —END 
5] Ij4 :REG—Bout—REG一Bout一一2丄IR[ ( 1 0 + x y z x 4 ) + 1 8 ] • —END R o p d -一-




) x y z — ( 2 p 2 )丄U F I R [ 1 9 2 0 ] n 一一 REG一A: 0=Rx, l = R y , 2=Rz 3=ERR 
I — E R R x i ( x y z < 0 ) v ( x y z > 3 ) 
I REG—Dout—FREG[ l+(2丄vector— ( 2 6 i I R ) [ ( 5 + x y z x 4 ) + L 4 ] ) ] 
I r s e t t h e c o u p l e r e g i s t e r - - -
I REG—Dout——FREG[ l+(2丄（- lTvec tor ) , ( 1 山 v e c t o r ) ) ] O —END 




I x y z — ( 2 P 2 ) 丄 U F I R [ 2 1 2 2 ] fl 一一 REG一A: 0=Rx, l = R y , 2=Rz 3=ERR —— 
I — E R R X L ( x y z < 0 ) v ( x y z > 3 ) 
I REG一Eout—FREG[l+(2丄vector—(26丄工R) [ ( 5 + x y z x 4 ) + l 4 ] ) ] 
| | fl s e t t h e c o u p l e r e g i s t e r 
I R E G一E o u t一— F R E G [ l + ( 2 丄 （ 〜 1 个 v e c t o r ) , ( l i v e c t o r ) ) ] • —END 





] REG一Gout—XREG[l+2丄工R[49+14] ] O —END 
] E N D : — 0 
V 
- r k s p a c e : 2 MATRIX A p r 11 , 1994 6 : 4 4 PM Page 26 
j e c t s : REG—H RW RWable SETC 
•REG一H 
] R E G — H o u t — X R E G [ l + 2 i I R [ 5 3 + l 4 ] ] END 
] E N D : — 0 
V 
VRW C 
I ] R — — C [ l ] i s D IS /ENC, C [ 2 ] i s R/W and C [ 3 ] i s PORT number —— 
j] -»ENDXL (C[1]=0) fi ——PORT DISABLED —— 
] — L l O x i ( C [ 3 ] < 1 ) v ( C [ 3 ] > 4 ) 
I ] - ^ L l x v ( C [ 2 ] = 0 ) fl (1=WRITE) 
I ] A — — W R I T E TO MEMORY —— 
!|：] — ( L 2 , L 3 , L 4 , L 5 ) [ C [ 3 ] ] 
||.] L2 ：MEM[ 1+XREG[ 3 0+1 ] +XREG[ 19+1 ] ] <-XREG [ 20+1] O 'WRITE PORT1' O —END 
] L 3 ： MEM [ 1+XREG [3 0+1 ] +XREG [27 + 1 ] ] «-XREG [ 2 8 + 1 ] O 'WRITE PORT2 ' O —END 
];L4:MEM[1+XREG[30+1]+XREG[19 + 1] ]—2丄32卞VtoIEEE64 FREG[7+1] O 'WRITE PORT3 
；；0] MEM[1+XREG[30+1]+XREG[19+1]+1]—2丄32 iVto IEEE64 FREG[7+1] O —END 
11] L5 :MEM[1+XREG [ 30+1]+XREG [ 27+1] ]—2丄32tVtoIEEE64 FREG[15+1] • 'WRITE PORT 
4A 
| :2] M E M [ l + X R E G [ 3 0 + l ] + X R E G [ 2 7 + l ] + l ] — 2 丄 3 2 4 V t o I E E E 6 4 FREG[15+1] • —END 
3 ] L I •• A — - READ FROM MEMORY -
1L ] A E m u l a t e t h e memeroy f e t c h d e l a y , 1 - 2 c y c l e s r a n d o m l y 
5] ,[RW] • 
;6] — ( L 6 , L 7 , L 8 , L 9 ) [ C [ 3 ] ] 
I 7 ] L6 :a—XDRlrdy—-LX?2 • ' READ FROM PORTL' • —END FL PORT 1 — 
| B ] L7 :D—XDR2rdy—_ix?2 • ' READ FROM PORT 2 , • 遍 D A … P O R T 2 — 
1 3 ] L8:C1—FDRlrdy—_lx?2 • ' READ FROM PORT3' • -END R — P O R T 3 — 
| 0 ] L9 • • ^ F D R 2 r d y ^ - - l x ? 2 • READ FROM PORT4, • —END R —-PORT 4 — _ 
I ！i . _ 
] L I O : E P U T ,[RW] INTERNAL ERROR： MEMORY PORT NUMBER OUT OF RANGE, 
2] END:—0 
V 
• A — R W a b l e IR 
] A — ( 1 = I R [ 1 ] ) V ( ( A / 0 0 0 = 3 T I R ) A ( ~ A / 0 = 6 t I R ) ) V ( A / 0 0 1 1 1 0 = 6 T I R ) 
V ' 
VSETC A ; T 
] A - • " [ S E T C ] SET C TO • Q^-A 
] T—VtoB32 X R E G [ 1 + 1 6 ] 
] T [ l + 3 0 ] — A 
] XREG [ 1 + 1 6 ]< -B toV T 
V 
- r k s p a c e : 2 MATRIX A p r 1 1 , 1994 6 : 4 4 PM Page 27 
j e c t s : SETF SETX SHOW SHOW_FREG SHOW 一 X R E G 
VSETF;PSW 
] PSW—VtoB3 2 X R E G [ 1 + 1 6 ] 
M] PSW[ 1 + 2 5 ] < - F t e m p < 0 R F l o a t S i g n : l = N e g a t i v e 
] P S W [ l + 2 3 ] — 0 R - F l o a t O v e r f l o w : l = o v e r f l o w ; NOT YET IMPLEMENTED 
L] P S W [ l + 2 2 ] < - 0 fi - F l o a t U n d e r f l o w : l = u n d e r f l o w ; N O T YET IMPLEMENTED 
[I] PSW[ 1 + 2 1 ] <-Ftemp=0 fi F l o a t Z e r o : l = i s z e r o 
W\ X R E G [ 1 + 1 6 ] — B t o V PSW 
V 
•SETX;PSW 
I ] — PSW—VtoB32_ XREG[1+16] 
Ej] p s w [ l + 3 0 ] —(Ctemp<0—2*31) v (Ctemp>一 1 + 2 * 3 1 ) n C a r r y F l a g : l=〇n 
| ] PSW[ 1+29 ]< -C temp<0 fl - F i Ix e d S i g n : l = N e g a t i v e ] P S W [ L + 2 7 ] — ( C 七 e m p < 0 - 2 * 3 1 ) v ( c t e m p 〉 _ l + 2 * 3 1 ) n ——Fixed O v e r f l o w : l = o v e r f l o w , ] P S W [ L + 2 6 ]—Ctemp=0 fl — F ix e d Z e r o : l = i s z e r o 
3 X R E G [ 1 + 1 6 ] — B t o V PSW 
V 
• SHOW 





] ' [ S H O W _ F R E G ] ‘ 
) L O O P : C l — F R E G [ I ] • ‘ 
] 工—1+l 
] ~ » L l x i (1关4 I I ) 
] EPUT ‘ ' 
] LI :—LOOPXL(ISPFREG) 
p) —0 
V 
•SHOW X R E G ; I 一 , 
j I I — I 
] ' [SHOW 一 X R E G ] ‘ 
I丨 LOOP : V t o H E X 8 XREG[I] • ‘ ‘ 
|) 工―工+ 1 
I I — L l x i (1关4 I 工） 
| | EPUT ‘ ‘ 
j I LI:—LOOPX L ( 工 印 X R E G ) 
II —0 






叶 r k s p a c e : 2 MATRIX A p r 1 1 , 1994 6 : 4 4 PM Page 28 
i j e c t s : STATUS STOREmemCOMPLETED 
VSTATUS 
I I X I R : ‘ O 口—26个工R 
j I , F I R : , O 17卞26丄工只 
• " - B i R : ‘ • 21T43 丄工 R 
• " psw： ‘ O V toHEX8 XREG [ 1 + 1 6 ] O EPUT ' ' 
W u X I R : ' O V toHEX8 B t o V u X I R • ' 
W u F I R : • O V toHEX8 B t o V u F I R • W ' 
] ' U B I R : ‘ O VtoHEX8 B t o V uB IR O EPUT ” 
] U X P C ： / O VtoHEX4 UXPC O , uFPC: , O VtoHEX4 uXPC • , ^ 
UBPC: ‘ O VtoHEX4 uBPC O EPUT “ 




] fi [STOREmexnCOMPLETED] F S D R l r d y = , • FSDRl rdy • FSDR2rdy= ' o 口 
—FSDR2rdy 
] fi ‘ [STOREmemCOMPLETED] MSAR1=' O XREG[1+19] • , MSAR2=' O XRE 
G [ l + 2 7 ] 
] 
] A c h e c k w h e t h e r t h e MEMory WRITE can be c a r r i e d o u t —— 
] ->L5 x i ( X S D R l r d y ^ 1) 
] •—'[STOREmemCOMPLETED] WRITE ' 
] D<-MEM [ 1+XREG [ 3 0+1 ] +XREG [ 19+1 ] ] ^-XREG [ 2 0+1 ] • ' > PORT” 
] XSDRlrdy—0 O XREG [ 1+19 ] ^-MSARlnext 
] L 5 : — L 6 x i ( X S D R 2 r d y 关 1 ) 
0] \H<r-' [STOREmemCOMPLETED] WRITE ‘ 
1] MEM [1+XREG [3 0+1]+XREG[ 27+1 ] XREG [2 8+1 ] • PORT2 ' 
2] XSDR2rdy<-0 O XREG[ 1+27]—MSAR2next 
3] L 6 : — L 7 x i ( F S D R l r d y关1 ) 
::4 ] ' [ STOREmemCOMPLETED ] WRITE • 
：|5] MEM[1+XREG[3 0+1 ]+XREG[19+1 ] 2丄3 2个VtoIEEE64 FREG[7+1] O ' : ' 
5] •—MEMCl+XREGCSO+lJ+XREGflS+lJ+l]—2丄32丄VtoIEEE64 FREG[7+1] O FSDRlrdy—0 
J7 ] PORT3 ' O XREG [ 1+19 ] <-MSARlnext 
j(3] L7:—ENDXL (FSDR2rdy关 1) 
p ] D^-7 [STOREmemCOMPLETED] WRITE ' 
JO] MEM[1+XREG[30+1]+XREG[27 + 1] ]—2丄32个VtoIEEE64 FREG[15+1] • ：' 
i i ] MEM[1+XREG[30+1]+XREG[27+1]+1]—2丄32 iVto IEEE64 FREG[15+1] O FSDR2rdy«-0 
2] U<r-'> PORT4' • XREG[1+27]—MSAR2next O —END 
|3] END:^0 . 
V 
i f I ^ i l l I I I 
I • 
r k s p a c e : 2 MATRIX A p r 1 1 , 1994 6 : 4 4 PM P a g e 29 
j e c t s : STOREmemlNIT TBLc TBLs T I T L E T V t o B 
•STOREmemlN IT 
] P BY PASS I N V A L I D READ/WRITE INSTRUCTION 
] —ENDx卜RWable I R 
] A i n i t i a t e NEW FETCH a n d NEW STORE —— 
] 
] Pi I R [ l + 6 ] = C o u p l e i n s t r u c t i o n i n d i c a t i o n (1—coup led ) 
] A I R [ l + 8 ] = P u t r e s u l t i n t o MAR ( f i x e d / f l o a t i n g READ) ( 1 - e n a b l e ) 
] R I R [ l + 9 ] = P u t r e s u l t i n t o MSAR ( f i x e d / f l o a t i n g WRITE) ( 1 - e n a b l e ) 
] f\ I R [ l + 7 ] = T y p e o f o p e r a t i o n ( 1 - F l o a t i n g ) 
] fi 
0 ] — L l x i (XWRITE—STARTED=1) v ( u X I R [ 3 1 ] 关 1) 
1] W R I T E ( I R [ 1 0 ] A ( 〜 工 叩 ] ) ) , ! ^ n o r m a l f i x e d w r i t e 
2] WRITE(IR[10]A 卜 I R [ 8 ] ) A I R [ 7 ] ) , 2 fi c o u p l e f i x e d w r i t e 
3] XWRITE_STARTED—1 
4 ] L l : — E N D x i (FWRITE一STARTED=1) v ( u F I R [ 2 7 ] 关 1) 
5] W R I T E ( I R [ 1 0 ] A I : R [ 8 ] ) , 3 fi n o r m a l f l o a t i n g w r i t e 




VCONST<-TBLc I D X 
) CONST—~VtoB3 2 2 * I D X 
V 
•CONST—TBLs I D X 
] CONST—VtoB32 2 * I D X 
V 
V T I T L E 
i ) •<-/ XEND FEND BEND I N I T INSv XNEW FNEW BNEW END NOX NOF NOB X r d y F r •；、 
d y B r d y Y 
v 
I 
•B ITVEC—TVtoB TVALUE 
j I ->NEGATIVExlTVALUE<0 
1 I BITVEC—VtoB TVALUE • —0 
I j NEGATIVE ： BITVEC一 V t o B I TVLUE+1 
V 
1 ； . 
t j r k s p a c e : 2 MATRIX A p r 1 1 , 1 9 9 4 6 : 4 4 PM Page 3 0 
j j e c t s : U P D A T E f r e g 
I I ^ • I • • 
•UPDATEfreg；T;S；vector；PSW 
j ] T—S—0 O PSW<-VtoB32 XREG[1+16] 
I ] —Ex卜 (FDATArdyA ( F A L U r d y = l ) A ( N O F = 0 ) ) 
] —Exl ( 3 = ( 2 p 2 ) 丄 U F I R [ 2 3 24] ) fl '1 1 ' I S NOT ALLOWED ： 一 一 一 一 
I ] — E x i ( u F I R [ 2 5 ] = 0 ) A ENF-0, s t o r e - b a c k d i s a b l e ; ENC=1, enab 
l e 
j ] T—(2丄vector—(26山工R) [ ( 5 + ( (2p2)丄uFIR[23 2 4 ] ) x4) +C4 ] ) 
I ] n [ U P D A T E f r e g ] b i t — C : b i t — R = , • I R [ 7 ] , I R [ 9 ] 
] —SKIPOXL ( (T=6) A ( I R [ 9 ] = 1 ) A ( I R [8 ] = 1 ) ) v ( (T=14) A ( I R [ 9 ] = 1 ) A ( I R [ 7 ] = 1 ) A ( I R [ 8 ] = 1 
) ) 
；] FREG[1+T]—REG一Fout 
I ] • — ' [ U P D A T E f r e g ] FREG [ ' • T • • < • ' ] i s u p d a t e d t o : ' O REG一 F o u t 
^D] S K I P O : — S K I P I X L ( P S W [ 3 ] ^ 1 ) 
| 1 ] FREG[1+7]—REG一Font O T—7 
[ 2 ] [ U P D A T E f r e g ] FSDR1 i s a l s o u p d a t e d t o : ' O REG一:Fout 
[B] SKIPl:—L2xl (IR[31]关 1) fi — COUPLED OPERATION 
[N4] S—(2丄（一 IT v e c t o r ) • (14 v e c t o r ) ) 
[-5] —SKIP5XI ( (S=6—)八（IR[9]=1)八（IR[8]=1) ) v ( (S=14 )八（ IR[9 ]=1 )八（ IR[7 ]=1 )八 ( IR [8 ]=1 
) ) 
[5] FREG[1+S]—REG一FOUT一 
| 7 ] [ U P D A T E f r e g ] FREG[A • S O i s u p d a t e d 七o " • REG一Fout— 
f B ] SKIP5:—L2XL(PSW[4]关1) 
[ 9 ] FREG [ 1+15 ] «-REG_Fout_ O S—15 
[D ] [ U P D A T E f r e g ] FSDR2 i s a l s o u p d a t e d t o : , O REG_Fout一 
[ 1 ] L 2 : — L l x i � （ （ （ s = 7 ) v ( T = 7 ) ) A ( F S D R l r d y = 2 ) ) 
• ] FSDRl rdy—1 
!|3] L I : — ( (S=15) v (T=15) ) A (FSDR2rdy=2)) 




| r k s p a c e : 2 MATRIX A p r 1 1 , 1994 6 : 4 4 PM Page 3 1 
r j e c t s : UPDATEpcvc UPDATExreg 
• U P D A T E p c v c 
I ] —ENDxi~ (BDATArdyA ( B A L U r d y = l ) a (NOB=Q)) 
i ] R UPDATE PC 
I ] S _ B R — I R [ 4 4 ] A SET THE SELF+BRANCH B I T 
f i ) —Llxi〜（（VC=1) A(S_BR=0) ) fi NOT VALID BRANCH, NO PC UPDATE ---
] A • " [ U P D A T E p c v c ] B r O f f s e t : ' 
] (REG, OPD, REG, L_OPD) [ 1+2丄（ IR[45] , I R [ 4 9 ] ) ] 
|h ] R E G : X R E G [ l + 3 1 ] — X R E G [ l + 3 1 ] + X R E G [ l + 2 丄 工 R [ 5 7 + L4〕] O —LI 
i ] O P D : X R E G [ l + 3 1 ] — X R E G [ l + 3 1 ] + B t o T V I R [ 5 7 + l 7 ] O ->L1 
丨 " L _ O P D : XREG [ 1+31]—XREG [ 1 + 3 1 ] + B t o T V I R [ 4 9 + L l 5 ] O —LI 
P ] PUSH TO INTERNAL STACK 
I丄 ]LI:—L2xl (hBIR[19]=0) A NO PUSH 
||l>]工STACK[ISOF 工 S P ] — X R E G [ (32 17 23 31) [1+2丄uBIR[20 21]]] • —L2 
麗 j] fl POP FROM INTERNAL STACK 
|1] L2:—L3xt (uBIR[22]=0) A 一 NO POP 
XREG[ ( 3 2 17 23 31 ) [ 1+2丄uBIR[23 2 4 ] ] ] — ISTACK[ ISOF I S P ] • —L3 
[ ； > ] R INC ISP 
\ p ] L3 :—L4XL ( u B I R [ 2 5 ] = 0 ) 
^―•••• • irwrnnrng—li—ili I _lllI• • I! fth- __1 lihi  ml 
8 ] 工 S P — I S P + 1 O —L4 
P ] A DEC ISP 
|0] L4:—ENDxl(uBIR[26]=0) 
|:1] I S P — I S P - 1 O —END 
|：2] END :— 0 
V 
vUPDATExreg；T；vector；PSW 
I ] PSW<-VtoB3 2 XREG [ 1 + 1 6 ] 
J ->Ex i~ (XDATArdyA ( X A L U r d y = l ) A (N0x=0) ) 
I ] —L0XL(3关 (2p2)丄UXIR[26 2 7 ] ) 
麗] XREG[29+1]—REG C o u t 
I ] n<r-f [UPDATExreg ] SP i s u p d a t e d t o : , O REG一Cout O —MAR 
I ] L O : — ( L 1 , L 2 ) [ 1 + I R [ 1 ] ] 
:] n RR f o r m a t 
I ] LI:—MARXL ( u X I R [ 2 8 ] = 0 ) A ENC=0, s t o r e - b a c k d i s a b l e ; ENC=1, e n a b l e 
| ] T — ( 2 丄 v e c t o r — I R [ ( 1 0 + ( ( 2 p 2 ) 丄 u X I R [ 2 6 2 7 ] ) x 5 ) + L 5 ] ) 
!0] —SKIPOxi ( (T=18) A (IR[9 ] = 1 ) A(IR[8]=0) ) v((T=26) A(IR[9 ]=1) A(IR[7]=1)A(IR[8] = 
I 0 ) ) 
j 1 ] XREG[1+T]—REG_Cout 
|-2] [UPDATExreg ] XREG[‘ O T O , ] i s u p d a t e d t o : ‘ O REG一Cout 
^3 ] S K I P 0 : — S K I P 2 X I ( P S W [ 1 ] ^ 1 ) 
c4] XREG[1+2 0]—REG—Cout 
|5] [UPDATExreg ] XSDR1 i s a l s o u p d a t e d t o : " • REG—Cout 
SKIP2:~»MARXL ( I R [ 7 ] 关 1 ) fl ——COUPLED OPERATION —— 
E7] T—(2丄（1个vector) • ( - I t l i v e c t o r ) , ( 2 i v e c t o r ) ) 
CS] —SKIP5XL ( (T=18) A ( I R [ 9 ] = 1 ) A( IR[8]=0) ) v ( (T=26) A ( I R [ 9 ]=1 ) A ( I R [ 7 ] = 1 ) A ( IR[8] = 
丨| 0 ) ) 
p ] XREG [1+T]—REG—Cout 一 
[D] [UPDATExreg ] XREG[ r • T O • ” ] i s u p d a t e d t o : ' O REG一Cout— 
| 丄 ] S K I P 5 : — S K I P 3 X I ( P S W [ 2 ] # 1 ) 
^ r k s p a c e : 2 MATRIX A p r 11 , 1994 6 : 4 4 PM Page 32 
i j e c t s : U P D A T E x r e g ( C o n t x d ) 
J 2] XREG[1+28]—REG一Cout一 
| , j ] [UPDATExreg ] XSDR2 i s a l s o u p d a t e d t o : ' • REG一Cout一 
| 1 ] SKIP3 :->MAR 
| 5 ] fl RX f o r m a t 
15] L2:—MARXL ( u X I R [ 2 8 ] = 0 ) fl ENC==0, s to re—back d i s a b l e ; ENC=1, e n a b l e 
一 ] T — ( 2 丄 v e c t o r — I R [ (10+ ( (2p2)丄 uX IR [ 26 2 7 ] ) X 4 ) + L 4 ] ) 
f||n —S K IPIOXI((T=18)A(IR[9]=1)a(]:R[8]=0))V((T=26)A(:[R[9]=1)A(IR[7]=1)MIR[8] 
i： = o ) ) 
9] XREG[1+T]—REG一Cout 
0] ' [ U P D A T E x r e g ] XREG[ ‘ O T • i s u p d a t e d t o : ‘ • REG一Cout 
1] S K I P 1 0 : — S K I P 4 x i ( P S W [ 1 ]关1 ) 
2] XREG[1+20]—REG—Cout 
3] [ U P D A T E x r e g ] XSDR1 i s a l s o u p d a t e d t o : , O REG—Cout 
4 ] SKIPAJ—MARXI^IRCTJ^I) r 一 一 一 COUPLED OPERATION - — 一 一 
5 ] T — ( 2 丄 （ 〜 l t v e c t o r ) , ( l l v e c t o r ) ) 
6] — S K I P l l x i ( (T=18) A ( IR[9]=1) A ( I R [ 8 ] = 0 ) ) V ( ( T = 2 6 ) A ( 工 R [ 9 ] = 1 ) A ( 工 R [ 7 ] = 1 ) A ( 工 R [ 8 ] 
= 0 ) ) 
7] XREG[ 1 + T ] «-REG_Cout_ 
28] [ U P D A T E x r e g ] XREG[‘ O T O i s upda七ed t o i L , • REG一Cout一 
9 ] S K I P l l : — M A R x i ( P S W [ 2 ] ^ 1 ) 
0] XREG[1+28]—REG一Cout— 
1 ] [ U P D A T E x r e g ] XSDR2 i s a l s o u p d a t e d t o : • O REG—Cout— 
2] fl BOTH f o r m a t : UPDATE Memory A d d r e s s R e g i s t e r and MSAR 
3] MAR： fi BY PASS THE NON VALID MEMORY OPERATION 
4 ] —Ex卜RWable I R 
5 ] —MSARxi ( I R [ 9 ]关 1) fl R=0, NOT p u t t o MAR,* R = l , p u t 
5] XREG[1+17]—REG_Cout 
7] • — ' [ U P D A T E x r e g ] MAR1 i s u p d a t e d t o : ‘ O REG一Cout 
3] — S K I P l x i ( P S W [ 5 ] ^ 1 ) 
IB] — ( I R [ I O ] = 1 ) fl READ and: WRITE i n 6ne i n s t r u c t i o n 
0 ] XREG[1+19]—REG一Cout 
1 ] [ U P D A T E x r e g ] MSAR1 i s a l s o u p d a t e d t o : ‘ O REG—Cout O —SKIPl 
j:2] L21:MSARlnext—REG一Cout 
SKIP1:—MSARXL ( I R [ 7 ]关 1) R C=0, n o r m a l o p e r a t i o n : C=L, c o u p l e d 
j 丨4 ] XREG[1+25]—REG—Cout一 
5 ] [ U P D A T E x r e g ] MAR2 i s u p d a t e d t o : , O REG一Cout—-
O] —MSARXL(PSW[6]关1) 
7] —L22XI ( I R [ 1 0 ] = 1 ) R READ and WRITE i n one i n s t r u c t i o n 
B] XREG[l+27]—REG—Cout一 
b ] [ U P D A T E x r e g ] MSAR2 i s a l s o u p d a t e d t o : ' • REG一Cout一 • ->MSAR 
01 L2 2 : MSAR2 next<-REG C o u t 
u ^ j » — 
g l ] MSAR:—Exi ( I R [ 1 0 ]关 1 ) A W=0, NOT p u t t o MSAR； F = l , p u t 
2 ] —L20XL ( I R [ 9 ] = 1 ) A READ and WRITE in—one i n s t r u c t i o n 
|3 ] XREG [ 1+19 ] —REG一Cout 
L ] ' [UPDATEx reg ] MSAR1 i s u p d a t e d t o : • • REG一Cout 
5 ] —Exl ( I R [ 7 ]关 1) fl C=0, n o r m a l o p e r a t i o n : C = l , c o u p l e d 
I >5] XREG [ 1 + 2 7 ] —REG一Cout一 
；[7] [UPDATExreg ] MSAR2 i s u p d a t e d t o : " • REG一Cout_ • —E 
‘ 
r k S P a C e : 2 ^ ^ A p r 11 , 1994 6 : 4 4 PM Page 33 
1 6 C t S : UPDATExreg ( C o n t M ) ( C o n t ' d ) V t o B V t o B l l V toB17 V toB2 4 VtoB2 7 V t 
丨 扭 ] L 2 0 : A D e l a y UPDATE s i n c e t h e l a s t WRITE h a s n ' t bee 
n c a r r i e d o u t 
召] MSARlnext—REG—Cout 
:0] —ExLfiRi;?]^；!) fl C=0, normal operation: C=l, coupled 
1] MSAR2next—REG C o u t 
2 ] E:—0 
V 
•BITVEC—VtoB VALUE 
] B I T V E C ( 3 2 p 2)TVALUE 
V 
• B I T V E C — V t o B l l VALUE 
] B I T V E C — ( l l p 2 ) T V A L U E 
V 
VBITVEC—VtoB17 VALUE 
li] BITVEC— ( 1 7 p 2 ) TVALUE 
V 
VBITVEC—VtoB24 VALUE 
H.] BITVEC— (24p2) TVALUE 
V 
• B I T V E O V t o B 2 7 VALUE 
I ] B ITVEC—(27p2) tVALUE 
V 
VBITVEC—VtoB31 VALUE -- ‘"‘ ^ -
I ] B ITVEC—(31p2) tVALUE w 
v • 
•BITVEC—VtoB32 VALUE 
] BITVEC— ( 32p2 )TVALUE 
j 
V 
•BITVEC—VtoB3 3 VALUE 
I BITVEC— ( 33p2 )TVALUE 
V 
•B ITVEC—VtoB8 VALUE 
] B I T V E C ^ ( 8 p 2 ) T V A L U E 
V 
i x k s p a c e : 2 MATRIX A p r 1 1 / 1994 6 : 4 4 PM Page 34 
i j e c t s : V toHEX16 VtoHEX4 VtoHEX8 V to IEEE64 
W t o H E X 1 6 A；RES 
：] A « - ( 1 6 p l 6 ) TA 
j；] A—(48+7XA210) +A 
I] RES—82 DDR^A 
:] RES—RES [-1+2XL16] 
V 
VVtoHEX4 A；RES 
j ： ] A ^ - ( 4 p l 6 ) T A 
I ] A—(48+7xA2:10)+A 
l l RES—82 DDR^A 
I'J 
!];] RES—RES [ _ l + 2 X L 4 ] 
V 
VVtoHEX8 A；RES 
I] A < - ( 8 p l 6 ) T A 
j] A—(48+7xA=10)+A 




1]] —LOxiA关0 • RES—64p0 • —END 
I;] LO:SIGN—0 • — L l x i A > 0 
I ] SIGN—1 O A — A 
I ] L1:BD—LA O AD—A-LA 
I ] BD—BDtoB BD O AD—ADtoB AD 
I ] — L 2 x i ( B t o V BD)=0 
I ] EXP<-(pBD) - 1 O 
f. • . 
] L2:EXP—1 • BD—OPO 
] L 3 ： EXP<-EXP+ (AD [ EXP ] =0) 
IP] —L7 xi(AD[EXP]=1)a(EXP=1) 
1] ~>L3XL (AD[EXP]^1) A (AD[EXP-1]=0) 
:2] L7:EXP—-EXP 
丨:3] L4:—L5xi (EXP21024) v (EXP<"1023) 
4 ] ->L8x l (EXP<0) 
；5] RES-SIGN, (lltl6pVtoBll EXP+1023), ( l i 5 3 T B D ,AD) O .END A normalized 
s 6 ] L 8 : R E S — S I G N , ( l l T 1 6 P V t o B l l EXP+1023) , ( (-EXP) 4 (52-EXP) TAD) END 
’|7] L5:->L6x\. (EXP之 1024) 
怦 ] R E S ^ S I G N , ( l l p O ) , ( 1 0 2 2 I ( 1 0 2 2 + 5 2 ) T A D ) O —END fl d e n o r m a l i z e d 





|rkspace : 2 MATRIX Apr 11, 1994 6:44 PM Page 35 
I j ects : WRITE XALU _ 
• • I I I I I 
•WRITE C 
] A——C[L] is DIS/ENC, C[2] is PORT number —— 
II 
I] —ENDXL (C[1]=0) FL - 一 PORT DISABLED —— 
I ] —LLOXI(C[2]<1)V(C[2]>4) 
I| ] fi WRITE TO MEMORY 
I] —(L2 ,L3 ,L4 ,L5 ) [C[2]] 
I ] L2:XSDRl rdy—2 O —END 
J J L3 :XSDR2rdy<-2 O —END 
I] L4 :FSDRl rdy—2 O —END 
I) L5:FSDR2rdy—2 • —END 
I〕] L10:EPUT '[WRITE] INTERNAL ERROR: MEMORY PORT NUMBER OUT OF RANGE' 
|.L] END:—0 
•XALU;OP 
->ENDXL (NOX=L) V (XDATArdy^l) fl ~ XALU keep idle 
—READYXL (XALUrdy二1) A XALU is working f o r last operation ——一 
XALUr dy <-XALUr dy+1 
] —END 
] R E A D Y : OP—1+ (5 p 2 )丄 I IXIR [ 1 6 + L 5 ] 
] —ERRx i (0P<1 )A (OP>32 ) 
] - ( N O P , ADD, SUB, MUL, D I V , R 1 , R 2 , R 3 , L S H T , R S H T , L R 0 T , RRQT , LCROT, RCROT /B,R4,AND / 
OR, XOR, NOT—B , CLR, SET, CLRa, SETa, R5 , R6 , R7 , R8 , BINC, BDEC, R9 , R10) [ OP 1 
] XALUrdy—1 
」 A D D ： REG 一 Cout—Ctemp—REG—Aout+REG 一 B o u t 
丨 0] REG一Cout一—Ctemp一—REG一Aout一+REG一Bout一 O SETX • —END 
:1 ] SUB ： REG 一 Cout—Ct emp—REG 一 A o u t - R E G 一 B o u t 
:2] REG_Cout_^Ctemp_-e-REG_Aout_-REG_Bout_ O SETX • —END 
:3] MUL:REG—Cout—Ctemp—REG AoutxREG B o u t 
4] REG—Cout一一Ctemp一—REG一Aout一XREG一Bout— O SETX • XALUrdy—0 O —END 
:5] DIV:REG—Cout—Ctemp—LREG Aout-rXTEST DBZ REG B o u t 
— _ _ 
:6] REG—Cout——Ctemp一一LREG_Aout一+XTEST一DBZ REG一Bout一 O SETX • XALUrdy—-2 O ->E 
ND 
7 ] AND:REG_Cout—Ctemp—BtoV(VtoB3 2 REG一Aout) AVtoB32 REG_Bout O —END 
OR:REG__Cout—Ctemp—BtoV(VtoB3 2 REG—Aout) W t o B 3 2 REG一Bout O —END 
XOR:HEG—Cou七—Ctemp—BtoV(VtoB32 REG—Aou七）关VtoB32 REG_Bout O —END 
:0] NOT_B:REG__Cout—Ctemp—BtoV~VtoB3 2 REG—Bout • —END 
；1] B:REG一Cout—Ctemp—REG一Bout 
2 ] REG—Cout_—Ctemp——REG—Bout一 O —END 
丨3] CLRa:REG一Cout—Ctemp—0 O —END 
4 ] SETa:REG_Cout—Ctemp—BtoV 3 2 p l • —END 
|5] CLR:REG_Cout—Ctemp—BtoV (VtoB3 2 REG一Aout) A REG 一 B o u t O —END 
:6] SET:REG—Cout—Ctemp—BtoV (VtoB32 REG—Aout)"REG—Bout O —END 
]7] LSHT: Ctemp—VtoB33 REG—Aoutx2*REG一Bout O REG—Cout—BtoV "32TCtemp 
,|B] SETC c t e m p [ 1 ] O Ctemp<-REG_Cout • —END fi MAYBE WITH PROBLEM 
|.|9] RSHT:REG—Cout—Ctemp—LREG一Aout+2*REGJBout O SETC 11 (REG_Boutp 2) T (2 •REG_Bou 
t ) 丨 REG_Aou t O —END 
|0 ] RROT ： REG_Cout«-Cteinp<-BtoV (3 2-REG_Bout) 0VtoB3 2 REG一Aout O SETC(VtoB3 2 Ctemp 
) [ 1 ] • —END 
|| 
i r k s p a c e ： 2 MATRIX Ap r 11 , 1994 6 : 4 4 PM Page 36 
i j e c t s : XALU ( C o n t ' d ) . XFILTER XMPX1 
| l ] LROT: REG Cout—Ctemp—BtoV REG_Bout0VtoB32 REG一Aout • SETC(VtoB32 Ctemp) [32 
] O —END 
2 ] R C R O T : C t e m p — ( 3 3 - R E G 一 B o u t ) ” ( V t o B 3 2 REG 一 A o u t ) , G E T C ) O S E T C C t e m p [ 3 3 ] • REG 
一 C o u t — C t e m p — B t o V 32卞Ctemp • —END 
L j LCRO;:Ctemp-REG一Bout0 (GETC, (VtoB3 2 REG一Aout) ) • SETC Ctemp[1] • REG—Cout-
Ctemp—BtoV l i C t e m p • —END 
1(1] BINC:REG__Cout—Ctemp—REG__Bout+l • —END 
h 5 ] BDEC:REG一Cout—Ctemp—REG一Bout-1 • — E N D 
l 6 ] E R R : E P U T 乂 贴 1 ^ ] X - INTERNAL ERROR: FIXED MICRO OP-CODE OUT OT RANGE, 
,7] NOP:—END 
8 ] R l :—END 
9 ] R2:—END 
0 ] R3:—END 
I:L] R4:—END 
:2] R5:—END 
丨3 ] R6:—END 
卜4] R7:—END 
; 5 ] R8:—END 
6 ] R9:—END 




] TYPE—OP [ 1 ] • OP<- l iOP 
；] — ( L 1 , L 2 ) [1+TYPE] 
:] R RR FORMAT 
] L I : UADDR—RRSTART [ 1 + B t o V OP] O —END 
] fi RX FORMAT 




;] — ( L 1 , L 2 , L 3 , L 4 ) [ 1 + 2 2 丄 uXIR [15 1 6 ] ] 
|j] L l : R E G _ B o u t — C t e m p O —END fl 00 : Ctemp 
I] L2 :—ENDXI (i :r [ 1 ] = 0 ) V ( I R [ 6 ] = 1 ) A - — i m m e d i a t e mode — 01 : REG_B/ 
I R . o p d 
Ij] R REG_Bout^-MEM [ REG_Bout+XREG [ 3 0 + 1 ] ] • —END —• NOT AVAILABLE 
l l L 3 : R E G Bou t—TBLc 31-REG B o u t END A —10 : c o n s t T A B L E c l r 
I .；」 — -.. 
| l L4 ： REG Bou t—TBLs 3 1 -REG B o u t O —END fl 1 1 : c o n s t TABLEse t 
|1 J _ — 





b r k s p a c e : 2 MATRIX A n r 1 Q Q A … … ” 
Apr 11, 1994 6 :44 PM Paqe 37 
) j e c t s : XREADY 
•R—XREADY；xyz；SELECTED 
|L] 
A — — TEST IF REG—A REQUIRES XDR1 
: ! ] x y z ^ ( 2 p 2 ) ± u X I R [ 2 2 2 3 ] n 一一 REG一A: 0=Rx, l = R y , 2=Rz 3=SP - 一 
} ] ">L2x l ( I R [ 1 ] = 1 ) 
>] A RR f o r m a t 
；] R — ( l = X D R l r d y ) v (SELECTED—(1+18)关（1+(2丄IR[ ( 1 0 + x y z x 5 ) + l 5 ] ) ) ) 
R M ( 工 R [ 7 ] 关 l ) v ( i = X D R 2 r d y ) v S E L E C T E D ) O —L3 
: l ] fl RX f o r m a t Rz i s n o t assumed t o be s e l e c t e d h e r e 一 一 -
:•] L 2 : R < - ( l = X D R l ] r d y ) v (SELECTED—(1+18)关（1+(2丄工R[ ( 1 0 + x y z x 4 ) + L 4 ] ) ) ) 
: .0] R—R八（（IR[7]关 1) v ( i =XDR2rdy ) vSELECTED) O —L3 
[ .1] 
.2] fl TEST IF REG—A REQUIRES XDR2 
.3] L3:xyz—(2p2)丄UXIR[22 23] fl -- REG—A: 0=Rx, l=Ry, 2=Rz 3=SP —— 
: .4 ] — L 4 x i ( I R [ l ] = i ) 
: .5 ] A RR f o r m a t 
R—Ra( ( l = X D R 2 r d y ) v (SELECTED—(1+2 6)关（1+(2丄工R[ ( 1 0 + x y z x 5 ) + L 5 ] ) ) ) ) 
v.7] R—RA ( ( I R [ 7 ]关 1 ) v ( I = X D R l r d y ) vSELECTED) O ->L5 
R RX f o r m a t Rz i s n o t assumed t o be s e l e c t e d h e r e 
R9] L4 ：R«-RA ( ( l = X D R 2 r d y ) v (SELECTED—(1+26)关（l+(2丄工R[ ( 1 0 + x y z x 4 ) + L 4 ] ) ) ) ) 
:0] R—RA ( ( I R [ 7 ] 关 1) v ( l = X D R l r d y ) vSELECTED) • —L5 
11] 
[::2] A TEST IF REG一B REQURIES XDR1 
i;,3] L 5 : x y z — ( 2 p 2 ) u X I R [ 2 4 2 5 ] fl - - REG一A: 0=Rx, l = R y , 2二Rz 3=SP 
I 4 ] —L6XL( IR[1]=1) 
|:5 ] R RR f o r m a t 
I 6 ] R—RA( ( l = X D R l r d y ) v (SELECTED—(1+18)关（ l+(2丄 IR[ ( 1 0 + x y z x 5 ) + L 5 ] ) ) ) ) 
I 7 ] R—RA ( ( I R [ 7 ] 关 1 ) v ( l = X D R 2 r d y ) vSELECTED) O —L7 • 
|i8 ] 一 RX forma七 “ 
1 9 ] L 6 : — ( L 8 , L 8 , L 7 ) [ x y z + 1 ] 
SLO] RA( (l=XDRlrdy)V (SELECTED—(1+18)关（L+(2 丄工 R[ (10+xyzx4)+L4 ] ) ) ) ) 
| , 1 ] R—RA ( ( I R [ 7 ] 关 1 ) v ( l = X D R 2 r d y ) vSELECTED) O ->L7 
麗’4^  J . 
|：3]. fi——TEST I F REG—B REQURIES XDR2 
|：4] L7 :xyz—(2p2)丄UXIR[24 25] fl ~ REG—A: 0=Rx, l=Ry , 2=Rz 3=SP—— 
| : 5 ] —L9x i ( I R [ 1 ] = 1 ) 
h6] fl RR f o r m a t 
7 ] R a ( ( i = X D R 2 r d y ) v (SELECTED— ( 1 + 2 6 )关（ 1 + (2丄工R [ ( 1 0 + x y z x 5 ) + L 5 ] ) ) ) ) 
|:8] R—RA((IR[7]尹 l)V ( l = X D R l r d y ) v S E L E C T E D ) END 
91 fl--一 RX f o r m a t 1 . . . . . . 
丨叫 L 9 : —(L10,L10 , E N D ) [ x y z + 1 ] 
L 1 0 : R . R A ( ( 1 = = X D R 2 r d y ) v ( S E L E C T E D ^ ( l + 2 6 ) ^ ( l + ( 2 x I R [ ( 1 0 - f x y z x 4 ) + , 4 ] ) ) ) ) 
] 2 ] R^RA ( ( i R [ 7 ] ^ i ) v ( i = X D R l r d y ) VSELECTED) O —END 
13] 
：L4 1 END • 
I J A W [ X R E A D Y ] DATArdy : , O R 
\[5] —0 
V 
D r k s p a c e : 2 MATRIX A p r 11 , 1994 6 : 4 4 PM Page 3 8 
) j e c t s : XTEST—DBZ X t o F u B I R r d y uBMPXl UBMPX2 
•OUT—XTEST—DBZ I N 
I.] OUT—IN 
I：] —OXLIN关 0 
I ] EPUT ' [XTEST一DBZ] X-ARITHMETIC ERROR： DIVIDED BY ZERO' • OUT—l 
V 
• X t o F 
[.] -^ENDxl ( 0 = u F I R [ 2 6 ] ) 
j ] F R E G [ 1 ] — I E E E 6 4 t o V ( V t o B 3 2 XREG[1] ) , (V toB3 2 XREG[2 ] ) 
丨 ] E N D : — 0 
V 
• u B I R r d y 
i ] A ' [ u B I R r d y ] uBRAMaddr= / O D«-uBMPX2out 
,|| ] u B I R i n — V t o B 2 7 uBRAM [ UBMPX2out] 
j ] END:^0 
V 
VuBMPXl;PSW 
!j ] PSW—VtoB32 XREG[1+16] 
I ] — ( L l , L 2 , L 3 , L 4 , L 5 , L 6 , L 7 , 1 0 p L 8 ) [ 1+BtoV uB IR [ l 4 ] ] 
I -j ；LI: fi 一" U n c o n d i t i o n a l b r a n c h 
] BLDBr—1 O —END 
I I L2 ： a B r a n c h when XB=1 
I ] BLDBr—PSW[33-4 ] • —END 
I 
hi L 3 : A B r a n c h when SX=1 
] BLDBr—PSW[3 3 - 3 ] • —END 
1 L 4 . R B r a n c h when X0=1 
0 ] BLDBr—PSW[3 3 - 5 ] • —END 
丨.1] L 5 : R B r a n c h when CF=1 
•2] BLDBr<-PSW[3 3-2] O —END 
•3] L6: a Branch when XZ=1 
.4] BLDBr—PSW[3 3 - 6 ] • —END 
•5] L 7 : fi B r a n c h when (SX X〇R X O ) = l 
• 6] BLDBr^-PSW [ 3 3 -3 ] ^PSW [ 3 3 -5 ] O ->END 
:.7] L 8 : fi No B r a n c h 
• 8] BLDBr—0 O —END 
.9] END:—0 
V 
VuBMPX2； temp； tempi 
:.] — L l x t ( I N T R = I ) 
:] t emp^ - (uBPC+ l ) , t e m p i , B B r A d d r , t e m p i — ( B F I L T E R 4 T l l 4 3 i I R ) 
I ] UBMPX2out—temp [ 1 + 2 2 丄（BLDBr, BNEW)] 
I ] -^ENDxl (BDATArdyA ( B A L U r d y = l ) a (NOB=0) ) A - RESET uBPC, KEEP- I T UNCHANGED 一 
I ] UBMPX2 out«-uBPC O —END 
I ] L l : E P U T ' [UBMPX2 ] 工 NTR o c c u r s ' 
\ ] END:—0 
V 
r k s p a c e : 2 MATRIX A p r 11, 1994 6 : 4 4 PM Page 39 
! j e c t s : u F I R r d y uFMPXl UFMPX2 uSTATUS u X I R r d y 
• u F I R r d y 




j ] PSW<-VtoB3 2 XREG [ 1 + 1 6 ] 
|] ~>(L1,L2,L3,L4,L5,L6,L7,10(DL8) [ 1+B toV uF IR[ i 4 ] ] 
j：] L l : R U n c o n d i t i o n a l b r a n c h 
I] FLDBr—1 O —END 
] L 2 : fi B r a n c h when XB=1 
I ] FLDBr—PSW[3 3 - 4 ] • ~>END 
I ] L 3 : fi B r a n c h when SX=1 
| ] FLDBr—PSW[3 3 - 3 ] • —END 
• ] L 4 : fi B r a n c h when X0=1 
| q ] FLDBr—PSW[33-5] • —END 
| l ] L 5 : fl B r a n c h when CF=1 
I 2] FLDBr—PSW[3 3 - 2 ] • —END 
! l 3 ] L 6 :
 B r a n c h w h e n XZ=1 — 
丨-4] FLDBr^-PSW[3 3 - 6 ] • —END 
；.5] L 7 : R — B r a n c h when (SX XOR X O ) = l — - — -
；-6] FLDBr^-PSW[ 33-3 ] ^PSW[ 33-5 ] O —END 
• 7 ] L 8 : A No B r a n c h 
1.8] FLDBr—0 O -»END 
.9] END:—0 
V 
VuFMPX2； temp； tempi 
:.] — L l x t ( INTR=1) 
::] temp— (uFPC+1) , t e m p i , F B r A d d r , temp i—(FFILTER 4个261IR) 
P] uFMPX2out<- temp [ 1+ 2 2 丄（FLDBr, FNEW)] 
I ] -^ENDXL (FDATArdyA ( F A L U r d y = l ) A (NOF=0) ) fi - , RESET uFPC, KEEP I T UNCHANGED -
I 丨] UFMPX2out^-uFPC O —END 
I 丨 ] L I : EPUT ' [UFMPX2]工NTR occurs' 
I ] END:—0 
V 
VuSTATUS 
] ' U X I R : ‘ O VtoHEX8 B toV uXIR O ‘ 
] ' U F I R : • O VtoHEX8 B toV u F I R O ‘ 
] ' U B I R : • • VtoHEX8 B toV uBIR O EPUT ，' 
: ] ' UXPC： ' O VtoHEX4 UXPC O ' uFPC： ' O VtoHEX4 uFPC • ' 
UBPC： ' O VtoHEX4 UBPC O EPUT ' ' 
V 
• u X I R r d y 
I:] u X I R i n — V t o B 3 1 uXRAM [ UXMPX2 o u t ] 
!|：] END:->0 
V 
j — . . —… 
f j r k s p a c e : 2 MATRIX A p r 11, 1994 6 : 4 4 PM Page 40 
| j e c t s : uXMPXl UXMPX2 
I "" 
VuXMPXl;PSW 
]；] PSW—VtoB3 2 XREG [ 1 + 1 6 ] 
；|] — I ^ I ^ I / M O p L S ) [1+B toV U X I R [ 1 4 ] ] ‘ 
i:] L I : fi U n c o n d i t i o n a l b r a n c h 
] XLDBr—1 • —END 
i L 2 : B r a n c h when XB=1 
I ] XLDBr<-PSW[3 3 - 4 ] • —END 
7] L 3 : a B r a n c h when SX==1 
XLDBr—PSW[3 3-3] O —END 
L 4 : A B r a n c h w h e n X〇=l 
:L0] XLDBr—PSW[3 3 - 5 ] O —END 
j L I ] L 5 : fi B r a n c h when CF=1 
:L2] XLDBr—PSW[3 3 - 2 ] O —END 
:L3] L 6 : fi - - - B r a n c h when XZ=1 
L4] XLDBr—PSW[3 3 - 6 ] • —END 
丄5] L 7 : A B r a n c h when (SX XOR X O ) = l 
丨丄6] XLDBr^-PSW [ 3 3 - 3 ] #PSW [ 3 3 - 5 ] O —END 
1 7 ] L 8 : A No B r a n c h 
!L8] XLDBr—0 O —END 
U9] END:—0 
V 
VuXMPX2； temp； tempi 
丨丄] —LIXL (工NTR=1) 
: ! ] t emp—(uXPC+1) , t e m p i , X B r A d d r , t e m p i — ( X F I L T E R 6 t I R ) 
U uXMPX2out<- temp [ 1+ 2 2 丄（XLDBr, XNEW)] 
:、] —L2X卜 （ (XDATArdy ) A ( ( X A L U r d y = l ) ) A ( N 0 x = 0 ) ) 
•i] —END 
；>] L2 :uXMPX2out—uXPC • —END fl RESET uXPC, KEEP I T UNCHANGED 
: ' ] L I : EPUT ' [UXMPX2]工NTR o c c u r s ' 
: t ] END:—0 
V 
Appendix II: Screen dump of the simulation runs 
Case One: Gaussian Elimination Inner Loop 
********************** GAUSSIAN PROGRAM *********************** 
[0] GAUSSIAN 
[1] I Initialization, register files ____ 
XREG #is 32 #rho 0 & FREG #is 16 #rho 0 & MEM #is 128 /rho 0 
1 ^  J 
[ 4 ] j The Program 
[5] MEM[#iota 8]#is GAUSS I AN 一 PROGRAM 
[6] ！ The Original 3x4 Matrix 
[7] PUTFtoMEM 51,-1 | -1 1 2 2 
[8] PUTFtoMEM 53,1 j 3 - 1 1 6 
[9] PUTFtoMEM 55,2 | -1 3 4 4 
[10] PUTFtoMEM 57,2 
[11] PUTFtoMEM 59,3 
[12] PUTFtoMEM 61,-1 
[13] PUTFtoMEM 63,1 
[14] PUTFtoMEM 65,6 
[15] PUTFtoMEM 67,-1 
[16] PUTFtoMEM 69,3 
[17] PUTFtoMEM 71,4 
[18] PUTFtoMEM 73,4 
[19] . 
[20] J set the register to approiate value 
[21] XREG[l+16]#is BtoV(0 0 1 1 1 1 ,26 #rho 0) j S E T PSW TO READY MODE ~ 
[22] 
[23] XREG[l+5]#is 3 { (N) 
[24] XREG[l+6]#is 1 | (K) 
[25] XREG[l+7]#is 1+1 j (J) 
[26] XREG[l+0]#is 60 j ( A[J,P]) 
[27] XREG[l+8]#is 52 j ( A[K,P]) 
[28] XREG[ 1+4]#is 50+(2 #x 8) j ——TERMINATION CONDITION —— 
[29] 
[30] FREG[ 1+0]#is_(3—1) j ( -M[ J,K] :=A[ J,K]+A[K,K]) 
[31] “ 
[32 ] j execute the program 
[33] STATUS & DISPLAY 51 3 4 & #inkey 
[34] LIWC 1 
[35] DISPLAY 51 3 4 
********************** GAUSSIAN SIMULATION RUN ****************** 
[LIWC] cycle: 1 PC =1 
[MICROF] Fl start at cycle 1 
[IRC] IEND: 0 $ INIT: 1 $ S_BR: 0 $ VC: 0 
[IRC] REFRESH THE INSTRUCTION PIPELINE 
[LIWC] cycle: 2 PC =1 
[UPDATExreg] XREG[0] is updated to: 60 
[UPDATExreg] XREG[8]^ is updated to: 52 
[UPDATExreg] MARl is. updated to: 60 
[UPDATExreg] MSARl is also updated to: 60 
[UPDATExreg] MAR2 is updated to: 52 
[UPDATExreg] MSAR2 is also updated to: 52 
[MICROX] XI start at cycle 2 
[READ] -1 READ FROM PORT3 
[READ] -1 READ FROM PORT4 
[MICROF] Fl start at cycle 2 
[IRC] IEND: 0 $ 工NIT: 0 $ S_BR: 0 $ VC: 0 
[LIWC] cycle: 3 PC =1 
[FETCHmemCOMPLETED ] PORT3 FDRK-l 
[FETCHmemCOMPLETED] PORT4 FDR2<1 
[MICROB] B1 start at cycle 3 
[IRC] IEND: 1 $ INIT: 0 $ S BR: 0 $ VC- 0 
[IRC] FETCH A NEW ISNTRUCTION IN 
[LIWC] cycle: 4 PC =3 二 二 • . . . 二 . 二 二 二 二 . _ 一 
[ ^ T E x r e g ] XREG[0] is updated to: 62 — … 
UPDATExreg] XREG[8] is updated to: 54 
[UPDATExreg] MAR1 is updated to: 62 
[UPDATExreg] MAR2 is updated to: 54 
[MICROX] X2 start at cycle 4 
[MICROF] F2 start at cycle 4 
[IRC] IEND: 0 $ INIT: 0 $ S BR: 0 $ VG: 0 
[IilWC] cycle: 5 PC =3-~ 
[IRC] IEND: 0 $ INIT: 0 $ S_BR: 0 $ VC: 0 
[LIWC] cycle: 6 PC =3-: 
[IRC] IEND: 0 $ INIT: 0 $ S一BR: 0 $ VC: 0 
[LIWC] cycle: 7 PC =3--
[IRC] IEND: 0 $ INIT: 0 $ S一BR: 0 $ VC: 0 
[LIWC] cycle: 8 PC =3--
[READ] -1 READ FROM PORT3 
[READ] -1 READ FROM PORT4 
[IRC] IEND: 0 $ INIT: 0 $ S一BR: 0 $ VC: 0 
[LIWC] cycle: 9 PC =3-: 
[FETCHmemCOMPLETED ] PORT3 FDRK1 
[FETCHmemCOMPLETED] PORT4 FDR2<2 
[IRC] IEND: 0 $ INIT: 0 $ S_BR: 0 $ VC: 0 
[LIWC] cycle: 10 PC =3-
[IRC] IEND: 0 $ INIT: 0 $ S一BR: 0 $ VC: 0 
[LIWC] cycle: 11 PC =3: 
[UPDATEfreg] FSDR1 is also updated to: 2 
[STOREmemCOMPLETED ] WRITE 1073741824: 0> PORT3 
[IRC] IEND: 0 $ INIT: 0 $ S_BR: 0 $ VC: 0 
[LIWC] cycle: 12 PC =3: 
[MICROB] B2 start at cycle 12 
[IRC] IEND: 1 $ INIT: 0 $ S_BR: 1 $ VC: 1 
[LIWC] cycle: 13 PC =3= 
[UPDATExreg] XREG[0] is updated to: 64 
[UPDATExreg] XREG[8] is updated to: 56 
[UPDATExreg] MAR1 is updated to: 64 
[UPDATExreg] MAR2 is updated to: 56 
[MICROX] X3 start at cycle 13 
[MICROF] F3 start at cycle .13 
[IRC] IEND: 0 $ INIT: 0 $ S_BR: 0 $ VC: 0 
[LIWC] cycle: 14 PC =3: 
[IRC] IEND: 0 $ INIT: 0 $ S_BR: 0 $ VC: 0 
[LIWC] —一一 cycle: 15 PC =3-
[IRC] IEND: 0 $ INIT: 0 $ S_BR: 0 $ VC: 0 
[LIWC] cycle: 16 PC =3: 
[IRC] IEND: 0 $ INIT: 0 $ S_BR: 0 $ VC: 0 
[LIWC] cycle: 17 PC =3-
[READ] -2 READ FROM PORT3 
[READ] -1 READ FROM PORT4 
[IRC] IEND: 0 $ INIT: 0 $ S一BR: 0 $ VC: 0 
[LIWC] cycle: 18 PC =3-
[FETCHmemCOMPLETED] PORT4 FDR2<2 
…flRC] IEND: 0 $ INIT: 0 $ S一BR: 0 $ VC: 0 
[LIWC] cycle: 19 PC =3: 
[FETCHmemCOMPLETED] PORT3 FDR1<6 
[IRC] IEND: 0 $ INIT: 0 $ S一BR: 0 $ VC: 0 
[LIWC] cycle: 20 PC =3 
[IRC] IEND: 0 $ INIT: 0 $ S_BR: 0 $ VC: 0 
[LIWC] cycle: 21 PC =3 
[UPDATEfreg] FSDRl is also updated to: 7 
[STOREmemCOMPLETED ] WRITE 1075576832 : 0> PORT3 
[IRC] IEND: 0 $ 工NIT: 0 $ S一BR: 0 $ VC: 0 
[LIWC] cycle: 22 PC =3 
[MICROB] B3 start at cycle 22 
[IRC] IEND: 1 $ INIT: 0 一 B R : 1 $ VC: 1 . 
[LIWC] cycle: 23 PC =3 
[UPDATExreg] XREG[0] is updated to: 66 
n p n ^ ^ 3 X R E G [ 8 ] “ updated to: 58 i s updated to: 66 
[UPDATExreg] MAR2 is updated to: 58 
[MICROX] X4 start at cycle 23 
[MICROF] F4 start at cycle 23 
° $ 工 N I T : 0 $ S BR: 0 $ VC: 0 Ll•工WC] cycle: 24 PC = 3 : JIRC] IEND: 0 $ INIT: 0 $ S一BR: 0 $ VC: 0 
[LIWC] cycle: 25 PC =3: 
JIRC] IEND: 0 $ INIT: 0 $ S BR: 0 $ VC: Q 
[LIWC] cycle: 26 PC =3— 
[IRC] IEND:. 0 $ INIT: 0 $ S_BR: 0 $ VC: 0 
[LIWC] cycle: 27 PC =3: 
[READ] -2 READ FROM PORT3 
[READ] -1 READ FROM PORT4 
[IRC] IEND: 0 $ INIT: 0 $ S一BR: 0 $ VC: 0 
[LIWC] cycle: 28 PC =3: 
[FETCHmemCOMPLETED] PORT4 FDR2<3 
[IRC] IEND: 0 $ INIT: 0 $ S一BR: 0 $ VC: 0 
[LIWC] cycle: 29 PC = 3 : 
[FETCHmemCOMPLETED ] PORT3 FDRK-l 
[IRC] IEND: 0 $ INIT: 0 $ S_BR: 0 $ VC: 0 
[LIWC] cycle: 30 PC =3-
[IRC] IEND: 0 $ INIT: 0 $ S一BR: 0 $ VC: 0 
[LIWC] cycle: 31 PC =3-
[UPDATEfreg] FSDR1 is also updated to: 12 
[STOREmemCOMPLETED] WRITE 1076363264:0> PORT3 
[IRC] IEND: 0 $ INIT: 0 $ S_BR: 0 $'VC: 0 
flilV/C] cycle: 32 PC =3- — 产 
[MICROB] B4 start at cycle 32 
[IRC] IEND: 1 $ INIT: 0 $ S_BR: 1 $ VC: 0 
[IRC] FETCH A NEW ISNTRUCTION IN 
[LIWC] cycle: 33 PC =5 
[MICROX] X5 start at cycle 33 
[MICROF] F5 start at cycle 33 
[IRC] IEND: 0 $ INIT: 0 $ S_BR: 0 $ VC: 0 
[LIWC] END 
Case two: Matrix multiplication Loop 
**************** MATRIX MULTIPLICATION PROGRAM ******************** 
[0] MAT一 MUL 
[1] J — - - Initialization, register files 
[2] XREG #is 32 #rho 0 & FREG #is 16 #rho 0 & MEM #is 128 #rho 0 
[3] 
[4 ] { The Program 
[5] MEM[#iota #rho MAT一MUL一PROGRAM]#is MAT_MUL_PROGRAM 
[6 ] I The Original一TWO一3x3 Matrix 一 "“ 
[7] PUTFtoMEM 51,1 j 1 2 3 
[8] PUTFtoMEM 53,2 | 4 5 6 一 
[9] PUTFtoMEM 55,3 j 7 8 9 
[10] PUTFtoMEM 57,4 
[11] PUTFtoMEM 59,5 
[12] PUTFtoMEM 61,6 
[13] PUTFtoMEM 63,7 
[14] PUTFtoMEM 65,8 
[15] PUTFtoMEM 67,9 
[ 1 6 ] 
[17] PUTFtoMEM 71,1 | 1 1 1 
[18] PUTFtoMEM 73,1 ！ 0 1 1 
[19] PUTFtoMEM 75,1 j 0 0 1 
[20] PUTFtoMEM 77,0 
[21] PUTFtoMEM 79,1 
[22] PUTFtoMEM 81,1 
[23] PUTFtoMEM 83,0, 
[24] PUTFtoMEM 85,0 
[25] PUTFtoMEM 87,1 
[26] 
\ll\ 1 “：：；,3^ ^he register to approiate value —— 
[29J XREG[l+16]#is BtoV(0 0 0 0 0 0 ,26 #rho 0) j— SET PSW TO READY MODE --
XREG [ 1+0 ]#is 50 j (STARTaddr of A) 
XREG [ 1+8 ]#is 70 j (STARTaddr of B) 
^ 十 1 ] # i s 2 i (next A., size of a IEEE64 real) 
[34] X ^ G I I ^ ] ^ 3 ! 3 J R R 1 B ' S I Z E OF A I E E E 6 4 R E A I X D I M ( A ) ) 
[35] 
[ ^ ] XREG[ 1+3 ]#is 1 j ( row number, 1=1) 
f二 ] XREG[ 1+4]#is 1 j ( column number, J=1 ) 旧 ] XREG [ 1+5 ]#is 0 j ( TERMINATION CONDITION ) [39 ] 
[40] XREG [ 1+6 ]#is 88 | ( STARTaddr of C ) 
[41] 
[42] XREG [ 1+7 ]#is 50 j ( backup of the STARTaddr of A ) 
[43] XREG [ 1+15 ]#is 68 j ( backup of the STARTaddr of B ) 
[44] 
[45] FREG[ 1+0]#is 0 j ( TEMP SUM OF EACH ELEMENT ) [46 ] 
[47] j execute the program 
[48] STATUS & DISPLAY 51 3 3 & DISPLAY 71 3 3 & #inkey 
[49] LIWC 1 
[50] DISPLAY 91 3 3 
************ MATRIX MULTIPLICATION SIMULATION RUN **************** 
[LIWC] cycle: 1 PC =1 
[MICROB] B1 start at cycle 1 
[IRC] IEND: 0 $ INIT: 1 $ S_BR: 0 $ VC: 0 
[IRC] REFRESH THE INSTRUCTION PIPELINE 
[LIWC] cycle: 2 PC =1 
[UPDATExreg] XREG[ 5 ] is updated to: 50 
[MICROX] X2 start at cycle 2 
[MICROF] F2 start at cycle 2 
[IRC] IEND: 0 $ INIT: 0 $ S_BR: 0 $ VC: 0 
[LIWC] cycle: 3 PC =l-~- — 
[MICROB] B2 start at cycle 3 
[IRC] IEND: 1 $ INIT: 0 $ S_BR: 0 $ VC: 0 
[IRC] FETCH A NEW ISNTRUCTl5N IN 
[LIWC] cycle: 4 PC =3 
[MICROX] X3 start at cycle 4 
[MICROF] F3 start at cycle 4 
[IRC] IEND: 0 $ INIT: 0 $ S_BR: 0 $ VC: 0 
[LIWC] cycle: 5 PC 
[IRC] IEND: 0 $ INIT: 0 $ S_BR: 0 $ VC: 0 
[LIWC] cycle: 6 PC =3--
[UPDATExreg] XREG[5] is updated to: 56 
[IRC] IEND: 0 $ INIT: 0 $ S_BR: 0 $ VC: 0 
[LIWC] cycle: 7 PC =3“： 
[MICROB] B3 start at cycle 7 
[IRC] IEND： 1 $ INIT: 0 $ S_BR: 0 $ VC: 0 
[IRC] FETCH A NEW 工SNTRUCTI^N IN 
[LIWC] cycle: 8 PC =5 
[UPDATExreg] XREG[0] is updated to: 50 
[UPDATExreg] XREG[8] is updated to: 70 
[UPDATExreg] MARl is updated to: 50 
[UPDATExreg] MAR2 is updated to: 70 
[MICROX] X4 start at cycle 8 
[READ] -2 READ FROM PORT3 
[READ] -2 READ FROM PORT4 
[UPDATE fr eg] FREG[0] is updated to :0 
[MICROFl F4 start at cycle 8 
[IRC] IEND: 0 $ INIT: 0 $ S一BR: 0 $ VC: 0 
[LIWC] cycle: 9 PC =5--- _ 
(MICROB] B4 start at cycle 9 
[IRC] IEND: 1 $ INIT: 0 $ S BR: 0 $ VC- 0 
[IRC] FETCH A NEW ISNTRUCTI^N IN 
[LIWC] cycle: 10 PC =7 1111111111111 
[FETCHmemCOMPLETED ] PORT3 FDRK1 
[FETCHmemCOMPLETED] PORT4 FDR2<1 
(UPDATExreg] XREG[0] is updated to: 52 
[UPDATExreg] XREG[8] is updated to: 76 
[UPDATExreg] MAR1 is updated to: 52 
[UPDATExreg] MAR2 is updated to: 76 
[MICROX] X5 start at cycle 10 
[MICROF] F5 start at cycle 10 
[IRC] IEND: 0 $ INIT: 0 $ S_BR: 0 $ VC: 0 
[LIWC] cycle: 11 PC =7: 
[IRC] IEND: 0 $ INIT: 0 $ S一BR: 0 $ VC: 0 
[LIWC] cycle: 12 PC =7-
[IRC] IEND: 0 $ INIT: 0 $ S_BR: 0 $ VC: 0 
[LIWC] cycle:. 13 PC =7-
[IRC] IEND: 0 $ INIT: 0 $ S一BR: 0 $ VC: 0 
[LIWC] r cycle: 14 PC =7-
[READ] -2 READ FROM PORT3 
[READ] -1 READ FROM PORT4 
[IRC] IEND: 0 $ INIT: 0 $ S_BR: 0 $ VC: 0 
[LIWC] cycle: 15 PC =7: — 
[FETCHmemCOMPLETED] PORT4 FDR2<0 
[IRC] IEND: 0 $ INIT: 0 $ S_BR: 0 $ VC: 0 
[LIWC] cycle: 16 PC 宗7: 
[FETCHmemCOMPLETED] PORT3 FDR1<2 
[IRC] IEND: 0 $ INIT: 0 $ S_BR: 0 $ VC: 0 
[LIWC] cycle: 17 PC =7~ 
[UPDATEfreg] PREG[0] is updated to :1 
[IRC] IEND: 0 $ INIT: 0 $ S_BR: 0 $ VC: 0 
[LIWC] cycle: 18 PC =1: 
[MICROB] B5 start at cycle 18 
[IRC] IEND: 1 $ INIT: 0 $ S_BR: 1 $ VC: 1 
[LIWC] cycle:•19 PC =7: 
[UPDATExreg] XREG[0] is updated to: 54 
[UPDATExreg] XREG[8] is updated to: 82 
[UPDATExreg] MAR1 is updated to: 54 
[UPDATExreg] MAR2 is updated to: 82 
[MICROX] X6 start at cycle 19 
[MICROF] F6 start at cycle 19 
[IRC] IEND: 0 $ INIT: 0 $ S一BR: 0 $ VC: 0 
[LIWC] cycle: 20 PC =7~ 
[IRC] IEND: 0 $ INIT: 0 $ S_BR: 0 $ VC: 0 
[LIWC] cycle: 21 PC =7: 
[IRC] IEND: 0 $ INIT: 0 $ S_BR: 0 $ VC: 0 
[LIWC] cycle: 22 PC 
[IRC] IEND: 0 $ INIT: 0 $ S_BR: 0 $ VC: 0 
[LIWC] cycle: 23 PC =1-
[READ] -1 READ FROM PORT3 
[READ] -1 READ FROM PORT4 
[IRC] IEND: 0 $ INIT: 0 $ S一BR: 0 $ VC: 0 
[LIWC] cycle: 24 PC =7 
[FETCHmemCOMPLETED] PORT3 FDR1<3 
[IRC] IEND: 0 $ INIT: 0 $ S一BR: 0 $ VC: 0 
[LIWC] cycle: 25 PC =7 
[FETCHmemCOMPLETED] PORT4 FDR2<0 
[IRC] IEND: 0 $ INIT: 0 $ S一BR: 0 $ VC: 0 
[LIWC] cycle: 26 PC =7 ；- -
[UPDATEfreg] FREG[0] is updated to : 1 
[IRC] IEND: 0 $ 工NIT: 0 $ S—BR: 0 $ VC: 0 
[LIWC] cycle: 27 PC =7 • 
MICROB] B6 start at cycle 27 
[IRC] IEND: 1 $ INIT: 0 $_S一BR: 1 $ VC: 1 
[LIWC] cycle: 28 PC =7— , 
[UPDATExreg] XREG[0] is updated to: 56 
u ^ n ^ X r e g ] X R E G [ 8� is updated to: 88 
n o n ^ X r S g ] 馳 1 i s updated to: 56 
[UPDATExreg] MAR2 is updated to: 88 
[MICROX] X7 start at cycle 28 
[MICROF] F7 start at cycle 28 
� R C ] IEND: 0 $ INIT: 0 $ S一BR: 0 $ VC: 0 
[LIWC] cycle: 29 PC = 7 : 
[IRC] IEND: 0 $ INIT: 0 $ S_BR: 0 $ VC: 0 
[LIWC] — cycle: 30 PC =7: 
[IRC] IEND: 0 $ INIT: 0 $ S_BR: 0 $ VC: 0 
[LIWC] cycle: 31 PC = 7 : 
[IRC] IEND: 0 $ INIT: 0 $ S_BR: 0 $ VC: 0 
[LIWC] cycle: 32 PC =7: 
[READ] -2 READ FROM PORT3 
[READ] -1 READ FROM PORT4 
[IRC] IEND: 0 $ INIT: 0 $ S_BR: 0 $ VC: 0 
[LIWC] cycle: 33 PC =7-
[FETCHmemCOMPLETED] PORT4 FDR2<0 
[IRC] IEND: 0 $ INIT: 0 $ S_BR: 0 $ VC: 0 
[LIWC] cycle: 34 PC =7-
[FETCHmemCOMPLETED] PORT3 FDR1<4 
[IRC] IEND: 0 $ INIT: 0 $ S_BR: 0 $ VC: 0 
fLIWC] cycle: 35 PC =1: 
[UPDATEfreg] FREG[0] is updated to :1 
[IRC] IEND: 0 $ INIT: 0 $ S_BR: 0 $ VC: 0 
[LIWC] cycle: 36 PC =7: 
[MICROB] B7 start at cycle 36 
[IRC] IEND: 1 $ INIT: 0 $ S_BR: 1 $ VC: 0 
[IRC] FETCH A NEW ISNTRUCTION IN 
[LIWC] cycle: 37 PC =9 
[UPDATExreg] XREG[6] is updated to: 90 
[UPDATExreg] MSAR1 is updated to: 90 
[MICROX] X8 start at cycle 37 
[UPDATEfreg] FREG[7] is updated to :1 
[STOREmemCOMPLETED] WRITE 1072693248:0> PORT3 
fMICROF] F8 start at cycle 37 
[IRC] IEND: 0 $ INIT: 0 $ S_BR: 0 $ VC: 0 
[LIWC] cycle: 38 PC =9: 
[MICROB] B8 start at cycle 38 
[IRC] IEND: 1 $ INIT: 0 $ S_BR: 0 $ VC: 0 
[IRC] FETCH A NEW ISNTRUCTION IN 
[LIWC] cycle: 39 PC =11 
[UPDATExreg] XREG[4] is updated to: 2 
[MICROX] X9 start at cycle 39 
[MICROF] F9 start at cycle 39 
[IRC] IEND: 0 $ INIT: 0 $ S_BR: 0 $ VC: 0 
[LIWC] cycle: 40 PC =11 
[UPDATExreg] XREG[14] is updated to: 1 
[IRC] IEND: 0 $ INIT: 0 $ S_BR: 0 $ VC: 0 
[LIWC] cycle: 41 PC =11 
[UPDATExreg] XREG[14] is updated to: 2 
[IRC] IEND: 0 $ INIT: 0 $ S_BR: 0 $ VC: 0 
[LIWC] : cycle: 42 PC =11 
[MICROB] B9 start at cycle 42 
[IRC] IEND： 1 $ INIT: 0 $ S一BR: 0 $ VC: 0 
[IRC] FETCH A NEW ISNTRUCTION IN 
[LIWC] cycle: 43 PC =13 
[UPDATExreg] XREG[8] is updated to: 68 
[UPDATExreg] XREG[0] is updated to: 50 
[MICROX] X10 start at cycle 43 
EMICROF] F10 start at cycle 43 
[IRC] IEND: 0 $ INIT: 0 $ S一BR: 0 $ VC: 0 
[LIWC] cycle: 44 PC =13 
[MICROB] B10 start at cycle 44 
[IRCl IEND: 1 $ INIT: 0 $ S一BR: 0 $ VC: 0 
[IRC] FETCH A NEW 工SNTRU�ON IN 
[LIWC] cycle: 45 PC =15 ‘ ---
[MICROX] Xll start at cycle 45 
[MICROF] Fll start at cycle 45 
I E N D ; 0 $ INIT: 0 $ S BR: 0 $ VC: 0 
[LIWC] ： cycle: 46 PC =15 
[IRC] IEND: 0 $ INIT: 0 $ S BR: 0 $ VC: 0 
[LIWC] cycle: 47 PC =15" 
[UPDATExreg] XREG[8] is updated to: 72 
[IRC] IEND: 0 $ INIT: 0 $ S一BR: 0 $ VC: 0 
[LIWC] cycle: 48 PC =15 
[MICROB] Bll start at cycle 48 
[IRC] IEND: 1 $ INIT: 0 $ S BR: 0 $ VC: 0 
[IRC] FETCH A NEW ISNTRUCTl5N IN 
[LIWC] cycle: 49 PC =17- -
[UPDATExreg] XREG[5] is updated to: 50 
[MICROX] X12 start at cycle 49 
[MICROF] F12 start at cycle 49 
[IRC] IEND: 0 $ INIT: 0 $ S一BR: 0 $ VC: 0 
[LIWC] cycle: 50 PC =l"7 
[MICROB] B12 start at cycle 50 
[IRC] IEND: 1 $ INIT: 0 $ S一BR: 0 $ VC: 1 
[IRC] REFRESH THE INSTRUCTION PIPELINE 
[LIWC] cycle: 51 PC =3 
[MICROX] X13 start at cycle 51 
[MICROF] F13 start at cycle 51 
[IRC] IEND: 0 $ INIT: 0 $ S一BR: 0 $ VC: 0 
[LIWC] cycle: 52 PC =3: 
[IRC] IEND: 0 $ INIT: 0 $ S一BR: 0 $ VC: 0 
[LIWC] cycle: 53 PC =3-
[UPDATExreg] XREG[5] is updated to: 56 
[IRC] IEND: 0 $ INIT: 0 $ S_BR: 0 $ VC: 0 
[LIWC] --一 cycle: 54 PC =3: 
[MICROB] B13 start at cycle 54 
[IRC] IEND: 1 $ INIT: 0 $ S_BR: 0 $ VC: 0 
[IRCj FETCH A NEW ISNTRUCTION IN 
[LIWC] cycle: 55 PC =5 
[UPDATExreg] XREG[0] is updated to: 50 
[UPDATExreg] XREG[8] is updated to: 72 
[UPDATExreg] MAR1 is updated to: 50 
[UPDATExreg] MAR2 is updated to: 72 
[MICROX] X14 start at cycle 55 
[READ] -2 READ FROM PORT3 
[READ] -1 READ FROM PORT4 
[UPDATEfreg] FREG[0] is updated to :0 
[MICROF] F14 start at cycle 55 
[IRC] IEND: 0 $ INIT: 0 $ S_BR: 0 $ VC: 0 
[LIWC] cycle: 56 PC =5~ 
[FETCHmemCOMPLETED] PORT4 FDR2<1 
[MICROB] B14 start at cycle 56 
[IRC] IEND: 1 $ INIT: 0 $ S一BR: 0 $ VC: 0 
flRci FETCH A NEW ISNTRUCTION IN 
[LIWC] cycle: 57 PC =7 
[FETCHmemCOMPLETED ] PORT3 FDRK1 
[UPDATExreg] XREG[0] is updated to: 52 
[UPDATExreg] XREG[8j is updated to: 78 
[UPDATExreg] MAR1 is updated to: 52 
[UPDATExreg] MAR2 is updated to: 78 
EMICROX] XI5 start at cycle 57 
[MICROF] F15 start at cycle 57 
[IRC] IEND: 0 $ INIT: 0 $ S一BR: 0 $ VC: 0 
[LIWC] cycle: 58 PC =7 — 
[IRC] IEND: 0 $ INIT: 0 $ S_BR: 0 $ VC: 0 
[LIWC] cycle: 59 PC =7 
[IRC] IEND:" 0 $ INIT: 0 $ S_BR: 0 $ VC: 0 
[LIWC] cycle: 60 PC =7 
[IRC] IEND: 0 $ INIT: 0 $ S一BR: 0 $ VC: 0 

00024^3^5 
