Fault-Tolerant Single-Chip Vector Processor : architecture and performance analysis using Livermore loop benchmarks by Ganesh, Sundararajan
THE FAULT-TOLERANT SINGLE-CHiP VECTOR ::. ::::.. = ::: :z._, "!:: 
PROCESSOR: ARCHITECTURE AND 
:::::-
PERFORMANCE ANALYSIS USING 









Submitted to the faculty of the 
Graduate College of the 
Oklahoma State Umverslty 
m partial fulfillment 'of 
the requirement for 
. the Degree of 
MASTER OF SCIENCE 
May, 1992 
' 
{" \ ' .. ,_ "~~:- ·!~ 
"'n' !. ~ 
THE FAULT-TOLERANT SINGLE-CHIP VECTOR 
PROCESSOR: ARCHITECTIJRE AND 
PERFORMANCE ANALYSIS USING 






Dean of the Graduate College 
ii 
ACKNOWLEDGEMENTS 
I wish to express my sincere gratitude to all the people who have assisted me 
in this work and during my stay at Oklahoma State University. I sincerely thank Dr. 
J. J. Lee for entrusting to, me this topic _whic~ is one of his research areas, and for 
serving as my graduate student advisor. I am also extremely grateful to Dr. Lee for 
providing me with financial support during my graduate study. 
My sincere thanks also goes to Dr. Richard L. Cummins and Dr. Chris G. 
Hutchens for serving in my graduate committee. I am thankful to Dr. Cummins for 
giving technical help in thesis-writing. l owe my gratitude to Dr. Hutchens for 
offering good suggestions in improving the quality of my thesis. His pep-talks were 
always a morale-booster, and I thank him for the talks. I also thank Leslie Fife and 
Philis for their patient help in improving my thesis quality. 
I am thankful for the unforgettable friendship and company of (breakdown) 
Paneer, Krishnan, Shankar, Gopal (Mr. El Paso), Balachandar, Shekar, Vijay_(Mr. 
, , 
B), Ravi (Blue), Venkat and Saravanan (rat) in Stillwater. I am extremely grateful 
to my childhood friends Sivakumar, J agdish, Sriram, Sunil and Babu, who haven't 
forgotten me even after my long period of absence from India. 
I am greatly indebted to my parents and my brother Jayesh, who make me 
feel that I have the best family in the world. My special thanks goes to my uncle and 
aunt for their constant support and encouragement. I also thank Manu, Priya, 
Shubha, Shruthi and Latha for their ever-refreshing letters which kept me going in 
Stillwater. I also thank Ramesh and Uma for extending their hospitality in U. S., 
and Rajendran for his telephone calls. 
iii 
TABLE OF CONTENTS 
Chapter Page 
I. INTRODUCTION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 
II. SINGLE-CHIP VECTOR PROCESSOR · : . : . . . . . . . . . . . . ~ . . 9 
Interconnection Network . . . . . . . . . . . . . . . . . . . . . . . 9 
Pipeline Net ........ · ......... ·. . . . . . . . . . . . . 14 
Basic Structure ofFTVP . . . . . . . . . . . . . . . . . . . . . . . 19 
Structure of the Vector Register . . . . . . . . . . . . . . . . 21 
Structure of an Arithmetic' Unit . . . . . . . . . . . . . . . . 24 
Chaining Capability . . . . . . . . . . . . . . . . . . . . . . . . . . 26 
Basic Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 
III. FAULT-TOLERANCE IN THE FTVP . . . . . . . . . . . . . . . . . . . 37 
Types of Fault and Fault Vectors , . . . . . . . . . . . . . . . . . . . 38 
Translation Procedure . . .' . . . . . . . . . . . . . . . . . . . . . 41 
Example . . . . . . . . . . . . . . ·. . . . . . . . . . . . . . . . . 47 
Fault Free Condition . . . . . . . . . . . . . . . . . . . . . . 47 
Pipeline Fault Condition ........... · . . . . . . . . . 51 
Switch Fault Condition . . . . . . . . . . . . . . . . . . . . 51 
Pipeline and Switch Fault . . . . . . . . . . . . . . . . . . . 54 
Link and Switch Fault . . . . . . . . . . . . . . . . . . . . . 54 
IV. EVALUATION . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . 57 
Speedup and Throughput . . . . . . . . . . . . . . . . . . . . . . . 57 
Proposed Architecture . . . . . . . . . . . . . . . . . . . . . . . . 65 
V. CONCLUSIONS AND FUTURE WORK . . . . . . . . . . . . . . . 70 
REFERENCES 71 
APPENDIX A- LIVERMORE LOOPS . . . . . . . . . . . . . . . . . . . . . . . 74 
APPENDIX B- PROGRAM GRAPH OF THE LIVERMORE 
LOOPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 
APPENDIX C- WAFER SCALE iNTEGRATION . . . . . . . . . . . . . . . . . 82 
iv 
UST OF TABLES 
Table Page 
I. The Trend in Supercomputers and High-End Main Frame 
Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 
II. A Taxonomy of Fault Tolerance Techniques in Commercial 
Computing Systems . . . . . . . . . . . • . . . . . • . • . _ • . • . • . 7 
III. Basic Instructions for the FrVP . . . . . . . . . . . . . . . . . . . . . . 32 
IV. Evaluation Tnne of the Livermore Loops . . . . . . . . . . . . . . . . . . 6 0 
V. Livermore Loop Parameters . . . . . . . . . . . . . . . . . . . • . . . . 66 
VI. Comparison of the Packages of Figure of Merits . . . . . • . . . . . . . . 85 
v 
LIST OF FIGURES 
Figure Page 
1. Architecture of a Register-Register Type Vector Processor . 4 
2. 8x8 Cube Switching Network _ .... 10 
3. 8x8 Conventional Crossbar Network . 10 
4. Program Graph . . . . . . . 17 
5. Pipeline Net Implementation 17 
6. Crossbar Network Implementation 18 
7. Hardware Model of the FTVP 20 
8 . Structure of a Vector Register 22 
9. Structure of an Arithmetic Unit 24 
10. FTVP Implementation of the Example Loop 27 
11. Chaining Operation for Livermore Loop 9 29 
12. Instructions for the Example Loop . . 35 
13. WSI Model for the Proposed Study . 39 
14. 1-to-1 Mapping Procedure ... , .. 46 
15. Architecture of an 8 Pipelined Structure With Switch Indexes 48 
16. Routing for Livermore_Loop 1 with a Fault Free Condition 50 
17. Routing for Livermore Loop 1 in case of Faulty Pipelines 52 
18. Routing for Livermore Loop 1 in case of Faulty Switches 53 
19. Routing for Livermore Loop 1 in case of Pipeline and Switch Faults 55 
20. Routing for Livermore Loop 1 in case of Link and Switch Fault . 56 






Speedup vs. Loop Length 
Speedup vs. Loop Length for Livermore Loops with 
No Recurrence Relationship ........... . 
24. Speedup vs .. Loop Length for Livermore Loops with 
Recurrence Relationship . . . . . . 
25. Program Graph for Livermore Loop 1 










The development of processors with pipelined arithmetic units has offered an 
economical way of increasing the speed of vector supercomputers since the 1960s 
when the first generation vector supercomputers arrived [1]. The hardware, called 
the pipeline, is divided into a series of substages called the pipeline stages. Each 
substage of the pipeline will execute a portion of the overall task (process) 
performed by the pipeline. The input for the overall task performed by the pipeline 
is streamed into the first pipeline sta~e, and the output of the overall task emerges 
from the last stage of the pipeline. Each intermediate stage of the pipeline will 
receive its input datum from the previous pipeline stage, compute and send the 
Fesults directly to the subsequent pipeline stage. While the results are being sent to 
the subsequent pipeline stage, a new input datum from the previous stage may be 
received. As soon as a pipeline stage receives.a new input datum, it starts 
computing the output of the portion of task assigned to it, independent of the other 
stages, resulting in an overlapped execution similar to the assembly line in an 
industry [1 ]. This overlapped execution makes it possible for the input of the overall 
task to be continuously streamed to the first stage of the pipeline without waiting for 
the output to emerge from the last stage of the pipeline. Thus, the throughput of a 
processor that has pipelines is increased. The architecture of processors that use the 
pipelining technique has evolved from that of a single pipelined architecture similar 
to the TI ASC [2] to a multipipelined architecture similar to the NEC SX-2 [3 and 




THE TREND IN SUPERCOMPUTERS AND HIGH-END 
MAIN FRAME SYSTEMS 
System Model Archltecture Max no of Processor type Max memory Peak 
confisuratwn £rocessors ca£aclt~ £erformance 
Cray X-MP/4 MP With SM and direct 4 process'ors Custom ECL 16MWmCM 840 Mflops 
mterconnect 128MW m 
SSD 
Cray 2 MP With SM and direct 4 processors Custom ECL 256MW 2 Gflops 
connect 1 lOP' 
Cray 3 MPwith SM 16 processors · GaAs/ECL 2GW 16 Gflops 
Cyber 205 UP with scalar processor 1' processor Custom CMOS 4MW 400 Gflops 
and 4 vector pipelmes 
ETA-10 MP With SM 8 processors Custom 256mW 10 Gflops 
18 lOPs 
FUJitsU VP-200 UP With multiple 1 processor CustomECL 32MW 533 Mflops 
functiOnal pipes 
NECSX-2 UP With 16 functional 1 processor_ Custom 32MW 1 3 Gflops 
pipes 
Hitachi S-810 UP With ~uluple 1 processor Custom 32MW 840 Mflops 
pipelmes 
HEP-1 MP with SM and switch 16 processors Custom 256MW 160 Mflops 
network 
illM MP With SM and direct 4 process<;>rs Custom TCM 2GBCM 480 Mflops 
3090/400/VF connect 16TBEM 
Umvac MP with SM and direct 4 Processors, 4 Custom 16MW 67Mflops 
1194/ISP X 2 connect lOPs, 2 ISPs 
CDC Cyberplus MC With DM and nng 64 processors Custom 512 KW per 65 Mflops 
connect processor and 620MIPS 
per processor 
ConnectiOn SIMD With DM hypercube 64 K processmg VLSI/CMOS 32 MBytes >1000 
machme embedded in a global elements gate array Mflops 250 
mesh Mflops 
BBN Butterfly MP With SM VIa butterfly 256 processors M68020 custom 128 MW 256 Mips 
switch network coprocessor 
Lora! MPP SIMD 128x128 mesh with, 16K pr,ocessmg CMOS/SOS8 128MB 470 Mflops 
DM elements processmg 
elements per 
chip 
illMGF 11 SIMD With reconfigurable 576 .pro~essmg Custom floatmg 2MB per 20 Mflops 
Benes network elements pomt processor processor per processor 
1 1 GB total 11 Gflops 
illMRP3 MP With SM/DM and fast 512 processors 32-bll RISC 128MW 800 Mflops 
network 1300 MipS 
Cedar Hierarchical MP With SM 256 processors Alhant/FX 256MW 3.2 Gflops 
clusters 
MP- Multiprocessor, SM- Shared memory, CM- Central memory, SSD- Solid state device, 
lOP -l/0 processor, MM- Memory- Memory, UP- Uniprocessor, DM- Distributed 
memory, TCM- Thermal Conduction Module, MC- Multicomputer, EM- Extended 
Memory, ISP- Integrated Scientific Processors 
and mainframe systems available today. The CRAY-1 [1, 3 and 6] has twelve 
pipelines, with each pipeline executing a different function. The Cyber 205 [1] and 
the Fujitsu VP 200 [1 and 6] are also multipipelined. The NEC SX-2 processor has 
four sets of pipelines. Each pipeline set consists of an adder unit, a multiply/ divide 
unit, a logical unit and a shift unit [4]. Each of these machines has a separate scalar 
processing unit and a separate vecto~ processing unit. The Titan design [7 and 8] 
combines a vector floating-point unit and a sc3.Iai floating-point unit into a unified 
structure. The designers of the CDC/NASF [1] have proposed a fault-tolerant 
architecture which has an extra pipeline to be used if a fault occurs in another 
pipeline. 
The vector processor in the computer systems shown in Table I employs the 
pipelining technique to increase the speed of the processor, which can be further 
increased by dynamically linking the pipelines present in the processor; this is 
termed as the pipeline chain. Pipeline chaining is a linking process that occurs when 
the result obtained from one pipeline unit is directly fed to another [1 ]. The Cray 
computer calls the dynamic link chaining [1] while the Cyber 205 [1] terms the 
dynamic link short-stopping. In recent RISC processors, such as the Intel i860, a 
primitive form of pipeline chaining is achieved. The i860 can link a multiplier 
pipeline and an adder pipeline in a pipeline chain in its dual-mode [9]. 
The architecture of a vector processor in computer systems like the Cray and 
the NEC SX-2 that are available commercially is classified as a register-register 
architecture. A typical register-register architecture of a vector processor is defined 
as the pipelines in the processor connected to the vector registers by an 
interconnection network. Such an architecture is shown in Figure 1. The 
interconnection network may be a multistage switching network, as in the case of 
the HEP-1 (Table I), or a crossbar network [10 and 11 ], as in the case of the 




Figure 1. Architecture of a Regjster-Register Type Vector Processor 
As seen in Table I, most of the supercomputer systems were fabricated with 
the prevalent transistor and ECL technologies. These machines tend to be large 
and costly. Advances in the area of Very Large Scale Integrated (VLSI) circuit 
fabrication technology today have made the fabrication of such large systems as 
single-chip processors possible [12]; single-chip fabrication of these systems will lead 
to a reduction in cost and area occupied by these systems. Wafer Scale Integration 
(WSI) is a single-chip technique to fabricate such complex systems [12, 13 and 14]. 
In 1966, TI fabricated the first Large Scale Integrated (LSI) circuit by 
fabricating much smaller-sized components on an intact substrate, and then wiring 
the components that are functional, directly with each other [12]. This process, 
WSI, can be regarded as a special form of packaging in which the extra wiring 
normally used to interconnect the packages is fahricated on the surface of a wafer 
substrate containing the components and mounted inside a single package [12]. 
Intermil wiring significantly reduces problems like the wiring capacitances present in 
conventional circuits or ceramic carriers [12]; wiring capacitances present will 
reduce the speed of the fabricated system. Further, in WSI, interconnection 
densities are increased since the Wire dimensions used are smaller when compared 
to conventional circuits or ceramic carriers [12]. Further, closeness of the 
components fabricated using WSI le'ads to shorter interconnection wiring, which in 
turn enhances the speed of the fabricated circuit, and decreases the power 
requirement for the 1/0 drivers of the fabricated circuit [12]. 
But the major design challenge in WSI is the presence of faulty modules in 
the fabricated circuit [12]. With the increased application of computer systems in 
important activities like telecommunication, banking, that control everyday life, it is 
absolutely essential to make the computer systems reliable and cheap [15]. The 
existing technique of replacing circuit boards would be impractical if faulty modules 
are present in the computer systems fabricated using WSI, because we cannot assure 
5 
a design which will be one-hundred percent fault-free. Therefore, we have to design-
the system fabricated using WSI in such a way that we will be able to achieve a 
functional design even though there are faulty modules in the system; a system 
designed in this manner is termed as a fault-tolerant system. Since we cannot assure 
one-hundred percent functional cells or m~dules in the computer systems fabricated 
using WSI, we need a technique that will identify the faulty modules in the 
fabricated system. After this identification, the technique should be able to 
construct a fault-free system using the fault-free modules or cells in the wafer [14]. 
Faults in a system fabricated using WSI can be classified into two broad 
types: static faults and dynamic faults [17]. The static faults are permanent ones, 
such as broken bonds or cracks in the wafer which lead to the loss of a submodule or 
part of a circuit in the wafer [17]. The dynamic faults are temporary ones; for 
example, a shift in the threshold voltage [17]. In this report we will deal only with 
the static faults. 
The usual method used to compensate for a faulty module in a system 
fabricated using WSI is to add a spare module to the system during the fabrication 
process itself [17]. After fabrication, the faulty modules present in the system are 
' ' 
detected, and a suitable reconf~guration algorithm developed prior to the 
fabrication, is applied to substitute the spare module in place of the faulty module 
[17]. The designers of the WSI memory chip have used this technique, termed the 
redundancy technique, to achieve fault-tolerance [13]. Designers of commercial 
systems like the VAX 8600 [18] and the IBM 3090 [19] have also achieved fault-
tolerance by using the techniques shown in Table II [15], which also shows the 
techniques applied f?r achieving fault-tolerance in some of the other computer 
systems available commercially. 
The first objective of this thesis is to develop the architecture of a register-
register type single-chip vector processor. As our second objective, the vector 
6 
TABLE IT 
A TAXONOMY OF FAULT TOLERANCE TECHNIQUES 
IN COMMERCIAL COMPUTING SYSTEMS 
Structure Detection Recovery Sources of failure Techniques 
tolerated 
U mprocessor 
VAX8600 Hardware Software Hardware Hardware error 
detection 
IBM 3090 Hardware Hardware/ Hardware Hardware error 
Software detection, retry, 
workaround 
Multicomputer 
Tandem Hardware/ Software Hardware, design, Check pointing, 
Software environment "I'm alive" 
messages 
VAX3000 Hardware Hardware Hardware, Duplication and 
environment matching 
Multiprocessor 
Teradata Hardware Software Hardware, Duplication 
environment 
Sequoia Hardware Software Hardware, Duplication and 
environment matchin~ 
processor designed must have the capability to dynamically link the pipelines in the 
processor. WSI is the technique to be used in the proposed fabrication of the single-
chip vector processor. Therefore, due to the presence of faulty modules in the 
fabricated single-chip processor, fault-tolerance has to be achieved in the intended 
processor design, and it is our third objective. 
To achieve our first two objectives, we propose a three-level vector processor 
structure whic;h is referred to as Fault-Tolerant Vector Processor (FTVP). To meet 
our third objective, a simple translation procedure is provided to achieve fault-
tolerance within the FTVP. The study ofFTVP originates from a two-level pipeline 
structure, called the pipeline net, proposed in [20]. The proposed pipeline net, which 
will be discussed in Chapter ll, is capable of dynamically linking pipelines with the 
help of interconnection networks in the structure. 
We have used the crossbar network as an interconnection network in the 
FfVP to provide full interconnection capability for fast data access. The dynamic 
pipeline-linking capability, chaining, of the FTVP is demonstrated by using the 
Livermore loop benchmark programs [21] listed in Appendix A. Speedup and 
throughput analysis for the FTVP are done using the Livermore loops. The analysis 
follows the steps taken by [20]. Based on the analysis presented here, an FTVP 
hardware architecture is recommended for fabrication using the WSI. 
8 
CHAPTER II 
SINGLE-CHIP VECTOR PROCESSOR 
. Interconnection Network 
As seen from Figure 1, the interconnection network routes the vector data 
from the vector registers to the pipelines in the vector processor and vice-versa; 
therefore, it is an important aspect of the processor. The two major network 
techniques that exist today are the multistage switching network technique and the 
crossbar network technique. A three-stage cube switching network [1] and a two-
sided normal crossbar network are shoWn. in Figure 2 and Figure 3, respectively. 
The crossbar network provides full interconnection capability between any 
input-output terminal pair. It must be noted here that we will be dealing with only 
input-output combinations where. no two input-output terminal pairs have the same 
output; that is, each input terminal is associated with an unique output terminal. 
Due to the full interconnection capability, the crossbar network can connect all 
possible multiple input-output combi'nation without any conflict, that is, "blocking" 
[1]. Hence, the crossbar network is termed a non-blocking network; it also provides 
a fast data transfer rate due to its non-blocking nature [11 ]. But, the major concern 
in the development of the crossbar network is the cost of an NxN crossbar network, 
which is proportional to O(N2), where N is the number of inputs or outputs in the 
network and N2 is the number of switches in the network; this cost growth rate 
which is proportional to O(N2) is prohibitively high for large N. Therefore, the 
O(N2) cost growth rate has proved to be the major obstacle in the crossbar network 










Stege 2 0 









2 3 4 5 6 7 8 
OUTPUT 
Figure 3. 8x8 Conventional Crossbar Network 
On the other hand, an NxN multistage switching network has a cost growth 
rate which is proportional to O(Nlog2N) [1], where N is the number of inputs or 
outputs and Nlog2N is the number of switches in the network. The O(Nlog2N) cost 
growth rate is less than the O(N2) cost growth rate. For example, let us consider a 
case with N = 32. The cost growth rate for the crossbar network is proportional to 
0(1024), as the number of switches in the network is 1024; in case of a multistage 
network the cost growth rate is proportional to 0(160), as the number of switches in 
the network is 160. However, the apparent advantage in cost for a multistage 
switching network is offset by an increase in the transfer delay between any input-
output terminal pair due to the complexity of each multistage switch, and due to 
data blocks (explained in the next paragraph) in the network [11]. For example, in 
case of a Banyan tree multistage switching network the transfer delay through the 
network is proportional to O(Na(log2N)2), where 0 < a < 1 [11 ]. Whereas, the 
transfer delay through a crossbar network is proportional to O(N) [11 ]. Let us 
consider the case when N = 32. The transfer delay through the crossbar network is 
proportional to 0(32) and for the Banyan tree is proportional to 0(830) 
( = 0(32(1og232)2)), if a = 1. We see an enormous increase in the transfer delay in 
case of the multistage switching networks. 
Most multistage switching networks are blocking networks. A network is 
defined as blocking when there are conflicts in the use of network communication 
links for the simultaneous connection of more than one input-output terminal pair. 
Once again it must be noted that we will be dealing only with multiple input-output 
combinations where no two input-output terminal pair have the same output. For 
example, as shown in Figure 3, in the case of a multistage cube network there is a 
conflict in the network communication link between stage 1 and stage 0 (enclosed 
between hatched lines) for the input-output terminal pairs (5-0) and (7-1); 
therefore, data to be sent through one input-output connection has to be held and 
11 
sent after the other connection is accomplished, thereby increasing the overall data 
transfer time. 
Multistage switching networks may be used to effect a compromise between 
the data transfer rate and network cost. A trade-off has to be accomplished by 
sacrificing speed for the network cost while selecting a multistage network. If a 
multistage switching network is present in the system with multiple data flow, it will 
slow the system due to its blocking nature. Therefore,' in applications where speed 
' ' 
is not of prime importance, the multistage switching networks can be used to 
' ' 
achieve a low cost design by sacrificing speed [11]. When any multistage switching 
network is fabricated using WSI, the network is' vulnerable to failures during the 
fabrication process due to the complexity 'i,nvolved in the multistage switches and 
interconnection links [11]; these failures lead-to the presence of faulty switches and 
links in the network. Therefore, to improve the ability of multistage networks to 
tolerate the presence of faulty modules, designers have proposed to add one or 
more extra switch stages to the existing networks [22, 23 and 24]; however, fault-
tolerance in these designs is restricted to the faulty module in a certain location 
rather than any general location. The extra stage cube proposed in [22] is an 
example of such a design. In the case of an eight-input extra stage cube network, 
the network cannot tolerate multiple faulty switches in stages n, 0 and stages n-1, ... 
1, simultaneously (where stage n is the extra stage provided) [22]. This does not 
map to the real situation where faulty modules are distributed at random locations. 
Further, adding an extra stage to the existing multistage network increases the 
complexity of the network, making it vuln~rable 'to failures during fabrication. Since 
a switch in the multistage network i~ used to connect more than one input-output 
terminal pair, in case of fault in the switch, the input-output connections done 
through that switch cannot be accomplished. But, in case of a crossbar network 
12 
since a switch connects only one input-output pair, a switch fault affects only that 
input-output connection. 
The multistage switching networks also result in an irregular hardware 
structure due to the complex switches and criss-crossing of interconnection links; if 
system designers wish to modify the existing network structure, that structure must 
be altered in its entirety [11]. The crossbar network, on the other hand, has a 
regular matrix-like structure similar to the memory structure; this matrix-like 
structure gives system designers more flexibility in adding switches and links without 
redesigning the whole structure. In addition, the crossbar network has a simple 
control mechanism due to the nature of switch access and s~itch function when 
compared to any multi'stage switching network; multistage switches are usually 
multifunctional, whereas crossbar switches perform only one function [11]. The 
crossbar network is arranged in the form of a matrix, and the arrangement is 
' -
referred to as a switching matrix [11]. Only two control signals, represented as the 
row and column signals, need to. be generated by the control unit to access any 
particular crossbar switch in the switch matrix [10]. This simple access leads to the 
design of a simple control unit, which reduces the area occupied by th~ control unit 
and reduces the control overhead [11]. This reduction in control unit area implies 
that more area is available for the expansion of the existing crossbar network. 
Further, the VLSI implementation of a multistage network does not 
necessarily result in lesser area when compared to .a crossbar network as stated 
earlier in this section. The. Banyan multistage switching network, which is a 
representation of the other multistage switching networks like the Cube, the Omega, 
has a cost growth rate which is proportional to Q(N2) and not the predicted 
O(NlogzN) [11]. Further, crossbar networks ofsize 32x32 and 32x64 have already 
been fabricated using WSI [11]. The areas occupied by these networks are 3.4x3.4 
mm2 and 3.4x7.8 mm2, respectively. 
13 
Therefore, based on the above discussion, we can state that for the WSI 
single-chip fabrication of a vector processor, a crossbar network is more suitable 
than a multistage switching network. Since the vector processor requires a network 
that provides full interconnection capability and has fast data-transfer rate, the 
crossbar network is:ideal. 
Pipeline Net 
As stated in Chapter I, the study of the FfVP to be introduced in the next 
section originates from the pipeline net discussed in this. section. Therefore, we will 
discuss about the pipeline net as an introduction to the FfVP. Using the pipelines 
and the crossbar netWork, a pipeline net is proposed in [20]. The pipeline net is 
constructed from interconnecting multiple functional pipelines through two buffered 
crossbar networks [20]. The pipeline net is a two-level structure, and is made' of 
multiple functional pipelines (~), two buffered crossbar networks (BCN) and a set 
of vector registers (R) [20]. Multiplexers are used to connect the Rs to the FPs, or 
the FPs to the FPs [20]. , All FPs are identical and multifunctional [20], and each FP 
can execute addition, subtraction, divisio:q, multiplication or a logic function during 
a particular cycle [20]. The Rs holdt~e·operand and results. The BCNs provide a 
dynamic connecting path among the FPs and Rs. A collection of fetch/store 
pipelines are used to transfer data between the main memory and Rs, similar to the 
memory access pipelines present in the,Cray X-MP and the Fujitsu VP-200 [20]. 
The pipeline net'is used for the computation of Vector Compound Functions 
(VCF) [20]. The VCFs are a collection of linked scalar operations to be executed 
repeatedly many times in a looping structure [20]. This looping structure is referred 
to as the forpipe loop. The VCFs are converted into the forpipe loops, and are 
evaluated by the pipeline net [20]. The syntax of a forpipe loop is 
14 
forpipe i := 1 ton do <body> 
All VCFs are represented in this syntax [20]. For example, consider the following 
FORTRAN loop: 
DO 1 I = 1 to 400 
1 X[/)_=· Q + Y[I] * ( R * Z[/+10] + T * P[/+11]) 
This is represented in the syntax of forpipe loop as . 
forpipe i: = 1 to 400 do 
begin · 
x[i] := ·q + y[i] * ( r * z[i+ 10] + t * p[i+11]) 
e~ . . 
where i is the loop index, and the compound statement within begin-end forms the 
loop body. 
The VCFs are evaluated in twp steps.· In the fir.st step, configuration of the 
pipeline net is done using the SET instructio.ns. The actual execution is done in the 
second step by the START instruction, enabling the operations of a particular cycle. 
The SET instruction is used to select the function of a pipeline, or the connection 
pattern in a crossbar network. The syntax of a SET instruction is 
SET unit, value · . 
The unit is either a functional pipeline or a crossbar network [20]. If the unit refers 
to a functional pipeline, then the value denotes the arithmetic or logic operation 
performed by the pipeline. If the umt refers to a crossbar network, then the value 
denotes the connection pattern in the crossbar network [20]. 
A START instruction is issued to enable the pipeline net operati'on. The 
syntax of a START instruction is 
STARTm, k 
This implies "start to execute for m clock periods with an operand entering the 
pipeline net every,k clock' period" [20]. 
The FORTRAN loop presented previously is used as an example for the 
pipeline net implementation. The program graph 9f the above loop is shown in 
15 
Figure 4. Let the add and multiply functions require two and four clock periods, 
respectively. The program graph is mapped to the pipeline net as shown in Figure 5. 
Following are the sequence of setup instructions needed to ~et up the pipeline net: 
SET FPJ, *; 
SET FP2, *; 
SETFP3, +; 
SET FP4, *; 
SETFP5, +; 
SETBCNJ, a; 
SET BCN2, /3; 
: Set FPl to multiplication 
: Set BCNl to' co:imection pattern a 
Figure 6 shows the crossbar network implementation obtain~d after the execution of 
SET instructions. After the pipeline net is configured, the VCFs are evaluated by 
passing the operand from the Rs through the pipeline net by issuing a START 
instruction. The final result is stored back in an register (R). 
Even though the pipeline net is capable of vector processing, it does not 
favor WSI fabrication due to the irregrilarity in its structure. Multiplexers, present 
in the pipeline net, require additional control signals apart from the control signals 
required for the crossbar switches. This leads to the design of an additional control 
unit for the multiplexers. Functional pipelines that can execute all the basic 
arithmetic and logic functions are difficult to design, and are vulnerable to failures 
during fabrication due to their complexity. The methods for introduction of the 
delays in the crossbar network and the' procedures to convert the program graphs to 
pipeline nets are complicated. It hasn't been clearly determined whether the 
software or hardware is going to execute them. Further, the register-register 
' ' ' 
transfer is not considered in the pipeline net design. Finally, to start the execution 
phase of any pipeline net operation, an exact prediction of the number of clock 
periods required for the execution of the VCFs is needed, as shown in the START 
instruction; this will be difficult. 
The proposed FfVP overcomes all the above difficulties in various manners. 




0 V( I) R 2(1+10) T P( I+ 1 1) 





q r Z(l+lO) t p(l+ 11) 
Figure 5. Pipeline Net Implementation 
BCNl BCNZ 
FP1 I I r MPY I I 
lz(l+lO)f--- FP2 I MPY 
1 t ~---~/ v .JTTTn FP4 I p ( 1 + 1 1 ) ~ v ~H-1-----r::M::::-PY:-----lJI---+-----t---. 
I Y < i) ~ ~~ rn-n-rrn-rillllllilllli llH tt----rAFillPs)ll---+------t---T 
I ADD 1 
FP3 
I ADD 11-_-t------+----. 
I q ~ 
--
-l X(l) I 
Figure 6. Crossbar Network Implementation 
fabrication. No multiplexers are present, and the switch control is simple. A. 
pipeline which can execute either multiplication or addition is considered, which 
leads to a simpler pipeline design than that required for the pipeline net. Buffers in 
the interconnection networks are eliminated, and data buffering (to be explained in 
the section on the arithmetic units of the FfVP) is done by the hardware. 
' ' 
Hardware data buffering eliminates the need for an accurate prediction of the 
-
number of clock periods req:uired for the execrition of any FfVP operation. Finally, 
the FfVP is designed to execute all types of data transfer involving in the vector 
processor. 
Basic Structure of FfVP 
A structure of the FfVP suitable for WSI is shown in Figure 7. A three-level 
crossbar network is used to interconnect the arithmetic units and vector registers. 
The first-level network is Crossbar Network 1 (CBNl), the second-level is CBN2, 
and the third-level is CB~4. CBNl is logically separated into two parts, CBNl and 
CBN3, to simplify understanding; but, physically CBN3 is a part of CBNl. Both 
vector and scalar processing are supported by this architecture as explained in the 
' ' 
subsequent sections. The vector :reg~siers 'are connected to the arithmetic units by 
CBNl. Feedback connections from the arithmetic units to the arithmetic units are 
,< 
done by CBN2 a~d CBN3. CBN2 and CBN4 connect the arithmetic units to the 
vector registers. The register-register connection is done by CBN4 through CBNl. 
Various connection patterns in t~e crossbar networks are accomplished by 
crossbar switch settings. The control signals needed to accomplish the various 
connection patterns are simple; this is due to the fact that the nature of switch 
access in the crossbar network is simple as explained in the section on 










VECTOR REGISTER ARITHMETIC UN 11 
~ 
~ 







CBN - CROSSBAR NETWORK 





which can store only one vector datum. Each arithmetic unit of the FrVP consists 
of two buffers and a pipeline. The pipeline in the arithmetic unit can execute either 
addition or multiplication. The two buffers present in the arithmetic unit provide 
data buffering (to be explained in the section on the arithmetic units of the FrVP). 
' ' Since the FrVP is proposed as a Slngle-chip processor, the presence of faulty 
modules in the FrVP due to WSI fabrication are of major concern. The types of 
faulty modul~s that may be present in the fabricated FrVP, and the method for 
' ' 
achieving fault-tolerance are discussed in Chapter ill. Simple instructions are 
proposed to allow the.FfVP to build a long pipeline chain. This will be the major 
' ' 
application of the.FfVP. The hardware resources .are "exposed" to the software and 
are controlled by simple instructions as in the case ofa RISC processor [24 and 37]. 
We will restrict our discussion to the setting up of one pipeline chain at a time in the 
FfVP. Multiple pipeline chains are not ~owed to be set up in the FrVP. If more 
than one pipeline chain needs to be set up, the pipeline chains are set up 
sequentially. Having given a general introduction of the FfVP, we will discuss the 
structure of a vector register and the' arithmetic unit present in the FfVP next. The 
crossbar network used to interco~ect the ~ithmetic units and vector registers has 
been discussed earlier in this chapter,' apd.therefore, will not be discussed further. 
Structure of the Vector Register 
Figure 8 shows the structure of a vector register in the FTVP, each of which 
consists of an array of sc~ar registers and a set of control logic. Each vector register 
can hold only one vector datum and is associated with four control logic codes: 
count, skip, SO /DEST flag and busy flag. The control logic code is set explicitly by 
a special instruction which will be discussed in the section on basic instructions. The 
count gives the number of scalar elements of a vector stored in the vector register. 
21 
Scalar data is stored in a vector register with count = 1. The skip gives the skip 
distance to the next element of a vector stored in the vector register; the SO/DEST 
flag indicates whether the particular vector _register is the source register or 
destination register for a pipeline chain operation; the busy flag indicates the 







SKIP . COUNT BUSV SO /DEST 
L..--__ 1 CJ 0 D 





Each vector register has one input port and one output port. Data is written 
into the vector register through its input port, and read from the vector register 
through its output port. Each port qf the register has an enable signal 
accompanying the data in the port: Data in the ports of any vector register are valid 
only when the enable signal of the respective ports is true. Each vector register 
allows only one access at a time, ,as each register has only one control logic code. 
Either a read or write operation can be performed by the vector register at any 
particular period of time, since t~e registers allow only one access at a time. 
Simultaneous read and write operations are not allowed by the vector registers of 
theFfVP. 
22 
To access the data stored in a vector register, tl;:te busy flag of the register 
needs to be checked. If the busy flag of the register is set, then it implies that the 
register is busy with some read or write operation, and it cannot be accessed. If the 
busy flag of the vector register is not set, then the access mode of the register may 
be set by initializing the count, skip and SO /DEST codes. A trigger signal from the 
' ' ' 
control unit is then sent to access the yector register. This trigger signal 
automatically sets the busy flag of the register, thereby preventing it from another 
' ' 
access until the task assigned to it is completed. 
If a vector register is specified as a source register, then, upon receipt of the 
trigger signal, the data stored in the register are sent tqrough the output port, one 
element per clock cycle, and the control logic is 'updated. During the read 
'I ~ 
-
operation, the output port e:p.able signal of the vector register is true until· all the 
elements of the stored data are sent, thus indicating the validity of the emerging 
data. The count logic of the register is decremented for every element of the vector 
sent through the output port. If count = 0, then all the elements of the vector would 
' 
have been sent, and the enable sign~! of the output port is automatically reset. 
If a vector register is specified as a destination register, then when the enable 
"' 
signal of the input port becomes true, data in the input port is written into the 
register, one element per clockcycle, ~pdating the control logic until the register 
receives all the elements of the assigned vector. When the enable signal of the input 
port becomes low, it implies that all the ~lements of the assigned vector have been 
written into the register. After data are sent or received by a register, the busy flag 
of the register is automatically reset. If all the 'busy flags of the vector registers 
present in the FTVP are kept in a centralized control unit, the set of flags will be 
equivalent to the scoreboard register in a RISC processor [25]. 
A vector register of the FTVP can be read or written by the external memory 
or by an arithmetic unit or by another vector register. All source registers for a 
23 
particular pipeline chain send the stored data synchronously to the arithmetic units 
with a common trigger signal from the control unit. This is termed the execution 
phase and is started by a special instruction discussed in the section on the basic 
instructions of the FfVP. 
Structure of an Arithmetic Unit 
Figure 9 shows the structure of an arithmetic unit in the FIVP. Each 
arithmetic unit consists of two buffers and a pipeline which is either an adder or a 
multiplier. Each arithmetic unit consists of two input ports and one output port. 
Data in each port of the arithmetic unit has a signal, termed as the enable signal, to 
indicate the availability of valid data in the port. 
INPUT P~1( I) EN ABLE INPUT PCRT(2) El-l ABLE 
Buffer Buffer 
~ ~ 
OJTPU1 POR1 EN o'la.E 
Figure 9. Structure of an Arithmetic Unit 
24 
The enable signal to the input ports of the arithmetic unit will originate 
either from the output port of a vector register or from the output port of another 
arithmetic unit. When the results begin to emerge out of the last stage of the 
pipeline in the arithmetic unit, the enable signal of the output port automatically 
becomes true. The enable signal remains true until all the results have emerged out 
of the pipeline. 
Since an arithmetic unit may be involved in a long pipeline chain, the two 
input vectors to that arithmetic unit may arrive at different times, traversing 
different paths in the FfVP. These path delays have to be equalized in order to 
synchronize the arrival of both input vectors to the pipeline. This is referred to as 
data buffering and is done by the two buffers present in the arithmetic unit. For 
example, consider an arithmetic unit which requires two vectors IN(l) and IN(2) for 
evaluation. Suppose, IN(l) has arrived at the input port(l) of the arithmetic unit 
before IN(2). Since IN(l) has arrived before IN(2), IN(l) is held in a buffer of the 
arithmetic unit until IN(2) arrives. The arrival of IN(2) is indicated by its enable 
signal. When the enable signal of IN(2) becomes true, it implies that IN(2) has 
arrived; therefore, IN(l)held in the buffer and IN(2) which has arrived at the input 
port(2), are sent to the pipeline in th~ arithmetic unit for processing. In this way, we 
can ensure that both the input vectors' are fed to the pipeline in the arithmetic unit 
at the same time. This hardware buffering frees the compiler from the need to 
provide data buffering, as in the case of the pipeline net design [20]. 
In a situation where one input vector to an arithmetic unit is shorter than the 
other input vector, the last element of the shortest input vector to the arithmetic 
unit is held in a buffer and used as the input to the pipeline until all the elements of 
the other input vector are sent to the pipeline. For example, if IN(l) and IN(2) are 
the two vectors to an arithmetic unit, and the number of elements of IN(l) is 10 and 
IN(2) is 20. After the first 9 elements of both IN(l) and IN(2) have been sent to 
25 
the pipeline, the lOth element of IN(l) (which is the last element) is held in the 
buffer of the arithmetic unit and used as the input to the pipeline, until all the 10 
remaining elements ofiN(2) have been sent to the pipeline. In this way, the'FTVP 
can handle scalar-vector operations because a scalar data is a vector with one 
element. 
Chaining Capability 
Pipeline chaining in the FTVP is a linking process that occurs when the 
results obtained from one arithmetic unit are fed directly to another arithmetic unit. 
FTVP has the capability of providing long pipeline chains, and accomplishes 
chaining by various switch settings in the interconnection networks. This differs 
from the dual-mode.of the i860 where two pipelines are linked through the special 
dual-mode instructions [9]. Consider the FORTRAN loop example whose program 
graph was given in Figure 4. The. program graph is translated to the FTVP 
implementation as shown in Figure 10., Since all the source registers for a particular 
pipeline chain send the data stored in them simultaneously upon receipt of a trigger 
signal, Y and Q arrive at the input ports ofAU3 and AU6, respectively, before the 
other input vectors to AU3 and AU6 arrive. Therefore, Y and Q a:re held in the 
buffers of AU3 and AU6, respectively, until the other input vectors to AU3 and 
AU6 arrive. The delay involved due to different path lengths is thus equalized 
automatically by this hardware buffering. 
A series of FORTRAN kernels have been developed by the US National 
laboratories to evaluate the performance of vector processors and supercomputers 
[21]. These FORTRAN kernels are called the Livermore loops. The Livermore 
loops have been extracted from various vector processing applications. Fourteen 

















. . .. 
.. . . 
.. . . 
REGISTERS 
_j Rl : I 
I R2 ~ I 
I R3 ~ I 
I R4~ I 
I R5 ~ I 
H R6 } 
I R7 j--L 
I ) Buffer 
CBN3 
PIPELINES 
\ \ \ 
\ * ~ * ~ ~-i * 





: CBN1 I 
+ 














obtain a simple snapshot of the complex architectural performance of a vector 
processor [21]. The Livermore loops provide typical benchmark programs for 
vector processing applications. Even though the operation and performance 
measureme11t of a vector processor cannot be expressed as a simple set of numbers, 
the livermore loops have served as the primary benchmark programs for nearly a 
decade [21 ]. ·Since we have not actually fabricated the FfVP shown in Figure 7, the 
livermore loops present a theoretical way of evaluating the FfVP performance. 
The Livermore loops are listed in Appendix A. 
Figure 11 shows the pipeline cbaining performed for. Livermore loop 9. The 
result obtained from one arithmetic unit is directly sent to the next arithmetic unit 
through CBN2 and :CBN3. Only the final result is sent back to a destination vector 
register. Similar kinds of chaining operations can· also be done for other livermore 
loops. Livermore loops 8, 13 and 14 are not evaluated because of the indirect array 
addressing in loop 13, and the unknown index calculations in loop 8 (SIG) and loop 
14 (GRD, DEX). livermore loops 4; 5, 6, 9 and 11 can be evaluated by the FfVP 
only if simultaneous read and wri~e operations are allowed by the vector registers, 
since they have recurrence relatio.nship. This is due to the fact that Livermore loops 
4, 5, 6, 9 and 11 have the same source and destination registers. Even though the 
FfVP will not be able to evaluate the vector loops with recurrence relationships 
right now, we proceed assuming that the FfVP will be able to evaluate when the 
registers allow simultaneous read arid write operations. But Livermore loops 2 and 
3 can still be evaluated by the FfVP even though there is a recurrence relationship, 
because, scalar data stored in the register Q' for the loops 2 and 3 will be sent 
towards the arithmetic units before the results of the respective pipeline chains are 
written back into the register Q. For this reason, no simultaneous read and write 
operations is required for these loop evaluations. 
28 
g 
I PX(13,1) . \ 












PX(3,1) · _j 
\ PX(S,I) I 
~ 
Figure 11. Chaining Operation for Livermore Loop 9 
The evaluation of livermore loops 4, 5, 6, 9 and 11 by the FfVP will be slow 
as they are essentially scalar operations. For example, consider the Livermore loop 
11 shown below: 
DO 11 I = 2, 1000 
11 X(K) = X(K-1) + Y(K) 
We see that there is a recurrence~relationship between X(K) and X(K-1). A new 
iteration for the calculation of X(K) cannot be started until the previous iteration is 
completed. If a pipeline with three stages· is assumed, then a new iteration can be 
initiated only when the pipeline has sent the results of previous iteration; therefore, 
no overlapping is done. Due to this nqn-overlapping execution a new iteration can 
only be initiated every third clock period. Only one stage of the pipeline will be 
evaluating at a particular clock period, while the other two stages are idle due to this 
non-overlapping evaluation. This type of problem has been studied by Kogge who 
proposed a double cycling method to reduce the pipeline idle time in this kind of 
situation [26]. The double cycling method serves to reduce the vector loop that has 
a recurrence relationship.to a vector loop that has a latency of one [26]. Latency is 
defined as the number of clock periods that elapse between two successive 
iterations. If the pipelining is achieved, then a new iteration may be done every 
clock period due to overlapping. But in Livermore loop 11, since no overlapping is 
done, the latency is high (three, if a' pipeline with three stages is assumed). To 




X(K-1) + Y(K)' 
X(K-2) + Y(K-1) 
X(K-3) + Y(K-2) 
• • • • • • • (1) 
• • • • • • • (2) 
• • • • • • • (3). 
Substituting (2) and (3) in (1) and substituting it back in the original equation, we 
have 
DO 11 K = 2, 1000 
B(K) = Y(K-2) + Y(K-1) + Y(K) 
11 X(K) = X(K-3) + B(K) 
30 
This implies that the calculation of X(K) is dependent on X(K-3) and not X(K-1), 
which implies that the next iteration in the calculation of X(K) is dependent on the 
results obtained three iterations before and not on the results from the previous 
iteration. Therefore, overlapped evaluatien can be done, and latency can be 
reduced. The calculation of B(K) can be overlapped with the calculation of X(K), 
and new iteration can be initiated every clock period. 
The Kogge's double cycling method is a two-step evaluation method. For 
example, in the modified Livermore loop 11 shown above, calculation of B(K) is the 
first step, and calculation of X(K) is the second step. The two steps can be 
combined into a single step and evaluated, as shown here: 
DO 11 K = 2, 1000 
11 X(K) = Y(K-2) + Y(K-1) + Y(K) + X(K-3) 
The Kogge double cycling method is used to recast Livermore loop 5 also, as shown 
in Appendix B. The Kogge double cycling method is effective only when the 
number of stages in all the pipelines of the FfVP are same, so that the method can 
be applied uniformly to all the pipelines no matter what function they execute [26]. 
In addition, to apply Kogge's double cycling method in the FfVP, the vector 
registers of the FfVP must allow simultaneous read and write operations. 
Basic Instructions 
The basic instructions required for the FfVP are divided into four groups: 
the register read/write and register control instructions, the arithmetic instructions, 
the network instructions and the execution phase instructions, all of which are 
shown in Table III. Each instruction shown in Table III is assumed to be executed 
in one clock period. The initial group consists of a memory-register write operation 
and a resister-memory read operation which are executed by the LOAD and 




BASIC INSTRUCTIONS FOR THE FTVP 
Instructions Type Comments 
Group - 1 Register read/write and 
register control 
instructions. 
LOAD reg., variable Memory to Regu;ter Write Load the variable in the memory 
to the specified register. 
STORE reg., variable Register to Memory Write Store the data in the register in 
the memory. 
MOVE source, destination Register - register transfer Register - register transfer 
instruction instruction. 
REG(reg., len, skip, SO/DEST) · Register control . Loads t:])e control parameters 
. . into the control logic of the 
specified register. 
reg - Specified register 
len - Length of the data stream 
skip - Skip distance 
SO/DEST - Source or 
destination 
Group - 2 Arithmetic Instructions 
add sl, s2, d Pipelined add instruction sl, s2 denote the two sources 
(registers "r" or temporary 
variables "t"); d denotes the 
destination (registers "r" or 
temporary variables "t"). 
mul sl, s2, d Pipelined multiply 
instruction · 
Group - 3 Network instruction 
set( k, i, j) Set the crossbar switch specified 
by k, i, j, where "k" is the 
network index (CBNl, CBN2, 
CBN3, CBN4), "i" and "j" are 
the row and column indexes of 
the crossbar switch. 
reset( k, i, j) Resets the crossbar switch 
specified by k, i, j. 
Group - 4 Execution phase 
instruction 
start (list of registers) Starts the execution phase Starts the execution phase and 
triggers the vector registers 
wait (list of registers) Holds the FTVP idle until all the 
specified registers are ready. 
register-register data transfer. The REG instruction sets the control logic code of 
the specified vector register. All register read or write instructions must be 
preceded by a REG control instruction to set the access mode of the register. The 
second group are the add and mul arithmetic instructions. Each arithmetic 
instruction has two source variables and one destination variable. The source and 
destination variables of an arithmetic instruction can be either a register or a 
temporary variable (denoted by t). Temporary variables are provided in the 
arithmetic instructions to facilitate setting up of a pipeline chaih. The third group of 
' ' instructions, set and reset; are used to set and reset a-crossbar switch, respectively. 
The final group of instructions, start and wait, start and hold the execution phase in 
the FfVP, respectively. The wait instruction holds the FfVP from executing any 
new instruction until all the registers specified in the list receive their data. During 
the start of the execution phase, all vector registers that send their data towards the 
arithmetic units are triggered by a common trigger signal from the control unit, and 
this achieved by the start instruction. 
Since we have restricted ourselves to the execution of one pipeline chain at a 
time in the FfVP, the typical pattern of instructions that occur for every pipeline 













Set the control logic of the registers for 
the memory-register access. 
LOAD instruction:. 
, Wait for the data to be loaded. 
Set the control logic of registers for the pipeline 
chain. 
Generate the arithmetic instructions. 
Virtual to physical translation. Generate the 
switch settmgs. 
Start the execution phase. 
Wait for the results. 
Set the register control of the result register for 
the register-memory access. 
Store the result in the memory. 
33 
The control logic for the registers that need data from the memory are 
initially set by the ~G instructions for memory-register access. Memory access is 
then done by the WAD instruction. A wait instruction is issued to hold the FIVP 
from executing any new instruction until all the registers in the list receive data from 
the memory; this is due to the fact that data loaded from the memory to the vector 
registers will ~ary in length, and the FIVP Q~.eds ·to w~t until every specified 
register receiyes every element of the assigned vector from th~ memory. After the 
memory access, the mode of all source registers involved in a pipe}ine chain is reset 
for the pipeline chain operation. This is due to the fact that during the LOAD 
operation these registers received data from the memory, and therefore, were 
destination registers. For a pipeline chain, these registers send their data towards 
the arithmetic units, and therefore, are source registers. The arithmetic instructions 
' ' ' 
are then generated. Th~ compiler then generates the virtual addresses of the 
pipelines based on the arithmetic instructions. A translation procedure generates 
the physical pipeline addresses from the virtual addresses. The crossbar switch 
settings are then computed from the physical addresses. A translation procedure, 
and the reasons for the translation procedure are discussed in Chapter III. Once a 
pipeline chain is set, the start instructi~n is issued to start the execution phase. A 
,,, 
wait instruction is issued to hold the FfVP from executing any new instruction until 
the destination registers of the pipeline chain specified in the list receives every 
element of the res,ults. After the results are received, the mode of the destination 
registers are set for the register-memory access. The final results are sent back to 
the memory after completion of the execution phase by the STORE instruction. 
Figure 12 sh~ws the instructions generated for the example FORTRAN loop 
we have been considering till now. Scalar data loaded in the registers rl, r3 and r6 
I 
have count = 1. Once the execution phase is started, data stored in the source 
34 
35 
Sequence Instructions Comments 
1 REG(rl, 1, 1, DEST) The register control logics are set REG(r2, 400, 1, DEST) 
REG(r3, 1, 1, DEST) for memory-to-register read 
REG(r4, 400, 1, DEST) operation. 
REG(r5, 400, 1, DEST) 
REG(r6, 1, 1, DEST) 
' 
2 LOADr1, R 
LOADr2,Z The variables are read from the LOADr3, T 
memory to the registers. LOADr4,P 
LOADr5, Y 
LOADr6,Q 
3 wait(r1, r2, r3, r4, r5, r6) Wait tillall the registers have 
received their data. 
4 REG(r1, 1, 1, SO) 
REG(r2, 400, 1, SO) The source registers for the pipeline 
REG(r3, 1, 1, SO) chain operations are once again set. 
REG(r4, 400, 1, SO) Register r7 is the final destination 
REG(r5, 400, 1, SO) register. 
REG(r6, 1, 1, SO) 
REG(r7, 400, 1, DEST) 
5 mul r1, r2, t1 
mul r3, r4, t2 Arithmetic instructions. 
add t1, t2, t3 
mul r5, t3, t4 
add r6, t4, r7 
6 Virtual to physical address Translation procedure. 
translation and generation of switch 
settings 
7 start(r1, r2, r3, r4, r5, r6) Start of the execution phase 
8 wait(r7) Wait till all results are written into 
register r7. 
9 REG(r7, 400, 1, SO) Register-to-memory write. 
10 STOREr7,X Final results are stored in the 
memo~ 
Figure 12. Instructions for the Example Loop 
registers are sent to the arithmetic units, one element per cycle. The final results 
stored in register r7 are sent back to the memory. 
The compiler is responsible for setting up one pipeline chain at a time. Since 
the FTVP is a RISC type processor, the compiler is assumed to know the exact 
content of the number of hardware resources in the FTVP. If the instructions for a 
particular vector loop are more than the hardware resources in the FTVP, the 
compiler divides the bigger loop into smaller vector loops, and evaluates the smaller 
vector loops one by one. The rules for division of a big vector loop into smaller 
vector loops are presented in Chapter IV. Further, the compiler is responsible for 
the allocation of all source and destination registers for a particular pipeline chain. 
A pipeline chain is removed by the reset instructions if the need arises. Otherwise 
the settings of the previous pipeline chain may be maintained for future use. The 
method by which the compiler achieves these is beyond the scope of this thesis. 
We choose the software approach to build a pipeline chain in the FfVP over 
the hardware approach to minimize the control unit hardware required; this 
minimization leads to reduction in the chip-space occupied by the FTVP. But this 
software approach will introduce a large control overhead which hopefully will be 
offset by the speed enhancement ach-ieved by single-chip implementation. 
36 
CHAPTER ill 
FAULT-TOLERANCE IN THE FfVP 
.. 
As stated in Chapter I, the first objective of this thesis is the proposed 
fabrication of the FfVP introd~ced in Chapter ·n as ~ single~chip processor. The 
technique for the proposed single-chip fabrication is WSI. A short discussion on 
' ' 
WSI is included in .Appendix C. As ~iscussed in Chapter I, a.major drawback to the 
' ' 
FfVP fabrication'using WSI is the presence of faulty modules in the FfVP. We 
must find a way to det~ct the faulty modules present in the fabricated FfVP, and 
provide a technique that would make the FTVP fault-tolerant. However, detection 
of faulty modules in the FTVP is not the area of study of this thesis. Various 
techniques for fault-detection exist and are dealt extensively in [27, 28 and 29]. One 
technique for achieving fault-tolerance is by providing redundancy in the form of 
extra modules during fabrication to compensate for the faulty modules in the 
system. After fabrication, suitable .routing algorithms developed prior to fabrication 
are applied to avoid the faulty·modul~s, and utilize the good ones t~ obtain a 
functional design. In our proposed FTVP, a simple translation procedure achieves 
fault-tolerance. No redundancy in the form of extra F(VP mod1,1l~s will be provided 
'• ' 
during the proposed fabrication. Our translation procedure avoids the bad modules 
in the FTVP, and uses the good ones to set up a pipeline chain. The translation 
procedure convert~ the virtual addresses generated by the compiler to· the physical 
addresses in the FTVP. ~ased on the physical addresses, t?e compiler computes the 
crossbar switch settings to build a pipeline chain. 
37 
Types of Fault and Fault Vectors 
The proposed WSI architecture of the FivP is shown in Figure 13. The 
general types of faulty modules that may be present in the FTVP are classified as a 
faulty arithmetic unit, a faulty register, a faulty crossbar network link, and a faulty 
crossbar network switch. A faulty arithmetic unit will include both the buffer and 
pipeline faults, and will be henceforth referred to as a pipeline fault. If a pipeline in 
the fabricated FTVP is faulty, a new fault-free pipeline is selected. A new fault-free 
register is selected in the case of a faulty register .. Links in the crossbar networks 
are classified as horizontal and vertical. Links are represented by the vertical and 
horizontal lines in Figure 13. The link~ interconnecting CBN2 and CBN3 are 
considered as the horizontal links of CBN2. The links interconnecting CBN2 and 
CBN4 are considered as the vertical links of CBN4. In the case of a faulty 
horizontal link in CBN1 or CBN4, the register connected to that particular link 
cannot be accessed. Therefore, a new register must be selected. Since a new 
register is selected, we consider a horizontal link fault in CBN1 or'CBN4 as a 
register fault. In the case of a fa.ulty horizontal link in CBN2 or CBN3, a new fault-
free horizontal link is selected. A vertical link fault in CBN1, CBN2 or CBN3 is 
considered as a pipeline fault and a new pipeline is selected. This is due to the fact 
that a faulty vertical link in CBN1, CBN2 or CBN3 denies access to the pipeline 
connected to that particular vertical link. Therefore, a new pipeline must be 
selected. A new fault-free verticalli~k is sdected in the case of a faulty vertical link 
in CBN4. A switch fault. present in any of the four networks is considered a vertical 
link fault for simplicity. Therefore, the steps taken for avoiding a faulty vertical link 
discussed previously, are implemented in case of a switch fault. 
The compiler, while assigning the pipelines and networks through the 















VECTOR II"' II" II" II" 'I" 
II"' II"' II"' I'" 
REGISTERS II" II" II" 
:q :q :q :q_ :q :q_ :q :q_ :q ~ 
:q l:l, :q ".<( "%I l:l, :q :q_ "%I :q_ 
I ~ ~ ~ ""%\. "%I ".<( "%I ~ .">t. :q_ 
I ~ ~ ~ 'i.l. ~ 2l, 2l, 2l, ~ l:l, 
CBNl --:q :q_ l:l, :q_ :q :q_ l:l, l:l, 
':q 21. 
"" 
".<( "%I ".<( "%I ".<( "%I :q_ 
~ 2l, 2l, 2l, ">t. 2l, 2l, ~ ~ 21. 
l:l, :q_ :q_ ~ :q_ :q_ :q :q_ l:l, 
"" 




































modules in the fabricated FfVP. But since the proposed FfVP is considered to be 
a RISC-type processor, the compiler is assumed to know that there are faulty 
modules in the fabricated FfVP. Therefore, the compiler generates the virtual 
addresses which have to be mapped to the exact physical addresses of the FTVP, 
taking into account the faulty modules in the FfVP. This mapping is accomplished 
by a translation procedure. The tranSlation procedure uses the fault vectors 
presented below while determining the physical address in the FfVP. 
40 
The fault-free modules in the FfVP are represented at the software level by 
the fault vectors. It is important to note here that the fault vectors will contain the 
physical addresses of the fault-free modules in the FfVP. The fault-free registers in 
the FfVP are represented by the register fault vector. Similarly, the fault-free 
pipelines and the fault-free links are represented by the pipeline fault vector and the · 
link fault vector. Since a switch fault is considered as a vertical link fault, no 
separate fault vector is provided for the fault-free switches. 
The pipeline fault vector contains the physical address of the fault-free 
pipelines in the FfVP. A pipeline fault in the FfVP may occur as a result of the 
pipeline fault itself, or in the case of CBNl, CBN2 or CBN3, a vertical link fault or a 
switch fault. Therefore, the physical address of the pipeline which satisfies any of 
these situations is not present in the pipeline fault vector. Since the FfVP will have 
two sets of pipelines, namely adders and multipliers, each set of pipelines will have a 
pipeline fault vector to r~present that particular set. 
The register fault vector contains the physical address of the fault-free 
registers in the FTVP. A register fault in the FfVP may occur as a result of the 
register fault itself, or a horizontal link fault in CBNl or CBN4. Therefore, the 
physical address of the register which satisfies any of these situations is not present 
in the register fault vector. 
The link fault vector contains the physical address of the fault-free horizontal 
links in CBN1, CBN2 or CBN3. In the case of CBN4, the link fault vector contains 
the physical address of the fault-free vertical links. 
Translation Procedure 
Since the proposed FfVP is considered to be a RISC type processor, the 
compiler is assumed to have the knowledge of the number of fault-free hardware 
resources available for processing. Further, the compiler is assum,ed to be aware 
that there are faulty modules in the fabricated FfVP. But the compiler will not be 
aware of the exact location of the faulty modules in the FfVP. Therefore, the 
compiler generates a virtual address which is mapped to the physical address in the 
FfVP avoiding the faulty modules. A translation procedure maps the virtual 
addresses generated by the compiler to the physical addresses of the FfVP, 
avoiding the faulty modules. The compiler then computes the switch settings 
needed to set up a pipeline chain, using the fault-free pipelines and links. The steps 
involved in the translation procedure for each pipeline chain are: 
Step 1: Determine the type of network. 
Step 2: Assign the virtual pip~line and row addresses. 
~ ' 
Step 3: Obtain the physical pipeline and row addresses by a 1-to-1 mapping 
procedure. 
Step 4: Calculate the switch settings. 
Based on the type of instruction generated by the compiler, the switching 
networks for that particular instruction are assigned in step 1 of the translation 
procedure. The virtual pipeline address (p') and the horizontal link virtual address 
corresponding to CBN2 (i") for an instruction are assigned in step 2 of the 
translation procedure. The virtual addresses generated in step 2 are 1-to-1 mapped 
41 
to the physical addresses of the FfVP using the fault vectors in step 3. The switch 
settings are then computed from the pipeline physical addresses to set up a pipeline 
chain in step 4. Once a pipeline chain is set, a start instruction is issued to start the 
execution phase. The group I instructions shown fu Table lll are not considered as 
instructions required for a pipeline chain, as no pipelines are involved while these 
instructions are executed; only arithmetic ins~ctions use the pipelines in the 
FfVP. 
Now we will presept the rules for assigning the switching networks for the 
arithmetic instructions as per step 1 of the translatio~ procedure.· As seen in Table 
III, each arithmetic instruction generated by the compiler bas three variables: two 
source variables and one destination variable. If the source variable in an 
arithmetic instruction is a register, it implies that the arithmetic instruction requires 
input datum from a register. If the source variable in an arithmetic instruction is a 
temporary variable, it implies that the arithmetic instruction requires input datum 
from another arithmetic unit. , If the destination variable in an arithmetic instruction 
is a register, it implies that the results of the arithmetic·operation have to be stored 
in a register. If the destination variable in an arithmetic instruction is a temporary 
variable, it implies that the results of the arithmetic operation are to be sent to 
another arithmetic unit. Therefore, netw<;»rks are assigned to each variable of the 
arithmetic instruction so as to achieve the above conditions,. and the rules for 
assigning the networks are given below: 
Rule 1: If the source variable in·an arithmetic instruction is a register, the network 
assigned for that source variable is the CBNl. Thisis due to the fact that the 
arithmetic operation performed corresponding to the instruction, requires data from 
a register, and the data has to be fetched through CBNl. 
Rule 2: If the source variable in an arithmetic instruction is a temporary variable 
(denoted by t), the network assigned for that source variable is the CBN3. This is 
42 
due to the fact that the arithmetic operation performed corresponding to the 
instruction, requires data from the output port of another arithmetic unit, and the 
data has to be fetched through CBN3. 
Rule 3: If the destination variable in an arithmetic instruction is a temporary 
variable, the network assigned for that destination vanable i~ the CBN2. This is due 
to the fact that the results of the arithmetic op~ration performed by the instruction, 
have to be sent to another arithmetic unit, and it has .to be done through CBN2. 
Rule 4: If the dest~nation variable in 3n arithmetic instruction is a register, the 
> 
networks assigned for' that destination' variable are th,e CBN2 and CBN4. This is 
due to the fact that the results of the-arithmetic operation performed corresponding 
to the instruction, have to be stored in a register, and this has to done through CBN2 
' ' 
and CBN4. 
For example, consider the instruction add rl, tl, r2. The switching networks 














Mter step 1 is executed, the virtUal addresses for an instruction are 
' ' ' 
- ' 
generated by the compiler in step i. -The virtual pipeline address for a particular 
instruction ranges from 0 to (number of fault-free pipelines - 1}. For e~ample, if 
there ar_e four fault-free pipelines in an FfVP, then the virtual pipeline address 
rangesfrom 0 to 3. The virtual horizontal link address for a particular instruction 
ranges from 0 to (number of fault-free horizontal links of CBN2- 1}. The virtual to 
physical mapping proced)Jre is executed then to obtain the physical addresses in step 
3. The switch settings required for building a pipeline chain are then computed 
from the physical addresses in step 4. This final step is discussed before step 3, 
however, for ease in understanding the entire process. 
43 
In step 4 of the translation procedure, the crossbar switches are set in the 
interconnection networks by the compiler through the set instructions. Each 
network assignment in step 1 has a corresponding set instruction. As seen from 
Table III, a set instruction requires 3 parameters: ''k."- the type of network, "i" - the 
row number of a crossbar switch in that network, and 'J"- the column number of the 
switch in that network. The following are ,the. rules for determining the three 
parameters: 
Rule 5: k for a switch setting is determined based on the rules 1, 2, 3 and 4. 
Rule 6: In the case of CBN1 or CBN4, the index i equals to the physical address of 
the assigned register in the arithmetic instruction. For CBN2, a virtual index i" is 
generated by the compiler, which is mapped to i by a 1-to-1 mapping procedure to 
be discussed later .. The physical index i thus generated is aiso the physical index i of 
CBN3 for the same variable. 
Rule 7a: For CBNl, CBN2 or CBN3,jis computed from the pipeline physical 
address (p) obtained from the mapping procedure in step 3 of the translation 
procedure. Ifj corresponds to the first source variable in the arithmetic instruction, 
thenj = 2p (where pis the physical address of the pipeline). Ifj corresponds to the 
second source variable in the arithmetic instruction, thenj = 2p + 1 (where p is the 
physical address of the pipeline). If j .corresponds to the destination variabl{( in the 
arithmetic instruction, then j · = p (where p is the physical address of the pipeline). 
Consider the instruction add r1, t1, r2. Let the pipeline physical address (p) assigned 
to this instruction be 2. The j indexes corresponding to the three variables in the 
instruction are 
Variable 
r 1 (first source) 
t1 (second source) 
r2 (destination) 
j Index 




Rule 7b: For CBN4, the physical indexj is equal to the physical index i of the 
preceding CBN2 switch setting (for the arithmetic instructions, every CBN4 
assignment will be preceded by a CBN2 assignment, according to rule 4 ). 
Rule 8: For the MOVE instruction only one network is assigned and the network is 
CBN4. This is due to the factthat the register-register data transfer performed 
corresponding to the instruction has to be ~ccomplished by CBN4 in an FfVP. 
The Virtual to physical 1-to-1 mapping procedure executed in step 3 of the 
translation procedure is shown in Figure 14. The mapping procedure involves two 
- ' 
steps. In the first step, the pipeline physical address (p) is q})t~ned from the 
' -
pipeline virtual address (p'). In the s~c~nd step, the vi~tual.index i" generated for 
CBN2 is mapped to the physical address (i) by t~e mapping procedure. The two 
Steps invol~ed in the mapping procedure are e~lained bel9w. 
Step 1: The virtual pipeline address (p') generated by the compiler in step 2 of the 
translation procedure forms the column number of the pipeline fault vector. The 
address stored in that column of the pipeline fault vector is the p~ysical address of 
the pipeline. The physical column indexj required for the switch settings in CBNl,. 
CBN2 or CBN3 are computed based on rule 7a. 
' -
Step 2: The virtual index link.addre~~ for ~BN2 (i'') generated by the compiler in 
. 
step 2 of the translation procedure f~rms the column number of the link fault vector 
of CBN2. The address stored in that column of the link fault vector (i') forms the 
column number of the link fault vector of CBN~. The address stored in that column 
of the link fault vector of CBN3 is the physical index i for CBN2 and CBN3 by rule 
6. But, if a CBN2 assignment is followed by a CBN4 assignment, then the index i' 
obtained forms the colurpn number of the link fault vector of CBN4. The address 
stored in that column of the link fault vector is the physical index i for CBN2 and the 
physical indexj for CBN4 according to rule 7b. 
45 
Step 1: For CBNl, CBN2 and CBN3 
V1rtue1 P1pelme Address 
P1pel1ne Fault Vector 
Phys1 cal P1 pel1 ne Address 
Phys1cal Sw1tch Index 
Step 2: For CBN2, CBN3 arid CBN4 
I 
L- ----
L mk Fault Vector of CBN2 
L 1 nk fault vector of CBN3 or CBN4 
Phys1 eel Index 1 for CBN2 end CBN3 by rule 6 
or Phys1cel Index 1 for CBN2 end 1ndex J for 
CBN4 by rule 7b 
Figure 14. 1-to-1 Mapping Procedure 
46 
Example 
The translation procedure discussed in the previous section is demonstrated 
using the example FORTRAN loop for various fault conditions. An eight pipeline 
structure shown in Figure 15 is considered with no register fault. Let the index k for 
CBN1, CBN2, CBN3 and CBN4 be 0, 1, 2, 3 and 4, respectively. The indexes i and j 
ate shown in Figure 15. Let the physical address for rl - > r7 be 0- > 7, respectively. 
Pipeline fault vector for the multipliers is denoted as pipeline fault vector(l) and for 
the adders as pipeline fault vector(2) .. The arithmetic instructions for the loop (from 
Figure 12), and the networks assigned by the rules 1, 2, 3 and 4 are 
Arithmetic instruction 
mul r1 r2 t1 
mul r3 r4 t2 
add t1 t2 t3 
mul r5 t3 t4 
add r6 t4 r7 














CBN1 CBNl CBN2 
CBN1 CBN1 CBN2 
CBN3 CBN3 CBN2 
CBN1 CBN3 CBN2 
CBNl CBN3 CBN2 
CBN4 












1 2 3 4 5 6 7 
u 1 2 3 4 5 6 7 
0 1 2 3 4 5 6 7 
u 1 2 3 4 5 6 7 
0 1 2 3 4 5 6 7 
Fault Free Condition 
Horizontal link physical address 
Horizontal link physical address 
Horizontal link physical address 
Vertical link physical address 
Physical addresses of the pipelines stored in the pipeline fault vectors with 
fault free condition are 
47 
t 
0 ' 7 1--









H r- I I j_ I I 
* * * * 
+ + + + 
0 ' -/
l . 
Figure 15. Architecture of an 8 Pipeljned Structure with Switch Indexes 
Column 0 1 2 3 4 5 6 7 
Pipeline fault vector( 1) I 0 1 2 3 I 4 Pipeline fault vector(2) ~___;;~-=--_.;:;.-t.--:r-5~-6....---.7ortf 
We will now consider two instructions mul rl, r2, t1 and add r6, t4, r7 for obtaining 
the switch settings. First consider the mul rl, r2, tl instruction. The virtual pipeline 
address assigned to this ,instruction is 0 which forms .the column number of the 
pipeline fault vector(l) (because the arithmetic instruction is multiplication). The 
address, 0, stored in that column is the pipeline physical address for the instruction. 
The virtual index i" generated for CBN2 is 0, and this forms the column number of 
the link fault vector of CBN2; the address stored in that column is 0, and this forms 
the column number for the link fault veCtor of CBN3. The address stored in that 















Physicf pipeline !o, 2' 2l Physic!0~ 3jtlch sett!tg1' 1l 
4 2, 0, 8 2, 1, 9 1, 2, 4 
2 0, 4, 4 2, 2, 5 1, 3, 2 
However, the instruction add r6, t4, r7 is a different case. Since the CBN2 
assignment is followed by a CBN4 assignment, rule 7b is applied. The virtual index 
i" generated for this instruction is 4. This forms the column number of the link fault 
vector of CBN2. The address stored in that column forms the column number of 
the link fault vector of CBN4. The address stored in that column of CBN4, 4, is the 
physical index i of CBN2 and physical indexj of CBN4. The switch settings are 
Variable Switch setting Rules 
r6 (0, 5, 10) 5, 6, 7a 
t4 (~3, 11) 5,~7a 
r7 (1, 4, 5) 5, 6, 7a 
(3,6,4) 5,6, 7b 
Figure 16 shows the FfVP implementation. 
49 
i Rl I 
i R2 I 
i R3 I 
i R4 I 
-1 R5 I 




.. 1 _I 
* * * * 
+ + + 
I I I 





Pipeline Fault Condition 
Let the pipelines with physical addresses 0 and 4 be faulty. The pipeline fault 
vectors corresponding to this situation are 
Column 0 1 2 3 4 5 6 ~~~~~~~~~~~~~~~~~~~a~ 'r-:1f---i2:--~3--=:;..,l. s 6 ' 7 _ __ ___,,, 













Figure 17 shows the FTVP implementation. 
Switch Fault Condition 
Physical switch setting 
. 0, 2, 4) 0, 3, 5) 1, 1, 2 
2, 0, 10) 2, 1, 11) 1, 2, 5 1
0, 0, 2) 10, 1, 3) 1, 0, 1) 
0,4,6) 2,2, 7) 1,3,3 
0,5, 12) 2,3, 13) 1,4,6 
3,6,4 
Figure 18 shows the location of switch faults in the FfVP. Since switch faults 
are considered as vertical link faults; the corresponding pipelines cannot be 
accessed. Therefore, the pipeline fault vectors in such situation are 
Column 0 1 2 , 3 4 5 6 
Pipeline fault vector( 1) ._I ..._1 ----:2:..-...:3~~f-~~±_:-_-_-:__...-_-_-=_,_ _-_-. Pipeline fault vector(2) _ 6 7 
7 











Figure 18 shows the FTVP implementation. 
0, 0, 2l 0,2,4 
2,0,8 
0, 4, 6 
0,5, 12) 
1
0, 1, 3l 0,3,5 
2, 1,9 
2, 2, 7 
2, 3, 13) 
1, 0, 1 
1, 1, 2 
1, 2, 4 
1, 3, 3 
1, 4, 6 
3,6,4 
51 
H Rl I I 
~ R2 I 
H R3 I 
H R4 I I 
H R5 I I 
H R6 I I 
r-H R7 r-
H r-
~ r- I I 
* * * 
+ + + 
I 
Figure 17. Routing for Livermore Loop 1 in case of a Faulty Pipelines 
H Rl I -
H R2 I I 
H R3 I I 
H R4 I I 
H RS I I 
H R6 I 
' 







* * * * 
+ + + 
I 





Pipeline and Switch Fault 
Figure 19 shows the location of faulty pipelines and switches in the FTVP. 
Column 0 1 2 3 4 5 6 7 
Pipeline fault vector(1) It 2 3 J 













Figure 19 shows the FTVP implementation. 
Link and Switch Fault 
Physical switch setting 
1
0, 0, 2) 
0, 2, 4) 
2, 0, 10) 
· OA ·6) 
' ' ' 0, 5, 12) 
1
0, 1, 3) 
0, 3, 5) 
2, 1, 11) 
2,2, 7) 
2, 3, 13) 
1, 0, 1 
1, 1, 2 
1, 2, 5 
1, 3, 3 
1, 4, 6 
' 3, 6, 4 
Figure 20 shows location of faulty links and switches. The pipeline fault 
vectors and link fault vectors are 
Column 0 1 2 . 3 4 5 6 1 






1 2 3 4 5 6 7 
u 1 2 
1 2 3 
0 2 3 
0 1 2 











4 ~. () 7-
5- __Q_ '1 
5 6 7 







Figure 20 shows the FTVP implementation 
Horizontallink,physical address 
Horizontal link physical address 
Horizontal link physical address 
Vertical link physical address 
Physical switch setting 
'10, 0, 2l ,2,4 
2,2,8 
0, 4, 6 
0, 5, 10) 
1
0, 1, 3l 0, 3, 5 
2,3,9 
2,4, 7 
2, 5, 11) 
1, 2, 1 
1, 3, 2 
1, 4, 4 
1, 5, 3 
1, 6, 5 























1- - ~ 
-I- : 
1-
* * * 
+ + 
I 




••••••••••••••••••••••••••••• ••• I ••• I . ...... •••••••••••••••• 1---
H Rl I 
H R2 I 
rl R3 I 
rl R4 I I 
H R5 I I 
H R6 I 
-
I R7 t-
H t- ~ 
H J- I 
* * * * 
+ + + + 







Figure 20. Routing for Livermore Loop 1 in case of Link and Switch Faults 
CHAPTER IV 
EVALUATION 
Speedup and Throughput 
Evaluation of the FfVP is accomplished by using the livermore loops listed 
in Appendix A. The evaluation follows the steps taken in [20]. The time taken by a 
pipeline chain in evaluating an assigned vector loop is computed first. Since vectors 
from all the source registers of a pipeline chain e,merge simultaneously, the total 
time taken for the evaluation of a vector loop will be equal sum of the time required 
to set up a pipeline chain and the time required by a vector datum to traverse 
through the longest path from the input to the output of this pipeline chain. The 
longest path is termed the critical path. Let S be the time taken to build a pipeline 
chain, C be the number of pipelines in the critical path, a be the number of stages in 
an interconnection network (CBNl, CBN2, CBN3 or CBN4), B be the number of 
stages in a pipeline of the Ff\TP, a be the number of clock periods elapsed before 
the next element of the result vector emerges out of a pipeline chain, and N be the 
assigned vector loop length. 
We call the parameterS the setup time for a pipeline chain. The setup time 
depends upon the time taken for decoding the arithmetic instructions corresponding 
to the pipeline chain, the translation procedure involved and the execution of set 
instructions to build that pipeline chain. For the calculation of setup time we need 
an accurate prediction of the time taken for decoding the arithmetic instructions, 
the execution of the translation procedure and the execution of set instructions. 
Even though we have assumed in Chapter II that each arithmetic instruction will be 
57 
decoded in one clock cycle, we do not have the exact time taken for execution of the 
translation procedure. We therefore consider the setup time as a function of the 
time taken to execute the set instructions only;· even though this will not be an 
accurate setup time, it serves our purpose for the evaluation of FIVP's 
performance. Each arithmetic instruction requires three set instructions 
corresponding to the network assignment in step 1 of the translation procedure; the 
exception is when the destination variable of an arithmetic instruction is a register. 
We need an extra set instruction for the CBN4 ,assignment. In Chapter II we 
assumed that each set instruction is executed in one clock cycle; therefore, the setup 
time will be a function df the number of set instructions generated for a pipeline 
chain. Consider the example FORTRAN loop in which there are 5 arithmetic 
instructions. The number of set instructions for this loop are 5*3 + 1 = 16. Since 
each set instruction is executed in one clock cycle, the setup time for this example is 
16 clock cycles. It is possible to set more than one switch in a clock cycle, but we 
will stick to the original assumption of setting only one switch in a ·clock cycle. 
The parameter C is the number of pipelines in the critical path. In Figure 4, 
C = 4. The latency between each element of the result vector emerging out of the 
pipeline chain is (J. Latency is the number of clock cycles elapsed between each 
result. If there is a recurrence relationship in a vector loop similar to the loops 
discussed in Chapter II, a will be equal to the number of stages in a pipeline of the 
FTVP, because the pipeline needs to be drained before the next iteration in the 
loop can be started; otherwise, a will be equal to one, as explained in Chapter II. 
To reduce the latency in the case of a recurrence relationship, we use the Kogge's 
double cycling method discussed in Chapter II [26]. But to apply the Kogge's 
method, the number of stages in all the pipelines of the FIVP need to be equal (for 
explanation refer Chapter II). To meet this requirement, we assume a pipeline 
similar to the Intel i860's pipeline [9]. The i860 has a three-staged multiplier and a 
58 
three-staged adder. Further, to apply Kogge's double cycling method, the vector 
registers of the FfVP must allow simultaneous read and write operations. 
We represent the time taken to evaluate the given vector loop by a pipeline 
chain as T(n). From Figure 7 we see that a vector datum in the critical path of a 
pipeline chain crosses CBNl once at the start of the execution phase, CBN2 and 
CBN3 (C-1) times within the pipeline chain, and CBN2 an~ CBN4 once at the end 
of the execution phase. Also, the vector datum passes through a pipeline which is B 
staged, C times., Therefore, the time taken for the first element of the result to be 
written into the destination register of the pipeline chain is (C)(B + 2a) + a clock 
periods. After the first element of th~ result is written into the register, the , 
remaining elements of the result are written into the register every a clock period. 
Therefore, the time taken for theN results of a pipeline chain to be written into the 
register is (C)(B + 2a) + a + a(N- 1), and the total time taken for evaluation is 
T(n) = S + (C)(B,+ 2a) + a + a(N -1) 
For the FIVP, the number of stages (a) in any interconnection network is 1. Each 
pipeline of the i860 has three stages; that is, B = 3. In our initial discussion in 
Chapter II we stated that each element of a vector is written into the vector register 
every clock period. Therefore, a = 1 and 
T(n) = 
= 
S + 1 + (C)(3 + 2) + 1(N-1) 
S + 5C + N 
In the example FORTRAN loop, N = 400, S = 16 (derived before) and C = 
4. So, the number of clock cycles taken for evaluation of the example loop is 436 
cycles. The total evaluation time for each of the livermore loops based on the 
program graphs in Appendix Bare shown in Table IV. We assume that the same 
adder can perform the required subtraction for Livermore loops, by two's 
complement addition through a special instruction sub. 
59 
TABLE IV 
EVALUATION TIME OF THE LIVERMORE LOOPS 
Loop Number of arithmetic 
Number instructions 
c a s T(n) 
1 5 4 1 16 436 
2 10 5 3 31 654 
3 2 2 3 7 3015 41 3 2 3 11 532 51,2 10 10 1 33 416 61 8 6 3 28 1055 
7 16 7 1 49 209 91 15 5 1 46 171 101,3 9 9 1 46 1091 111,2 3 2 1 10 1019 
12 1 1 1 4 208 
In a vector loop that has recurrence relationship, we cannot achieve less 
evaluation time by dividing the vector loop into smaller loops and isolating the 
recurrent portion of the loop. This can be demo~strated by an example. Consider 
the case of Livermore loop 2 which is divided into two loops as 
Step 1: DO 2 K = 1, 996, 5 
2 T(K) = Z(K)*X(K)+Z(K+1)*X(K+l)+Z(K+2)*X(K+2)+ 
Z(K + 3) *X(K +3) + Z,(K +4) *X(K +4) 
Step 2: DO 3 K = 1, 996, 5 
3 Q = Q+T(K), 
The a for step 1 is "1" as there is no recurrence relationship. The a for step 2 is "3" 
because there is a recurrence relationship. Each element of the input vector, T(k), 
to the adder has to be delayed by the number of clock cycles equal to the number of 
stages in a pipeline of the FTVP for step 2. The total execution time for step 1 is 
268 cycles, and for step 2 is 607 cycles. The total execution time for livermore loop 
2 is 875 cycles, as opposed to 654 cycles shown in Table IV, because of the delay 
caused by step 2. Therefore, it is more advantageous to build a long pipeline chain 
1 Can be executed only if the registers allow simultaneous read and write 
2 Modified by Kogge's double Cycling Method 
3 Has to be divided into small vector loops 
60 
to evaluate a large vector loop than few short pipeline chains to evaluate small 
vector loops. But a long pipeline chain may result in the use of large buffers. 
Therefore, a trade-off has to be done in selecting the maximum allowable buffer 
size against the evaluation time. 
The next step is to evaluate the FfVP's throughput, which is defmed as the 
ratio of the total number of arithmetic operations performed by the FfVP to the 
total time taken for performing these operations. Let M be the total number of 
pipelines present in an FfVP; this irr~.plies that the total number of arithmetic 
operations performed by the FfVP is MN. Therefore, throughput of the FfVP is 
Hm = MN 
S + 5C + N 
Two parameters have been proposed in [30] to measure the throughput 
performance of a vector processor. H a. is the maximum throughput obtained when 
N approaches infinity, and NJ/2 is the minimum vector length needed to obtain half 
the maximum throughput. For the FfVP, Ha. = M and NJj2 = S + 5C; this implies 
that, in order to achieve the highest throughput (H a,), all the pipelines in the FfVP 
have to be utilized. The initial time to set up a pipeline chain has to be minimized 
in order to achieve at least half the maximum throughput for short vectors. By 
limiting the number of set instructions and executing more than one set instruction 
at a time, the initial setup time can be minimized. Figure 21 shows the results of the 
throughput analysis of the FfVP for various vector loop lengths with a constant 
' 
number of pipelines. While calculating the setup time for Figure 21, we assumed M 
arithmetic instructions forM pipelines so that the highest throughput is achieved. 
The number of pipelines in the critical path (C) is 4 (except forM = 2 where C = 
2). The results of Figure 21 demonstrates that for large vector loop lengths, the 



















2 : : : 
0 
10 100 1000' 10000 100000 100000 
LOOP LENGTH 
c M=2 + 4 • 8 6. 16 
Figure 21. Loop Length vs. Throughput 
The next step is to evaluate the relative speedup of the time taken for 
evaluation of a vector loop using M pipelines in the FfVP over the time taken for 
evaluation of that vector loop using a single pipeline in the FfVP. A pipeline in the 
FfVP can evaluate in two modes; the first one is the vector mode when there is no 
input-output recurrence relationship in the assigned vector loop. The second one is 
the scalar mode when there is a recurrence relationship in the assigned vector loop. 
In the FfVP, each vector element emerging from a pipeline in the scalar mode will 
have a latency which is equal to the number of stages in the pipeline, as the pipeline 
has to be drained before the next input element can be sent to the pipeline. 
Therefore, the time taken for evaluation by a pipeline in the scalar mode is Ts = 
BN = 3N. The time taken for evaluation by a pipeline in the vector mode is Ty = 
S1 + 3a + B + N- 1, as the datum passes through CBNl, CBN2, CBN4 and a 
pipeline once, before the results are written into a register. S1, the setup time for a 
single pipeline, for which we need four set instructions (2 for CBNl, 1 for CBN2, 
62 
and 1 for CBN4), is 4. Moreover, for the FTVP a = 1 and B = 3, resulting in Tv= 
N + 9. Ifwe needM1 number of pipelines in the scalar mode, andM2 number of 
pipelines in the vector mode, the time needed to evaluate the vector loop by a single 
pipeline is T(l) = TsM1 + TyM2 = 3M1N + M2(N + 9). 
Speedup (Sp) of the FTVP is the ratio of the time taken for evaluation of a 
vector loop by a single pipeline, to the total time taken for the evaluation of a vector 
loop in a pipeline chain. Therefore, 
, Sp = T(l) = 3M1N + Afd:N + 9) 
T(n) S +5L' + N 
Figure 22 shows the speedup analysis of the FTVP for various vector loop 
lengths with a constantR (where R =: MJ/M andM = M1 + M2). The figure shows 
that the speedup is constant for large vector loop lengths. We assumed that the 
number of pipelines in the critical path (C) is 4, and the total number of pipelines 





es 0 30 
lJ.J 






c R=O + 1/4 o 1/2 6. 3/4 X 1 
Figure 22. Speedup vs. Loop Length 
63 
Figure 23 shows the speedup analysis for the Livermore loops that have no 
input-output recurrence relationship, and Figure 24 shows the speedup analysis for 
the Livermore loops that have input-output recurrence relationship. In calculating 
the speedup, we do these following steps. Consider the evaluation of Livermore 
loop 1 in an FTVP with one adder pipeline and one multiplier pipeline; that is, M = 
2. M = M2 as there is no recurrence relationship in this loop. Evaluation of the 
loop is divided into three steps, as there are only two pipelines in the FTVP. The 
three steps are 
Step 1: 1 multiplication. T1(N) = 4+400+5 = 409 
Step 2: 1 multiplication and 1 addition. T2(N) = 7+400+10 = 417 
Step 3: 1 multiplication and 1 addition. T3(N) = 417 
T(n) = T1(N) + T2(N) + T3(N) = 1243 







n. 8 ~ 
0 









c Loop 1 
Llvermore loops 
8 
I'U1BER OF PIPELINES 
0 7 6 9 
16 
)( 1h 
Figure 23. Speedup vs. Loop Length for Livermore Loops 
















c LOOP 2 
Livermore Loops 
a· 
~BER OF PIPELINES 
+ 3 0 4-
16 
.6. 6 )( 10 
Figure 24. Speedup vs. Loop Length for livermore Loops 
with Recurrence Relationship 
Proposed Architecture 
32 
As a final step in the evaluation we propose an architecture of the FfVP 
based on the Livermore loop analysis .. This architecture is proposed to keep the 
number of interconnection links and crossbar switches in the FfVP to a minimum. 
The links in the FfVP are laid out using good conduction lines on the 
semiconductor wafer. These links oc;cupy much of the chip space. Therefore, the 
number of links must be kept to a minimum. Further, reducing the number of links 
reduces the number of crossbar switches in the interconnection networks, which 
further conserves the chip space. For example, since most of the vector operations 
send only their results to the registers, the number of links connecting CBN2 and 
CBN4 can be reduced; this leads to a reduction in the number of crossbar switches 
in CBN2 and CBN4. 
65 
Figure 25 shows the program graph for livermore loop 1. An FTVP 
hardware with 5 pipelines, 4 pipeline-pipeline links and 1 pipeline-register link is 
sufficient for evaluation of this loop. Similar kind of results are obtained for the 














-~) P1p~ lm~ - Pip~ llnf L inlc 
............. ) Pipeline- Regist~r Link 
Figure 25. Program Graph for livermore Loop 1 
. TABLEV 
liVERMORE WOP PARAMETERS 




3 2 4 1 
5 5 9 1 
i 1 1 1 
1 2 1 2 
5 5 9 3 
3 5 6 4 
8 8 15 1 
7 8 14 1 
9 8 9 
3 2 1 














4 Can be executed only if the registers allow simultaneous read and write 
5 Modified by Kogge's double Cycling Method · 
66 
As seen from Table V, an F1VP hardware with 16 pipelines (8 adders and 8 
multipliers}, 15 pipeline-pipeline links, 3 pipeline-register links and 4 register-
register links is sufficient for evaluation of most of the livermore loops. These 
requirements are incorporated into the proposed architecture shown in Figure 26. 
3 
Figure 26. Hardware of the Recommended Structure 
From Table V we can see that livermore loop 10 is an exception that has to 
be handled by the proposed hardware in Figure 26. In such cases the original vector 
loop has to divided into smaller vector loops and evaluated one by one. Certain 
criteria for handling these exceptional situations have been established. 
Criterion 1 for the division of a big vector loop is to divide according to the 
number of destination registers and register-register transfers required in the vector 
loop. Criterion 2 is to divide the vector loop according to the number of pipelines 
available in the FTVP. The register-register transfer instructions that can be moved 
outside the loop by rearranging the original vector loop and executed independently, 
are done so; this is criterion 3. We can see that by rearranging the instructions of 
67 
Livermore loop 10, some of the register-register transfer instructions can be moved 
outside the loop and performed independently. The rearranged loop is 
NO. 10 DO 10 I = 1, 100 
AR = CX(5,I) 
BR = AR-PX(5 I) 
. CR = BR-PX(6:I) 
PX(5,I) = AR 
PX(6,I) = BR 
AR = CR-PX(7,I) 
BR = AR-PX(8,I) 
PX(7,I) =-CR 
PX(8,I) = AR 
CR = BR-PX(9,I) 
AR = CR-PX(10,I) 
PX(9,I) = BR 
PX(10,I) = CR 
BR = AR-PX(ll,I) 
CR = BR-PX(12,I) 
PX114,Il = CR-PX(13,I) PX 11,I = AR 
PX 12,I = BR , 
PX 13,I = CR 
10 CONTINUE 
The instructions AR = CX(5,I), PX(ll,I) = AR, PX(12,I) = BR, and PX(13,I) = 
CR can be moved outside the loop. The register-register transfer involved in AR = 
> > 
CX(5,I) should be done before commencement of the pipeline chain operation, and 
the other three instructions should be done after the pipeline chain operation. The 
following is the vector loop after the four instructions were moved outside the loop. 
NO. 10 DO 10 I = 1, 100 
BR = AR-PX(5 I) 
CR = BR-PX(6:I) 
PX(5,I) = AR 
PX(6,I) = BR 
AR = CR-PX(7,I) 
BR = AR-PX(8,I) 
PX(7,I) = CR 
PX(8,I) = AR 
CR = BR-PX(9,I) 
AR = CR-PX(10,I) 
PX(9,I) = BR 
PX(10,I) = CR 
BR = AR-PX(ll,I) 
CR = BR-PX(12,I) 
PX(14,I) = CR-PX(13,I) 
10 CONTINUE 
68 
Since the architecture proposed in Figure 26 has only 4 register-register links, 
applying criterion 1 to the above instructions, Livermore loop 10 is divided into two 







DO 10 I = 1, 100 
BR = AR-PX(S,I) 
CR = BR-PX(6,1) 
PX(S,I) = AR 
PX(6,1) = BR 
AR = CR-PX(7,1) 
BR = AR-PX(8,1) 
PX(7,1) = CR 
PX(8,1) = AR 
CR = BR-PX(9,1) 
CONTINUE 
DO 10 I = 1, 100 
AR = CR~PX(10,1) 
PX(9,1) = BR 
PX( 10,1) = CR 
BR = AR-PX(11,1) 
CR = BR-PX(12,1) 
PX(14,1) = CR-PX(13,1) 
CONTINUE 
Therefore, Livermore loop 10 can be evaluated in two steps by the proposed 
architecture in Figure 26. In obtaining the two steps for the evaluation of Livermore 
10, we had assumed that there are no faulty modules in the architecture proposed in 
Figure 26. If faulty modules are present, then the number of hardware resources 
available for processing will be less; in such a case the original loop has to be further 
subdivided. It is the duty of the compiler to identify these kinds of exceptional 
situations, and divide the assigned vector loop applying the three criterion. 
69 
CHAPTERV 
CONCLUSIONS AND FUTURE WORK 
A vector processor that can be fabricated as a single-chip processor using the 
WSI fabrication technique was designed in this thesis. The chaining capability of 
the vector processor was demonstrated usipg the Livermore loops. A basic 
instruction set and the translation procedure for the vector processor was 
developed, a speedup analysis was done, and the fault-tolerant capability of the 
vector processor was demonstrated. Based on the Liven:ilore loop analysis, a 
hardware structure, was recommended for fabrication. A method of handling large 
vectors by this proposed hardware was also discussed. 
As seen from the Livermore loop analysis, an FfVP with 16 pipelines is 
sufficient for most practical problems. If more,pipelines are required, the problem 
to be executed can be broken down into many small vector loops and executed one 
by one, or the number of pipelines in the FfVP can be increased. 
Further study needs to be done on providing multiple pipeline chains in a 
single FTVP. An intelligent compiler needs to be developed to implement the 
proposed translation procedure. The vector registers of the FfVP need to be 
expanded to store more than one vector datum, to allow irregularvector accesses 
for complex vector applications, and to allow simultaneous read and write 
operations. In conclusion, the FfVP provides an efficient dynamic chaining and 
fault-tolerance capability. The fault-tolerant capability, along with the WSI 
fabrication, paves the way for an efficient, single-chip processor. The FfVP can be 
used for applications like wave equations, heat transfer and signal processing. 
70 
REFERENCES 
[1] Kai Hwang and Faye A. Briggs, "Computer Architecture and Parallel Processing", 
McGraw-Hill series, 1984. 
[2] Douglas I. Thesis, ''Vector Supercomputers", Computer, Volume 4, Number 7, pp 
52 - 61, 1974. . 
[3] Olaf Lubek, "Supercomputers Peifonnance: The Theory, Practices and Results", 
Advances in Computers, pp 308 -360. 
[ 4] Rod A. Fatoohi, ''Vector Peifonnance Analysis of NEC SX-2", Computer 
Architecture News, Volume 18, Number 3, pp 389-400, Sep 1990. 
[5] Kai Hwang, "Advanced Parallel Processing with Supercomputer Architectur,es", 
Proceedings IEEE, Volume 75, Number 10, October 1987. 
[6] OlafLubek, James Moore and Raul Mendez,·~ Benchmark Comparison of 
Three Supercomputers: Fujitsu VP-200, Hitachi SBI0/20, Cray X-MP /2", 
Computer, pp 10 - 23, December 1985. 
[7] Tom Diede, Carl F. Hagen Maier, GlenS. Miranker, Jonathan J. Rubinstein 
and WilliamS. Worley, Jr., 'The Titan Graphics Supercomputer 
Architecture", Computer,pp· 13-25, September 1988. 
[8] Norman P. Jouppi, Jonathan Bertoni and David W. Wall, ·~ Unified 
Vector/Scalar Floating-Point Architecture", Computer Architecture News, 
Volume 17, Third International Conference on ABPLOS, April3 - 6, pp 134 
- 143, 1989. 
[9] Les Kohn and Neal Margulis, '1ntroducing i860 64 Bit Microprocessor", IEEE 
Micro, Volume 10, pp 15-29, August 1989. 
[10] Christos J. Georgiou, "Fault tolerant Crosspoint Switching Networks':, 
Proceedings of 14th International Symposium on Fault-Tolerant Computing, 
pp 240 - 245, 1984. ' 
[11] W. Chen, Prof. J. Mavor, Prof. P. B. Denyer and D. Renshaw, "Superchip 
Architecture for Implementing Large Integrated Systems", lEE Proceedings, 
Volume 135, Number 3, pp 137 - 150, May 1988. 
[12] Jack F. McDonald, Hans J. Greub, Randy H. Steinvorth, Brain J. Donlan and 
Albert S. Bergendahl, 'Wafer Scale Interconnection for GaAs Packaging -
Applications To RISC Architecture", Computer, pp 21-35, April1987. 
[13] Gilman Chesely, 'WSI Architecture", Computer, V-17, pp 94-5, November 1984. 
71 
[14] Martin Gold, 'Wafer Scale Integration is Still a Challenge to Design, Fabricate, 
Test, Electronic Design, v32, pp 87, May 3, 1984. 
[15] Daniel P. Siewierek, ''Fault Tolerance in Commercial Computers", Computer, pp 
26-37, Volume 23, Number 7, July 1990. 
[16] Adit D. Singh and Singaravel Murugesan, ''Fault-Tolerant Systems", Computer, 
Volume 23, Number 7, pp 15- 17, July 1990. 
[17] Rajiv Gupta, Alessandro Zorat and I. V. Ramakrishnan, ''Reconfigurable 
Multipipelines for Vector Supercomputers", IEEE Transactions on Computer, 
Volume 38, Number 9, pp 1297-1307, September 1989. 
[18] William F. Brockert and RonaldE.Josephson, ''Designing Reliability into VAX 
8600 Pipeline", Digital Technical Journal, pp 71-77, Number 1, August 1985. 
[19] Werner Buchholz, 'The IBM System/370 Vector Architecture", IBM Systems 
Journal, Volume 25, Number 1, pp 51-62, 1986. ·· 
[20] Kai Hwang and Zhiwei Xu, ''Mu(tipipeline Networking for Compound Vector 
Processing", IEEE Transactions on Computer, Volume 37, Number 1, 
January 1988. . . 
[21] John P. Riganati and Paul B. Schneck, ''Supercomputing", Computer, Volume 
17, Number 10, pp 91 - 113, Oct. 84. 
[22] George B. Adams and Howard J. Siegely, ''Extra Stage Cube: A Fault Tolerant 
Interconnection Network for Super Systems", IEEE Transactions on Computer, 
Volume c-31, Number 5, pp 443-454, May 1982. 
[23] Nirependra N. Biswas, S. Srinivas and T. Dharanendra, 'j4 Centrally Controlled 
Shuffle Network for Reconfigurable and Fault Tolerant Architecture'~ 
Computer Architecture News, Volume 15, Number 1, pp 81-87, March 1987. 
[24] George B. Adams III, Dharma P. Agrawal and Howard J. Seigel, ''Fault 
Tolerant Multistage Interconnection Network", Computer, pp 14 -27, June 
1987. 
[25] David A. Patterson, ''Reduced Instruction Set Computers", Communications of 
ACM, Volume 28, Number 1, pp 8-21, January 1985 .. 
[26] Peter M. Kogge, 'The Architecture of Pipelined Computers", New York: 
McGraw Hill, 1981. 
[27] Dharma P. Agrawal, 'Testing and Fault Tolerance of Multistage Interconnection 
Networks", Computer, pp 41-53, April1982. 
[28] Tse- yun Feng and Chuan- Lin Wu, ''Fault Diagnosis for a Class of Multistage 
Interconnection Network", IEEE Transactions on Computer, Volume c-30, 
Number 10, pp 743-758, October 1981. 
[29] Israel Koren, ''Defect and Fault Tolerance In VLSI Systems", Volume 1, 
Plenum Press, 1989. 
72 
[30] R. W. Hockney and C. R. Jesshope, "Parallel Computer", Bristol: Adam 
Hilger, 1981. 
[31] A. J. Blogett and D. R. Barbour, "Themzal Conduction Module: A High-
Peifomzance Multilayer Ceramic Package", IBM Journal of Research and 
Development, Volume 26, Number 1, pp 30-36, January 1982. 
[32] Alessandro De Gloria, ''VISA: A. Variable Instruction Set Architecture", 
Computer Architecture News, Volume 18, Number 2, pp 76-84, June 1990. 
[33] Robert P. Colwell, Robert P. Nix, JohnJ. 0' Donnell, David B. Papworth and 
Paul K. Rodman, '.11 VLIW Architecture for a Trace Scheduling Compiler', 
Computer Architecture News, Volume 15, Second International Conference 
on ASPLOS, pp 180-192, 1987. 
[34] Tse- yun Feng, '.11 Survey of Interconnection Network", Computer, pp 12-27, 
December 1981. 
[35] John P. Hayes, "Computer Architecture and Organization", McGraw-Hill series, 
1988. 
[36] Norman P. Jouppi, ''Superscalar vs. Superpipelined Machines", Computer 
Architecture News, Volume 16, Number 3, pp 71-80, June 1988. 
[37] David A. Patterson and Carlo H. Sequim, '.11 VLSI RISC', Computer, pp 8 - 20, 
September 1982. 
[38] Dileep Bhandarkar and Richard Brunner, ''Vax Vector Architecture", Computer 
Architecture News, Volume 18, Number 2, pp 204-215, June 1990. 
[39] Richard 0. Carlson and Constantine A Neugebauer, "Future Trends in Wafer 
Scale Integration", Proceedings of the IEEE, Volume 74, Number 12, pp 
1741-1751, December 1986. 
[40] Koichi Yamashita, Akinori Kanasugi, Shinpei Hijiya, Gensuke Goto, Nobutake 
Matsumura and Takehide Shirato, ''A Wafer-Scale 170 000-Gate FFT 
Processor with Built-In Test Circuits", IEEE Journal of Solid-State Circuits, 


















DO 1 K = 1, 400 
X(K) = Q+Y(K)*(R*Z(K+lO)+T*Z(K+ll)) 
DO 2 K = 1, 996, 5 
Q = Q+Z(K)*X(K)+Z(K+ l)*X(K+ l)+Z(K+2)*X(K+2)+ 
Z(K + 3)*X(K + 3) + Z(K + 4)*X(K + 4) 
DO 3K = 1, 1000 
Q = Q+Z(K)*X(K) 
DO 4 J = 30, 870, 5 _ 
X(L-1) = X(L-1)-X(LW)*Y(J) 
LW = LW+l 
DO 5 I = 2, 998, 3 
X1I) = Z(I)*(Y(I)-X(I-1)) 
X I+ 1) = Z(I + l)*(Y(I + 1)-X(I)) 
X I+2) = Z(I+2)*(Y(I+2)-X(I+l)) 
DO 6 J = 3, 999,3 
I= 1000-J +3 
X1I) = X(I)-Z(I)*X(I + 1) 
X I-1 = X I-1 -Z I-1 *X I) 
x I-2~ = xh-2~-zh-2~*xh-t) 
D07M = 1,120 
X(M) = U(M)+R*(Z(M)+R*Y(M))+T*(U(M+3)+R*(U(M+2) 
+ R *U(M + 1)) + T* (U (M + 6) + R * (U (M + 5) + R *U (M + 4)))) 
D08KX = 2,3 
DO SKY= 2,21 
DUl = U11KX,KY + l,NLll-Ul~KX,KY-l,NLll 
DU2 = U2 KX,KY+l,NL1-U2 KX,KY-l,NLl 
DU3 = U3 KX,KY+l,NLl -U3 KX,KY-l,NLl 
Ul(KX,KY,NL2) = Ul(KY,NLl +All *DUl + Al2*DU2 
+ A13*DU3 + SIG*(Ul(KX + l,KY,NL1)-2*Ul(KX,KY, 
NLl)+ Ul(KX-l,KY,NLl)) , 
U2(KX,KY,NL2) = U2(KY,NLl)+A21*DUl+A22*DU2 
+ A23*DU3 + SIG*(Ul(KX + l,KY,NL1)-2*Ul(KX,KY, 
NLl)+ Ul(KX-l,KY,NLl)) 
U3(KX,KY,NL2) = U3(KY,NL1) + A31 *DUl + A32*DU2 







DO 9 I= 1, 100 
PX(1, I)= BM28*PX(13,I)+BM27*PX(12,I) 




NO. 10 DO 10 I = 1, 100 
AR = CX(5,I) 
BR = AR-PX(5,I) 
PX(5,I) = AR 
CR = BR-PX(6,I) 
PX(6,1) = BR 
AR = CR-PX(7,I) 
PX(7,I) = CR 
BR = AR-PX(8,I) 
PX(8,I) = AR 
CR = BR-PX(9,I) 
PX(9,I) = BR 
AR = CR-PX(10,I) 
PX(10,I) = CR 
BR = AR-PX(11,I) 
PX( 11,I) = AR 
CR = BR-PX(12,I) 
PX~12,1l = BR PX 14,1 = CR-PX(13,I) 
PX 13,1 = CR 
10 CONTINUE 
No. 11 X(1) = Y(1) 
DO 11 K = 2, 1000 
11 X(K) = X(K-1)+ Y(K) 
No. 12 DO 12 K = 1, 199 
12 X(K) = Y(K+ 1)-Y(K) 
NO. 13 DO 13 IP = 1, 128 
I1 = P( 1,1P) . 
J1 = P(2,IP) 
~j~:~~j : ~j~:~~j: ~Hl:~B 
P 1,IP = P 1,IP + P(3,Jl) 
P 2,IP = P 2,IP +P(4,Jl) 
I2 = P(1,IP) 
12 = P(2,1P) 
P(1,1P) = P(1,IP)+ Y(l2+32) 
P(2,IP) = P(2,IP) + Z(12 + 32) 
I2 = I2+E(I2+32) 
12 = 12+F(12+32) 
H(l2,12) = H(I2,12) + 1.0 
75 
13 CONTINUE 
No.14 DO 14K = 1,150 
IX=GRD(K) 
XI=IX 
VX(K) = VX(K)+EX(IX)+(XX(K)-XI)*DEX(IX) 
XX(K) = XX(K) + VX(K) + FLX 
IR = XX(K) 
Rl = IR 
RXI = XX(K)-RI 
IR=IR-(IR/64)*64 
XX(K) = RI + RXI 
RH(IR) = RH(IR) + 1.0-RXI 












Modified by Kogge's double cycling method 
D051=2,998,3 
5 X[!]= Z[I] * {Y[I]- Z[I-1] * (Y[I-1]- Z[l-2] * [Y[I-2] -X[I-3]])) 
X[l+l] = Z[l+l] * (Y[l+l]- X[!]) 










Modified by Kogge's double cycling method 
X[J]= Y[l] 
D011K=2,1000 




WAFER sc;ALE INTEGRATION 
Wafer Scale Integration (WSI) can be regarded as a special form of 
packaging in which extra wiring, normally used to interconnect the package 
containing working components, is fabricated on the surface of a wafer substrate 
containing the components and mounted inside a single package [12]. This internal 
wiring eliminates many of the problems encountered in the conventional printed 
circuit boards or ceramic carriers [12]. Internal wiring leads to reduced wiring 
length, which, in turn, increases the system speed and decreases the power 
requirement of the I/0 drivers. The reduced wiring length achievable in WSI will 
not necessarily translate into reduced propagation delay, unless the other wire 
dimensions are scaled properly [12]. For example, suppose a metal wire of 
rectangular cross section with length 1, width w and thickness t is located at a 
distanced from a ground plane of th'e same metal [12]. Then the RC charging delay 
of the distributed system is approximately given by [12] 
TRC = w2 
td 
A typical integrated circuit line is made of Aluminium (p = 2.()3 x w-6 n em) on 
SiOz ( € r = 3.9) [12]. If 1 = 20cm, t = d = 0.5 J.Lffi, then TRc = 160ns [12]. If the 
dielectric constant is reduced to unity, then the delay would be only 600ps. This will 
be two orders of magnitude faster than the metal line of a conventional IC 
processing technique. 1 = 20cm line represents a worst-case length for WSI wiring , 
on wafers three to four inches in diameter. This is small when compared to the 
chip-level wiring length of 25m for the IBM 3081 processor unit fabricated using the 
82 
conventional LSI technology [31 ]. Fifty percent of the CPU time of the IBM 3081 
processor is dominated by the wiring delay [31]. Further, it is possible to obtain high 
propagation speeds in WSI using simple extensions of the existing technology. High 
propagation speeds can be achieved by fabricating thick film LC transmission lines 
rather than thin film lines with J{C charging behavior [12]. The thick film lines that 
favorably effect the propagation times can also help improve the discretionary 
wiring yield, depending on the type of fabrication employed [12]. 
The statical unce~tainty in the wiring delays resulting from wiring of the 
random collection of working components has been one drawback of traditional 
WSI. That is, since the locations of the working cells are not fixed, the wiring delay 
of a given path may vary from wafer t,o wafer. However, since these delays can be 
much shorter than those found in most other packaging arrangements, this could be 
less of a problem than might be expected [12]. 
There are few subjects in solid-state electronics that bring forth many 
negative comments as WSI [39]. These are partly due to the dominance of the 
prevailing VLSI technology which is expected to dominate the field well into the 
future, and lessen the need for WSI 112]. But many designers overlook the fact that 
VLSI actually makes poor use of the enormously large silicon area available. The 
average VLSI chip area grows very slowly even as higher levels of integration are 
achieved. The name WSI implies a quantum jump in more components integrated 
on a monolith piece of silicon than the state-of-art VLSI. The WSI silicon piece is 
much larger than the one used in the state-of-art VLSI, and is normally of wafer size 
[39]. 
The attractiveness of WSI lies in its promise of reduced cost, high 
performance, higher level of integration, greatly increased reliability and application 
potential. Traditionally, the increased component density of a VLSI chip is 
achieved primarily by a downscaling of the feature sizes; only in a secondary manner 
83 
is this increased component density obtained by the use of larger chip dimensions. 
The increase in component density of a VLSI chip due to the shrinkage of the 
minimum feature size has been of several orders of magnitude, while the increase of 
maximum feasible chip area has been modest [39]. Since the practical limit of 
scaling has not been reached, VLSI will continue to dominate WSI. But WSI tries 
to increase the component density still further by bringing an increased chip area, 
which has contributed very little to the VLSI performance. Further, redundancy and 
fault-tolerance in WSI adds reliability to the fabrication increasing the growing 
number of advantages it has over VLSI. 
To avoid multilevel metallization in WSI, it is necessary that the circuit 
design avoid cross-wafer data communication as much as possible. Therefore, 
instead of cross-wafer data communication, the cells fabricated should communicate 
with the neighboring cells. This is possible with the pipelined and bus-oriented 
architectures; therefore, WSI is more suitable for pipelined and bus-oriented 
architectures [39]. The FTVP proposed in this study is a pipelined architecture 
favoring WSI fabrication. The important consequence of this pipelined structure 
fabrication is the avoidance of multilevel metallization. This, in turn, makes it 
practical to apply the state-of-art VLSI fabrication technology to WSI, thereby 
giving it a significant density advantage over the equivalent VLSI implementation. 
To remain competitive over VLSI, any WSI process must satisfy the following 
requirements [39]: 
* Make use of the densest VLSI fabrication process available. 
* Avoid the introduction of additional process as much as possible to keep the 
complexity to a minimum. 
*Avoid cross-wafer communication by using pipeline architectures. 
* Provide multiple external power and ground contacts on the wafer at regular 
intervals. 
84 
But it must be noted that any advances made in WSI will be reflected on VLSI and 
vice-versa. The major competitors for WSI are the VLSI technology itself, because 
of the rapid decrease in scaling technology which helps in achieving higher chip 
density and the multichip VLSI technology. Table VI shows the figure of merit of 
various technologies [39]. 
TABLE VI 
COMPARISON OF THE PACKAGES OF FIGURE OF MERITS 
Packaging Power Size or Cost Overall Figure 
Approach xDelay Weight of Merit 
Printed wiring 1.00 1.00 1.00 1.00 
board 
Thick-film 1.08 '0.42 1.02 0.46 
multilc;tyer on 
cerarmc 
Ceramic 0.34 0.2 0.65 0.044 
multilayer hybrid 
Thin-film 0.19 0.14 0.6 0.016 
multilayer hybrid, 
populated on one side 
Wafer Scale 0.10 0.09 0.46 0.0041 
Integration 
Thin-film 0.08 0.07 0.44 0.0025 
multilayer hybrid, 
populated on both 
sides 
As seen from Table VI, the multichip hybrid technology has an advantage over WSI 
because it is risk-free and offers high performance due to the advancement in VLSI 
technology. But the multichip hybrid technology requires a large number of 
metallurgical bonds, and has difficulty in heat removal [39]. 
85 
VITA~/ 
· SUNDARARAJAN GANESH 
Candidate for degree of 
Master of Science 
THE FAULT-TOLERANT SINGLE-CHIP VECTOR 
PROCESSOR: ARCHITECTURE AND PERFORMANCE 
ANALYSIS USING UVERMORE LOOP BENCHMARKS 
' 
Major Field: Electrical Engineering 
Biographical: 
Personal Data: Born in Ganapathi Agraharam, Tamilnadu, India, 
September 24, 1967, the Son of V. Sundararajan and R. Malathy. 
Education: Graduated from ST. Joseph of Cluny higher secondary school, 
Neyveli, India in June 1984; received Bachelor of Engineenng degree 
in Electronics and Instrumentation from Annamalai University, 
Chidambaram, India in July 1988; completed requirements for 
Master of Science degree at Oklahoma State University in 
May 1992. 
Professional experience: Research assistant to Dr. J. J. Lee, Department of 
Electrical Engineering, Oklahoma State University, January, 1990 to 
December 1991; Technical assistant to Mr. Bill Barnes, Department 
of Biochemistry, Oklahoma State University, September, 1989 to 
April 1990; member of Tau Beta Phi and Eta Kappa Nu. 
