Scholars' Mine
Masters Theses

Student Theses and Dissertations

Fall 1980

A multiprocessor system using a switch matrix configuration
Rabah Aoufi

Follow this and additional works at: https://scholarsmine.mst.edu/masters_theses
Part of the Electrical and Computer Engineering Commons

Department:
Recommended Citation
Aoufi, Rabah, "A multiprocessor system using a switch matrix configuration" (1980). Masters Theses.
5794.
https://scholarsmine.mst.edu/masters_theses/5794

This thesis is brought to you by Scholars' Mine, a service of the Missouri S&T Library and Learning Resources. This
work is protected by U. S. Copyright Law. Unauthorized use including reproduction for redistribution requires the
permission of the copyright holder. For more information, please contact scholarsmine@mst.edu.

A MULTIPROCESSOR SYSTEM USING
A SWITCH MATRIX CONFIGURATION

BY

RABAH AOUFI, 1955 -

A THESIS

Presented to the Faculty of the Graduate School of the

UNIVERSITY OF MISSOURI-ROLLA

In Partial Fulfillment of the Requirements for the Degree

MASTER OF SCIENCE IN ELECTRICAL ENGINEERING

1980

T4669
c.1
57 pages

Approved by

ii

ABSTRACT

This thesis describes a class of interconnection net
works based on the use of a switch matrix to provide proces
sor to memory communication.

This switch allows a direct

link between any processor to any memory module.

The cost

and performance of this network are analytically examined.
The results are compared with those of a multiprocessor
system using a time-shared bus configuration and it is shown
that for the two extreme cases of maximum and minimum
throughput, the two approaches are equivalent from a perform
ance point of view.

However, in the general case, even with

a higher cost, the switch matrix provides a much better
performance than the time-shared bus configuration.

Further

more, the architecture of a multiprocessor MIMD type computer
using a switch matrix is investigated and Petri net tech
niques are used to model process coordination among proces
sors .

iii

ACKNOWLEDGMENT

I would like to express my gratitude to Dr. Darrow
Dawson for his guidance during my graduate work at the
University of Missouri-Rolla.

I also would like to thank

Dr. Theodore E. McCracken and Dr. Min Ming Tang for serving
on my Ma s t e r ’s Committee and Mrs. Monique Helterbrand for
her precision and promptness in preparing the typescript.
A special thanks to my mother, Sahara, for her patience,
understanding and moral support during my stay in the United
States of America.

TABLE OF CONTENTS
Page

ABSTRACT ..................................................

ii

ACKNOWLEDGMENTS ..........................................

iii

LIST OF ILLUSTRATIONS ....................................

vi

LIST OF TABLES ............................................

vii

I.

INTRODUCTION ....................................

1

A.

Review of Multiprocessing Systems ..........

1

B.

Classification of Multiprocessor Systems ..

2

1.

Symmetric and Asymmetric Processor....

2

2.

System Organization.....................

3

a.

Switch Matrix.......................

3

b.

Time-shared Bus ...................

3

c.

Multiport Memory Systems ..........

3

Outline .....................................

4

SWITCH ORGANIZATION ............................

5

A.

Principle of Operation .....................

5

1.

Description ............................

5

2.

Contention .............................

5

3.

Reliability ............................

8

C.
II.

B.
III.

Control of the Switch ......................

10

PERFORMANCE AND COST OF THE SYSTEM .............

13

A.

13

System Throughput

..........................

1.

Maximum Throughput

....................

16

2.

Minimum Throughput

....................

16

3.

Average Throughput

....................

17

V

Page
B.
IV.

System Cost

THE MIMD MACHINE .........

24
30

A.

Definition

B.

Parallelism Through The Switch Network ...........30

ofSIMD and MIMD Machines

...... .

30

1.

Overview ................................

33

2.

Node Switch Operation

..................

35

3.

Interprocessor Control

................

35

Resource Sharing and S c h e d u l i n g .......

41

4.
V.

.................................

CONCLUSION

.......................................

45

BIBLIOGRAPHY ..............................................

47

VITA .......................................................

49

APPENDIX: THE

50

OCCUPANCY PROBLEM

........................

vi

LIST OF ILLUSTRATIONS
Figure

Page

1.

Multiprocessor Switch Organization ................

6

2.

Conflict due to Absence of Priority Word

9

3.

Solution to the Conflict Using Priority Word
Presence

..........

............................................

9

4.

Block Diagram of a Switch Node ....................

11

5.

Multiprocessor Data Bus Organization .............

14

6.

System Throughput Vs Number of Processors,

18

7.

System Throughput Vs Number of Processors, m=10...

20

8.

Transition States of a 4x4 System ................

22

9.

Relative System Cost Vs Number of Processors for
Two Cases, m=10

m=3

..

....................................

10. System Cost Vs System Throughput, m=10

...........

26
27

11. Normalized Curves of System Cost/Throughput Vs
Number ofProcessors, m=10

.........................

29

12. Block Diagram of a SIMD Computer .................

31

13. Block Diagram of a MIMD Computer .................

32

14. A Marked Petri Net .................................

34

15. Switch Node Operation ..............................

36

16. Illustration of the Producer-Consumer Problem ....

38

17 . Solution to the Reliability Problem ..............

40

18. Illustration of The Deadlock Problem .............

42

19. Solution to The Deadlock Problem .................

43

vii

LIST OF TABLES
Table
I.

Page
Table Mapping Memory Addresses Into Switch
Nodes

.............................................

7

II.

Input X Coding ....................................

10

III.

Discrete Markov Chain Model .....................

24

1

I.

A.

INTRODUCTION

REVIEW OF MULTIPROCESSING SYSTEMS
During the past several years, multiprocessing systems

have been discussed in the literature and a number of differ
ent systems have been implemented or proposed.

Low cost mi

croprocessors are now being designed into multiprocessing
systems.

Parallel (SIMD) type processors, computer network

ing and multiprocessor systems are among the existing organi
zations.

This thesis considers the multiprocessor MIMD

feature.

SIMD and MIMD are defined in Section IV.

It is important to recall the fundamental definition of
a multiprocessor system as given by [1], before the advan
tages of multiprocessor systems are presented.
A multiprocessor computer is a system containing two or
more processor units of approximatively comparable capabili
ties.

Each unit has access to shared common memory as well

as having common access to at least a portion of the I/O
devices.

In addition all processor units are controlled by

one operating system that provides interaction between the
processors and the programs they are executing at all levels.
Several advantages may be realized with multiprocessor
systems.

Throughput often increases almost directly with the

number of processors while system cost increases by only a
small amount.

Shared resources provide economic advantage by

eliminating devices to be duplicated in other systems.

On

the other hand they provide direct access of data without

2

transmission from system to system.

The cost of a standby

unit is small and a spare processor can be switched into
the system to replace a failed processor.
B.

CLASSIFICATION OF MULTIPROCESSOR SYSTEMS
This section contains a brief discussion of the classi

fication of multiprocessor systems.

Two distinguishing fea

tures that differentiate between designs are the use of
processing units and the interconnection of processor units
and memories.
1.

Symmetric and Asymmetric Processor:

The symmetric

multiprocessor system consists of a network of functionally
equivalent processors.

This type system is used in a general

purpose environment, where processing requirements are con
stantly changing.

The advantage of this class is that a giv

en task can be assigned to any idle processor for execution*
since there is an equivalence among individual processors.
Another significant advantage of this class is that the fail
ure of a module does not cause the failure of the entire
machine.

However, symmetric systems require that every

processor have full capabilities (which increases hardware
expense, and complicates the operating system which becomes
responsible for the identification and control of tasks).
A second group consists of heterogenous processors spe
cially configured for a set number of tasks.

Tasks and their

actions must be completely known in advance.

In this case,

processors may be specialized to carry out one particular
type of task.

One processor may perform all I/O operations,

3

another provides floating point arithmetic capability, a
third provides file maintenance.

The operating system is

greatly simplified and becomes a task scheduler.
2.

System Organization: In addition to classifying

multiprocessor systems according to processor use, they may
be grouped in relation to the interconnection of processors
with system memories and peripheral devices.

Three main

types of organizations are possible.
a.

Switch M atrix: This scheme provides direct paths

from any processor to any memory or peripheral.

This allows

many processors to simultaneously utilize many different
memory modules, reducing memory reference interference
between processors.

However, the switching matrix may be

extremely expensive (the cost increases rapidly with the
number of processors) eliminating much of the cost advantage
of a multiprocessor system.
b.

Time-shared B u s : This method is to multiplex all

processors memories and peripheral devices over one data bus.
This is a lower cost approach, but system throughput becomes
limited by bus capacity.
c.

Multiport Memory Systems: In this third method each

processor has access through its own bus to all memory mod
ules.

Like the two previous organizations, the multiport

system organization has the disadvantage of high cost
multiple-connection hardware.
All of the above organizations and their variations are
useful and worthy of consideration.

This thesis is concerned

b

with the symmetric processor and its associated switching ma
trix.

Much of the discussion can be applied to the other

classes as well.
C.

OUTLINE
The next section describes in detail the switch matrix

organization.

Section III analyzes performance and cost of a

typical multiprocessor system.

Such a system is the MIMD

type computer treated in section IV where process coordina
tion is modeled using Petri net techniques as a tool for the
purpose.

5

II.

A.

SWITCH ORGANIZATION

PRINCIPLE OF OPERATION
A block diagram of the switch organization is shown in

Figure 1.
1.

Description: The switch that interconnects proces

sors and data memories to allow memory sharing,
a number of nodes connected via ports.

consists of

Each node contains

two input ports labeled A and B and two output ports labeled
C and D.

Each node can send a message on its output ports

and receive one on its input ports.

It is assumed that each

memory can respond to a single request during one cycle so
that there is no simultaneous double service.

The message

contains the address of the memory to be mapped into physical
memory and a priority word.
When a switch node receives a message, it attempts to
route it correctly through the appropriate path.

This is

accomplished by storing in each node a table which maps the
recipient address into the port number as shown in Table I.
Input A could be connected to either the output labeled D or
the output labeled C depending on the value of some generated
control.

However, input B could be connected only to the

output labeled C.

This technique reduces the complexity of

the table mapping and defines a unique path between the mes
sage entry and its destination.

It is clear that the inputs

of the root-node can be switched to anyone of the outputs.
2.

Contention: The hierarchy in priority is set by

6

MEMm

Figure 1.

o— —
o
Multiprocessor Switch Organization

7

TABLE I
TABLE MAPPING MEMORY ADDRESSES INTO SWITCH NODES
PROCESSOR UNITS_________ SWITCH NODES__________ MEMORY MODULE

1

1,1

2

2,1

•

1,1

-

•

1

p,l

p

-

2,1

-

... -

1

1,1

- 1,2

2

2,1

- 2,2

1,1

~ 1,2

•
•
2

p,l

P

- P,2

-

... -

2,2

- 1,2

•

•

•

•

•

•

•

•

•

- P,2

2,2

-

...

-

...

- p,m

i

P,1

-

i
CM

2,1

1- 1

1
—1

2

P

1
1- 1

1

1 ,m

-

2,m

-

...

- l,m

-

2 ,m -

1 ,m

m

8

incrementing the request priority as it passes through the
node.

Preference to route the request is given then to the

request with highest priority.

When a request is accepted,

it is followed by latching sequentially the high order byte
and then the low order byte of the memory address.

A message

is then at any instant distributed between two nodes and a
conflict to acquire the node is created between the beginning
of some request and the middle part of another.

To avoid

such a conflict, each part of the message should have the
priority word associated with it.
of the message word length.

This requires an increase

Figures 2 and 3 show a possible

solution to the contention problem.
3.

Reliability: When a request has been made by a

processor to access a certain memory address, a signal mes
sage is reported back through the switch to the requesting
processor to indicate whether the operation has been success
ful or not.

In case there is a failure of the operation,

another attempt will be made by the processor to achieve its
request granting.
When a switch node fails, a misrouted message could be
created.

A leaking message should be inserted at the begin

ning of a request by every originator.

This message leaks

out the switch from the misrouted request.

The leaking mes

sage consists of a message with a higher priority value en
abling the processor to gain access to the node. The number
of bits associated with each request word should be suffi
cient in order to prevent the priority value from reaching

9

NODE 2

N0DE1

NODE3

M ESSA G E 1
Low Order Data

_ n __ n _

MESSAGE 2

High Order Data+Priority

Figure 2.

N0DE1

Conflict due to Absence of Priority V.ford

N 0D E2

N0DE3

__n _ r i _

MESSAGE1

Low Order Data+Priority

--------------'
*-------------High Order D ata+Priority

MESSAGE2

Figure 3+ Solution to the Conflict
Using Priority Word Presence

10
its maximum and overflowing.
B.

CONTROL OF THE SWITCH
The functional block diagram of a switch node appears

in Figure 4.
bit lines.

All single lines in the figure are multiple
The double lines on INOUT box represent incoming

and outgoing address and data lines.
line is also provided.

A read/write control

The function of the INOUT box is to

set up a connection between the incoming information port
and one of the outgoing ones, according to the value of the
input X.

The input X may be encoded with as few as two bits

as shown in Table II.
TABLE II
Input X Coding
Connection
>
i
o

Input X
01

A - D

10

B - C

11

B - D

Forbiddei

The function of the CONTROL box is to generate the
signal X and provide arbitration.

A request is generated

when its line is presented at the input port.

The memory

address is mapped into the stored table to provide the
correct routing of the selected message.

A signal X is then

issued to box INOUT to specify the right exit port.

When a

request for a busy memory is rejected, a busy signal is
eventually transmitted to the source which originated the
blocked request.

The DONE signal is supplied to each

Request 1
B usyi

R /W
R eq uest 2
Busy 2

Figure 4.

Block Diagram of a Switch Node

12

CONTROL box to guarantee information flow about the success
or failure of operations.

In case of failure, new attempts

should be made till the operation is achieved correctly.

To

avoid any gate delay, the DONE signal is connected directly
through the network.
Actual implementation of the switch in the real world
requires additional practical considerations.

An evaluation

of this interconnection network in terms of system performance/cost and allowance of programming concurrency will be
made in the next two sections.

13

III.

A.

PERFORMANCE AND COST OF THE SYSTEM

SYSTEM THROUGHPUT
The estimation of system performance and cost is moti

vated by the work of [2].

In his analysis, Reyling derived

results based on the utilization of a time-sharing technique
as shown in Figure 5*

This section deals with the equivalent

space-sharing technique.

The analytical results, concerning

system performance and cost, are compared with those of time
sharing technique and validated through examples.
Space-sharing means that a set of resources is parti
tioned into non-intersecting blocks such that each block
executes some application.

The applications are executed

independently in parallel.

Time-sharing reduces the idle

time.

Space-sharing reduces the percentage of resources that

are idled.
To determine multiprocessor throughput as a function of
the number of microprocessors required, the characteristics
of the system have been defined as:
Ts:

System throughput defined as the number of instruc
tions executed per second by the system.

Tp:

Throughput of an individual processor when there is
no memory interference,

p:

Number of processors in the system

m:

Number of memory modules in the system.

The effects of interference when memory is used for
making single-word transfers have been considered here;

S Y S T E M DATA BUS

Figure 5.

Multiprocessor Data Bus Organization

15

contention for multiple-word transfer units also affects
throughput of a particular system and may be investigated
in a manner similar to the following discussion.
When several processors simultaneously address the
same memory module, a memory interference occurs.

If n

generated requests are queued to the same memory module,
then n-1 processors must wait for the module to become un
locked in order to gain access to it.

Throughput of the

entire system is reduced because each processor is slowed
down.
In order to study memory interference in more general
terms, maximum, minimum, and average throughput Ts is deter
mined.

For this purpose, the following model is described.

At a given instant of time t, p different requests are
generated and divided among the m modules.
that the processing time is null.

It is assumed

Furthermore it is assumed

that a processor issues a new request immediately after re
ceiving its current request with a uniform probability

(1/m).

To illustrate the ideas, an example with p=4 and m=4 is
considered.

The number of requests simultaneously present at

memory module j at time t will be indicated by

(j ). In the

case where:
Xt ( D

= 0,

Xt (2) = 3,

Xt (3) = 1,

the model will be illustrated by
P

0 12

3

m

2 2 2 3

Xt (4) = 0

16

1,

Maximum Throug h p u t :

The maximum throughput Ts (MAX)

will occur If each memory module receives a single request at
a given Instant of time t.
X t (j) = 1

In other terms,

for j = 1, 2,

. . ., m

In this case, all the processors are doing useful work since
they are accessing different memory modules of the shared
main memory.

Clearly, Ts will equal pTp.

graphically in Figures 6 and 7.

This is shown

This result is also true for

time-sharing system performance.
2.

Minimum Through p u t :

the minimum value of T .

It is also of interest to find

The worst possible case would be if

all p requests had to be queued to the same memory module j ,
so that
Xt (j ) = P

for j = 1, 2,

. . ., m

and consequently, p-1 processors will be waiting to gain
access to the resource.

It is assumed that the probability

that a request will be pending is also

( ~ — )•

The memory bandwidth B is defined as the number of
requests serviced per cycle.

It follows that,

for the above

example, the bandwidth would be:
B = numk er> of processors _
number of cycles

*J
3

_ ^ no
*

where the number of cycles is equal to X^(j) maximum for
lj..., m
The decrease in throughput could be derived by considering
the ratio
r

- bandwidth with maximum interference
bandwidth with
no
interference

expressing the fact that p-l processors would be waiting for
the busy memory module during interference yields

B _ 1 + (p-l) (l/m) =
1
R _
“
1 + (p-l)(l/m)

The minimum value of Ts is given as:
Ts

(min) = (throughput with no interference) x R

_ pxT x R _
P xTp
p
---- I +.( F l M l 7 m T
This minimal value of throughput may be used to determine the
range of possible throughputs and has been plotted in Figure
6 for m=3 and in Figure 7 for m=10.

The two figures show

that with the hypothesis stating that m=p, the results are
equivalent to the time-sharing system performance results.
Even with maximum interference, both analysis still depicts
a substantial increase in Ts with p.

However,

it should be

pointed out that these last two cases concerning performance
bounds are events of small occurence.

As an example, a

system with parameter m=p=n has the random sampling probabi
lities
g(l) =
n

and

n

g(p) =
n

1
n-1

where g(h) is the probability that X^(j) maximum is equal to
h.

For a 7x7 system, g(l) and g(7) are given by
g(l) - —--y

rjf

=

1
0.00611 and g(7) ~ ~z — =
yO

0.00000849

As one can see, these probabilities are very low to let the
maximum and minimum interference occur frequently.
3.

Average Throughput: Average throughput is a

SYS TEM

TH R O U G H PU T

18

Figure 6 .

System Throughput Vs Number of Processors, m=3

19
deterministic factor of system performance.

It is computed

by considering a sequence of transition states viewed as a
discrete

Markovian process with state space

(l,2,...,m) and

with probability transition A (i ,j ) from state i to state j.
Let p(i) denote the steady-state probability of state i.
Then,
m
p(i) =

P(J)i

k=l,2,...,m

J= 1
To simplify the analysis, an assumption is made that all the
states are inter-reachable.

The number of busy modules is

represented by a state of m-tuple (pl,p2,...,pm) with
m

i=l
A new state (j j 2 » • • • »Jm ) is reachable from state
(i^,i^,...,i ) with the transition probability

x!
.( 1 \
(j* -i. )! . . . (j ■
-i ) ! \ m )
1 1'
m ”m

[3]

x

where x is the number of nonzero elements in the new state
vector.

Furthermore, the distribution probability p(i) of

all possible states obeys the normalizing equation
m

i=l
In order to compute the elements of the transition m a 
trix A(i,j), the enumeration tree of a 4x4 system as in

20

Figure 7.

System Throughput Vs Number of Processors.

m=10

21
Figure 8 has been considered.

The letters 1^, I 2 ,..., 1^

denote the initial states, and the letters F^, F 2 ,..., F^
denote the final states.

The letter W denote the number of

ways in which transition can occur.

This number is the sum

of different combinations to traverse the tree, e.g. the
number of ways to reach state
is (lx3+3xl) equal 6 ways.

(3*1,0,0) from state

(3,1,0,0)

The matrix equation of the

V
P3
=

P2
P1
_p o_

1
►-d
-tr
— 1

system considered has been derived as

~0.25

0.625

0.000

0.0156

0 .0152 “

0.75

0.375

0.125

0 .18 75

0 .1875

0.00

0.187

0.125

0.1406

0.1406

0.00

0.375

0.625

0.5625

0.5625

P1

0.00

0.000

0.125

0.0937

0.0937

_ po_

with the constraint:

P3
P2

P 4+ P 3+ P 2+ P l+P0 = 1

The average number of busy memory modules is given by
m
5

m
1 p(1) = V ~

i=l

m
1

p o

Ad,j),

j= l

it follows that the average throughput will be given by
m
Ts (AVE) = Tp J
i=l

m
i J

P(j) A (1, j )

j=l

Table III shows the average number of busy memory modules
for an 8x8 discrete Markov chain model during one cycle.
This is in contradiction with the assumption made that the
processing time is null.

Figure 6 shows the average

22

l wy ,Fi
J
I W:l/^ i\W
]
1111
l4:2100 -2000<NW w:i/ F
l
i Wil/
/ WF3:2200
no^
\
2
1
0
0
oono_ W m
X,W W/
\W
i
\1100<N
►
F
^
:
2
1
1
0
^W
:2 W
:/^
I3.2200 U100<
\W
F5:1111
1r 4000

:4000

3000

2 3100

:3

2:3100

:3

:1

:4

:3

;2

2

3

:1

Figure 8.

Transition States of a 4x4 System

23

TABLE III
DISCRETE MARKOV CHAIN MODEL
NUMBER OP PROCESSORS Pc = 1 , 2 .... 8 (ROWS)
NUMBER OF MEMORY MODULES Mp = 1 , 2 .... 8 (COLUMNS)______

1

2

3

^

5

6

7

8

1

1 .0 0 0 0 1 .0 0 0 0 1 .0 0 0 0

1 .0 0 0 0 1 .0 0 0 0

1 .0 0 0 0 1 .0000

1.0000

2

1.0000 1.5000 1.6 66 7

1.7 5 0 0 1.8000

1.8333 1.8571

1.8750

3

1.0000

1.6 667 2.0476

2.269 2 2.4095 2.5054 2.5748

2.6272

4

1.0000

1 .7 5 0 0 2.2707

2 . 6 2 1 0 2.8630 3-0365 3.1657

3.2652

5

1.0000 1.8000 2.4102

2.8633 3-199 6

3.4530 3.6482

3.8019

6

1.0000

1.8333 2.5059

3-0370 3.4533

3.7809 4.0415

4.9471

7

1.0000 1.8571 2.5751

3.1663 3.6486

4.0418 4.3636

4.6292

8

1.0000 1.8 7 5 0 2.6274

3.2657 3.9624

4.2521 4.6294

4.7491

2H

throughput for a 3x3 system.

Actually,

the request genera

tion rate follows a certain distribution.

In order for the

comparison of the present results with the time-shared bus
performance results to hold, it is assumed that processors
can generate new requests every cycle.
Feller [4] treating the "occupancy problem” had given
the transition probability as

A

(n)

(i.J)
v=0

where A

(n)

(i,j) is the probability that there will be j

occupied memories after n additional requests

(cf. Appendix ).

Average throughput of the considered model has been plotted
in Figure 7«

Indeed, the plot shows that there is a more

substantial increase of throughput with the number of proces
sors than in the case of time-sharing configuration.
B.

SYSTEM COST
In this section, the system cost shall be studied in

order to determine how much the potential increase of the
system due to the added throughput will cost.

For this

purpose the following subsystems costs have been defined:
Cr:

Cost of system resources

(including memory, mass

storage, and peripheral devices).
Cp:

Cost of an individual processor (including MOS
LSI microprocessor chips, power supply cost, and
mechanical assembly).

Cs:

Cost of the switch (including wiring, control

25
logic,

arbitration and conflict solving, mechanical

assembly of the s w i t c h ) .
For a system with p processors,

the total system cost is

derived as
Ct = Cr + pCp + Cs
Two systems have been considered:

one in which Cp=Cr/5 and

C s =P^Cj , the other in which Cp=Cr/30 and Cs=p^Ci.

Ci and Cj

are the costs of individual switch node and its associated
control and wiring,

and have been equally chosen to be

Ci=Cj=Cr/50
Tp is assumed to be the same in both cases.
Figure 9 shows the increase in Ct with p.

Another

assumption that Cr is independent of the number of m i c r o 
processors p has been made.

In reality, an increase in p

may require an increase in total storage.

As opposed to the

results based on time-sharing technique, where

cost of the

system has been found to be linear with p, the cost of the
present

system using a switch has been found to be parabolic

as expressed respectively by the equations of the two chosen
sy s t e m s :
Ct=Cr

(l+p/5+p2/50) and Ct=Cr

(l+p/30+p2/ 5 0 )

The information in Figures 7 and 9 has been combined in
Figure 10 to indicate cost versus throughput.

This cost of

the system is a strong function of the ration C r / C p , and
system cost increases rapidly as p approaches 10, diminishing
the cost/effectiveness of the system.

In order to determine

the optimum number of microprocessors in the system,

the

SYSTEM

COST ( Ct)

26

Figure 9«

Relative System Cost Vs Number of

Processors for Two Cases.

m=10

27

_|____________ I____________ I-------------- L_

2Tp

4Tp
SYSTEM

Figure 10.

6Tp

sip

TH R O U G H PU T

System Cost Vs System Throughput

m=10

28

ratio Ct/Ts has been calculated.

This ratio is the cost per

instruction execution that the user would have to pay.

The

information obtained from Figures 7 and 10, and plotted in
Figure 11 show that, for two systems with analogous p a r a m 
eters the time system user will be paying less price per
instruction execution than the space-shared system user as
long as the number of processors in the system does not ex
ceed 10.

For a larger number,

seems to be more attractive.

the space-shared configuration
This illustrates the advantages

of minimizing both Cp/Cr and the value of m as shown.
For t o d a y fs microprocessors,
typically very low,

the ratio Cp to Cr is

since the cost of a complete mic r o p r o c e s 

sor is in the range of several hundred dollars, while system
memory and peripherals may be in the range of $5,000 to
$20,000.

However the cost of a switch increases as p

.

For

the C.mmp computer developed at Carnegie-Mellon University,
the cost of the switch turned out to be half the cost of the
entire system.

A means of decreasing the number of shared

memory modules m in the system
local to each microprocessor.

is to provide

some memory

This approach has a double

advantage of reducing memory reference interference and high
access speed to data memory.

29

Figure 11.

Normalized Curves of System

Cost/Throughput Vs Number of Processors.

m=10

30
IV.

A.

THE MIMD MACHINE

DEFINITION OF SIMD AND MIMD MACHINES
The example of multiprocessor systems chosen in this

study was the MIMD type.

Two types of parallel processing

systems are single instruction stream-multiple data stream
(SIMD) machines and multiple instructions stream-multiple
data stream (MIMD) machines.

An SIMD machine typically

consists of a set of p processors and m memories,
connection network, and a control unit.

an inter

The control unit

broadcasts

instructions to the processors and all active

processors

execute the same instruction at the same time.

Thus a single stream instruction drives all the processors.
Each processor executes instructions using data taken from
a memory to which only it is connected.
multiple data stream.

The interconnection network allows

interprocessor communications.
the ILLIAC IV [5 ].

This provides a

A type of such a machine is

An MIMD machine typically consists of p

processors and m memories, where each processor may follow
an independent instruction stream.
ple data streams.

Hence, there are m u l t i 

As with SIMD there is a multiple data

stream and an interconnection network.
machine

is

the C.mmp[6].

Figures

An example of such a

12 and 13 show the SIMD

and MIMD computers respectively.
B.

PARALLELISM THROUGH THE SWITCH NETWORK
A typical MIMD multiprocessor using a switch network

as described previously has been considered.

The first part

c o n t E c?

U N IT

PRO 1

PR O 2

PRO p

M EM 1

M EM 2

M EM m

INTERCONNECTION

Figure 12.

N ETW O R K

Block Diagram of a SIMD Computer

32

I/O

Figure 13.

C H A N N E LS

Block Diagram of a MIMD Computer

33
of this section will be devoted to the modeling of the switch
node operation,

then the rest of the section will examine

problems related to two sensitive areas which are inter
processor control and resource sharing and scheduling.

Petri

nets appearing to be a clear and convenient way to express
process coordination are used here to explore such problems.
To avoid any ambiguity for the reader, the definition of
Petri nets and the simulation rules are given explicitly.
1.

O v e r v i e w : James L. Peterson

[7] defined a Petri net

as in the following:
"A Petri net is an abstract,
flow.

The properties,

concepts,

formal model of information
and techniques of Petri nets

are being developed in a search for natural,

simple and

powerful methods for describing and analyzing the flow of
information and control in systems that may exhibit asyn
chronous and concurrent act i v i t i e s . ”
Figure 14 shows a simple Petri net.
two types of nodes:
transitions).

Circles

The graph contains

(called places) and bars

(called

The places and transitions are connected by

direct arcs from places to transitions and from transitions
to places.
?2

Place P^ is an input to transition T^ and places

an d P^ are output to transition T^.

The execution of

Petri nets is controlled by markers moving around the graph.
Each place has one or more markers in it or may be empty.

A

transition is said to be enabled if all its input places
contain at least one marker

(or token).

The transition fires

by removing the enabling tokens from their input places

*>

Figure 14.

t2

A Marked Petri Net

35
and generating new tokens which are deposited in the output
places of the transition.

Petri nets constitutes a broad

area of study and the reader should consult the literature
on this advanced theory for more information [8,9].
2.

Node Switch O p e r a t i o n : When two requests arrive at

a switch node,

contention is essentially made as follows:

the switch node selects one packet and rejects the other one
if the two packets are to be passed to the same output.

It

takes time t^ to determine the successor node to which the
message is to be sent.

If that output is in use, it waits

its turn for the use of the output link.
output port becomes free,

When the selected

it takes time tg for data to be

available at the output port.

Figure 15 models a timed Petri

net of the switch node operation.

Input A can select either

output B or C, whereas input B can only select output C.
Place P^ cannot acquire a token and hence disables transition
tg from firing.
B.

Consequently output D is forbidden to input

When a message-packet

is automatically lost.

is rejected by the switch node,

it

If there is no conflict at the switch

node level, processors carry out their tasks in a concurrent
fashion,

creating parallelism as it will be seen in the re

maining of the section.
3.

Interprocessor C o n t r o l : A major concern about inter

processor control lies in synchronization of the processors
to carry out a parallel computation correctly.
an example of this problem,

To illustrate

the producer-consumer problem

with one producer and two consumers

is considered.

The items

36

TA

Figure 15.

tb

Switch Node Operation

37
produced by the producer are passed to the consumers to be
picked up on a random basis.
by one consumer at a time.

Only one item can be consumed
In order to avoid the access of

the produced item by both consumers s i m u l t a n e o u s l y , one of the
consumers must lock the other from trying to consume the same
item.

The instruction to do this must be indivisible.

The

indivisibility can be achieved by instructions of the form
"Test-and-Set" as implemented in many systems.

Two portions

of code generated by two processors wishing to access a
common resource are called "critical sections".

To control

the correct execution of critical sections without conflict,
Dijkstra

[10] introduced a new concept using semaphores.

A

semaphore is a variable upon which a processor can execute a
P and a V operation as in the following:
V ( S ) : S ◄—
P(S): L:

S + 1
If S = 0 then go to L else S *4-- S - 1

Figure 16 summarizes the producer-consumer mutual exclu
sion protocol in terms of Petri nets.

Places p^ and p^ re

present the producer and places p^, p ^ 5 p^ and p^, p^, pg r e 
present the consumers.

Place P^ models the semaphore setting.

Transitions t^ and t^ are mutually exclusive.
disables the other automatically.

Firing one

Transitions tg and t^ r e 

present the critical sections of process 1 and process

2.

Transitions t^, t 2 , t^, t^ control the entry and exit to the
critical sections.
can be seen,

Transition t^ the production process.

synchronization of concurrent processes can be

achieved using semaphore techniques.

Unfortunately,

the

As

38

Figure 16.

Illustration of the Producer-Consumer
Problem

39

solution to these problems is related to scheduling and
reliability problems.

The failure of the processor whose

process is in its critical section may lead to a dangerous
situation.

The rest of the processors will be

infinite testing loop.

blocked in an

In the following discussion,

ble approach to the problem is explored.
instead of a semaphore is provided.

a p o ssi

The use of a lock

The lock consists of

one part to test and set the lock and a busy signal bit to
indicate that the processor executing the critical section
code is successfully running.

A process wishing to obtain

the lock tests the appropriate part of the lock with a single
indivisible instruction.
that the lock is free,

If the result of the test indicates

it is then locked and the locking

process can execute its critical code.
set,

If the lock part is

the processor performs a second test upon the busy

signal bit.

If this bit is set then the processor using the

resource is still executing properly.

Otherwise, the oper

ating system unlocks the lock indivisibly, allowing one of
the waiting processors to proceed to use the lock and execute
its critical code.

Figure 17 shows a Petri net model of two

concurrent processes using a lock.
or t^ does not fire,

If either transition t

5
then the firing of transition t^ or t^

resets the lock at place Py.

Either process 1 or process 2

at place p 1 or p^ has now a token in its place.
t

Transition

(or transition t^) is now enabled and process 1 (or proc

ess 2) is now ready to use the lock and enter its critical
section.

40

Figure 17*

Solution to the Reliability Problem

41

4.

Resource Sharing and Scheduling: In this section,

the deadlock problem is examined.

To illustrate the ideas

once again, an example of a deadlock problem is considered.
Two processes p 1 and

request use of memory module M^,

as shown in Figure 18.
M^,

and then needs

Process p^ acquires memory module
and needs

to carry on.

The

operating system*s resource scheduler services process p^*s
first request
request
continue

and process P 2 *s first

(transition t^,

(transition t 2 > M^).

From there, neither process can

(the places p-^ and p^ are empty and neither transi

tion t-p M 2 nor

can fire).

To circumvent the deadlock problem,

some prior condi

tions must be set before the system requirements are met.
Prevention of system deadlock has been discussed and careful
ly analyzed in the literature [11].

In the light of this

fruitful analysis and based on the conditions derived in
order to avoid deadlock problems, a solution to the deadlock
problem cited above is given and illustrated in terms of
Petri net techniques as in Figure 19.
explanatory.

The graph is self-

At transition t^ and t^, there is a mutual ex

clusion set by place p^*

Firing of either transition disables

the other and enables its next transition in sequence by
placing a token in either place p^ or p^ accordingly.
illustrates one of the conditions to avoid the deadlock
problem which consists of preventing a process to hold
exclusive control of some resources while a request for

This

b2

Figure l8

Illustration of The Deadlock Problem

43

Figure 19.

Solution to the Deadlock Problem

more resources is pending.
The deadlock problem has been a burden in the domain
of multiprocessors task scheduling for years and the only
way to circumvent it is to set preventing conditions which
unfortunately increase operating system overhead.

45

V.

CONCLUSION

It has been shown that a switch matrix configuration for
the processor-memory interconnection network has reliability
and expandability.

If a switch node fails,

the system can

still function with less memory and degraded performance.
A computer system model has been used to estimate the
relative performance of a computer using a switch n e twork to
another system using a bus.

Performance bounds have been

found to be equivalent in both systems.

However, average

throughput has been derived to be increasing more with the
number of processors in the case of switch utilization.
Certain simplifying assumptions have been made to make the
analysis

tractable but the model can be used to at least

approximate the performance of some computer systems.
Expressions of the cost of the system have been given
in the case where the cost of system resources C^ is thirty
times and five times the cost of an individual p r o c e s s o r C
P•
It appears that in a certain number of processors range the
space configuration handles large parallel computations
better than the time configuration.
Parallelism through such a switch network has been v i e w 
ed in terms of Petri net m o d e l i n g techniques.

The switch

node operation has been investigated in detail.
Problems related to interprocessor control and resource
sharing and scheduling have been studied.
proaches

Possible a p 

to their solutions have been given and validated

through examples. When the switch node operation has been

46

explored,

it has been stated that a non-seleeted message-

packet was rejected by the system and consequently lost.
A topic of significant

importance would be the

investigation of networks with b u f f e r i n g capability to
permit request queueing and prevent

this loss.

47
BIBLIO G R A P H Y
1.

Enslow,

P.H.

Processing,
2.

Reyling,

(Ed), Mu l t i p r o c e s s o r s and Parallel
John Wiley & Sons, N.Y.

G. Jr.

1974

Performance and Control of Multiple

Microprocessors Systems,

Computer Design,

March 1974,

pp 8 1 - 86 .
3.

Bhandarkar,

D.P.

Analysis of Memory Interference

Multiprocessors,

IEEE Trans,

on Comp.

C-24

in

(Sep.1975)

pp 897-908
4.

Feller, W. An Introd u c t i o n to Probability Theory and
Its Applications.

5.

Davis, R.L.
Comp.

6.

Proc.

7.

and C.G.

Montvale,

N.J.

Peterson,

J.

Bell,

C.mmp A Multimini

1972,

Comp.

Conf.4l,

L. Petri nets,

Processor
AFIPS Press,

1977) PP 223-252

1.1.8, Dept.

Chicago Circle
Petri,

1968

Computing Surveys, V o l . 9

Analysis of M a r k e d Graphs

and Petri nets by Matrix Equations,

9.

N.Y.

pp 765-777

Murata, T. and Church,R.W.

MDC

& Sons,

1968), pp 8 OO - 8 1 6

AFIPS 1972 Fall Joint

N o . 3 (Sep.

8.

I, John Wiley

The Illiac IV P r o c essing Element, Trans.on

C - l 8 (Sep.

Wulf, W.A.

Vol.

Research Report

Information Engineering,

(Nov.

1975)

Uni.Illinois

25 pp

C .A . Concepts of net Theory in Proceedings S y m p .

and Summer School on M a t h e m a t i c a l F o u n dation of Computer
S c i e n c e , High Tatras,
10.

Dijkstra,

E.W.,

Programming,

(Sep.

1973)

pp 137-146

Solution of a P r oblem in Concurrent

Comm, ACM 8,

(Sep.

1 9 6 5 ) p p •569-570.

48

11.

Stone,

H.S.,

Parallel Computers,

Computer Architecture,
111.

1975,

pp

318 - 3 7 4

H.S.

I n t r o d u c t i o n to

Stone,

ed.

SRA,

Chicago,

**9

VITA

R a b a h A o u f i was b o r n on M a r c h 2,
S e t i f State,

Algeria.

e d u c a t i o n in M e d j a n a ,
(Algeria).
Bab-Ezzouar,
for the

He r e c e i v e d his p r i m a r y a n d s e c o n d a r y
B o r d j - B o u - A r r e r i d j , and D e l l y s

In S e p t e m b e r
Algiers,

first time.

197^ he e n t e r e d the U n i v e r s i t y

w h i c h wa s

the U n i t e d S t a t e s

in J a n u a r y

in May

position
1980.

of a

1977.

He t h e n

from
came to

1978 an d a t t e n d e d a n E n g l i s h

c o u r s e at C o l u m b i a U n i v e r s i t y ,
he e n t e r e d the U n i v e r s i t y

equivalence

in E l e c t r i c a l E n g i n e e r i n g

P o l y t e c h n i q u e d !A l g e r

of

o p e n e d to r e c e i v e s t u d e n t s

He r e c e i v e d the

B a c h e l o r of S c i e n c e D e g r e e
l !E c o l e

1955 in M e d j a n a ,

N e w York.

In S e p t e m b e r

of M i s s o u r i - R o l l a and h e l d

of G r a d u a t e T e a c h i n g A s s i s t a n t

1978

the

d u r i n g the F a l l

of

50

APPENDIX
THE O C C U P A N C Y
Consider a sequence
ing

of p l a c i n g a r e q u e s t

modules.

The

system

memory modules
with

states

ar e

is

of

independent

t r ials,

each

consist

at r a n d o m at o n e of m g i v e n m e m o r y
said

occupied.

P 1,...,Pm

PROBLEM

to be

in S t a t e p k if e x a c t l y

This determines

a Markov

and transition probabilities

k

chain

such

th a t

on e x p r e s s i n g
o r ials,

this

th e b i n o m i a l
formula

coefficients

simplifies

in t e r m s

of

fact

to

k-j-V

with P j ^

=

(For a m o r e

0
specific

if k < 3
d e m o n s t r a t i o n of t h i s

Formula

see

[4])

