Multiprocessor with no contention by Marquart, David Wayne
MULTIPROCESSOR WITH NO CONTENTION 
BY 
DAVID WAYNE MARQUART 
B.S., University of Illinois, 1977 
THESIS 
Submitted in partial fulfillment of the requirements 
for the degree of Master of Science in 
Electrical Engineering in the Graduate College of the 
University of Illinois at Urbana-Champaign, 1979 
Urbana, Illinois 
TABLE OF CONTENTS 
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 
2. Proposed Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 
3. Paging ........................................... 
4. Structure of a Processing Element 
5. Messages and Interrupts . . . . . . . . . . . . . . . . . . . . . . . . . . 
6. Input I Output 
7. Operating System ................................. 
8. Performance Evaluation 
9. Reliability 
1 
3 
6 
8 
10 
13 
15 
18 
25 
10. Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 
11. Su.rtllnar y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 
References .......................................... 33 
1 
CHAPTER 1 
Introduction 
Many different architectures have been proposed for 
multiprocessing computers. Among these are 
processors, shared high-speed storage systems, and 
coupled systems communicating via I/0 channels, 
memory access, or common I/0 devices. All of these 
ha11e drawbacks. 
array 
loosely 
direct 
methods 
Array processors are good for matrix problems, but may 
not be attractive for a general computational task [1]. 
Processing systems communicating via external networks have 
problems in synchronization, data transfer speed, and 
interference between processors [1]. Systems using common 
storage are faster and , conceptually simpler, but have 
special problems with memory and bus contention, and with 
2 
the complexity of interconnections between storage and 
processors 
designing 
(1]. Furthermore, 
operating systems 
architectures. 
there are 
for most 
problems in 
multiprocessing 
The conceptual simplicity of a common high speed store 
is attractive for a multiprocessor architecture, but bus and 
memory conflicts must be prevented or minimized, and the 
complexity of the interconnection sy3tem between storage and 
processors must be minimized. The complexity of the 
operating systems required to control the system should also 
be considered. 
3 
CHAPTER 2 
Proposed Solution 
If the common store for a multiprocessing system were 
serial, and each processing node could access only one 
"window" past which the store circulated, there would be no 
possibility of bus or storage module contention. Each 
processing node could access only one element of the 
circulating store at a time, and tbe "windows" for all nodes 
would be mutually disjoint. Therefore, no two processors 
could attempt to access an element of storage 
simultaneously. This schema is similar to other methods of 
preventing storage contention which allow a processor to 
access the shared storage only during the processor's time 
slot, during which all other processors are locked out of 
the store [1]. 
4 
In the proposed storage access method, a processor can 
only access a storage location when the location passes the 
processor's access point into the circulating store~ A 
processor's action with the main store does not affect any 
other processor's ability to transfer information to or froill 
the main store; any number of processors could be 
transfering information to or from the main store 
simultaneously, if the proper addresses are passing their 
ports. 
A conceptual ~odel of the proposed scheme is presented 
in Figure 1. Note that the store width can be as small as 
one bit. 
The proposed architecture also simplifies storage to 
processor interconnections. Each processor would need only 
a simple interface to the main store, consisting of a 
buffer, the hardware necessary to transfer data between the 
main store and the buffer, a counter to contain the address 
currently passing the port, and a d.ata address register 
holding the address to be read or written. The number of 
these interfaces would be the same as the number of 
processors, and would not grow rapidly as processors are 
added. This is in contrast to crossbar switching systems, 
which need NM bus interconnections, where N is the 
5 
number of processors and M is the number of memory 
modules. 
The main store may be i~plemented with any serial 
storage device, such as charge couple devices or ~agnetic 
bubble devices. The ports may be distributed around the 
loop in any fashion~ they may be clustered at one point, 
evenly spaced, or in multiple clusters. The only 
requirement is that the portion of the loop seen by any one 
port may not be shared with any other port. 
Processor 
I/0 Processor 
I/0 Processor 
System Control Task ~ 
Message Control Task 
Dedicated Processors 
(Operating System) 
,.. "1 .... ___ _ 
- Main Store 
Figure 1. Model of storage scnema · 
6 
CHAPTER 3 
Paging 
The use of a serial ~emory exclusively would result in a 
severe performance penalty. Therefore, each processing node 
should have. a local random access store and use the serial 
main store for a page space. With the decreasing price of 
random access semiconductor storage, large executable stores 
for each processor in a system will become economical [2]. 
It has been demonstrated that small page sizes and a large 
executable store improve paging performance. With these 
large executable stores and the simplicity of moving small 
pages in and out of the main store, paging will be rather 
efficient. 
In paging systems using disk storage, the page sizes 
used are large to offset the overhead of the I/0 operations 
needed during paging. In the proposed architecture, the 
7 
transfer mechanism between main store and local paging 
memory is simple enough to avoid excessive overhead and make 
small page sizes reasonable. The port only waits for the 
proper address to pass by and initiates a transfer to or 
fro~ its buffer. This transfer must be fast enough to 
co~plete before the next shift in the main store. Data 
transfer between the port and the local random access 
storage page pool will probably involve little overhead, as 
the buffer and the page pool could reside in the same set of 
integrated circuits. 
8 
CHAPTER 4 
Structure of a Processing Element 
Each processing element consists of a processor, an 
executable memory, equipment for communicating with the main 
serial store as described above, and equipment for 
communicating with other processors, which will be described 
later. Since the executable storage for a processor is 
paged, there must be dyn~~ic address translation hardware to 
allow transparent access to the store. Such hardware would 
probably consist of an associative memory. The processing 
nodes would also have . protected storage for task status 
information, page tables, and operating system data, as well 
as read only storage containing paging, interrupt handling, 
and other operating system support routines. Data 
protection for operating system data can be implemented via 
the dyn~~ic address translation hardware. A block diagram 
of a processing node ~ay be found in Figure 2. 
Message 
System 
Interface 
Dynamic Address 
Translation Hardware 
Port to 
Main Storage 
Main Store 
Protected System RAM 
System ROM 
Local Page Storage 
(RAM) 
Figure 2. Block Diagram of a Processing Node 
9 
10 
CHAPTER 5 
Messages and Interrupts 
The operating system must have a way of interrupting 
running tasks, and there must be a way of alerting a task to 
events external to the task. This provides the basis for 
synchronization and data passing beween tasks. A way to 
provide such services is to use a storage system similar to 
the main serial store, but much smaller, which would carry 
messages between tasks. Being much smaller, the store would 
have a smaller circulating time than the main store. 
Messages should be fixed length and consist of a task 
address, a flag signalling whether the "slot" of the message 
store is being used or not, a rotation control flag, and 
data fields as necessary. The messages are addressed to 
tasks, not to processing nodes. Each processor has a 
11 
mechanism scanning the address fields of the messages 
passing it in the message store, comparing them to the task 
identifier for the task executing on the processor. when a 
match occurs, the mes~age is read into a buffer, the "in 
use" flag in the message store is set to "unused", and the 
task running on the processor is interrupted so that the 
local operating system routines can handle the message. A 
processor attempting to send a message would wait for an 
unused slot in which to insert the message. See Figure 3 
for a model of the message system. 
The rotation control flag is set to a null value on 
message entry. On passing the processor acting as the 
message system controller, the rotation control field is set 
to a "passed" value. If a message is not removed from the 
message store by the time it passes the message control 
processor a second time, it should be removed by the message 
control processor and stored, as it is addressed to a task 
which is not running. It should be queued to be delivered 
when the task is reactivated. The message control processor 
should be a dedicated processor. If any other task is run 
on the processor, it will be interrupted whenever a message 
to an inactive task is detected. If a processing node 
recognizes a message addressed to it, but cannot receive it 
i,ninediately, the rotation control field should be reset to 
message control 
processor 
messages to 
inactive tasks 
(to storage area 
to be de 1 i vered 
upon task 
reactivation) 
message store 
a message 
12 
messages exiting 
(via the exits at 
each processing 
element) 
(via the entry point 
at each processing 
element) 
Figure 3. Message System 
prevent removal of the message by the message control 
processor. The data fields of the message might consist of 
the originating task's identification, command fields, and 
main storage addresses of information to be passed to the 
addressed task. 
13 
CHAPTER 6 
Input I Output 
Input - output services are provided by special purpose 
processors attached to the main store and message system. 
I/O requests are made by messages from tasks to the 
operating system, which in turn signal the I/O processor 
which should handle the request. Data transfer is 
accomplished through the ~ain store; ~ain store addresses 
are transmitted via the request message. 
Only the I/0 processors should be attached to any I/O 
devices. This simplifies task scheduling, as tasks need not 
be run on specific processors in order to access specific 
devices; all non - I/0 processors would remain identical. 
The I/O processors use the same main storage interface as 
the processing nodes, and the same message system interface; 
14 
the only differences are that the I/0 processors are 
attached to I/0 devices and run only I/0 related pro3raills. 
15 
CHAPTER 7 
Operating System 
The operating system should consist of some routines 
residing in the individual processing units to handle 
program interrupts, operating system requests, message 
system interface, and paging, and a set of tasks which 
should handle task scheduling, I/0 scheduling, operator 
interface, and other concerns of the operating system. One 
of the operating system tasks would be the system control 
task; it would run continuously on a dedicated processor. 
Other operating tasks would run under its control, perhaps 
on dedicated processors. See Figure 4 for a diagram of the 
global control structure of an operating system for the 
machine. 
queuing 
control 
message 
system 
control 
Figure 4. Operating System Structure 
16 
Each processor, excluding I/O processors, should be 
identical; thus the task of allocating processors to tasks 
is one of allocating customers to identical servers. A 
single queue of ready tasks with priority considerations 
would be the simplest method of allocation [3]. 
17 
The working set concept [5] of page management may be 
the most useful method of page management. It shows some 
advantages over simple first-in-first-out, least rec~ntly 
used (LRU), and other methods of page management [4]. 
Furtherffiore, it would make the identification of the pages 
to be loaded for a task upon its reactivation following a 
suspension easier: only the pages in the working set would 
be loaded into the local paging store. 
18 
CHAPTER 8 
Performance Evaluation 
For this architecture to be efficient, the amount of 
processor time lost to paging operations must be small. The 
n~nber of paging operations which occur is dependent on the 
particular algorithm being executed, the page replacement 
policy, the page size, and the executable storage size; in 
general, the distribution of page faults in time is not 
predictable. However, the amount of time lost for a page 
fault is computable. The page-wait time is the sum of 
several distinct times: page-out time (ff a page is to be 
replaced in executable storage), page-in time, and page 
replacement algorithm time. 
Page-out time is represented by the random variable X. 
If the timing of page faults and the position of the main 
19 
store are uncorrelated, then X is unifor~ between 0 and C, 
where C is the circulation time of the main store. In 
addition to X, the page-out operation requires some setup 
time and transfer time, lumped together as F, so that total 
page-out time is F + X. The page replacem~nt algorithm may 
decide that no page replacement is necessary. If P is the 
probability that a page-out is required, then P(X + F) is 
the average time requirement for page writing. 
Page-in is a more complicated process. If no page-out 
occurs, the analysis is the same as for page-out, without 
the factor P (alternatively, p = 1) . If there is a 
page-out, then the position of the serial main store may not 
be considered to be random when the page-in operation is 
started. The main store will be positioned just following 
the page written during page-out, and the page-in time 
depends on the relationship between the positions in the 
main store of the written out page and the page to be read 
in. If all the pages for a task were clumped together, the 
wait for the page to be read would be .small, if the page to 
be read followed the page to be written, or close to C if 
the page to be read preceded the written page. If there is 
any delay between the time page-write is completed and 
page-in may begin, then there would be a larger probability 
that the wait time would be closer to C than to zero. If 
. 
the pages for a task were randomly scattered around the main 
20 
serial store, then the page-in wait time would be uniform 
OIJer (0 , C), as X is. The mean for both situations is 
C/2~ howeiJer, the random distribution of pages ~auld be 
easier to implement, particularly if dynamic page allocation 
is to be allowed. Also, if there is any delay between 
page-write and being ready for page-in, the probability 
shift described above would cause the clumped pages scheme's 
mean to increase beyond C/2, while for random allocation 
the mean remains at C/2. Figure 5 shows the rotation delay 
density functions of the clumped and random allocation 
strategies for page-waits following a page-out operation. 
If the randomly scattered pages philosophy is used, the 
distribution of page-in time after a page-out is X + R, 
where R is a setup and transfer time similar to F. Since 
P is the probability that a page-out is required, the 
expression for page-in wait time is therefore 
P(X + R) +(1- P)(X + R) = X + R 
If the randomly scattered page philosophy were not used, and 
the distribution of page-out wait time~ after page-out were 
z, the expression would be 
P(Z + R) +(1- P)(X + R) 
Thus the total paging time for random page placement is 
W = V + P(X + F) + + R = V + X(l + P) + PF + R 
where v is the page replacement algorithm time. The mean 
of the paging time is 
4-
0 
c: 
0 
..->, 
.,, 
Ur-
C:ClJ 
~"0 
4-
c: 
>,o 
., ..... 
_....., 
VIta 
c:., 
CliO 
"OS.. 
~~ 
~---· 
,_ _____ 
0 
Time 
• • ·-·Clumped allocation 
---Random allocation 
c --
Fi3ure 5. Probability Density of Page Wait Times 
mean(W) = (1 + P)mean(X) + V + PF + R 
= (1 + P)C/2 + V + PF + R 
21 
An interesting special case arises when the serial main 
store is wide enough to hold all of the pages for a task. 
If concurrent read/write operations from disjoint portions 
of the width of the memory are possible, then all paging 
operations would take a setup time plus the rotation delay 
22 
X. If not all the pages would fit into the width of the 
main store, and if the probability of the page to be read at 
the same time as the swapped out page is written is Q , 
then the paging time expression becomes 
V + (T + Q) X + QF + QR 
It is of interest to find the proportion of useful 
computation time to total processor time; that is, the 
proportion of time that is not lost to paging operations. 
If K is the distribution of task running times between 
page faults, then the proportion of useful computation time 
is 
U = mean(K) I [mean(K) + .mean(W)] 
The distribution K cannot be derived in general. It 
depends heavily upon page size, page replacement policy, the 
number of pages which will fit in the executable storage, 
and the program and data being used. It has been 
demonstrated that small page sizes and large executable 
stores reduce the occurence of page faults [3]. The 
proposed architecture promotes the use of small page sizes, 
since there is little difference in difficulty between 
transfering one byte or many adjacent bytes between a 
processor's store and the main store. 
23 
If A is the speed of a processor in a machine of the 
proposed architecture (all processors in a system are 
identical) and the fraction of useful computation time is 
U, then the effective speed of a single processor is UA. 
If the number of general purpose processors (processors not 
dedicated to the operating system or I/0) is . J, then the 
effective speed of the machine is UAJ. 
The most interesting factor of the effective speed UAJ 
is the fraction of useful computation speed u. If U is 
sufficiently large (in its range 
proposed architecture has a 
from zero to one) , 
large advantage 
the 
over 
uniprocessor machines. To take a processor of speed A and 
speed it up by a factor of, say, one hundred would be costly 
and difficult. However, if. U = 1/2, a pessimistic view, 
then ~naking the number of processors J equal to two 
hundred would provide the same increase in speed. The 
sharply declining prices of random access semiconductor 
memory, processors, and serial memories such as charge 
coupled devices and bubble memories may make the proposed 
architecture a more economical prospect than building single 
processors with the capabilities of the serial machine. 
Furthermore, if the uniprocessor to which a machine of 
the proposed architecture is being compared to is a virtual 
storage machine running the same task mix, then its fraction 
24 
of useful processing time U must be equal to or lower than 
the U of the serial store machine. If the uniprocessor 
pages to a disk or drum, the page size overhead and . slow 
speed of magnetic device information transfer will cause the 
page speed to be reduced. Multiprogramming makes up for 
part of this, but also introduces more overhead. 
The proposed architecture also has advantages over other 
multiprocessor architectures due to the simple processing 
node to main storage interconnection, elimination of bus and 
memory contention and the associated arbitration mechanisms, 
separation of computation and I/O functions, ease of 
expansion, and the applicability of the machine to general 
tasks, not only special purposes. 
25 
CHAPTER 9 
Reliability 
The system is made up of identical processing elements; 
thus, if one or more fail, .the system may still operate, 
even if performance is degraded [1]. As repair time should 
be short compared to the mean time to processing element 
failure, it is unlikely that many processing elements would 
be out of service at any one time; a many processor system 
would rarely be degraded significantly in throughput due to 
the small proportion of processing elements expected to be 
out of service at a time. 
If special equipment is used in the message control 
processor to monitor the message store, then the failure of 
the message control processor would make the system 
inoperable. This suggests that the necessary hardware be 
26 
implemented (but not enabled) on all, or at least several, 
processor units to provide backup. 
The I/0 processors are not identical servers; they are 
attached to separate I/0 devices. If I/0 devices can be 
switched among I/0 processors, then multiple I/0 processors 
provide a backup. Spare I/0 processors should be provided, 
degradation-free processing can proceed even in the event of 
an I/0 processor failure. 
The main store and the message system store are the most 
critical elements of the system. If a bit stream becomes 
inoperable, then many pages will be destroyed. Ir the store 
were modular enough to be operable in part even if failure 
occured in sections of it, then machine restart would be 
possible. If the data in the store were encoded in an error 
correcting format, then multiple bit errors due to below 
tolerance storage devices could be corrected and the data 
transfered to another part of the circulating store until 
the malfunctioning bit stre~n could be repaired. 
Any of the above failure modes might result in the loss 
of some data. For an I/0 processor or general processing 
node failure, only one task would be affected. Such tasks 
could be restarted from checkpoints or cancelled to be 
rerun. In some cases, tasks could affect other tasks so 
27 
hea~ily that a set of tasks would have to be cancelled. The 
most serious effect that a task termination could have would 
be a non-reco~erable operating system task termination. 
Such a termination might force system reinitialization. 
Proper design of the operating system would allow for 
recovery and continuation of normal processing, through 
checkpoint data kept in the main store. 
In order to recover from a fault, the fault must first 
be detected and isolated. Self diagnosable computers can 
monitor their own performance and take corrective action on 
below tolerance or failure conditions before they cause 
system failure or further loss of information [6]. A 
processor could continously 
then check it for integrity on 
place data on the main store, 
each circulation. Similar 
methods could be used for the message store. 
Roving diagnosis is a possible method of checking the 
processing elements. Periodically, three processors could 
be freed from normal processing and two of them set to check 
a third. A special intercpnnection riet would be used for 
this purpose, with no network interference possible because 
only one test would be allowed at a time. The 
interconnection net would not necessarily connect all 
processors to all other processors [7]. The operating 
system would switch to a new set of processors at the end of 
28 
each test, rotating through all processors. At the end of a 
test of all processors, the tests could be restarted, or 
restart could be delayed for a set time, or until the 
operator or an external event triggers a checking run. 
Delaying rechecking runs improves throughput at the expense 
of quick error detection: the delay between checking runs 
should be significantly less than the mean time to failure 
for the processors being used. 
29 
CHAPTER 10 
Applications 
The proposed architecture would support independent 
tasks quite well; the absence of data sharing would sa~e 
synchronization and communication overhead. Many business 
data processing tasks are presently performed by single 
tasks not communicating with other tasks, so business 
applications would be easily adapted to the machine. In 
addition, mildly interdependent tasks would be simple to 
implement. Mildly interdependent in this case means that 
semaphores and data queuing would be ~he only forms of 
inter-task communication and data sharing. These 
communications can be handled by the message system, or by 
passing queue addresses through the message system. More 
difficult to implement would be shared-storage schemes, 
where data is continuously updated by more than one task. 
30 
Reaj-only storage or applications where only one task can 
write a main storage area at a time would not suffer 
greatly. 
It is evident that the system would not support well any 
algorithm which · is not amenable to virtual storage 
environments. 
Since the system should be useful for independent tasks, 
such a machine could do the work of several independent 
systems, while utilizing only one set of peripheral devices. 
This would result in some savings by eliminating any 
redundant devices not needed to support tne throughput of 
the multiprocessor. It would also eliminate or reduce the 
need for multiple copies of data on different computer 
systems and inter-system data transfer. 
An example of an application which would make good use 
of the multiprocessor is sorting. A common method of 
sorting large files is to form the incoming records into 
ordered •strings" and then merge strings together 
succesively until only one ordered string remains [8]. A 
multiprocessor sort would involve one or more tasks 
performing input and string formation, then queuing these 
strings via an optimized operating system queuing mechanism 
to merge tasks. Merge tasks would merge sets of ordered 
31 
strings into longer strings, and would pass merged strings 
on to another set of merge tasks or queue them back to 
themselves. They could also write strings to secondary 
storage in case a file is too large to fit in main 
storage. A final set of merge tasks would ultimately 
perform the final merge. This algorithm utilizes the full 
parallel p~ocessing power of a machine of the proposed 
architecture; it consists of mildly interdependent tasks 
which only queue blocks of data to other tasks and do not 
require careful synchronization. 
of as many processors as the 
justifies. 
The algorit~~ can make use 
amount of data to sort 
Compilation can similarly utilize the system. The 
lexical analyzer can pass a queue of tokens to the parser; 
the various passes of the compiler could be implemented as 
separate tasks receiving queued information from the 
preceeding task. In addition, some functions might be 
carried out in parallel dur.ing a pass, such as processing 
subroutines in parallel. 
32 
CHAPTER 11 
Summary 
I 
A multiprocessing computer architecture has been 
presented that eliminates bus and storage module contention 
while simplifying the processor to storage interconnection. 
The system would also have some advanta3es over conventional 
virtual stora3e computers, and would not be bound to a 
special purpose, as many multiprocessors are. 
33 
REFERENCES 
[1] A. J. Weissberger, "Analysis of Multi-Microproc~ssor 
Architectures," Computer Design, val. 16, pp. 151-163, 
June 1977. 
[2] G. Reyling, "Performance and Control of Multiprocessor 
Systems," Computer Design, ~ol. 13, pp. 81-87, March 
1974. 
[3] E. G. Coffman and L. C. Varian, 
Data on the Beha~ior of 
Environment," Comm. ACM, vol. 
1968. 
"Further 
Progra1ns 
11' pp. 
Experimental 
in a Paging 
471-474, July 
[4] A. C. Shaw, The Logical Design of Operating Systems, 
Englewood Cliffs, NJ: Prentice-hall, 1974. 
[5] P. J. Denning, "The Working Set Model for Program 
Beha~ior," Comm. ACM, vol. 11, pp. 323-333, May 1968. 
[6] J. A. Abraham and G. Metze, "Roving Diagnosis 
Performance Digital Systems," Proc. 1978 
for High 
Conf. on 
Information Sciences and Systems, The John Hopkins 
University, Baltimore, MA, 1978. 
34 
[7] R. Nair, Diagnosis, Self-diagnosis, and Roving Diagnosis 
in Distributed Digital Systems, Coorjinated Science 
Laboratory Report R-823, University of Illinois, Urbana, 
Ill., September 1978. 
[8] N. Wirth, Algorithms + Data structures = Programs, 
Englewood Cliffs, NJ: Prentice-Hall, 1976. 
