This paper introduces a system architecture which
INTRODUCTION
Computer technology appears to be reaching a point of diminishing returns in attempts to increase the basic speed of a large-scale processor.
Regardless of the advances in hardware technology, there have traditionally been requirements for architectural innovations to gain increased speed and capabilities as well as added flexibility.
Historically, the major concerns for parallel designs centered upon the efficient utilization of hardware resources.
With the recent revolution in the capabilities and economics of large-scale integration technology, the cost of a basic central processing unit has decreased to the point where it is no longer a significant fraction of total system costs. At present, the cost of software development is of major concern even in conventional architectures and will likely be the limiting economic factor in architectures which have been developed to exploit program parallelism.
This paper is directed towards the development of fundamentally different computer architectures for the efficient utilization of an aggregate of the newly available microprocessors operating concurrently to gain increased computational power.
In order to realize the potential advantages of the concurrency of operations possible in multipleprocessor systems, an adequate system for communication and control among a multitude of processors must be developed. In the past, multiple-processor systems employed only a small number of complete processors or large numbers of slaved functional units and were structured accordingly.
The communication between processors has often been achieved through the use of a dedicated set of channels, multi-port memories, a 4O cross-bar switch, time-shared busses, or combinations of these methods; typically, the control arrangements have become less flexible as the number of processors increased.
Thus, systems of the past are often unwieldy and impractical in terms of current desires.
T. C. Chen 2'3 has demonstrated the weaknesses in traditional concurrent systems and provided motivation for the development of multiple-processor systems that are loosely coupled with a high degree of local intelligence and autonomy.
In his discussion on the efficiency of traditional, tightly coupled, concurrent systems, Chen shows that for small deviations from the ideal, perfectly parallel task to a real task with small amounts of serial or sequential requirements, the efficiency of a tightly coupled, concurrent system takes a precipitous drop.
The efficiency falls initially at a rate of M-I where M is the number of parallel elements in the system; the greater M is, the more significant the impact of less than perfectly structured, perfectly parallel problems.
Since no two problems are ever quite the same, this also provides motivation to have the system adapt to fit the problem rather than distorting the problem to make it amenable to solution by the data processing system.
As a result of considerations such as those previously mentioned, a new system of processor should meet the following requirements: I. A large number of processor modules should be possible.
2. Uniformity of modules from the point of view of the communication/control structure should exist.
3. Each processor module should be capable of communication with all (or most) other processor modules.
4. Blocks of processors should be able to function as a team, independently of other teams.
5
. A hierarchy of control should be possible as shown in Fig. I. 6. A dynamic ability to reconfigure the system (i.e., rearrange the hierarchy of control) to fit the system to the problem, thus allowing the system to appear as a Von Neuman machine, a parallel array processor, an associative parallel processor, etc., as required.
7. Considerations such as reliability, fault-tolerance, and graceful degradation demand the incorporation of redundancy and a capability for dynamic reconfiguration.
The system described here has been developed to satisfy the preceding requirements.
GENERAL SYSTEM OVERVIEW As illustrated in Fig. 2 , the proposed system consists of a number of modules containing microcomputers and ancillary circuits connected by a series of busses, loops, and SHORT or BLOCK/SHORT modules.
All interprocessor communication takes place on the various busses.
Each processor has its own independent memory and is capable of performing any of the system tasks assuming it has been suitably programmed.
Along with the various elements of hardware in the system, a basic system philosophy and a set of communication protocols are required.
It is intended that this system be restructurable and capable of being organized in a hierarchical fashion.
As such, there will generally be one processor (any processor) responsible for overall system action.
This processor designates subordinates, establishes the chain-of-command and directs its immediate subordinates in the tasks they are to perform. In order to implement this philosophy, the following basic characteristics/protocols will be incorporated into the design: I. Each module will be named, both with a unique, permanent name, a "P-name" and with a name that is changeable, a "V-name." Each V-name consists of two parts:
A Block or team name and an element name. All communication is carried out by tagging or addressing the information with the destination V-or P-name and placing it on a bus. Thus, data or commands may be passed to a module by speci[ying both the block and element names or to all modules in a block by specifying the block name and "XX" for the element name where "XX" specifies a "universal" name to which all modules respond.
Likewise, information can be passed to all modules simultaneously by specifying "XX, XX" as the V-name.
2. All commands sent by a master or controlling module must be taken in by its subordinate and acknowledged. The subordinate queues the commands pending the arrival of the appropriate operands.
3. Task completion must be signaled.
4. Several adjacent processors may be strung together to form a wider arithmetic ability than would otherwise be available.
5. All communication throughout the system will consist of information packets containing the data to be transferred and a series of tags. Since each processor is identified by a name, all ambiguities associated with the transfer of information are resolved through the use of the processor names. In addition to the destination P-or V-name, each packet will contain tags uniquely associating the operands with the commands in a possible queue or other temporary storage medium.
For data packets, a I bit tag will also indicate the order of the operands for non-commutative operations.
DESCRIPTION OF SYSTEM ELEMENTS
The heart of each system module is the microcomputer itself.
Each microcomputer, the microprocessor with its memory, will be microprogrammed to provide all the basic functions of a standard processor and to respond appropriately to the actions of the system.
It should automatically perform overhead type tasks.
For example, it should maintain a queue of commands and automatically acknowledge the receipt of commands.
Generally, the programs would consist of a series of subroutines whose call would be initiated by commands received from more superior elements of the hierarchy.
Each processor must have a priority interrupt capability such that interrupts occurring below the processor's priority level are masked.
It must also have lines for the "carry out" generated by an arithmetic operation or left shift.
Likewise, it should also have a "carry in" capability.
Currently available processors organized on a bit slice basis provide 41 these features. 9
There is no requirement as to word length, speed, etc., for the processor.
Communication between processors is provided by the system of busses.
There are two basic types of busses employed in the system: Conventional time-shared busses and circulating busses.
The circulatin~ bus or C-Bus, often referred to as the Pierce Loop, 6' , 8 can be conceptually considered to be a circulating loop that moves a packet of data in a fixed direction a uniform distance in each unit of time. Any user can transmit by placing an information packet on the bus anytime a gap in the circulating traffic appears at its location. Each user must continually monitor the traffic pa.ssing its location.
When a user recognizes that a packet passing its location contains its address (or name), the user removes the packet from the bus. The packer's former position in the traffic stream is now a gap, free to be filled with a new packet by any user. The C-bus thus provides a temporary memory or queue of information and is a means by which several independent data transfers can be carried out simultaneously. There are two classes of C-busses, the DONE busses and DATA bus. Their functions will be explained later. Conventional busses are of the typical "party line" arrangement having one transmitting user and many receiving users at a time; the CMD (Command) and ACK/NAK (Acknowledge/Negative-Acknowledge) busses are of this type.
Each of the conventional busses is equipped with a bus controller to arbitrate conflicting requests for the bus for transmission.
Any user desiring to transmit must be granted permission by the controller.
The functions of the busses can also be separated into two divisions:
Data transfer and control. In order to control the system efficiently, several sets of busses providing command and control capabilities have been grouped together.
Each set will collectively be termed a Control Group (C.G.).
Each Control Group competes for attention from each processor on a priority basis much as in the case of a priority interrupt system.
The Master Control Group (M.C.G.) is the highest, most significant priority or Oth level (C.G.O.). Each additional Control Group is on level I, 2, etc. Each Control Group other than the Master can be blocked/ terminated at the left edge of any processor by activation of the BLOCK/SHORT module.
By this it is meant that conventional busses are blocked, circulating busses are "shorted" or the loop is closed. Each Control Group consists of a CMD bus, a DONE bus and an ACK/NAK bus. The CMD bus carries commands to processors.
When a processor name matches the name attached to a command on a CMD bus at level "n," an interrupt to the processor is generated on interrupt priority "n." If the processor priority is lower than "n," the interrupt is accepted and the command is recognized as destined for this processor.
A processor recognizing a command is obligated to reply positively on the ACK/NAK bus if the command can be accepted into its command queue.
Otherwise, the processor replies with a negative acknowledge or NAK. The DONE bus provides the means by which the command processor can acknowledge the completion of the required task.
Data transfers are carried out on the DATA C-bus. A data item is placed on the DATA bus in the form of an information packet containing the data and the destination processor name. As the packet circulates around the bus, the destination name is compared to the name of each processor.
When a match occurs between the name on a data item and a processor, that processor is signaled and it is obligated to remove the data item from the bus.
The basic bus formats for information packets with an explanation of the various fields is given below: Each BLOCK/SHORT module is controlled by the processor to its immediate left. Figure 4 illustrates the utility of the BLOCK/SHORT modules.
Here C.G. I is broken into two independent parts with each section functioning just as if it were a complete C. G. Processor 2,1, for example, can then control 3,1 and 3,2 without any interaction with other processors on C.G.I. the name of a processor. When interpreted as a V-name, it consists of 2-parts, the block and the element name.
DONE BUS

Iml
Each command sent to a module is numbered and held in memory in numerical order by the receiving processor until its operands are present and there are no commands having a lower number in memory.
The operands are uniquely identified as belonging with a particular command by a matching Operation #.
In the DATA bus format, this indicates the order of the two operands for noncommutative operations.
The actual operands, etc., transmitted on the DATA bus.
A/N
I bit that indicates the positive acknowledgement (A) of the receipt and acceptance of a command or a negative acknowledgement (N) indicating that the named module is unable to accept or perform the required operation.
The SYNC/CARRY LOOP also transfers data throughout the system.
It is designed to transfer information shifted or "carried out" from the arithmetic section of a processor to the arithmetic section of another processor.
This allows several processors to function as a single multiprecision arithmetic unit.
The SYNC/ CARRY LOOP passes through each processor module and has no storage of information.
By activation of the appropriate SHORT modules as shown in Fig. 3 , the LOOP may be gated through the processor proper or past it.
In a similar manner, it may also be "shorted" at the left edge of each processor module, i.e., it may be broken into two closed loops at the left end of the module, In addition to the various busses, the items mentioned previously as BLOCK/SHORT modules and SHORT modules perform an important function in the implementation of a hierarchical structure.
The BLOCK/SHORT
BASIC ILLLISTRAT!ONS
The system, when viewed in an unstructured, idle configuration, will appear as a collection of processors arranged in a cylindric fashion connected by a collection of busses.
However, this structure, when viewed in an active state, will generally be divided into a collection of teams of processors in a hierarchy of responsibility and control.
Structuring takes place in the following fashion:
Initially, the user wil] designate a processor as the master and load its memory with the appropriate programs.
This processor then begins .execution.
2. The master would decide which of the various processors will perform particular tasks.
3. The master commands each processor in turn to load the program being sent to it over the DATA bus.
4. Each processor then sets its V-name and priority to the values sent it on the DATA bus upon command of the master.
5. The appropriate modules are then commanded to activate their BLOCK/SHORT or SHORT modules, as required.
For example, the hierarchy shown in Fig. 5 may be defined in the system by activating the appropriate BLOCK/SHORT modules, naming the processor appropriately and specifying their priorities or the level on which the module expects commands.
The oth module has been established with the V-name of "1,1" and designated as the most superior element in this structure.
Modules 1 and 5, assigned V-names of "2,2" and "2,1," respectively, are both directly controlled by "1,1" and expect commands at the oth priority level, i.e., from the master control group.
Module I (named "2,2") controls directly the three modules 2, 3 and 4 (named 3,1; 3,2; 3,3; respectively) through commands on the control group at the ISt priority level.
Note that since the BLOCK/SHORT modules between 0 and I and between 4 and 5 have been activated, this group of processors is capable of complete]y independent action without interaction with other modules on the Control Group level I. Assuming that the appropriate control modules in the SYNC/CARRY loop have been activated as shown, modules "3,XX" could be considered as an arithmetic functional unit of 3.n precision where n is the word size of a given module.
Module "2,2" would be the controller for this arithmetic section.
As another example, consider a parallel array processor.
This configuration, using an arithmetic capability of 2-n bits would appear as in Fig. 6 . Again each level in the hierarchy is controlled on a different level control group.
Module "1,1" is the system controller and actually contains the program to be executed.
Each of the modules "3,1" through "M+2,2" contains the appropriate data elements as in any parallel array processor. Module "1,1" would control each of the functional groups A, B, ... by placing a command with the appropriate destination name on the Master Control' Group CMD bus for the specific controlling module desired. Processor "1,1" can control all the functional groups simultaneously with one command addressed to "2,XX." Thus, as in the case of a parallel array processor, ~ single ADD, MULTIPLY, etc., command could cause all M functional groups to perform the required operation on the appropriate operands in each of the independent memories.
In the case that restructuring is required (due to problem changes or hardware failures), the master need only cause the system to pause while it proceeds through the structuring phase again, etc.
It is assumed that the master can interrupt any processor by commands on C.G.O which can never be blocked.
Although the preceding discussion and examples have only two C.G.'s and result in three levels of hierarchy, there could be several more C.G.'s. This would allow several more levels of hierarchy and, at each level, each processor would appear as a master to all those processors subordinate to it.
The following points should be noted:
I. All data transfers take place on the DATA bus.
Therefore, this bus will be a bottleneck and its performance will seriously affect the total system throughput. The DATA bus must therefore be a very high speed bus.
2. In order that a group of m processors be connected to form an m.n bit arithmetic section, they must be adjacent or broken only by single modules operating on a different hierarchy level.
3. Although the master controller usually would communicate only with the modules one level below it in the hierarchy, it can send commands to any module through the master control group. It, therefore, can begin corrective action by reassigning names, etc., should a fault occur.
OBSERVATIONS AND CONCLUSIONS
The utility of the system proposed here depends I. upon the amount of hardware and software overhead required and the latency in the interprocessor communications.
Based on the work of Hayes and Sherman, it can be shown that, on the average, in a light to mod-2. erately loaded system, the expected delay to place an information packet on a C-bus, and consequently the total message communication rate, is well within the practical limits for useful systems. Assuming a 3. number of processors communicating with II other processors symetrically on a bus with a 1.5 ~ 106~word/sec rate with each processor transmitting at a rate of 50 x 103 word/sec, Hayes and Sherman show that each 4. processor can expect a delay of less than 0.7 ~s. 7 On the other had, Avi-ltzhak I has shown that, in the heavily loaded case, a deadlock situation can occur where competing groups of processors "see-saw" control of the bus, locking out all other processors. There-5. fore, it is important that the meaning of "heavy," "moderat~," and "light" loading be determined quantitatively. Worst-case figures must be computed and the potential for deadlock eliminated.
As is evident from the examples, only a limited number of Control Groups are likely to be used. It will be necessary, however, to determine exactly how many levels of hierarchy and hence of Control Groups are required to provide a generally useful organization meeting the criteria mentioned earlier.
Since the system is to be constructed in a modular fashion with a uniform communications interface between modules, the direct physical configuration of a system to serve as a small real-time controller or other fixedtask system should be simple and straightforward. Following a "divide and conquer" philosophy, each module would be given a single fixed task and would be responsible for or to a small constant set of other modules thus reducing the problems inherent in handling multiple tasks or interrupts in real-time.
A point of major concern is the cost of the software required to support a system of the type proposed here and the difficulty of preparing user programs. While more research is necessary in this area before conclusions can be drawn, the complexity of the support software is reduced by the simplicity of the C-bus concept and by the intercommunications protocol which is largely hardware controlled.
An interesting point is that the proposed architecture can be configured for the execution of data-flow programs.4, 5 The difficulty of preparing data-flow programs is no more difficult than preparing programs for conventional machines since it is not necessary to explicitly detect parallelism. To execute data-flow programs, system processors, or perhaps processor teams, would be assigned as operators in the data-flow program. Each processor would be directed to distribute copies of its computational results to destinations indicated by the links of the program. The flow of data tokens is represented by the flow of operands on the DATA bus. The flow of control tokens in the form of packets transmitted on the control busses forces data-flow programs to enforce the firing rules.
In conclusion, an organization of microprocessors intercommunicating over a series of busses and having a restructurable, hierarchial control philosophy has been presented. Although the development of this architecture is by no means complete, it is hoped that the problems indicated and will yield a flexible multiprocessor architecture that allows restructuring of system resources to tailor them to processing requirements. ~.,..,.,.......,..,,..~..,.v..~.,,.~.., .. 
