An experimental distributed microcomputer concept has been developed, implemented, and is currently operational at the Naval Air Development Center as a vehicle to investigate distributed processing concepts with respect to replacing larger computers with networks of microprocessors at the subsystem or node level. Major benefits being exploited include increased performance, flexibility, system availability, and survivability, by use of multiple processing elements with reduced cost, size, weight and power consumption. This paper concentrates on defining the distributed processing concept in terms of control primitives, variables, and structures and their use in performing a decomposed DFT (Discrete Fourier Transform) application function on a laboratory model. The DFT was chosen as an experimental application to investigate distributed processing concepts because of its highly regular and decomposable structure for concurrent execution.
Introduction
Stimulated by the reduced size, weight, power consumption, and cost advantages of microprocessors, distributed computing is an evolving technology and is the subject of many investigative efforts.l The major potential benefits of distributed computing include increased system -wide real time performance, ease of adaptability to integration and change, high system availability, and decreased system vulnerability. The introduction of low cost microprocessor technology now permits experimenting with concepts that were previously restricted to paper studies.
In addition, the reduced size, weight, and power advantages of microprocessors enable applications that would otherwise not be financially or technically feasible solutions. Because of these desirable characteristics, microprocessor and distributed computing technologies are very attractive for avionic processing system applications. However, before these technologies can be applied effectively, appropriate expertise and tools must first be in place
In view of this need, this paper describes an experimental autonomous control structured that was developed and implemented at the Naval Air Development Center on a laboratory model. It consists of a pool of concurrently executing microprocessors which serves as a vehicle for investigating distributed processing concepts.
Functional distribution in avionic processing systems
Future avionic systems are envisioned as having processing functions distributed over many physical processors, which communicate via a common medium such as the MIL -STD -1553 bus.
Typical functions may include navigation, communications, radar, and display. A generic configuration is depicted in Figure 1 . 
Introduction
Stimulated by the reduced size, weight, power consumption, and cost advantages of microprocessors, distributed computing is an evolving technology and is the subject of many investigative efforts. 1 The major potential benefits of distributed computing include increased system-wide real time performance, ease of adaptability to integration and change, high system availability, and decreased system vulnerability. The introduction of low cost microprocessor technology now permits experimenting with concepts that were previously restricted to paper studies. In addition, the reduced size, weight, and power advantages of microprocessors enable applications that would otherwise not be financially or technically feasible solutions. Because of these desirable characteristics, microprocessor and distributed computing technologies are very attractive for avionic processing system applications. However, before these technologies can be applied effectively, appropriate expertise and tools must first be in place. In view of this need, this paper describes an experimental autonomous control structure 2 that was developed and implemented at the Naval Air Development Center on a laboratory model. It consists of a pool of concurrently executing microprocessors which serves as a vehicle for investigating distributed processing concepts.
Functional distribution in avionic processing systems
Future avionic systems are envisioned as having processing functions distributed over many physical processors, which communicate via a common medium such as the MIL-STD-1553 bus. Typical functions may include navigation, communications, radar, and display. A generic configuration is depicted in Figure 1 .
Depending on the reconfiguration scheme selected, sensors are connected to those processing subsystems providing primary and backup functions. In the event a primary subsystem fails, reconfiguration could be accomplished by loading the backup subsystem with appropriate programs over the system bus from a common bulk store. Because the backup subsystem is probably a primary subsystem for another function, a degraded mode of operation may result.
This distribution of major avionic processing functions on a system bus can be referred to as global distribution.
As a means of increasing the performance and availability of a subsystem, a higher performance processor could be replaced with a network of microprocessors as shown in subsystem 2 of Figure 1 . This distribution can be referred to as local distribution.
Since the local distribution of processors could increase both performance and availability, the need for system reconfiguration would be greatly reduced or may not even exist.
If this is the case, sensors would only be connected to primary subsystems.
In addition, the system bus would not experience the additional loading due to reconfiguration. Other advantages of distributing processors locally are that expansion and contraction of processing power as well as reliability can be performed in an incremental manner.
The concept described in this paper, therefore, is intended to address distributed processing as applied at a local subsystem or node level.
Experimental concept implementation
The approach being employed to determine the applicability, strengths, and weaknesses of distributed computing in avionic systems is that of experimental investigation.
A laboratory model has been established as a baseline and consists of off -the -shelf hardware comprised of multiple microprocessors interconnected by means of a shared memory. An experimental control structure has been defined such that local knowledge of the existence of other processors is not required and global control and task scheduling is performed via a highly reliable shared memory.
To demonstrate concept feasibility, a well -known application, the DFT (Discrete Fourier Transform), was chosen.
Hardware configuration Figure 2 depicts the experimental hardware configuration employed.
It consists of five processors designated as task processors and one as an interface processor. Each of the processors has a private local memory and intercommunicates through a common shared memory. The interface processor also consists of a keyboard, display, and floppy disk, since it is used as an external communications medium.
The shared memory facility consists of a memory and a bus structure which enables up to six microcomputers to be interconnected.
It provides polling and arbitration logic so only one processor has access to shared memory at any one time. Normally, accesses to shared memory are limited to one word each time a processor is granted access. However, a special area (the flag block) can be designated in shared memory. When the flag block is accessed, a processor is guaranteed an uninterrupted access time that is hardware settable.
Assumptions
The task processors cooperate to perform a scenario concurrently where the interface processor provides external interface and display functions. With this implementation, commands to perform a computational scenario are provided through the interface processor. Each local memory contains an identical copy of the operational programs.
These consist of a control executive and application tasks. Shared memory is common to all processors and 234 / SPIE Vol 298 Real -Time Signa/ Processing /V (1981)
Depending on the reconfiguration scheme selected, sensors are connected to those processing subsystems providing primary and backup functions. In the event a primary subsystem fails, reconfiguration could be accomplished by loading the backup subsystem with appropriate programs over the system bus from a common bulk store. Because the backup subsystem is probably a primary subsystem for another function, a degraded mode of operation may result. This distribution of major avionic processing functions on a system bus can be referred to as global distribution. As a means of increasing the performance and availability of a subsystem, a higher performance processor could be replaced with a network of microprocessors as shown in subsystem 2 of Figure 1 . This distribution can be referred to as local distribution. Since the local distribution of processors could increase both performance and availability, the need for system reconfiguration would be greatly reduced or may not even exist. If this is the case, sensors would only be connected to primary subsystems. In addition, the system bus would not experience the additional loading due to reconfiguration. Other advantages of distributing processors locally are that expansion and contraction of processing power as well as reliability can be performed in an incremental manner. The concept described in this paper, therefore, is intended to address distributed processing as applied at a local subsystem or node level.
Experimental concept implementation
The approach being employed to determine the applicability, strengths, and weaknesses of distributed computing in avionic systems is that of experimental investigation. A laboratory model has been established as a baseline and consists of off-the-shelf hardware comprised of multiple microprocessors interconnected by means of a shared memory. An experimental control structure has been defined such that local knowledge of the existence of other processors is not required and global control and task scheduling is performed via a highly reliable shared memory. To demonstrate concept feasibility, a well-known application, the DFT (Discrete Fourier Transform), was chosen.
Hardware configuration Each of the processors has a private local memory and intercommunicates through a common shared memory. The interface processor also consists of a keyboard, display, and floppy disk, since it is used as an external communications medium. The shared memory facility consists of a memory and a bus structure which enables up to six microcomputers to be interconnected. It provides polling and arbitration logic so only one processor has access to shared memory at any one time. Normally, accesses to shared memory are limited to one word each time a processor is granted access. However, a special area (the flag block) can be designated in shared memory. When the flag block is accessed, a processor is guaranteed an uninterrupted access time that is hardware settable.
Assumptions
The task processors cooperate to perform a scenario concurrently where the interface processor provides external interface and display functions. With this implementation, commands to perform a computational scenario are provided through the interface processor. Each local memory contains an identical copy of the operational programs. These consist of a control executive and application tasks. Shared memory is common to all processors and contains control variables and application data. It is accessed by means of control primitives and access rights are enforced by semaphores. This implementation enables the number of processors to vary without affecting any software. Task structure A possible task structure can be characterized as shown in Figure 3 .
Tasks 1 through N are sequential where each task consists of a number of subtasks that can be executed in parallel.
In this case, task 1 must be completed before task 2 and so forth. However, the associated subtasks can be performed concurrently in any specified order.
Many variations of the task structure are also possible and can be specified in terms of serial and parallel combinations.
The point to be made here is that to take advantage of concurrent processing,
Task structure the tasks must be partitioned to optimize execution time. The task structure shown in Figure  3 is provided to illustrate a method of synchronizing task execution by means of control variables in shared memory. This is accomplished by simply implementing a task list pointer which designates the next task to be performed. In addition, subtask pointers are also required to properly synchronize the topology of the total task structure. For example, the subtasks in Figure 3 could be each composed of a number of lower level tasks.
To ensure that lower level tasks, providing inputs to higher level tasks, are completed, requires an additional control mechanism which is designated a cumulative task counter.
This synchronization technique, through shared memory, enables from one to any number of processors to participate in a scenario without any processor being aware of the others existence.
However, the total execution time is directly dependent on the number of processors available to execute subtasks in parallel. As can be observed, if N processors are required to optimize execution time then fewer processors would result in a degraded mode of operation.
Control structure
A mechanism employed to permit a processor to capture shared memory is called semaphore. In the experimental system, a semaphore enforces access rights to shared memory as well as indicates various conditions. These conditions include whether shared memory is available or blocked and whether an application is in progress or completed. This semaphore together with control primitives and a task execution control algorithm provide a method for dynamically distributing tasks among a pool of processors without using a processor for centralized control, and is referred to as autonomous control.
Control primitives.
Control primitives are operations which give each process a means to guarantee mutually exclusive access to data shared among processes. Two primitives have been implemented to control process interactions and involve interactions with the semaphore. These primitives are depicted in contains control variables and application data. It is accessed by means of control primitives and access rights are enforced by semaphores. This implementation enables the number of processors to vary without affecting any software. Task structure A possible task structure can be characterized as shown in Figure 3 . Tasks 1 through N are sequential where each task consists of a number of subtasks that can be executed in parallel. In this case, task 1 must be completed before task 2 and so forth. However, the associated subtasks can be performed concurrently in any specified order. Many variations of the task structure are also possible and can be specified in terms of serial and parallel combinations. The point to be made here is that to take advantage of concurrent processing, the tasks must be partitioned to optimize execution time. The task structure shown in Figure  3 is provided to illustrate a method of synchronizing task execution by means of control variables in shared memory. This is accomplished by simply implementing a task list pointer which designates the next task to be performed. In addition, subtask pointers are also required to properly synchronize the topology of the total task structure. For example, the subtasks in Figure 3 could be each composed of a number of lower level tasks. To ensure that lower level tasks, providing inputs to higher level tasks, are completed, requires an additional control mechanism which is designated a cumulative task counter. This synchronization technique, through shared memory, enables from one to any number of processors to participate in a scenario without any processor being aware of the others existence. However, the total execution time is directly dependent on the number of processors available to execute subtasks in parallel. As can be observed, if N processors are required to optimize execution time then fewer processors would result in a degraded mode of operation.
Control structure
Control primitives. Control primitives are operations which give each process a means to guarantee mutually exclusive access to data shared among processes. Two primitives have been implemented to control process interactions and involve interactions with the semaphore. These primitives are depicted in Figure 4 . Control primitives
The SEIZE primitive is used to capture shared memory and the RELEASE primitive is used to give up shared memory to another processor. Other primitives are also used to fetch and store data in shared memory.
Task execution control. The algorithm employed to control task execution is shown in Task execution control memory, determines the subtask to be performed, fetches the appropriate data, increments the task pointers as necessary, and releases shared memory to other processors which perform the same process.
Once the appropriate data is fetched, the task is performed using the The SEIZE primitive is used to capture shared memory and the RELEASE primitive is used to give up shared memory to another processor. Other primitives are also used to fetch and store data in shared memory.
Task execution control. The algorithm employed to control task execution is shown in Figure 5 .When a scenario begins, the processor currently being polled seizes shared Figure 5 . Task execution control memory, determines the subtask to be performed, fetches the appropriate data, increments the task pointers as necessary, and releases shared memory to other processors which perform the same process. Once the appropriate data is fetched, the task is performed using the processor's local memory. Upon completing a task, the processor seizes shared memory, stores the data, and proceeds to perform another subtask. This process is repeated continually until all tasks are completed for the specified application scenario.
DFT decomposition
As indicated previously, to demonstrate concept feasibility, the DFT application was selected. The solution of the DFT can be approached several ways.
However, to provide for more shared memory interactions, an approach which results in three separate tasks, each of which is comprised of the same number of subtasks, was employed. The first task requires the computation of frequency terms as shown in Figure 6 . The second task requires the processor's local memory. Upon completing a task, the processor seizes shared memory, stores the data, and proceeds to perform another subtask. This process is repeated continually until all tasks are completed for the specified application scenario.
As indicated previously, to demonstrate concept feasibility, the DFT application.was selected. The solution of the DFT can be approached several ways. However, to provide for more shared memory interactions, an approach which results in three separate tasks, each of which is comprised of the same number of subtasks, was employed. The first task requires the computation of frequency terms as shown in Figure 6 . The second task requires the DFT decomposition for task 3
process-output to indicate shared memory accesses for input and output and processing that is performed by means of programs in local memory. Each of the three tasks are comprised of N = K subtasks where N is the number of frequency points specified. These subtasks are executed concurrently by the task processors of Figure 2 .
Data collection
To determine execution time, the interface processor is equipped with a timer that measures the time required for the task processors to execute a DFT scenario.
In addition, to determine the manner in which the task processors cooperate to perform a DFT scenario, certain variables are maintained in shared memory for each of the task processors.
Each task processor maintains a record of the number of subtasks that are performed for each task by means of these variables.
The task processors also update a cumulative counter to indicate that all the required subtasks are completed for each task. Figure 9 illustrates the format of the output provided through the interface processor of Figure 2 for an assumed As shown in Figure 9 , the output consists of time statistics, distribution of subtasks for each of the three tasks among the five task processors, and the resulting transformed function.
Example of execution results Table 1 provides an example of output statistics obtained, by exercising the laboratory model of Figure 2 , for a DFT scenario consisting of three tasks, each with 64 subtasks. The speed improvement is shown in Table la., relative to the one -processor case, when the number of task processors are varied from one through five.
It may be noted that the process-output to indicate shared memory accesses for input and output and processing that is performed by means of programs in local memory. Each of the three tasks are comprised of N = K subtasks where N is the number of frequency points specified. These subtasks are executed concurrently by the task processors of Figure 2 .
To determine execution time, the interface processor is equipped with a timer that measures the time required for the task processors to execute a DFT scenario. In addition, to determine the manner in which the task processors cooperate to perform a DFT scenario, certain variables^are^maintained in shared memory for each of the task processors. Each task processor maintains a^record of the number of subtasks that are performed for each task by means of these variables. The task processors also update a cumulative counter to indicate that all the required subtasks are completed for each task. Figure 9 illustrates the format of the output provided through the interface processor of Figure 2 for an assumed Figure 9 , the output consists of time statistics, distribution of subtasks for each of the three tasks among the five task processors, and the resulting transformed function.
Example of execution results Table 1 provides an example of output statistics obtained, by exercising the laboratory model of Figure 2 , for a DFT scenario consisting of three tasks, each with 64 subtasks. The speed improvement is shown in Table la., relative to the one-processor case, when the number of task processors are varied from one through five. It may be noted that the processor assignments are dynamic and that the manner by which processors select subtasks are not specified in advance.
The distribution of subtasks per processor for a particular It may be observed that the distribution of subtasks is basically uniform.
Since the assignment of tasks is random, the distribution of subtasks may change, somewhat, each time the very same scenario is executed.
Reliability considerations
A simplified reliability block diagram of the laboratory model, as it currently exists, is shown in Figure 10 .
Three critical areas have been identified for reliability consideration purposes, i.e., processors, bus, and shared memory. Simplified reliability block diagram
Processors
As indicated previously, because of the existing autonomous control structure, any number of processors can be configured in or out of a subsystem where the optimum number of processors required is a function of the particular application. However, additional processors could be made available, in an incremental manner, for increasing reliability to a desired level.
It may be noted that even a minimum number of processors would enhance reliability since the loss of processors result in a degraded mode of operation.
The current implementation only permits a static configuration of processors.
To facilitate dynamic reconfiguration, fault tolerant studies have been initiated to identify techniques that circumvent processor failures. Candidate techniques will be implemented and evaluated on the baseline laboratory model.
Bus
The current bus structure provides polling and arbitration logic and is the critical communications link between the processors and shared memory.
Since the bus is a potential single point of failure, a study is being initiated to identify alternate fault tolerant techniques that would enhance reliability and availability.
Shared Memory
In order for this concept to be viable, the shared memory must be extremely reliable. Although it is currently a single point of failure, popular techniques could be readily implemented. A single error correction and double error detection scheme would provide a certain increase in reliability.
This could further be enhanced by providing a duplex configuration each with single error correction and double error detection.
A study has been initiated to investigate these and other fault tolerant schemes.
SP /E Vol. 298 Real -Time Signal Processing IV 11981)/ 239 processor assignments are dynamic and that the manner by which processors select subtasks are not specified in advance. The distribution of subtasks per processor for a particular Table lb .,when the number of task processors are varied from one through five. It may be observed that the distribution of subtasks is basically uniform. Since the assignment of tasks is random, the distribution of subtasks may change, somewhat, each time the very same scenario is executed.
Reliability considerations
A simplified reliability block diagram of the laboratory model, as it currently exists, is shown in Figure 10 , Three critical areas have been identified for reliability consideration purposes, i.e., processors, bus, and shared memory. Figure 10 .
OF N
Simplified reliability block diagram
Processors
As indicated previously, because of the existing autonomous control structure, any number of processors can be configured in or out of a subsystem where the optimum number of processors required is a function of the particular application. However, additional processors could be made available, in an incremental manner, for increasing reliability to a desired level. It may be noted that even a minimum number of processors would enhance reliability since the loss of processors result in a degraded mode of operation. The current implementation only permits a static configuration of processors. To facilitate dynamic reconfiguration, fault tolerant studies have been initiated to identify techniques that circumvent processor failures. Candidate techniques will be implemented and evaluated on the baseline laboratory model.
Bus
The current bus structure provides polling and arbitration logic and is the critical communications link between the processors and shared memory. Since the bus is a potential single point of failure, a study is being initiated to identify alternate fault tolerant techniques that would enhance reliability and availability.
Shared Memory
In order for this concept to be viable, the shared memory must be extremely reliable. Although it is currently a single point of failure, popular techniques could be readily implemented. A single error correction and double error detection scheme would provide a certain increase in reliability. This could further be enhanced by providing a duplex configuration each with single error correction and double error detection. A study has been initiated to investigate these and other fault tolerant schemes.
Concluding remarks
Distributed computing is a state -of-the -art technology and currently lacks formalism with respect to implementation techniques. This paper provides an approach that forms a baseline for future investigations.
The approach defined herein provides an autonomous control structure where all processors can access an entire common data base by use of control primitives. Since processing is performed locally, shared memory need only be accessed for communicating data and control purposes. In addition, since interprocessor communications is anonymous, each processor need not know of the existence of other processors. This approach can be contrasted to one in which processor communications is explicit and each processor needs to know which processor is to receive or send data to (or from) another processor. The approach described herein was chosen because it was felt that fault tolerant schemes could more readily be implemented. This is due to the fact that loss of processors does not affect the control structure.
Performance evaluations have been made to determine the effects of varying the number of processors as well as local processing and common data base access times. A software simulation was performed, by employing a Generalized Computer Systems Simulator (GCSS) that is resident at the Naval Air Development Center, the results of which have been documented. 3 Measurements have also been obtained by exercising the laboratory model. Fault tolerant studies have been initiated to identify techniques for circumventing processor, bus, and shared memory failures.
Candidate techniques will be directly implemented and evaluated on the laboratory model.
Distributed computing is a state-of-the-art technology and currently lacks formalism with respect to implementation techniques. This paper provides an approach that forms a baseline for future investigations. The approach defined herein provides an autonomous control structure where all processors can access an entire common data base by use of control primitives. Since processing is performed locally, shared memory need only be accessed for communicating data and control purposes. In addition, since interprocessor communications is anonymous, each processor need not know of the existence of other processors. This approach can be contrasted to one in which processor communications is explicit and each processor needs to know which processor is to receive or_send data to (or from) .another processor. The approach described herein was chosen because it was felt that fault tolerant: schemes could more readily be implemented. This is due to the fact that loss of processors does not affect the control structure.
Performance evaluations have been made to determine the effects of varying the number of processors as well as local processing and common data base access times^ A software simulation was performed, by employing a Generalized Computer Systems Simulator (GCSS) that is resident at the Naval Air Development Center, the results of which have been documented.-' Measurements have also been obtained by exercising the laboratory model.
