efficient process managers for computer systems in which many identical CPUs have access to a shared memory.
Two factors degrade a process manager's efficiency: primitive operation times and busy waiting. Our design reduces operation times to the minimum by tightly coupling the hardware architecture and software data structure; this permits the primitive operations of process management to be fast microprograms in each processor's instruction set. 1 We will compare standard implementations of WAIT and SIGNAL with our proposals.
Busy waiting is more difficult to control. In a singleprocessor system it is completely eliminated by semaphores, which divert the processor from a waiting process to a ready process. But in a multiple-processor system, shared data structures within the process manager--especially the ready list--must be protected from simultaneous access by two or more processors. This can be achieved only with primitive locks, which are hardware bits tested periodically by processors. The processors cycling at locks steal memory cycles from the processors that will eventually release the locks. For a small number of processors, this can be controlled by reducing the times of operations that hold locks and by optimizing the retry delay between tests of the lock.
For a large number of processors, a more radical approach is needed. We will show that ready-list contention and the multiprocessor-priority problem can be virtually eliminated by implementing the ready list as a circulating ring of process indices.
II. Overview of a Process Manager
The process manager is the portion of the operating system that implements processes and semaphores. It hides the details of scheduling and of switching processors among the processes. It performs the WAIT and SIGNAL operations on semaphores. It creates and deletes processes and semaphores.
Our process manager has two levels. The lower level implements the primitive operations for starting, stopping, and scheduling a process on a processor. The upper level implements the WAIT, SIGNAL, CREATE, and DELETE operations. This division permits changing the architecture of processors or data structures without reprogramming the upper level operations [10] .
A. Data Structure
The internal data structure of the process manager comprises the process list, the ready list, and the semaphore list. (See Figure 1 .)
The process list (PL) is an array of process control blocks (PCBs) identified by process indices. Each PCB ~WAIT and SIGNAL were microprogrammed on the VENUS machine, an experimental uniprocessor [11] . The GEC 4080, a commercial machine, comes closest to meeting the design objectives discussed here [6] . Note that the allowable process indices are 0 ..... N. Process 0 is a special "null" process that will be discussed later. For convenience, we will use the notation LINK [i] for the link field in PL[i], rather than the formal notation PL[il.link.
The ready list (RL) is a queue containing indices of all processes enabled to run on a processor. It emanates from a descriptor containing a (head, tail) linked-list specifier and a lock bit. The queue itself is the chain of processes found by tracing through successive link fields starting from the head process and terminating at the tail process. The link fields of both the tail process and process 0 are always 0. The lock bit is used to restrict access to the ready list to at most one processor at a time. An Ada-like declaration of the ready list is: The initial value of a semaphore's count must be nonnegative; thereafter, a nonnegative count indicates an empty queue, whereas a negative count indicates a queue whose length is the magnitude of the count.
B. Locks
The foregoing defmitions show a lock bit in the ready-list descriptor and in each semaphore descriptor. These bits must be set while any processor is using the associated data structure. They are set using a LOCK instruction and reset using UNLK (unlock). Let Mem [x] .lock denote the bit of the memory word at address x used for a lock. The microprogram for LOCK x follows the schema: 2 The LOCK instruction takes (at least) 2 memory reference times (at least one read and a write) and the UNLK one.
A LOCK-UNLK pair is intended to enclose a critical section of instructions. While a processor is inside the critical section, its interrupts must be shut off by a DISABLE instruction to guarantee that it cannot be dive,ted until it has completed the critical operation. A As specified, LOCK is not the same as test-and-set on the IBM 370. In fact, LOCK requires two instructions on the IBM 370.
processor's interrupts remain off until it executes an ENABLE instruction, even if it changes processes while interrupts are off. In other words, the interrupt status is part of a processor, not a process)
The LOCK instruction itself must also be indivisible: once started, the addressed lock bit must be fetched, modified, and returned to memory; no other processor may access the addressed memory location until the instruction is complete. This requirement is easily enforced by the usual protocol at the processor-memory interface. Having placed an address in its memory address register, the processor raises an address-request line A and waits. When the addressed memory bank becomes idle, the memory arbiter selects a waiting processor (one with A = 1) and signals the proceed line to that processor. As soon as it receives the proceed signal, the processor performs the memory access (read and/or write) on the addressed location. At the completion of the memory access, the processor lowers the addressrequest line (sets A = 0), which informs the memory arbiter that the addressed bank is again idle. In the case of the LOCK instruction, both a read and a write operation are performed while A = 1. The protocol requires a processor to release the A line before loading a new address into its memory address register.
The dashed lines in the microprogram for LOCK and UNLK represent points at which the processor must relinquish its control over the memory bank. It must set A = 0 in order to cross a dashed line.
If the LOCK instruction is begun when the lock is set, the processor will perform the 'retry at LOCK' action. This means that the processor must release exclusive access to the addressed memory location and restart the LOCK operation. It also means that a processor waiting for a lock engages in busy waiting.
Busy waiting increases lock contention: by stealing memory cycles from the processor inside the critical section, the waiting processors prolong the holding time of the lock. This degradation can be mitigated by changing the retry action to "release memory, delay T, retry", where T should be about half the time a processor will remain in the critical section. Later we will calculate values of T appropriate for the specific cases of ready list and semaphore locks.
C. Starting, Stopping, and Scheduling Processes
At the lower level of the process manager, there are three (uninterruptible) operations for manipulating the process list and the ready list. These operations refer to a processor register 'self', that contains the process index of the process currently running on that processor. A a Do not confuse the interrupt status of the processor with the condition masks of each process. The masks control which exceptional conditions are enabled while a given process runs. The DISABLE instruction overrides all interrupts and conditions. Note that LINK[i] must be set to zero to indicate that process i is the tail of the ready list. The interrupts must be disabled on any processor executing SAVESW, LOADSW, or READY.
If the ready list is empty (head = 0), the LOADSW operation starts process 0, the null process, which executes until there is useful work to do. A simple form for the program of process 0 is an infinite loop cycling between a PAUSE T instruction and a LOADSW instruction. The PAUSE instruction stops the processor for an interval selected by parameter T, thereby reducing busy waiting on the ready-list lock.
D. WAIT and SIGNAL Operations
The WAIT and SIGNAL operations are in the upper level of the process manager. The WAIT operation is used to receive a signal from a semaphore; the calling process is delayed if the count is zero or less at the time of the attempted reception. The SIGNAL operation is used to transmit a signal through a semaphore; the waiting process at the head of the list is released if the count is less than zero at the time of the attempted transmission. Both operations must be indivisible in the sense that, while a WAIT or SIGNAL is in progress on a given semaphore, no other WAIT or SIGNAL on that same semaphore may be initiated. This implies that both operations must be uninterruptible. A program for the WAIT operation is: 
E. Interrupt Handling
A fault is an error condition detected by the hardware of a given processor; it triggers a procedure call within the current process to a routine that attempts error recovery [15] . An interrupt is a signal from a device to its device driver process requesting more work; it triggers a context switch to the device driver process. Conventional hardware does not distinguish faults from interrupts: a device signal is treated as a fault with the following error recovery:
"put self at head of RL"; self := device driver process index; "Load registers from PL [self] .stateword"; L: ENABLE; end INTERRUPT; To avoid the high overhead of this approach, the context switch contained in the interrupt sequence is often omitted. This makes correct operation of the operating system extremely difficult to prove. An efficient method of responding to interrupts, based on private semaphores, will be discussed later.
F. Creation and Deletion
The process manager must provide for the creation and removal of processes and semaphores. This can be done with the two pairs of programs where i is a process index, j is a semaphore index, and c is a condition code indicating success or type of failure of the operation. The CREATE operations locate unused entries in the process or semaphore lists, initialize them,return their indices, and set up links from the created objects to the creators. Possible failures are "table full" and "initialization invalid". The DELETE operations deallocate process and semaphore list entries and any created objects linked to them. Possible failures include "no such object" and "not your object". These operations must set locks and disable interrupts to prevent interference with other process management operations. The WAIT and SIGNAL operations are performed much more frequently that the CREATE and DELETE operations. Because the CREATE and DELETE operations have much less impact on performance we will not discuss them further.
G. Correctness
We will outline an argument to prove that the implementation is correct. The objective is to show that process indices properly traverse the states of the diagram in operation initiated by another process.) When a process index is in the self register of some processor, the process is running; when in the ready list, ready; and when in a semaphore queue, waiting.
It is important that the operations LOCK, UNLK, DISABLE, ENABLE, SAVESW, LOADSW, and READY be strictly private to the process manager. Otherwise, arbitrary processes can set locks, disable interrupts, or manipulate the ready list. The correctness of the process manager depends on these operations being used exactly as specified in WAIT, SIGNAL, and IN-TERRUPT.
The correctness argument has two parts. In the first, one shows that each LOADSW, WAIT, SIGNAL, or INTERRUPT operation moves exactly one process index between the pair of states designated in the diagram. The principles of these proofs, which rely on setting locks and disabling interrupts, have already been discussed in connection with the programs. In the second part, one shows that setting locks and disabling interrupts cannot lead to deadlock or indefinite waiting. There are five reasons for this: H. Performance To estimate space and time requirements, we handcoded the five process management operations (LOADSW, SAVESW, READY, WAIT, and SIGNAL) for the IBM 370 and VAX-11/780 instruction sets [2] . The results are summarized in Table I .
In the IBM 370, the stateword comprises 16 general registers, 16 control registers, 4 floating-point registers, and the program status word (PSW). LOADSW and SAVESW each include s = 37 operand references for all these registers; they also include instructions for disabling and enabling interrupts. In the VAX, the stateword comprises 16 general registers and 2 control registers. LOADSW and SAVESW each include s = 18 operand references for these registers. The VAX programs are shorter because the stateword is smaller and because the VAX instruction set is more powerful.
The figures in Table I do not include the delays for busy waiting on locks or for memory cycles lost while other processors perform at LOCK operations. The short path cases of instruction fetching and operand referencing arise when the semaphore count is positive. The long path cases arise when queues must be manipulated and contexts switched. Compared to the IBM 370, the VAX implementation requires roughly one-third the space and one-half the execution time, or roughly one-sixth the space-time.
III. A Solution
The WAIT and SIGNAL overheads in the worst case analysis are sufficiently high that communications tag count head tail I lock Fig. 3 . Semaphore Word.
692
among operating systems processes, which may occur from 100-300 times per sec., cannot be handled efficiently by programs such as those given earlier. These operations must be incorporated into the machine's instruction set.
A. Semaphore Words Suppose that each word of memory contains a tag field containing the type of information stored within [5, 7, 12, 15] . Suppose that a semaphore is implemented as a semaphore word, (see Figure 3) and that the WAIT and SIGNAL operations are part of the instruction set of each processor [3] . For this environment, the instructions WAIT x and SIGNAL x operate on a semaphore word stored in Mem [x] . These instructions are uninterruptible because their microprograms do not examine interrupt indicators.
Tagging permits distributing semaphores throughout data structures without endangering the integrity of wait and signal operations. Tagging increases software reliability by preventing locking operations from being applied to nonsemaphore locations. Note, however, that a type-checking compiler also provides the same advantage. For this application, tagging is more a convenience than a necessity.
While a semaphore operation is in progress on Mem[x], the lock bit is set. Any other processor attempthag a semaphore operation on Mem[x] must pause, retrying the operation after a delay. The delay should be about half the time required to complete the operation.
Once in control of a semaphore word, a processor adjusts the count field and, if necessary, moves a process index between the semaphore and the ready list. The ready list emanates from a semaphore word stored in Mem[RL], whose count field is not used. A processor attempting use of the ready-list semaphore word must pause and retry after a delay if the ready list is locked.
Because lock manipulations are embedded in WAIT and SIGNAL microprograms that do not examine interrupt indicators, the LOCK, UNLK, DISABLE, and ENABLE instructions may be eliminated from the instruction set.
Communications
October 1981 of Volume 24 the ACM Number 10 LOADSW locks the ready list for four memory reference times; READY locks it for at most four. If the lock bit is set, the action
B. Starting, Stopping, and Scheduling Processes
The SAVESW, LOADSW, and READY operations are implemented as microsubprograms callable only from the WAIT and SIGNAL microprograms. Since the processor's instruction pointer will be pointing to the next instruction after WAIT or SIGNAL, the return value of the instruction pointer does not have to be a parameter to SAVESW.
Tables II and III specify the microsubprograms for the SAVESW, LOADSW, and READY operations. The dashed lines represent points at which the addressing protocol requires the processor to release the addressed memory bank. The transfer of register contents to and from PL[self].stateword will require additional acquisitions and releases of memory banks. SAVESW simply copies the processor registers to the current process control block. LOADSW locks the ready list, removes the head process, unlocks the ready list, and loads the processor registers from the new process control block. Both LOADSW and READY begin with a tag check and test of the lock bit; after the ready list has been modified, the lock bit is reset.
The high level specifications of Tables II and III means: "release the memory, pause TRL memory reference times, then restart the microprogram." The delay TRL should be 2 memory reference times, which is half the time another processor executing LOADSW or READY will hold the ready-list lock.
Executing LOADSW when there is only one process, say k, in the ready list will leave Mem [RL] .head = 0 because LINK[k] = 0. The subsequent READY(i) instruction tests for the empty ready list (head = 0) and sets both head and tail to i in this case. Manipulations of head and tail pointers can be performed during the same memory access that manipulates the lock bit.
C. WAIT and SIGNAL Operations
Tables IV and V specify microprograms for the WAIT and SIGNAL instructions. These microprograms begin with tag and lock checking. If the semaphore's lock is set, the retry delay is TSF, M memory reference times, which is haft the time another processor will hold the lock inthe worst case. If the count e > 0, the WAIT instruction will write c -1 into the count field without setting the lock, completing in 2 memory reference times. If the count c _ 0, the SIGNAL instruction will write c + 1 into the count field without setting the lock, completing in 2 memory reference times. Otherwise, these instructions set the semaphore lock and proceed to their critical sections.
The critical section of the WAIT instruction saves the current stateword and attaches self to the semaphore's queue. The new count (c -1), head, tail, and lock value (0) are written to memory in one memory reference time. As in the earlier version of WAIT, the LOADSW operation must be placed outside the critical section in order to release the lock before performing the context switch. This reduces the holding time of the semaphore lock to 4 + s memory reference times in the worst case (where s is the size of the stateword), and 693 Communications October 198 l of Volume 24 the ACM Number 10 makes the ready-list locking interval disjoint from the semaphore locking interval.
The critical section of the SIGNAL instruction removes the process index from the head of the semaphore's queue, holding it in a local register i. This permits the READY(i) microprogram to be executed outside the critical section (but within an uninterruptible microprogram.) It reduces the holding time of the semaphore lock to four memory reference times in the worst case.
D. Correctness
The correctness of these instructions follows from that of the software WAIT and SIGNAL implementation given earlier: the microprograms simulate the previous case. 
E. Performance
The overall space and time requirements of this proposal are summarized in Table VI and erences during which the data structure is inaccessible to other processors. The proposed solution has lock times from one-half to one-quarter of the VAX implementation.
In the long run, as many WAITs will be executed as SIGNALs. This means that the semaphore lock retry delay should be half the average of the (long path) semaphore lock times of WAIT and SIGNAL, or TSEM = 2 memory reference times.
IV. Private Semaphores and I/O Operations
Section II.E describes an approach to handling device signals. This mechanism can be improved considerably with device driver process and private semaphores. A device driver process has charge of all interactions with a given I/O device; all other processes must use a driver as an intermediary. A private semaphore is a semaphore on which only one process can wait. Private semaphores are used for communicating with driver processes and for receiving device completion signals.
A device driver must run soon after the completion of the previous task on its device in order to maintain the device's utilization as high as possible. To this end, each process has a fLxed priority. All user processes operate at priority 0, the lowest priority, and device driver processes operate at higher priorities. Each processor contains a register 'priority' that indicates the priority of the running process. (Alternatively, the priority may be a field of the self register.) A signaled process preempts the running process if the signaled process is of higher priority; thus, a driver process always preempts a user process that signals it.
A. Private Semaphores
A private semaphore word (Figure 4 ) can be stored in the process control block or in a separate data structure. The owner field (i) contains the index of the only process allowed to wait on the semaphore. The priority field (p) contains the priority of the owner; it is used to determine The machine instructions PWAIT x and PSIGNAL x are used to receive and send, respectively, via a private semaphore word in Mem [x] . Alternatively, the W A l T and SIGNAL microprograms can be generic, taking the PWAIT or PSIGNAL actions if the tag is 'psem' rather than 'sem'.
695
Private semaphores are intended to be used only when one process waits for another to report the successful completion of a requested task. This implies that the number of PWAIT operations cannot differ by more than one from the number of PSIGNAL operations on a given semaphore. The correctness arguments below rely on this assumption.
Table VII specifies the microprogram for PWAIT. It is simpler than WAIT (Table IV) because there is no semaphore queue manipulation. PWAIT requires 2 memory references for the short path and 8 + 2s for the long. The semaphore is locked while SAVESW is in progress to prevent a PSIGNAL operation generated by another processor from loading the stateword before it has been completely saved.
Table VIII specifies the microprogram for PSIG-NAL. Except for actions (a), which handle preemption, it is simpler than SIGNAL (Table V) . It takes 2 memory reference times for the short path and 6 + 2s for the long. The delay for retrying the semaphore lock is TpSEM = (5 + S)/4 memory reference times.
If the owner is not waiting (w = 0), PSIGNAL sets the signal bit. If the owner is waiting and has higher priority than the running process, PSIGNAL moves self to the head of the ready list (using a new operation, PUSH, shown in Table IX ) and switches to the waiting process. Otherwise, PSIGNAL moves the waiting process to either the head or the tail of the ready list, depending on its priority. The semaphore need not be locked by
Communications
October 1981 of Volume 24 the ACM Number 10 PSIGNAL because the next operation on the semaphore will be performed by PWAIT, which cannot commence until after the PUSH or READY operations inside PSIGNAL have completed.
Ideally, a multiprocessor system will have the priority property, which requires that the least running process priority be as high as the greatest ready process priority. In practice, this means that a preemption must occur within a short time if a process is enabled whose priority exceeds that of any running process.
The priority mechanism of PSIGNAL fails to have this property for two reasons. First, PSIGNAL may awaken a process whose priority is less than that of self but greater than that of a process running on another processor. Second, PUSH does not insert processes in the ready list by priority; hence the next LOADSW will initiate some high priority process, but maybe not the highest. An interprocessor broadcast mechanism is required to achieve the priority property. The ready-list ring proposed in the next section has this property; it obviates the PUSH operation and eliminates the steps in (a) from Table VIII.
B. I/O Control
It is a good design principle to channel all requests to any given I/O unit through a device driver process. Only a driver process can issue STARTIO commands to its device and receive the completion signals from its device. A device driver maintains a work queue of requests from all other processes for tasks at that device. All the interactions with a given device, including scheduling tasks, setting up channel programs, and performing error recovery, are hidden within the driver process.
The work queue of a device driver contains entries of the form (i, r), where i is a process index and r an I/Orequest descriptor. A semaphore 'wsem' counts the num-696 where PL [self] .sem is the process's private semaphore. In the simplest case, where the device accepts only one request at a time, the driver process follows the schema:
( ( 1 ) The driver's private semaphore is used to receive the device completion signal that eventually results after a STARTIO. The driver then int'orms the requestor (i) of the task's completion via the requestor's private semaphore. Additional semaphores must be associated with the work queue to insure that only one process attaches or removes a request at a time.
The last command (HALT) of a channel program instructs the device to enter its idle state and generate a completion signal interrupt on the processor that started it. The interrupt handler issues the command PSIGNAL x where x is the address of the private semaphore of the device driver. If a user process or a lower priority device driver is running on that processor, it will be preempted during the PSIGNAL operation.
V. The Ready List
Although the microcode implementations of LOADSW and READY(i) lock the ready list for only four memory reference times, lock contention may still degrade performance during periods of frequent process
Communications
October 1981 of Volume 24 the ACM Number 10 switching. 4 This contention can be eliminated by redesigning the ready list to give all processors simultaneous access. The FIFO ready list must be replaced by a multiport data structure. As long as each ready process is guaranteed a fair share of CPU service, the FIFO order is unnecessary.
A. Ready-List Ring
Ready-list contention can be eliminated completely by storing the ready process indices in a shared hardware ring. As shown in Figure 5 , the ring consists of circulating slots (or packets), each capable of holding the index and priority number of a ready process. Each processor has a private window into the ring through an interface unit.
The low level microprograms in Tables II and III 
697
If the priority numbers are part of process indices, p is implicitly inserted by the ordinary READY(i) operation.
In this implementation, lock operations are absent from all low level microprograms. The only delays occur when waiting for a used or unused slot to be accessible at the ring interface unit. The length of this delay depends on the rotation speed of the ring and the frequency of used slots. Each processor's ring interface unit can continuously monitor the priorities of passing used slots. If a slot (i, p) is observed for whichp > priority, the ring interface unit removes the packet, marks the slot as unused, and triggers the following sequence in the processor: This solves the multiprocessor-priority problem noted earlier: a high priority process will preempt a running process of lower priority within one ring circuit time.
Using this scheme, the PSIGNAL microprogram (Table VIII) can be simplified by replacing steps (a) with the operation READY(i). The shortest and longest path times of PSIGNAL are then 2 memory reference times.
If there are no ready processes, an idle processor can continuously monitor packets without degrading the performance of other processors. This solves the idle processor problem discussed previously.
A special processor can perform ring management. Its goal is to separate each pair of used slots by an empty
Communications
October 1981 of Volume 24 the ACM Number 10 slot so as to equalize the LOADSW and READY times under heavy load. If the density of empty slots is too low, this processor can divert used slots into its local store; it can insert them back again when the density of empty slots rises. The capacity of the ring is increased by the size of the local store of the ring manager. In some systems a device driver process can run only on a small set of processors which are physically connected to the device by a path. This can be handled by adding a processor group field to each ring slot and assigning a processor group number to each processor. A processor will load only processes from slots with matching group numbers. A user process would have a universal group code that matches every processor.
The ready-list ring can be modelled as a queue with random selection for service. If there are n process indices in the ready list, a given one will be selected by the next LOADSW operation with probability 1/n. The mean number of LOADSWs until selection is n, the same as for a FIFO queue. A process will remain unselected after k successive LOADSW operations with probability (1 -1/n) k, which is always less than e -k/n. Thus, the 'probability that a process is unselected after 4n LOADSW operations is less than e -4 or 1.8 percent. Because the nonselection probability decays exponentially, there may be no need for a special mechanism to guarantee the eventual selection of a ready process index. In other words, starvation is unlikely to be a problem.
B. Other Implementations
Some of the features of the ready-list ring can be simulated in main memory without special hardware. The randomness of ready-list accesses will tend to distribute the windows uniformly through the ready list. Memory contention will be low if the RL vector spans several memory banks. The major potential delay in LOADSW results from the accumulation of ready-list probes requiring at least one memory reference time each.
This implementation does not solve the multiprocessot priority problem. Null processes (cf. Sec. II.C) may degrade the processors engaged in useful work. Process starvation does not occur because each processor begins 698 its search for a ready process at the entry where its previous search ended.
VI. Conclusion
We have demonstrated that a modest amount of hardware support can significantly reduce the space and time requirements of the primitive operations for process context switching and semaphore management. The proposed WAIT and SIGNAL operations have roughly 1/180 the space-time of the corresponding IBM 370 implementation, and roughly 1/40 the space-time of the corresponding VAX implementation. The proposed implementation reduces ready-list lock holding times to 4 memory reference times. These operations are efficient enough to permit process management without shortcuts and to permit a greater number of processors to be used.
Our results can be applied to any system in which a set of identical processors has access to shared memory. This is true of many large mainframe computer systems. It will also be true of single-user, multiprocess operating systems (e.g., Bell Labs's UNIX) on a personal computer: additional CPU boards can be plugged in, much like additional memory boards, to improve performance.
The proposed implementation simplifies the correctness proof of the process manager by eliminating LOCK, UNLK, DISABLE, and ENABLE instructions and by making context switching operations inaccessible except to WAIT and SIGNAL operations. It also permits a correct response to signals from devices to their driver processes.
The discussion of the ready-list ring illustrates that a multiport list is not prone to be a bottleneck under heavy use. The probability of ultimate starvation is zero even though the list becomes a random selection queue. The technology of ring networks is already well developed--e.g., the University of Cambridge Ring for connecting machines [13] and the University of Manchester's Dataflow Machine's ring of enabled instructions [8] .
