In recent years there has been increasing interest in message-based operating systems, particularly in distributed environments. Such systems consist of a small message-passing kernel supporting a collection of system server processes that provide such services as resource management, file service, and global communications. For such an architecture to be practical, it is essential that basic messages be fast, since they often replace what would be a simple procedure call or "kernel call" in a more traditional system. Careful study of several operating systems shows that the limiting factor, especially for small messages, is typically not network bandwidth but processing overhead. Therefore, we propose using a special-purpose coprocessor to support message passing. Our research has two parts: First, we partitioned an actual message-based operating system into communication and computation parts interacting through shared queues and measured its performance on a multiprocessor. Second, we designed hardware support in the form of a special-purpose smart bus and smart shared memory and demonstrated the benefits of these components through analytical modeling using Generalized Timed Petri Nets. Our analysis shows good agreement with the experimental results and indicates that substantial benefits may be obtained from both the partitioning of the software and the addition of a small amount of special-purpose hardware.
Access to system services are requested via protected procedure calls in a traditional system, whereas in a messagebased operating system they are requested via message passing. While a simple procedure call costs just a few instructions, and a protected procedure call (kernel call) costs a few hundred instructions, IPC costs a few thousand instructions in several systems that we studied [Ramac 86]. Since message exchange is the basic kernel mechanism in message-based operating systems, the performance of the system depends crucially on the rate of message exchange. Our measurements (see § 3) as well as the measurements of others [Artsy 84 , Cheri 83, Gagli 85] indicate that for small messages (which make up the vast majority of all messages sent [Cheri 83], the limiting factor is the high processing overhead that is incurred in message passing rather than limited network bandwidth or the time to copy messages from buffer to buffer.
There are two important figures of merit in this environment: round-trip time, and message throughput. Round-trip time is the elapsed time seen by an application between sending a message and receiving a reply from the intended receiver. This figure of merit affects an individual application's performance. Message throughput is a global figure of merit that determines the performance of the entire system. Informally, it is the number of messages that the system is handles per unit time. We show that a modest amount of additional hardware can significantly improve message throughput and average round-trip time in a multiprogramming environment. We also show that additional hardware support in the form of high-level bus primitives affords even greater improvement in communication subsystem performance.
Available hardware support [ABLE 84, Inter 83] and previous modeling studies [Gorad 87, Woods 84], address the issue of off-loading communication protocols onto front-end processors, and provide evidence that this approach can have a significant performance payoff. However, these and other previous proposals of hardware support for interprocess communication (see survey in [Ramac 86]) are more limited that the study reported in this paper in several respects. First, previous work generally assumes "communication" to be the work that is performed by the operating system to satisfy non-local requests. However, for message-based operating systems, measurements by ourselves [Ramac 86] and others [Cheri 83] show that there is a high processing overhead for local communication as well. Second, many proposals include only limited, low-level support for communication, leaving out support for operations such as address translation, control block manipulation, and kernel buffering, which account for a substantial portion of message passing overhead. Finally, a front-end processor for a specific network protocol (such as TCP/IP [Posts 81]) may not mesh well with the operating system primitives, and therefore may incur higher overhead than necessary. The problem needs to be addressed at a much higher level.
NODE Shared Memory
Figure 1: Node Architecture ing system and the applications. The shared bus, the shared memory, the message coprocessor, and the network interfaces together function as a single unit in assisting the host in message-passing activities. The host, the message coprocessor and the network interfaces interact and synchronize via the shared memory. This organization is similar to the ones assumed in the studies of network front-ends such as [Gorad 87 ] and in the commercial products such as the ABLE Easyway Port [ABLE 84] . However, what distinguishes our work from these earlier proposals is the level of messagepassing support envisioned in our proposal. In this research, we provide support for message-passing at the level of the operating system primitives. Our solution to the problem suggests a system architecture that has a software aspect and a hardware aspect.
Our goal in this research is to determine a system architecture within each node that improves the system performance over an organization that does not have such hardware assistance. Our objective reduces to answering two main questions: 2
(1) How should the message-based operating system be partitioned between the host and the message coprocessor? (2)
What kind of bus architecture is appropriate to support interaction between the host, the message coprocessor, and the network devices?
For a given semantics of interprocess communication, roundtrip time signifies the processing overhead that is incurred to effect the message transfer between the sender and the receiver. If there are exactly two processes communicating with each other in the entire system, clearly there would be an increase in the processing overhead due to the interaction between the host and the message coprocessor. However, we show that this increase can be kept very small by a careful partitioning of the message-based operating system. Moreover, through our performance analysis, we show that the per-process round-trip time improves as a result of improving the message throughput when there are several processes communicating with one another.
Software Partition
We studied the design and implementation of four operating systems in detail: Charlotte [ [Ramae 86 ] for more details) to ensure that we are not discovering coding inefficiencies of one operating system but see a trend that is common to all these systems. Our model of a distributed system assumes that processes communicate via explicit messages and that system services are provided by trusted server processes (as opposed to a monolithic kernel). Charlotte, Jasmin, and 925 belong to this model. However, all of these systems are experimental research projects. Therefore, we also studied Unix, which is not a message-based operating system, to see whether operating systems in extensive use suffer from similar problems. Our study revealed two important characteristics of distributed systems:
ZWe discuss other aspects of the system architecture in § 6.
Structure
System services in a distributed system are accomplished by a combination of computation and communication. By computation we mean processing done on behalf of servers. By communication we mean the system code that has to be executed to process a communication request.
Communication overhead
There is a fixed overhead incurred in communication (independent of the message size) that can be decomposed into components such as checking the validity of an IPC call, addressing and manipulating control blocks, and short-term scheduling. There is a variable overhead due to kernel buflhring that is dependent on the size of the message and the number of times the message is copied from source to destination. This combined overhead is present for both local and non-local communication. For non-local communication there is additional overhead such as sending network packets, processing network interrupts, checksum calculation, and retransmission. For small messages, i.e., message size smaller than 100 bytes, copy-time is less then 20% of the total round-trip time.
For large messages, i.e., message size larger than 1000 bytes, copy-time begins to dominate the total round-trip time.
In a distributed system, users request system services by communicating with the servers. These servers compute to satisfy the requests possibly communicating with other servers. Through our profiling studies, we showed that a large percentage of the round-trip time can be attributed to short-term scheduling and control block manipulation functions. These kernel functions are performed for both local and non-local communication. Therefore, it is clear that message-passing support should be provided transparently for both local and non-local communication. Further, timing measurements for performing typical services on Unix (see § 5.6)) suggest that server computation times are comparable to communication times incurred in the message kernel. The structure of the distributed system and the results of our studies suggest the following partition of the message-based operating system between the host and the message coprocessor: computation on the host and communication on the message coprocessor. The shared memory is used for synchronization and communication between the host, the message coprocessor and the network interfaces.
To verify the feasibility of such a software partition and to gather actual timing information, we implemented such a partition on an experimental system. 925 [IBM 83 ] is an early version of an experimental operating system (since renamed Quicksilver) for a network of workstations. For the purposes of this discussion, 925 is quite similar to the Stanford V Kernel [Cheri 83]. The version we modified ran on a multiprocessor workstation with three Motorola 68000 processors [Motor 82a], each with local memory, connected by a Versabus [Motor 82b] to each other, a shared memory, and an early experimental version of the IBM token ring [Bux 83 ]. We emulated the message coprocessor with one of the processors, and measured the implementation to obtain the processing times for the kernel activities involved in message passing.
Through the implementation, we established the feasibility of partitioning the message-based operating system between the host and the message coprocessor. Another important fruit of this exercise was insight into the kinds of system data structures that are used in communication processing, the operations that are done on them, and the overhead for these operations. Buffers and lists of control blocks are the data structures that are extensively manipulated in communication processing. Operations on these data structures include copying and atomic queue manipulation. With the Motorola 68000 implementation it takes 220 micro-seconds of processing time to copy 40 bytes, and 74 micro-seconds of processing time to perform an atomic queueing operation. There are four copy operations and sixteen queueing operations (see [Ramac 86 ] for details) in one roundtrip (non-local communication). Hence these times are important since they constitute a significant portion of the total round-trip time. Based on our implementation experience, we have a proposal for a hardware organization that we describe in the next section.
Hardware Organization
The system data structures in shared memory are manipulated by all the units inside each node. The operations on these data structures can be grouped into three categories: movement of blocks of data, queue manipulation, and simple read/write. The above groups of operations are general and applicable for implementing the semantics of interprocess communication of any operating system. Hence it is appropriate to support these operations on the shared memory at the bus level.
Several recent bus proposals support block transfer primitives [Borri 85]. However, these bus proposals are intended for a versatile environment with multiple memory modules, processor modules, and device modules. In our environment, there is a limited shared memory holding task control blocks and kernel buffers. The units that access this memory are the message coprocessor, the host, and the network interfaces. This memory does not contain either "kernel programs" or "user programs ". On the contrary, it contains only protected kernel data structures that are manipulated by trusted kernel code executing in the message coprocessor and the host. Each unit that accesses this memory has exactly one outstanding request. In a limited controlled environment it would be more cost effective for the memory to handle multiplexed block transfers. Moreover, none of the existing bus proposals support atomic queue manipulation primitives.
We propose a smart bus for message-passing support. To support the high-level primitives in this proposal, we propose a smart shared memory. To put our bus proposal in the proper perspective, we should point out that the intent is not to invent a standard for system buses. In fact, we view the bus, the message coprocessor, the shared memory, and the network interfaces together as a single unit that provides message-passing support to the host at the level of the operating system primifives. This unit coexists with the rest of the node architecture that includes the bus on which the host, the devices, and the host memory reside.
Smart Bus Overview
Smart bus connects the host, the message coprocessor, and the network interfaces to the shared memory. Multiplexed block transfer and atomic queue manipulation are the transactions supported on the smart bus. Smart bus decouples requests for block transfers from the actual data transfers. The shared memory caches information regarding block transfer requests (address and size) in an internal table, so that it can restart a lower-priority request after servicing a higher-priority one. The bus is never locked for arbitrary amounts of time, thus guaranteeing access for higher priority requests in a finite time. A unit can have exactly one outstanding block transfer request. Therefore, the shared memory does not have to handle any flow control problems. Priorifized arbitration among competing units is supported on the bus. Bus transfer rate is scalable with device technology due to the asynchronous protocol.
Physically, the bus includes sixteen multiplexed address/data lines, four-lines for commands, and four-lines for a tag. In addition, there are arbitration lines for access control, protocol lines to complete the asynchronous handshake, and a system reset line for startup. We refer to a transition on a protocol line as a clock-edge.
The number of multiplexed address/data lines in our design is sixteen, stemming from the fact that our experimental results were obtained from a sixteen-bit Versabus [Motor 82b] implementation. To maintain compatibility with our experimental results we used sixteen-bit address/data lines. However, there is no inherent assumption in our design that would preclude extension to a wider bus.
The transactions we propose on the shared bus can be grouped into three categories: block requests, queue manipulation, simple read~write. Table 1 gives a summary of the smart bus commands and the time (measured in clock-edges) for performing the operations. There are three transactions provided in this category: block transfer, block read data, and block write data. These primitives allow movement of blocks of contiguous data between the shared memory and other units in each node. They allow the shared memory to be multiplexed for handling simultaneous requests. Block transfer and block write data are initiated by the CPUs and network devices. Henceforth, we refer to either a CPU or a network device as a processor. The processor that initiates block transfer specifies whether it is a read or a write. Block read data is initiated by the shared memory. While block transfer is the primitive used by the processor to convey the intent to the shared memory, block read data and block write data are the primitives used to effect the actual data movement.
In block transfer, the processor sends the starting address of the block and a count indicating the number of configuous bytes of information to be transferred. The command (read or write) is specified on the command bus. Shared memory stores them in its internal table and responds by returning a tag that uniquely identifies the transaction. Block read data and block write data are primitives that are issued following the block transfer request. Both these primifives result in data transfer. Shared memory executes block read data to send the data along with the tag that uniquely identifies the processor of the block transfer request. The processor monitors the tag bus. When there is a tag match, the processor receives the data from the bus. Information transfer is in the opposite direction for block write data. Following a request to write a block of data, the processor executes block write data sending the data along with the tag to the shared memory. Shared memory receives them and uses the tag as an index into its internal table to get the address where the data is to be stored.
Queue Manipulation
There are three primitives provided in this category: enqueue control block, first control block, and dequeue control block. By presenting the memory as a singly-linked circular list of control blocks, these primitives allow atomic queueing operations to be performed on these lists. The data Figure 2 . When presented with a list address, the memory unit views it as the address of the location in memory ("List" in Figure 2 ) that points to the tail of the list. Definitions of these primitives are given below. In each case, "list" refers to the location in memory that points to the tail (last element) of the list ( Figure  2 ).
( In addition to the above transactions, the bus supports simple read/write primitives at byte granularity.
Smart Shared Memory
We proposed a bus architecture that is appropriate within the functional unit composed of the message coprocessor and the network interfaces. However, this proposal implicitly assumes that the shared memory has the necessary "intelligence" to handle the high-level requests of the smart bus. The proposal also assumes that the processors either themselves or within their bus interfaces have the necessary intelligence to generate these requests. Fortunately, even though the bus transactions are high-level, the nature of the environment make these transactions feasible from the point of view of hardware implementation. Moreover, the nature of the environment make it possible to provide these facilities at a reasonable "cost "3. We demonstrate this feasibility through a detailed design of a smart shared memory [Ramac 86 ]. The controller for the smart shared memory is microprograrnmed, and has under 3000 bits of micro-code. Based on the complexity of the design, we also show that the entire design can be packaged in two chips. The data path (without the memory system) can be implemented as a single chip with roughly 6000 active components. The sequencer can be implemented as a single chip with roughly 1000 active components.
Performance Analysis
Our solution to the message-passing problem in distributed systems had two parts: software partition and hardware organization. While we implemented the software partition, limitations on the time and money we had available for fabricating and testing the hardware lead us to model rather than build the smart bus and the smart shared memory. Moreover, modeling allowed us to parameterize the design, thus enabling individual features to be evaluated. The results of our performance analysis were sufficiently encouraging that we are now considering an experimental implementation of the hardware (see § 6.4).
We modeled our design using Generalized Timed Petri Nets (GTPN) [Holli 87], an extension of Petri nets [Peter 81] that allows assignment of firing durations to transitions and relative probabilities to alternate paths in the net. We then used a tool that builds the set of reachable states for the GTPN model and solves the resulting Markov Chain to determine steady-state performance measures. Aggregate performance measures specified by the user (e.g. system throughput) are also computed automatically by the tool. This approach provides more precise results than simulation, but the formulation of the model requires some care lest the number of states become excessive. Some of the techniques we used to avoid state explo~Ve measure "cost ~ by the complexity (component count) of the design.
Architectures
We compare three architectures. Architecture I ( Figure  3 ) is a uniprocessor implementation of a distributed system. The message-based operating system executes on the host. The host is in control of the network interface.
Architecture II (Figure 4) is the organization we implemented in 925. The servers execute on the host and the message-passing kernel executes on the message coprocessor. The shared memory contains the task control blocks and the kernel buffers. The message coprocessor is in control of the network interface.
Architecture HI is similar to architecture II, with the difference that a smart bus interconnects the different units within each node and a smart memory serves as shared memory.
One important fruit of the implementation is that it gave us the timing values needed for driving the different models. These timing values are the processing times for the different components of message passing. In the architectures we are comparing, we assume the processors to be identical. Hence the processing times we obtained from our implementation are applicable to all of them.
Workload Description
In this section, we describe the workload that we used as the basis for comparing the different architectures. While this is not the only possible workload, it is a typical workload in a distributed system. We plan to study other workloads in future (see § 6.1).
A When the send and the receive match, a rendezvous takes place between the client and the server. The server then computes for a while processing the request in the message from the client. At the end of the computation phase, the server completes the request from the client with a reply, completing the rendezvous between the client and the server. We call this extended request-reply sequence between the client and the server a conversation. Our workload contains both local and non-local conversations. The number of simultaneous conversations and the amount of computation specified in each conversation are the two parameters we vary in the workload. It is true that in a real system, clients compute as well. However, we designed our workload to stress the performance of the message-based operating system composed of the message kernel and the servers. Therefore, for these experiments we did not consider client-computation in our workload.
Offered load is a measure of the communication load that is presented to the system by each conversation, defined as the ratio of communication time (in a round-trip) to the sum of communication time and compute time. As we mentioned earlier, by communication we mean the system code that has to be executed to process a communication request. Intuitively, a compute-bound conversation is characterized by an offered load near zero, while a communication-bound conversation has a load near unity.
Processing Times
In our implementation, we had an 8 MHZ CPU clock. At 8 MHZ clock speed, Motorola 68000 has an instruction execution rate of roughly 0.3 MIPS [Motor 82a]. Versabus [Motor 82b] memory cycle time is on an average one micro-second. In our models, we assume an instruction execution time of three micro-seconds and a Versabus memory cycle time of one microsecond. We also assume that the four-edge handshake of smart bus equals Versabus memory cycle time and that the two-edge handshake equals half the Versabus memory cycle time. We should point out that a much higher speed is achievable for the smart bus with current technology. However, these conservative times for smart bus primitives give a more realistic basis for comparing the different architectures. Table 2 shows a comparison of implementing queue manipulation and block transfer operations for architectures II and III. For architecture II, each ofenqueue, dequeue, and first involves the following steps to be performed by the message coprocessor: lock a semaphore, execute the queue manipulation algorithm (see § 4.1.2), and release the semaphore. The message coprocessor executes a program loop for reading or writing a block in architecture II. The processing time for this loop execution is shown in Table 2 . The message coprocessor in architecture III executes three instructions to initiate any of the smart bus primitives. For example, to initiate block transfer the message coprocessor writes the starting address, count, and command to its bus interface. Tables 3, 4 , 5, 6, 7, and 8 are a breakdown of the communication time for one round-trip conversation into component message-passing activities. The breakdown gives the processing time and the time spent in accessing shared data structures for both local and non-local conversations. The times for architecture II were obtained directly from our implementation. The times for architecture I were obtained from architecture II by eliminating the overhead for synchronization between the host and the message coprocessor. The times for architecture III were derived from architecture II after factoring in the primitives of the smart bus. 
Operation

Validation
Our experimental implementation on the 925 system differed from architecture II in two ways:
(1) There were two hosts in each node instead of one.
(2) The network interfaces required an additional copy from the kernel buffers to the memory-mapped network buffers in shared memory.
We used the workload described in § 5.2 for performance measurements of the implementation. We validated a model for non-local conversations of our experimental implementation against these performance measures. , the model results are within 10% of the experimental results at high offered loads, while at low offered loads the deviation is within 25%. The optimistic prediction in the case of low offered load (high computation) is partly due to a load-leveling effect in the model not present in the experimental implementation. In the implementation, a process is bound to a particular host, whereas in the model, a request can be serviced on any available host. When the load is less communication-intensive, server processes spend a larger fraction of time on the host and as a result the throughput predicted by the model is higher. However, despite this effect, the model results show good overall agreement with the experimental results.
Results
In this section, we present and compare the results of solving the models for the three architectures for the workload we described earlier. 
Maximum Communication Load (40-byte Messages)
For architecture II, the throughput for one conversation is slightly less than that for a architecture I. The loss represents the overhead involved in the information transfer between the host and the message coprocessor. However, note that this loss is very small (= 10%). Increase in throughput with the number of conversations is less than linear due to the finite bandwidth of the message coprocessor. Note that architecture III is significantly better than both architectures I and II. The smart bus reduces the overhead in communication processing by providing high-level bus transactions. These transactions are significantly faster than a software implementation (see § 5.3). Figure 6 (b) illustrates the results for non-local conversations. The tendency to saturate with number of conversations is less pronounced for non-local conversations when compared to local conversations, since the processing load is spread across two nodes. Once again we note that architecture III performs significantly better than architectures I and II.
We note that architecture II does not do significantly better than architecture I (both local and non-local conversations). However, these graphs are for maximum communication load. Under these conditions the host is idle most of the time since there is no computation in any conversation. However, the premise behind partitioning the software is that load in a distributed system consists of a good mix of computation and communication. In the next section we will discuss our results under such typical load.
Varying Workload
In this section, we compare architectures I, II, and III under the assumption that the server does a certain non-zero amount of computation before replying to the client. As we mentioned earlier (see § 5.2), offered load is defined as L = C/(C+S), where C is the communication processing requirement in one round-trip and S is the server computation time. C is dependent on the architecture while S is a workload parameter. Tables 9 and 10 give the offered loads for different servercomputation times in the three architectures for local and nonlocal conversations respectively. Note that the offered load for a given server-computation time is the least for architecture III since it has the least C, and slightly higher for architecture II. The value of S for a given service is the same for each of the three architectures. For example, our measurements of Unix on a processor that is about two to three times the speed of the modeled architecture show service times ranging from 0.2 to 6.1 milliseconds (see § 5.6). Using Tables 9 and 10, we can read off the offered loads for each architecture given the servercomputation time. We want to be able to compare the performance of the three architectures for various servers. Figure 7 illustrates how message throughput depends on offered load, as determined by the amount of computation done by a server for each request. Since the offered load L depends on C, which is a function of the particular architecture, we normalized the results by plotting throughput for each architecture as a function of the offered load a given server would produce on architecture I. For architecture I, with local conversation, the results are independent of the number of conversations. Architecture II does slightly worse than architecture I for one conversation due to the overhead in passing information between the host and the message coprocessor. However, as the number of conversations is increased, the throughput improves considerably over architecture I. With a message coprocessor equal in processing speed to the host, the upper bound for throughput improvement (with no overhead between the host and the message coprocessor) is a factor of two, Architecture II approaches this limit over a range (0.5 to 0.9) of values for offered load. When the load is more computation intensive there is no significant gain in partitioning the software. The graph defines a region of operation of the distributed system in terms of mixture of computation and communication for which the message coprocessor is viable. By providing high-level bus primitives, architecture III does better than both architecture I & II and over a wider range (0.4 to 0.95) of offered load. The tendency to saturate for three and four conversations is also less pronounced for architecture III.
Figure 7 (b) shows a comparison of results for non-local conversation. For architecture II, the improvement in throughput with offered load over architecture I is less pronounced for the number of conversations that we have modeled. However, note that for four conversations we see an improvement (= 20%) over architecture I in the range of offered loads 0.7 to 0.9. Thus the graphs do show a trend in predicting the improvement that is attainable for much larger systems. Unfortunately, given the limitations of existing modeling tools, we were unable to model larger systems. We note once again that architecture III shows a marked performance improvement over the first two architectures. Over the range of offered loads 0.6 and 1.0, architecture III does significantly better than both architectures I and II. The graph suggests that smart bus primitives are as important for improving the performance of the system for non-local conversations as software partitioning.
Partitioned Smart Bus
We analyzed a fourth architecture that was motivated by the observation that task control blocks are a shared data structure between the host and the message coprocessor, whereas kernel buffers are a shared data structure between the message coprocessor and the network interfaces. We partition the smart shared memory and the smart bus as follows: The task control blocks are on a partition that interconnects the host with the message coprocessor and the kernel buffers are on a partition that interconnects the message coprocessor with the network interfaces.
We found in all cases that the partitioned organization did not perform significantly better than architecture III. We would have expected such an improvement in performance if there was a considerable contention for the shared memory. These performance results indicate that access to the shared memory is not the bottleneck in limiting the performance. For the same reason, for a given architecture, we do not expect a multiported shared memory to perform better than a single-ported shared memory for any of the four architectures that we analyzed.
Snrnm~ry
In summary, the graphs show the following: (1) Over ranges of offered loads (0.4 to 1.0 for local and 0.6 to 1.0 for non-local), partitioning the message-based operating system and providing high level bus primitives result in improvement in performance over a uniprocessor implementation. Thus there is a range of mixes of computation and communication in which a message coprocessor is appropriate for improving the performance of the system. We observed that the times for typical system services (measured on a Microvax II For one conversation there is a loss in performance due to software partitioning, but the loss is very small. Improvement in performance with the number of conversations is less than linear due to the finite bandwidth of the message coprocessor.
Smart bus primitives improve the performance of the system significantly for both local and non-local conversations. (4) Software partitioning, and high-level bus transactions (mirroring operating system functions) are a promising approach to solving the message-passing problem in distributed systems. Multiported memories do not help significantly since it is processing-time and not access to shared memory that is the limiting bottleneck.
Directions for Future Research
Any interesting research answers a few questions and raises several more. Ours is no exception. In the following sub-sections we identify directions for future research.
Extensions to the Performance Studies
In comparing the different architectures we used synchronous remote invocation send as the basis. We are currently investigating the performance of these architectures in other communication scenarios such as producer~consumer, asynchronous remote invocation send, and no wait send. Further, since use of smart bus and use of message coprocessor are independent, we are investigating the individual performance benefits due to these two components.
Instruction-Set Architecture
In our implementation of the software partition, we showed that an off-the-shelf processor is adequate to support the functionality of the message coprocessor. However, the message coprocessor is intended for a very specific purpose, namely, performing the different components of communication processing. The message coprocessor mostly manipulates structured data types such as lists of control blocks and buffers in performing its chores. We would like to study the potential of a special-purpose instruction-set for the coprocessor to exploit the cost benefits of simplicity due to functional dedication and performance benefits possible in an architecture tailored to the chores of communication processing.
Interaction with Host Architecture
Interaction of our system architecture with other system architecture features such as virtual memory and cache is another interesting and important future area of research.
Virtual memory introduces a number of interesting problems. In our implementation, processes execute from the local memory of the host. The message coprocessor can access host's local memory from the shared bus. It uses this access path to perform data transfer between a process' address space and kernel buffers in the shared memory. Since the host architecture did not support virtual memory this implementation was feasible. However, applicability of our software architecture to a more general environment would require re-evaluation in the presence of virtual memory. For instance, if we assume that the host maps process virtual addresses to physical addresses in local memory, we see at least two choices for handling the kernel buffering problem: the host handles the kernel buffering chore; or the host locks the relevant pages in local memory to enable the message coprocessor to perform kernel buffering. The first choice places a considerable processing burden on the host. Moreover, the buffer management algorithms become more complex and error-prone, since now both the host and the message coprocessor access the kernel buffers. The second choice seems more feasible. However, the message coprocessor would require some mechanism to inform the host that it is "done".
A related problem is a cache for local memory on the host. Fortunately, the problems with multiprocessor caches are well understood [Goodm 83, Katz 85] . "Bus monitors ~ [Goodm 83] are a possible solution to the problem that we plan to study in the specific context of our environment.
VLSI Implementation
The hardware assists we proposed in this paper lend themselves to high performance VLSI implementation. For example, we mentioned earlier that a shared memory controller for the smart bus could be built with a couple of chips of reasonable complexity. From our performance results, we are convinced that realizing these subsystems in chip form is worthwhile. Building and studying the performance of such network front-ends is another important research area to be pursued.
Shared Memory Multiprocessors
Systems such as Balance 8000 and Multimax are just a few examples of the newly emerging class of computing structuresm"multis"/Bell 85].
As we mentioned earlier (see § 4), we view the smart bus, the message coprocessor, the smart shared memory, and the network interfaces together as a single unit that provides message-passing support to the host at the level of the operating system primitives. In the "multis" environment, it is conceivable that this unit provides messagepassing support to all the processors in each node (see Figure  8) . The organization shown in Figure 8 raises several interesting issues such as the semantics of interprocess communication, the interaction of the smart bus with the system bus, and maintaining cache coherency in the presence of the smart shared ,memory, and promises to be an exciting area of future research. 
Conclusion
Local area networking has enhanced the interest of researchers in experimenting with distributed message-based operating systems. Current research [Artsy84, Cheri83, Gagli 85] and our own measurements of several operating systems show that interprocess communication (message passing) is roughly two orders of magnitude slower than a simple procedure call. Since system services are requested via message passing, the performance of message-based operating systems depends crucially on the rate of message passing. Our goal in this research was to study the problem of interprocess communication in a distributed system, and suggest a system architecture that improves the performance in this environment. In this paper, we suggested a system architecture that was composed of a software aspect and a hardware organization. Using GTPN as a modeling tool, we showed that software partitioning and high-level bus transactions (mirroring operating system functions) are a promising approach to solving the messagepassing problem in distributed systems.
