One reasonthat Active Messages are moreefficient than sendsand receivesis becausean active message, upon arrival at a processor,is processedimmediately by a specifiedroutine that wasdesignedexplicitly for that message. Therefore,the operating systemhasvery little to do with the message and latenciescan be controlled by the specifiedroutine.
Although Active Messages will probably operate more efficiently than the send receive model, it is not clear that there is not a more efficient model to use. The overheadrequired to select, verify, and call the correct routine to use, along with effectsof in.terrupting the processorand the cache,will probably be considerablymore than the hardware latency, a time that we would like to match with the softwarelatency. An example of where such fine grained communication would be required is a pipeline algorithm that will be describedin more detail in the next section. In this algorithm, a messageconsistsof a single floating point value and the number of operations betweencommunication The interrupt logic on the 860chip, usedto handle asynchronouscommunication events, is a very large componentof the messagelatency. It should be noted, however,that in the test that is usually performed to measurelatency, bouncing messages betweenneighboring nodes, interrupts will not occur. In this test each of two nodes repeatedly waits for an incoming messageand then immediately sendsa messageto the other processor. When a processorwaits for a message to arrive the interrupt logic is turned off and the processorsits in a very tight loop waiting for the status register to changebeforepulling the message from the input buffer. In a more realistic situation where a messagearrives before it is needed, causingan interrupt, the message latency canincreasefrom 70usto 130us.(The higher time was measuredwhen eachprocesswould wait for a messageby continuously executing the probe function until a message arrived.)
The causeof interrupts being so expensiveon the 860 chip is mostly due to the large state of the processor.This includes32 floating point registers,32 integer registers,an add, multiply, and load pipeline, and fairly complexinstruction modesthat all must be savedand reconstructedbefore resuming normal processing.The result of this is that it takes on the order of 1000instructions to handle an interrupt. This doesnot include any of the time to processa message.
Other sourcesof increasedlatency include the time to do a trap into the operating system and the effectson the cacheof messages asynchronouslyarriving at a node. The time to executean operating system call, although not as severeas communication interrupts, is roughly 50 instructions. We havenot measuredthe cost that an interrupt will haveon the instruction and data cachesbut we expect that it would be substantially more than the hardwarelatency.
The operating system on the iPSC/860 (NX/2) controls the communication hardware and interrupt mechanisms. The communication model is based on sendingand receiving contiguousblocks of typed messages.NX/2 must handle the general caseof having any message of any sizearrive at any time without the operating systemcrashing. This requires a complexsystemthat handlesbuffer management,handshakeprotocols, interrupts, security and other issues. The operating system, although quite complex, handlesthis generalcase very well.
Another aspectof the operatingsystemthat addsto the latencyof communicationis that the operating system usesthe same communication network as the applications. Because of this, NX/2 must be reliable and secure: This requirement adds to the overall message latency. Outgoing messages must be checkedfor valid addressesand the processormust assumethat any systemmessage can arrive at any time and must be handled properly.
This aspect of using the samehardware for both the operating system and user applications, and not hav!ng hardwaresupport for message security is a problem that will make the goal of reducing the latency to a few instructions very difficult if not impossible. The only good solution for theseproblems are to handle thexnin hardware. It should be noted that the CM-5 is onemachinethat doesnot havetheseproblems. We will not considerthis problem any further and will assumethat a single user hascontrol of the entire machine.
The software latency associatedwith sending messages is primarily due to a complex general purpose operating system that can handle any situation. However, most of the time an application doesnot need tile power of a generaloperating system. Many times the application writer has specific information that can be used to take advantageof the hardware using the Intel primitives. In the modified versionthe bulk of the time spent wasin the trap handler saving and restoring the processorstate. This final program is much more complicated than the other programs becauseof the complexnature of how the processes synchronize.In the other programstile synchronization is donebasedon the FIFOs while the synchronizationinvolved in tile last program is similar to that found in shared menaorysystems. Although it was not measured,we also expect that the time required to handle the synchronizationis significantly more than that in the other programs.
In summary, these tests imply severalcharacteristics for any proposedcommunication system. Firstly, interrupts are very expensiveand should be avoided. This will become more important asthe sizeof the processorstate that nmst be savedand restoredis growing with the complexity of the microprocessorsused. Second,the operating system should be minimally involved with communications, As seenabove where communications can be generatedwith oneor two assemblyinstructions, any interaction with the operating system will be very expensive. This removal of the operating system implies that issuestypically handled by the underlying system, such as security and IO, should be handled elsewhere. Finally, many numerical algorithms have a repetitive communication pattern, such as a pipeline, and the ability to setup the communication paths once and then reusethem can make very efficient useof the hardware.
5
Implementing aregeneralpurposeprocessors but in reality the communicationprocessorcould be a custom processor.for handling communications. An interesting problem with using two processorsasopposedto oneis the effect of cache. If a singleprocessorhandled both communicationand computation and the data to be sent wereresidingin the cacheat the send,then there would be no traffic betweenthe memoryand the processor;the processorwould sendthe data directly to the network. If two processors are usedthen the data will haveto first be flushed from the main processor'scacheso that the secondprocessorcan useit. For small messages it will probably be more efficient to have the computation processorsendthe data.
The processors communicatewith other nodesby receivinginformation from the network using the input buffer or by sending information 
