Due to advances in fiber-optics and VLSI technology, interconnection networks which allow multiple simultaneous broadcasts are becoming feasible. This paper examines the performance of distributed-shared-memory (DSM) systems based on the Simultaneous Optical Multiprocessor Exchange Bus (SOME-Bus) using queuing network models and develops theoretical results which predict processor utilization, message latency and other useful measures. It also presents simulation results which compare the performance of the SOME-Bus, the mesh and the torus using queuing-network models. The SOME-Bus is a low-latency, high-bandwidth, fiber-optic interconnection network which directly links arbitrary pairs of processor nodes without contention, and can efficiently interconnect over one hundred nodes. It contains a dedicated channel for the data output of each node, eliminating the need for global arbitration and providing bandwidth that scales directly with the number of nodes in the system. Each of the N nodes has an array of receivers, with one receiver dedicated to each node output channel. No node is ever blocked from transmitting by another transmitter or due to contention for shared switching logic. The entire N-receiver array can be integrated on a single chip at a comparatively minor cost resulting in o(N) complexity. The SOME-Bus has much more functionality than a crossbar by supporting multiple simultaneous broadcasts of messages, allowing cache consistency protocols to complete much faster. The effect of collective communications due to cache coherence is examined. Results reveal that the performance of the SOME-Bus interconnection network is the least affected by large communication times, compared to the other two architectures considered here. Even in the presence of intense coherence traffic, processor utilization and message latency is much less affected than in the other architectures.
the peak when all memory accesses are directed to global memory, and 60% of the peak when memory accesses are directed to local or cluster or global memory with equal probability.
Processor performance in four space-science applications is studied on the HP/Convex Exemplar [18] , a computer with up to 16 clusters (of four processors) connected by four SCI rings. Reduced speedup in some applications is attributed to "irregular data access patterns, global communication between processors and load balancing".
A study of four architectures with hardware support of shared memory is reported in [8] . Significant latency is found, even under optimistic assumptions, especially in characterizing the cache misses which result in traffic over the interconnection network (as evidenced by the very small network utilization in three architectures).
A DSM implementation on a 16-node nCUBE is described in [2] . Experiments with four parallel programs show reduced performance in an application which performs matrix addition on distributed data and requires a significant amount of data-transfer time compared to node-intensive computation time. It is observed that such programs are unsuitable for DSM unless a technique can be found to reduce the communication.
A DSM multiprocessor based on a multistage bus network is studied in [4] . A processor can be either busy or waiting for the response to a memory request. For all values of request probability above 0.1, processor utilization stays below 65% and 40% when almost all packets are directed to local or global memory respectively. Utilization drops dramatically as the request probability increases, a fact attributed to "higher amount of traffic and queuing delays".
A shared memory multiprocessor based on a 4x4 mesh network with wormhole routing is studied in [6] . The performance of two hardware-based prefetching schemes are evaluated with simulation. Cache misses in the range from 1% to 10% are observed using five popular applications.
Even after extensive efforts of software tuning, many researchers observe performance which is poor or moderate with many modern large-scale applications, mostly due to load imbalance, barrier synchronization, irregular and dynamic communication patterns, and multicast or broadcast communication due to DSM management, which place an excessive load on the interconnection network. They point out that interprocessor communication is what makes parallel programming challenging. The poor to moderate success encountered with many demanding applications is evidence of the truth of this statement. It is also a force behind the effort to develop interconnection networks that minimize this challenge as much as possible. The major reason of the moderate success lies in the nature of currently available interconnection topologies (small-degree networks with large diameters such as trees, hypercubes and meshes), regardless of their actual implementation medium, and in the mismatch between interconnection architecture and application structure.
High-performance (and high-complexity) interconnection networks have been proposed [5] . More practical networks have also been proposed, including networks based on the mesh with additional broadcast buses. The distributed crossbar switch hypermesh (an implementation of multidimensional hypergraph networks) is examined in [17] , where blocking probabilities and average values of message delays are calculated. Similarly, an optical implementation of hypermeshes using electrical and optical crossbars is examined in [19] . Although multiple wavelengths are used, multiple senders may use the same wavelength, requiring contention resolution. One of the earlier instances where multiple broadcasts are mentioned is in [7] where a 16x16 star coupler is used to deliver multiple wavelengths to different destinations. Another optical interconnection network using limited broadcasts is presented in [9] [10] [11] [12] . Two versions are compared in [3] . All assume the same architecture containing several point-to-point channels, one data broadcast channel, and one control broadcast channel used for arbitration.
The SOME-Bus Broadcast Architecture
The most useful properties of a parallel processor interconnection network are high bandwidth (scaling directly with the number of processors), low latency, no arbitration delay, and non-blocking communication. The Simultaneous Optical Multiprocessor Exchange Bus (SOME-Bus) [14] directly connects each processor node to all other nodes and is capable of accommodating over one hundred nodes without suffering from bottlenecks. It uses Bragg gratings within a fiber as narrow-band, inexpensive outcouplers and amorphous silicon detectors built as superstructures on CMOS devices. By using wavelength-division-multiplexing (WDM), it utilizes the higher bandwidth capabilities of optics while optimizing distribution of optical power.
One of its key features is that each node has a dedicated broadcast channel, realized by a specific group of wavelengths in a specific fiber, and an input channel interface based on an array of receivers which simultaneously monitors all channels, resulting in an effectively fully-connected network. Figure 1 shows the receiver array. A separate wavelength is used to insert the clock signal which is common to all data streams on a fiber. A very modest implementation of a system with 128 nodes using CMOS receivers, amorphous Si detectors, 32 fibers with 17 wavelengths (4 per node plus common clock) and a clock rate of 155MHz can provide 77 MB/sec of bandwidth per channel. Using a more aggressive technology such as gallium arsenide lift-off with a clock rate of 1 GHz per wavelength, the same 32-fiber SOME-Bus interconnect can provide up to 500 MB/sec of bandwidth per node.
Messages exchanged between nodes contain a header field with information on the message type, length and destination address. The input channel interface performs address filtering, barrier processing, length monitoring and type decoding, and supports the arrival of messages from any number of nodes simultaneously, buffering them in a queue until the local processor is ready to remove them. One queue is associated with each input channel, allowing an arbitrary number of messages to arrive simultaneously. Synchronization messages are collected and processed at the receiver. In addition to recognizing its own individual address, the receiver logic can recognize multicast group addresses as well as broadcast addresses.
The SOME-Bus may appear to be equivalent to a crossbar but it has much more functionality. A major consequence of this architecture is that, due to the multiple broadcast capability, no node is ever blocked from transmitting by another transmitter, no arbitration is required and network bandwidth scales directly with the number of nodes. No communication is ever blocked through contention for shared switching logic. With N nodes, the diameter of the SOME-bus is 1, the time needed for all-to-all communication with distinct messages is O(N) and the time needed for synchronization is O(1). Unlike a fully-connected point-to-point network, where the number of transmitters and channels increases O(N 2 ), the number of transmitters and channels of the SOME-Bus is O(N), quite smaller than the number required in other popular architectures, such as the hypercube or the torus. The number of receivers is N 2 , which is larger than the number required in other architectures. They are arranged so that N receivers are fabricated as amorphous silicon structures constructed as a thin film directly on the surface of a digital CMOS device, with no lithography required. Because of the low conductivity of the amorphous silicon layer, no subsequent patterning is required, and therefore the yield and cost of the receiver is determined by the yield and cost of the CMOS device itself. Since the receiver does not need to perform any routing, its hardware complexity (including detector, logic, and packet memory storage) is small keeping their cost small, too. The full receiver array can be implemented on a single chip even for large values of N (N > 128). Therefore, the total receiver cost is approximately O(N) instead of O(N 2 ). The SOME-Bus can most readily support a CC-NUMA system where the shared virtual address space is distributed across local memories which can be accessed both by the local processor and by processors from remote nodes, with different access latencies. Traffic on the interconnection network consists of data messages, due to a cache miss on one (local) node, directed to the memory of another (remote) node, and additional messages which allow the caches to maintain data consistency. Although the SOME-bus can utilize software techniques for implementing cache coherence, it allows strong integration of the transmitter, receiver and cache controller hardware to produce a highly integrated system-wide cache coherence mechanism. This hardware-oriented system uses cache blocks to enforce coherence, resulting in reduced probability of false sharing and thrashing, compared to software-oriented systems which use larger blocks such as virtual memory pages.
Snooping is one common technique to maintain coherence. It requires that all caches see every write memory request from every processor, and has limited the scalability of DSM systems in the past because the interconnection network quickly saturates even with a few processors. The SOME-bus does not encounter the same problem. Every processor may simply broadcast, on its own channel, messages which cause updates or invalidations at remote caches. Every receiver can also monitor its input channels for invalidation messages and signal the cache controller to take appropriate action when locally cached data is affected. Although the possibility of interconnection network saturation is eliminated, intense cache consistency traffic can saturate the cache controller. A SOME-bus-based system can take advantage of directory-based techniques which notify only those remote caches with affected data blocks. This can be simply accomplished by including a list of destinations in the invalidation message header. Messages are still broadcast over the sending node's output channel, but the decision to accept or reject an input message is performed at the receiver input rather than the cache controller of each remote node.
The receiver array which is integrated on a single chip can appear in the processormemory bus, so that the processor may access it as part of its memory. In addition, the receiver array in connected to the cache, and directory controllers for efficient protocol implementation.
Distributed shared memory models
In this paper, two models of a SOME-Bus system are developed and results are compared to additional results from simulation of the SOME-Bus system and systems based on the circuit-switched crossbar and the torus. Theoretical analysis of the SOMEBus system is based on queuing network models.
Model 1
In a SOME-Bus system with N nodes, each node contains a processor with cache, memory, an output channel and a receiver which can receive messages simultaneously on all N channels. To support the typical MSI protocol, each node also contains a directory which maintains coherence information on the section of the distributed memory which is implemented in that node (the home node). A multithreaded execution model is assumed: each processor executes a program which contains a set of M parallel threads. The node in which a group of M threads executes, is called the host or owner node. Threads can only be executed in their owner node. A thread continues execution until it encounters a global cache miss that requires data or permission from a remote node. Then, it becomes suspended until the required action is completed (data transferred from a remote memory or permission received), at which time it becomes ready for execution and eventually resumes execution. A thread is executed on the processor for runtime R before becoming suspended on a global cache miss. A cache miss causes a request message to be enqueued for transmission on the output channel. After transfer time T expires, the message is enqueued in an input queue at the receiver of the destination (remote) node, is serviced by the directory at the remote node and another message is send back to the originating node with data or acknowledgment. The remote node memory requires time L to assemble the response message. Time L is spent by the directory controller which must perform the necessary memory accesses on the memory, create the response message and enqueue it for transmission on the output channel of the remote node. As part of servicing a message, a directory may send messages to other nodes and receive data or acknowledgments. A global cache miss may be due to a read or write miss at the local cache. Accordingly, a data-request or ownership-request message is sent to the directory of the home node. When the message is received, the required data List of symbols block may be in shared state or modified state. The home directory may send a data message back to the requestor (in the case of reading a shared block) or it may send downgrade or invalidation messages and collect invalidation acknowledge messages. Since M indicates the maximum number of outstanding requests that a node may have before it blocks, it is assumed that when a node has less than M outstanding requests, it generates messages with mean interval of R time units between requests. R is the mean thread runtime. A request message generated by a node is directed to any other node with equal probability. These type of assumptions on the node operations are similar to the ones made in [20] and to the ones used in [1] and [8] who study the performance of DSM in a torus system. Since a message is sent by a node to the home node which responds with another message back to the originating node, this type of operation can be represented by a multi-chain, closed queuing network, where messages receive rather complicated forms of service at certain server stations. The M threads owned by each node form a separate class of messages with population equal to M. When m < M threads have outstanding messages, the remaining M-m messages are served in FCFS order at the processor. A message receives service (of geometrically distributed time with mean R) at the owner processor, is enqueued at the output channel of the owner node, receives service (of time T) by the channel, is enqueued in a receiver input queue at a remote node, receives service (of time L) by the directory at the remote node, and is similarly transferred back to the owner node through the remote output channel and owner receiver input queue.
Due to the symmetry of the system, all directory controllers behave in a similar fashion. Therefore, a system with N=M+1 nodes shows the same performance as a system with N > M+1 nodes. Typically, a small number of threads per node is possible. Then a relatively small queuing network representing a SOME-Bus system with M+1 nodes can be used to calculate performance measures. Node P (P=0,1,...,M) owns M regular data messages which circulate through the processor, the directory controllers of the other M nodes and the necessary channels. These messages represent the data-request and ownership-request messages from a remote node to the home node and the data-and ownership-acknowledge messages sent back to the originating node. We use different chains to distinguish between messages belonging to different processors, so that messages in chain P belong to node P. Figure 2 shows the queuing network resulting from a system with N=4 nodes (M=3). The complexity of the model arises from the fact that there are several other messages which are used only to maintain coherence. These messages do not pass through the processors. Instead they are generated by the directory controllers, go through the channels, possibly interact with cache controllers and return to the originating directory controllers. This additional coherence traffic has two direct effects on the regular data messages of the chains.
First, the coherence traffic determines the service time at the directory controller. We assume that data messages are due to a read miss with probability p rd , and that a block may be found in shared state with probability p sh . There are four distinct cases, each with a different action by the directory controller at the home node: a) data-request to a block in shared state (probability p rd * p sh ). The directory controller discards the data-request message and returns a data-acknowledge message back to the requesting node. From the queuing network point of view there is no distinction between a data-request and a data-acknowledge message. This activity can simply be viewed as a data-request message performing one trip from its own processor to the home directory controller and back to its own processor. Its service time at the home directory controller is the time L 1 that the directory needs to compose the data-acknowledge message with the copy of the requested block. b) data-request to a block in modified state (probability p rd * (1-p sh )). A downgrade message is sent by the home directory to the node with ownership of the block. That node broadcasts a downgrade-acknowledge (with write-back) message to the home node and the requesting node. c) ownership-request to a block in modified state (probability (1-p rd ) * (1-p sh )). A invalidate message is sent by the home directory to the node with ownership of the block. That node broadcasts a invalidate-acknowledge (with write-back) message to the home node and the requesting node. d) ownership-request to a block in shared state (probability (1-p rd ) * p sh ). The home directory controller broadcasts invalidations to all nodes having a copy of the requested block. It collects all invalidate-acknowledge messages and then sends an ownershipacknowledge message to the requesting block.
The average service time experienced by a data message at the home directory controller is therefore (2) where L 1 is the time that the directory needs to compose the data-acknowledge message with the copy of the requested block, L 2 is the channel transfer time, L 3 is the average channel queue time and L 4 = (L 2 +L 3 )+L 5 where L 4 is the mean time to send the invalidations and collect the invalidation acknowledgments. We assume that the cache controller works at much higher speed than the network, so that the response time of the cache controller can be ignored. If the number of blocks which must be invalidated on each ownership request is fixed N inv , then L 5 is the mean value of a random variable equal to the maximum of N inv identical random variables, each being equal to the service and queuing time at the channel server. In several applications it is observed that typically one invalidation message is sent. In that case, L 5 = L 2 +L 3 and L 4 = 2 * (L 2 +L 3 ).
Second, the coherence traffic is also passing through the channels. The interference with the data messages can be approximated by the fact that coherence messages absorb some fraction of the service rate of the channel server, making the channel service time of the data messages appear larger. This apparent service time is where T is the real channel mean transfer time, K D =M(M+1) is the total number of data messages in the system, and K C is the mean number of coherence traffic messages present in the system. To calculate K C , we notice that coherence messages are created due to the arrival of data messages into the underlying coherence mechanism. Then, we Using the reduced channel service rate, and the estimate of the directory service time, the closed queuing network with 3(M+1) queues, (M+1) chains and M messages in each chain can be solved using standard techniques. The only unknown parameter in the above equations is the channel queue time. An initial value can be used (for example, equal to the channel service time) and the model can be solved iteratively. The solution converges because an improper value of the channel queue time results in an improper value of the directory service time which causes the model to produce a new value for the channel queue time which is also improper but with the opposite effect (so that, an initial value which is too small results in a new value of the channel queue time which is too large). In our experiments, within two iterations the model produced results which completely agreed with simulation results.
Under the assumption that all processing and transfer times are geometrically distributed, Figure 3 shows the performance of the SOME-Bus architecture measured as processor utilization and response time. Response time is the time interval between the time instant when a cache miss causes a message to be enqueued in the output channel until the instant when the corresponding data (or acknowledgment) message arrives at the input queue. In Figure 3 , the probability that a message is a data message (due to a global read miss) is p rd = 0.8; the probability that it is an ownership-request (due to a global write miss) is 0.2. This assumption is consistent with common observed memory reference patterns, where write cycles constitute about 15% to 20% of all memory cycles. The probability that a block is found in shared state is p sh =0.8. The ratio of mean transfer time to mean thread run time varies between 0.05 and 1. This range of the ratio is sufficient to capture the system behavior under most common configurations and cache behavior. Specifically, let m be the miss rate and F the number of instructions per second performed by the processor at each node. Also, let S be the mean message size in bytes and C the channel bandwidth (in bytes per second). Then the mean thread runtime is equal to R=1/(mF), and the mean message transfer time (with no blocking) is T=S/C. Their ratio is T/R=mSF/C. In current high-performance architectures, the ratio of F/C is in the range of 0.5 to 1. For example, in the Cray T3D, F=150x10 6 , and byte-wide network links operate at 150MHz; in the Cray T3E, network links have 2 to 4 times more bandwidth; in the ASCI RED, each node has two processors with peak F=200x10 6 , and network links capable of 400 Mbytes per second in each direction. With small cache blocks and miss rate in the neighborhood of 10% or less, the resulting ratio T/R is in the range of 0.05 to 1.
Simulation results and comparisons
This section presents simulation results to compare the performance of the SOMEBus architecture to the performance of the two-dimensional torus and the circuit-switched crossbar architectures using a queuing-network model.
In the torus architecture, each node is connected to its four nearest neighbors. A node has four output queues which transfer messages through the associated channel to the destination nodes. Wormhole routing is used with an adaptive procedure to choose the next free channel in the direction from source to destination. In the crossbar architecture, each node has one output and one input channel. The crossbar can connect the output channel of one node to the input channel of any other node. A message waits at the queue of the output channel, if the destination input channel is being used by another output channel. When a node needs to multicast the same message, it waits until all required input channels are free, reserves them, and simultaneously sends a single copy of the message to all nodes. The destination node selection is uniform over all nodes in the SOME-Bus and crossbar architectures and uniform in each direction in the torus architecture.
A single invalidation message is broadcast on the SOME-Bus architecture. In the crossbar architecture, a node waits until all needed input channels are free, reserves them, and broadcasts one invalidation message to all destination nodes. In the torus architecture, a spanning tree is created from the source node to all destination nodes and multidestination wormholes broadcast the invalidation message.
The major parameters of the simulation are the distribution of the thread runtime and directory time to assemble a response message, the distribution of the message sizes, and the destination node selection distribution. Additional parameters of the simulation are the fraction of write messages, the number of invalidation messages sent with every request-for-ownership message and the amount of service time they receive at the destination node. Several quantities are used to measure and compare the performance of the architectures. The primary ones are the average fraction of time that threads are executing, and the average response time.
The reference point for the timing parameters and measurements of the simulation is the thread runtime R, assumed geometrically distributed with mean of 100 time units. Message size is selected in such a way that the transfer time T is fixed in the range from 5 to 100 time units (so that the resulting ratio T/R is in the range of 0.05 to 1 as mentioned above). In the torus architecture, T is the real transfer time only if the wormhole is not blocked at some intermediate node between source and destination. In all architectures, the processor at each node is executing a program with M=3 threads. An ownershiprequest to a shared block results in three invalidations. Results are compared using systems with 64 or 256 nodes. Figure 4 shows the processor utilization (which is the fraction of time that the processor is busy executing its own threads) for the three architectures as the ratio of mean transfer time to mean thread run time varies between 0.05 and 1. We also assume that message transfer time is fixed rather than geometrically distributed. Due to the reduction in variance, the processor utilization shows a small increase compared to the utilization shown in Figure 3. (Similarly, the response time shows a decrease, due to the variance reduction). As traffic intensity increases, utilization drops and becomes quite limited in the torus and the crossbar. The reason is the significant increase of channel waiting time with the resulting increase in response time, as Figure 4 shows. The simulation results on the torus system seem to agree with theoretical results reported in [2] where the authors find 90% utilization in a system with 64 nodes, four threads per node and light network traffic. These simulation results are significant as they indicate a more reasonable performance on the SOME-Bus architecture especially compared to the other architectures when the ratio T/R becomes large. Such large T/R ratios, due to miss rates around 10%, can reasonably be expected in many large applications. Utilization drops in all systems but it drops more quickly in the crossbar and torus architectures, where at values of T/R above 0.45, the crossbar and torus utilizations become less than 50%. In a system with 256 nodes, the performance of the SOME-Bus architecture and the crossbar remains the same while it is quite reduced in the torus.
The relative ability of the three networks to deliver messages without congestion can also be characterized by the average fraction of time that channels are idle. As Figure 5 shows, in the torus architecture, even as traffic intensity increases, channels remain free more than 50% of the time but cannot be used. A similar limitation is observed in the crossbar, while in the SOME-Bus architecture channel utilization increases even as traffic intensity increases. One reason for this phenomenon is that wormholes in the torus become blocked as the required channels are used by other wormholes. Simulation shows that the average number of times that a wormhole is blocked for a significant amount of time (more than a small number of time units) is between 1 and 2 in the 64-node system and between 3 and 4 in the 256-node system. Even at small values of the T/R ratio, it is apparent that wormholes encounter significant blocking all the time, and even under moderate traffic, wormhole routing tends to perform as store-and-forward routing. 
Model 2
One of the common assumptions in most DSM models available in the literature is that processor (and cache) behavior is unaffected by the activities of the directory controller (except through the protocol functions). However, the directory controller accesses memory in order to serve the incoming requests, and consequently uses some of the memory bandwidth which would otherwise be available to the local cache controller. As cache misses become more frequent and message arrival rate increases at the directory controller queue, the resulting reduction of bandwidth available to the cache controller cannot be ignored. In fact, if the node is implemented as a small symmetric multiprocessor, this phenomenon may occur even at relatively small miss rates.
The additional assumption that Model 2 makes is that while the directory controller is serving a request, it competes with the cache controller which is also generating accesses into the memory. With Model 2, we model the interference that a message creates while it is served at a remote node. The queuing network is similar to the one shown in Figure 2 , but there is interaction between the processor and the directory on a memory-cycle basis. The shared resource is the memory: when only one of the two controllers needs to access memory, it receives the full memory bandwidth. When both controllers need access, each receives a fraction of the bandwidth. This mode of operation is still modeled by the two queues labeled PROC and DIR in Figure 2 . When one queue has messages and the other is empty, the message at the head of the queue receives full service. When both queues have messages, then the two messages (and only those two) at the heads of the queues receive PS service, using fractions 0 and 1 of the server capacity ( 0 + 1 = 1). Mean service time is different for messages in the two queues.
As mentioned earlier, due to the symmetry of the architecture, a system with N=M+1 nodes shows the same performance as a system with N > M+1 nodes, where M is the number of threads owned by one processor node. Typically, a small number of threads per node is possible. For example, if M=3, a queuing network modeling a 4-node system would also be sufficient to model any larger network (with three threads owned by a processor). However, because of the interaction between the processor-cache controller and the directory controller, a closed queuing network cannot be used as in Model 1. Simpler models of this type have been solved using a Markov chain [13] . The large number of states of the present model make the full Markov chain approach impractical. Here, we present an approximate model, where one specific node is selected (with no loss of generality) and all remaining nodes and network are aggregated into two queues. The results of this model show very good agreement with simulation results.
The queuing network is shown in Figure 6 . Node 0 is the selected node consisting of a processor with cache, a directory and a channel server, as the previous model. Two chains of messages pass through this queuing network. Messages owned by node 0 form chain C 0 , while messages owned by all other nodes belong to chain C a . A message of chain C 0 passes through the processor server (service time is average thread run time) and through the channel server. After exiting the channel server, it enters the other part of the network for a period of time which is equal to the latency experienced by that message. A message of chain C a originates at a node in the other part of the network and passes through the directory controller (server) and the channel server. The latency experienced by that message is equal to the time it spends in node 0. In addition, the time a message of chain C a spends in the other part of the network is equal to the time a message of chain C 0 spends in node 0 (think time). We assume that this queuing network can be approximated by a Markov chain, and we model the interaction of the processor Figure 7 shows the state transitions out of a representative state S 0 (when all message numbers are nonzero). The transitions from state S 0 to states S a and S d are the ones where the interaction between processor and directory appears. Departures from the processor server in node 0 occur with rate p 0a which is equal to 1/R if there are no chain-C a customers at the directory server, or 0 /R if there is at least one chain-C a customer at the directory server.
Similarly, chain-C a customers at the directory server may depart at rate p 0d equal to 1/L or Messages depart from the channel server at different rates depending on which chain they belong to. Messages of chain C 0 depart with rate p 0c which is 0.25/T', and messages of chain C a depart with rate p 0f which is 0.75/T', because there are three times more messages in chain C a as messages in chain C 0 . (T' is the apparent service time as shown in Eq. (2) . Finally, the transitions from state S 0 to states S b and S e indicate an arrival of a chain-C 0 message at the processor queue and an arrival of a chain-C a message at the directory queue, which occur when the latency or think time, respectively, expire. The corresponding departure rates depend on the number of messages present in the remaining part of the network. Specifically, if N 2 messages of chain C 0 are out of node 0, then they are in a remote directory server (or associated channel). All N 2 messages may be in the same or different remote directories. Each message spends in Q L on average time LAT which is the average latency that a message encounters. Taking into account the probabilities that N 2 messages in Q L are served by one, two or three directories, we have Similarly, each message spends in Q T on average time TNK which is the average think time that a message encounters. Taking into account the probabilities that N 5 messages in Q T reside in one, two or three processors, we have Under the assumption that a Markov chain, with state transition probabilities as given above, describes the SOME-Bus architecture, we can calculate the state probabilities using standard techniques. There are two parameters in the solution of the Markov chain: the latency (LAT) encountered by messages of chain C 0 while they are out of Node 0, and the think time (TNK) encountered by messages of chain C a while they are out of Node 0. Since values of these parameters can be calculated from the state probabilities of the Markov chain, we employ an iterative solution: initially, the latency LAT is set to the average channel transfer time T' and the think time TNK is set equal to the average thread run time R; then the state probabilities are calculated and new values of LAT and TNK are measured, and the process is repeated until convergence is observed. The process converges, because if a value for one parameter is larger than the correct one, the solution of the Markov chain probabilities will result in a value that is smaller than the selected value (and similarly for selected values smaller than the correct one). For example, if a value selected for parameter LAT is larger than the correct one, then more messages will tend to remain in Q L , which would result in smaller queues in the directory and channel of node 0, and the new value of LAT will be smaller. In our experiments, convergence was observed within 6-8 iterations.
Under the assumption that all processing and transfer times are geometrically distributed, Figure 8 shows the performance of the SOME-Bus architecture measured by processor utilization, directory utilization, channel utilization and response time. Response time is the time interval between the time instant when a cache miss causes a message to be enqueued in the output channel until the instant when the corresponding data or acknowledgment message arrives at the input queue. When both are active, the cache and directory controllers are allocated half the total memory bandwidth ( 0 = 1 = 0.5). As before, the ratio of mean transfer time to mean thread run time varies between 0.05 and 1, which is sufficient to capture the system behavior under common configurations and cache miss rates. Performance results are obtained from the theoretical queuing model and simulation of a SOME-Bus architecture with 64 nodes. The figure shows the SOMEBus performance for two cases: a) when all messages sent to a remote node are data messages due to global read misses and b) when there is a mixture of data and ownership-request messages. In the second case, the probability that a message is a data message (due to a global read miss) is p rd = 0.8; the probability that it is an ownershiprequest (due to a global write miss) is 0.2. The probability that a block is found in shared state is p sh =0.8. When the directory controller receives an ownership-request to a block in shared state, it sends invalidation messages to the nodes with a copy of the block. In Figure 8 , this number of invalidations is assumed to be equal to 1. In case (a), where all messages are data-request messages to shared blocks, no invalidations are sent. These two cases are indicated in Figure 8 by Inv=0 and Inv=1. In both cases, there is very good agreement between theoretical and simulation results, especially in the processor utilization. As the miss rate increases, the T/R ratio also increases. Messages spend more time (compared to average thread run time R) in the network, with a resulting increase of the response time and channel utilization. In the presence of ownership-request messages, Simulation results and comparisons Figure 9 presents simulation results (processor utilization and response time) to compare the performance of the SOME-Bus architecture to the performance of the twodimensional torus and the circuit-switched crossbar architectures using the assumptions of model 2. These assumptions are similar to the assumptions of Model 1, and in addition we assume that the directory operation interferes with the cache in all three architectures. In all architectures, the processor at each node is executing a program with M=3 threads. An ownership-request to a shared block results in three invalidations. Results are compared using systems with 64 or 256 nodes. As in the previous experiments, the thread runtime R is geometrically distributed with mean of 100 time units. Message size is selected in such a way that the transfer time T is fixed (rather than geometrically distributed) in the range from 5 to 100 time units and the resulting ratio T/R is in the range of 0.05 to 1. Due to the reduction in the variance of the message transfer time, the processor utilization shows a small increase compared to the utilization shown in Figure  8 . As traffic intensity increases, utilization drops in all architectures. The SOME-Bus and crossbar architectures demonstrate complete scalability, as there is no difference in processor utilization between the 64-node and 256-node systems. In contrast, the 256-node torus shows a very limited processor utilization even in low values of the T/R ratio. Although the torus has four times more channels than the SOME-Bus, latency becomes relatively very large mostly due to the fact that wormholes tend to be blocked even at moderate traffic. Figure 10 shows the time to collect the invalidation acknowledge messages together with the overall response time (dotted lines) for each architecture. In the SOME-Bus, invalidation acknowledge messages do not contribute to latency any more than other messages. In contrast, these messages are a significant contributor to the latency in the other architectures although only a small fraction (20%) of the total number of messages are ownership requests.
Conclusion
Distributed shared memory allows ease of programming in large-scale parallel computer systems by hiding the message-passing mechanism. Mapping the shared logical address space on local memories and keeping it consistent requires significant traffic on the interconnection network. There is extensive research in protocols that may reduce the network traffic, but they require additional work from the programmer. Success of DSM relies on the ability of the interconnection network to transfer messages with as little latency as possible and to support the coherence protocols. Extensive research in currently common interconnection networks indicates reasonable performance when the network is lightly loaded. Performance is drastically reduced when more frequent cache misses result in heavier network traffic, leading some researchers to reject some applications as not suitable for DSM. Yet, as applications become larger and are distributed over a larger number of processors, more frequent cache misses and heavier network traffic should be expected. To avoid classifying a large number of applications as unsuitable, it is critical that interconnection networks with better properties be developed. Current advances in optical technology have made the SOME-Bus interconnection network a realistic, highly competitive candidate which promises to deliver the necessary performance. Its power is due to the fact that it requires no routing and no arbitration. No node is ever blocked from sending a message to any other node. Its simplicity results in reduced cost; although it works as a fully-connected network, the SOME-Bus grows by O(N) and minimizes both the number of the expensive transmitters and the cost of the highly replicated receivers. Theoretical and simulation results presented in this paper show that the SOME-Bus is much less sensitive to heavy traffic than other popular networks. Even when higher miss rates result in increased message traffic, processor utilization in a SOME-Bus-based system stays above 70%, while latency is almost negligible compared to latency observed in the torus and mesh architectures. In addition, invalidation messages, due to write accesses of local memory by the local processor, create no interference with the other messages transferred on the SOME-Bus and therefore cause no reduction in performance. This property is not observed in other common interconnection networks. 
