This paper develops a performance model of an optically interconnected parallel computer system operating in a distributed shared memory environment. The performance model is developed to reflect the impact of low level optical media access protocol and optical device switching latency on high level system performance. This enables the model to predict the performance impact in supporting distributed shared memory with different address allocation schemes and media access protocols. The passive star-coupled photonic network operates through wavelength division multiple access. Two media access protocols are examined for this WDM network, both are designed to operate in a multiple-channel multiple-access environment and require each node to possess a wavelength tunable transmitter and a fixed (or slow tunable) receiver. A semi-Markov model has been developed to study the interaction of the distributed shared memory architecture and the two access protocols of the photonic network. This analytical model has been validated by extensive simulation. The model is then used to examine the system performance with varying number of nodes, wavelength channels, memory and channel access times.
Introduction
Distributed shared memory (DSM) systems have attracted much interest as a means of obtaining the topological advantages of distributed memory systems and the programming advantages of shared memory systems [1] [2] [3] . Distributed shared memory systems have a global address space that is shared by all processors [4] . Although there have been many proposed approaches, a common goal is a reduction of the network traffic required to support DSM.
The approach proposed in this paper eases the traffic design constraint by incorporating a wavelength-division multiplexed (WDM) photonic network to support interprocessor communication. This paper examines the resulting performance of the system through a detailed performance analysis and shows how system-level performance is impacted with the choice of media access protocol required to support wavelength division multiple access (WDMA), the switching latency of the network interface and the tunable optical devices, and the address space allocation strategy.
Shared memory systems have the advantage of (relative) ease of programming but are hampered in terms of scalability due to interconnection network and memory contention and non-uniform access latency with increased system size [5] . DSM attempts to balance the disadvantages of both distributed memory and shared memory approaches by providing a shared memory view of a distributed memory system. However, providing this view places a heavy traffic load on the interconnection network [2] . Much effort has been placed in reducing the traffic load of a DSM system through memory allocation and configuration. For example, [6] introduced memory allocation schemes designed to reduce network access latency by staging data locally, Minnich and Farber [2] address the problem of network delay by reducing the network transfer time through a small block size, Cheriton [7] examined the coherency problem of multiprocessor systems that support shared memory at the instruction set level, while shared memory is implemented as a virtual memory with the operating system managing page faults in [1] .
The photonic network considered in this paper is based on a WDMA passive star-coupled configuration [8] [9] [10] , eases the heavy communication burden required to support a shared memory view and relaxes this significant design constraint. This factor enables various issues such as DSM architecture, memory allocation and interface, process migration and load balancing to be viewed from a new perspective. The network interface is a crucial element of a system that includes a high speed network [11, 12] , and the model developed in this paper assumes the low latency interface technique devised in [13] .
A major problem hindering the development of photonic based communication networks is the speed mismatch between the electronic and optical components. The low loss region of a single mode optical fiber has a bandwidth of about 30THz [14] , so the optical media is capable of speeds far exceeding the maximum speeds of the electronic interface components. WDM circumvents the speed mismatch problem by partitioning the bandwidth into many, more manageable, high speed channels. Each channel operates at the data-rate limited by the electronic interface components. This achieves a significant improvement in bandwidth utilization, allowing concurrent transmission over the multiple channels. Depending on the architecture, optical self-routing is achievable where a node only receives data destined to it and the system has the non-blocking connectivity characteristics of a crossbar [10] .
The network under consideration operates in a multi-access mode. Multi-hop architectures, such as Shufflenet [15, 16] , are an alternative to multi-access that avoids the complexity of supporting an access protocol but introduce a routing latency at each hop. Other examples may be found in [17] [18] [19] . Virtual point-to-point architectures [20] [21] [22] map a logical topology typically to a star-coupled network via wavelength-, spatial-, and/or time-multiplexing. A formal model of the construction of large WDM-based networks is considered in [23] [24] [25] , [26, 27] devised WDM-based architectures that include space-and time-multiplexing to support scalability and reduce the design constraints (number of channels and device tuning speed), and [28] considered power limitations of typical channel configurations.
Media access protocols developed for photonic star-coupled WDM networks may be broadly classified into reservation and pre-allocation strategies [29] . Reservation techniques may designate a wavelength channel as the control channel that is used to reserve access on the remaining data channels. An access protocol is required for both data and control channels. Pre-allocation techniques pre-assign the channels to the nodes, where each node has a home channel it uses either for all data packet transmissions or all data packet receptions. This eliminates the requirement of a control channel, and that a node possess both a tunable transmitter and a tunable receiver. The protocols considered in this paper have channels pre-allocated for reception so each node requires a tunable transmitter and a fixed receiver.
Two address space allocation schemes proposed to map the global address space to the individual local memories are dynamic memory allocation scheme (DMAS) [30, 31] and fixed memory allocation scheme (FMAS). DMAS allows the address space to migrate based on access frequency so the hit rate of the local portion of the global memory (LGM) increases. This mechanism may be implemented to operate dynamically, or statically (per job) at compile time when the resource requirements of a job must be known. Sophisticated compiler assists have been developed for the data partitioning problem. Data redistribution can occur as the program moves from phase to phase. FMAS statically allocates the address space between the m nodes in an interleaved fashion [32] . DMAS was introduced to reduce network traffic, while the main objective of FMAS is to reduce memory management complexity. An objective of the system considered in this paper is to avoid the complexity of dynamic allocation with the simpler fixed allocation method since network bandwidth is no longer a principal limiting factor. Complexity is an important issue in photonic networks since the overhead of supporting a complex access protocol or allocation scheme may degrade or even eliminate the performance advantage of the optical network. The model developed in this paper is extended in [33] to consider a snooping based cache coherence protocol and a reservation-based WDMA protocol which exploits the broadcast nature of the control channel by piggybacking cache coherence control signals such as requests and invalidations on the media access reservation packets. This paper is organized as follows. Section 2 defines the system in terms of node architecture and photonic network. Section 3 describes the assumptions and operation of the system and develops the analytic performance model. Section 4 defines the performance metrics, verifies the model through a comparison to simulation, compares the relative performance with varying system parameters and media access schemes, and investigates the relative performance of FMAS and DMAS.
... 
System Architecture
The system, consisting of the star-coupled network shown in Fig. 1(a) and the node architecture shown in Fig. 1(b) , is a distributed memory system that appears to the application developer as a shared memory system. The number of nodes and channels are denoted by m and C, respectively. The following section describes the architecture and access protocols.
Node Architecture
The system consists of m identical nodes, where each node possesses a local processor(s), a memory management unit (MMU) and a receiver/transmitter subsystem as shown in Fig. 1(b) . A node has two levels of cache: the (typically on-chip) processor cache (PC), and an extended cache (EC) located in the physical memory of the node. The network interface is assumed to use the low-latency method of injecting responses to memory requests directly in memory as described in [13] . The physical memory of a node is partitioned into EC and LGM.
LGM is mapped into the global physical address space of the system. Each processor owns a portion of the physical global address space, and manages a portion of the global virtual memory depending on the address allocation scheme. The EC is used to cache the global virtual memory and stages for the PC. The virtual address space is mapped onto the physical global address space of the system, supported by the MMU at each node. The system is described from the view-point of a tagged node, so the memory at a (destination, target or external) node is denoted by NLGM (non-local global memory). Although a node is described as if one processor is contained per node, each node should be viewed as a cluster of processors as in [34, 35] . The performance model presented in Section 3 can support clustered nodes by viewing requests to the EC as the aggregate of the individual processor requests within a cluster.
When cache-able data required by a processor is not in its EC, a miss occurs that causes the MMU to issue a shared memory request for a block transfer. If the data is not resident in LGM, the MMU transmits a request to the block owner through the photonic network. The design of the memory organization and address allocation scheme was motivated by the goal of simplifying its task by restricting a single copy of a block in the global virtual memory and allowing fast identification of block ownership.
Access Arbitration Strategies
Many reservation and pre-allocation WDMA protocols have been proposed as described in the survey [29] . The network architectures differ in the number and required tunability of the transmitters and receivers per node. The two protocols examined in this paper are based on channels pre-allocated for reception where each node has a tunable transmitter and a fixed or slow tunable receiver.
A pre-allocation based protocol operates in the following manner. The source node will:
1. Determine the home channel of the destination.
The home channel is determined in a decentralized fashion based on the channel allocation policy [36, 9, 10] . Let c j denote a channel, 0 j < C, then the home channel of node m i is c imodC for all 0 i < m. Both protocols considered in this paper use interleaved allocation so the protocols are denoted as I-x.
2. The source node then switches (requiring a time L) its transmitter to the appropriate channel, and 3. Transmit (requiring a time T) according to the particular access protocol.
A node receives and processes all traffic along its home channel. I-SA and I-TDMA both assume fixed sized packets. Time is slotted and transmission is synchronized to packet boundaries.
Interleaved Slotted Aloha: I-SA uses slotted aloha to control access. An idle node attempts transmission on the home channel of the destination node in the slot following packet generation. When collision occurs, the transmitter backoffs from transmitting based on a geometric distribution, where p b denotes the backoff probability. In the slot following collision, retransmission will occur with probability 1 ? p b . If the retransmission was successful, it transmits the next packet at the head of the transmission queue (after tuning, if necessary). If not, the node re-enters the backoff state.
The receiver of the source node is tuned to its own home channel so successful packet transmissions cannot be sensed and acknowledgments are necessary. One approach is to use explicit (or piggybacked) acknowledgments. An alternative is slot extension where a packet slot is composed of two phases: the data transmission and acknowledgment (ACK) subslots [9, 26] . A source node transmits a data packet to the destination node during the data transmission subslot; and the destination node transmits an acknowledgment to the source node during the ACK subslot. The ACK subslot is composed of the propagation delays, the processing latency (the time the receiver needs to decode the packet header and tune its transmitter to the home channel of the source node), and transmit time of the ACK. The ACK subslot is collisionless when m = C but must be extended in a time division fashion when m > C. An advantage of slot extension is immediate detection of a collision, but system performance degrades as the propagation delay increases and when the processing latencies are significant relative to the packet transmission time [25, 37] . Slot extension is used here since the propagation delay in an optical backplane for processor interconnection is much less than in a general local area network environment.
Interleaved TDMA: I-TDMA avoids collisions and the complexity of supporting acknowledgments and retransmissions by time multiplexing access to the destination nodes and eliminating collisions. Time is slotted on each channel, and the home channels are pre-allocated for packet reception. Every node in the system has a chance to transmit on each channel per frame. Determining the slot which is assigned to a particular source-destination pair is simple and decentralized and based on the home channel allocation policy defined above for I-SA.
A frame has a length of m slots. A node is assigned a total of C slots per cycle and remains inactive for the remaining m ? C slots. However, the channels are fully allocated. A possible slot allocation scheme is to assign node m i channel c 0 at slot i + 1] mod m, and subsequent channels are allocated one unit apart for the remaining C ? 1 channels. Alternative allocation schemes have been studied with the objective of overlapping the switching latency to relax the design constraints on the optical devices [38] .
It is possible to generalize the slot allocation where the number of slots per frame is not m and a node could be assigned multiple slots per channel per frame as with Universal Time-Slots [39] . One node could act as a scheduling agent and assign slots based on requests from the nodes, providing an added level of adaptability. This approach is very effective in a telecommunications environment where connections between communicating nodes have a duration that is typically orders of magnitude longer than the time required for connection setup. Communication patterns in a distributed shared memory computer system typically do not follow this pattern of long duration connections, particularly with clustered multithreaded processors, which is why fixed slot allocation is used in this paper.
System Model
This section defines the model used to evaluate the system performance. The model, based on a semiMarkov process, is used to determine the impact of the access protocol and memory allocation strategy on system performance. Refer to [40] and [41] for an analysis of a multibus multiprocessor based on a Markov and queueing network model, respectively.
A semi-Markov model was developed in [42] for a multibus system. In particular, it developed the three state cluster which is inherent in the model developed below. A semi-Markov model is an approximation technique, useful because of the significant reduction in state space, and a comparison to simulation is used to ensure that the implied approximations are reasonable.
Model Assumptions
The system is described in terms of normalized time. Time is normalized to the average time between successive EC accesses. The behavior of the system is described in terms of the state of the process executing at the tagged processor. The process is viewed as residing at the processor (EC hit), local memory (LGM hit), the communication network during transfer, or external access (NLGM hit). The process may be blocked due to contention at memory or the network. Network latency is highly dependent on the media access protocol. The system is assumed to possess C multiple access channels created through WDM on the photonic network. System operation is characterized as follows:
behavior of the nodes and channels. Channel traffic will not be uncorrelated in general, due to common transmit queues, but this approximation simplifies the transition probability derivations in Appendix A and is shown to be reasonable in the comparison to simulation in Section 4.2.
A 2 : Each processor submits a global memory request when an EC miss occurs with a hit ratio of . A 3 : The probability that more than one node requests arrival or release of memory or a channel between time t and time t + t is o( t).
A 4 : Global memory allocation schemes.
FMAS:
Block ownership is preassigned, and the blocks do not migrate between local memories based on access frequency. Let denote the conditional probability of accessing LGM given a EC miss.
The result is a uniform memory reference pattern, so FMAS = 1=m for FMAS due to the definition of a uniformly distributed address space. Let r ij denote the conditional probability that m i accesses the local memory of node m j upon a local cache miss given that m i had a LGM miss. Due to the uniform allocation, r ij = 1=(m ? 1) for i 6 = j, 1 i m and 1 j m.
DMAS:
Block ownership is preassigned but the memory blocks are allowed to migrate between local memories based on access frequency. The LGM hit rate would be much higher than the FMAS, so we expect FMAS DMAS < 1:0. The value of DMAS is unknown and varied in the performance analysis section. We assume that a desired memory block may be located in any of the other m ? 1 nodes with equal probability given an LGM miss occurs, so r ij , i 6 = j, has the same form as the FMAS case.
A 5 : Packet-switched system. A source node sends a request message (a single packet) to the target node requesting a particular memory block. The target node receives the packet, decodes the request, accesses its local memory to satisfy the global memory request, then sends the requested data to the source node via a single packet along the optical network.
A 6 : Random selection for service (RSS) queue discipline for network and memory access. When more than one request is queued for a resource, the requests are selected for service in random order independent of the time of arrival.
The memory access time is M time units. Access to memory is slotted on unit time. Time is slotted for network access based on the block transmission time T (there is an initial synchronization delay for network access to align to the packet slot boundaries).
A 8 : No context switching. A processor is not multi-threaded and waits for the response to an external request.
A 9 : Request and response packets have the same size (this is a channel slot requirement of the access protocols).
A 10 : Propagation delay is not explicitly considered since it can be overlapped with I-TDMA [37] and can be included in L for I-SA.
The following section derives a performance model of the proposed system based on a semi-Markov model. A number of performance metrics can be derived from the analytical model. In particular, we are interested in processor utilization per node (u P ), memory utilization per node (u M ), total channel utilization (U C ), and average latency per access (t avg ). Table 1 summarizes the terms used in this paper.
The Semi-Markov Model
A semi-Markov model is developed to approximate the behavior of a node. The state diagram of the model is depicted in Fig. 2 . Note that the self-loops in this model could be eliminated by specifying the average sojourn time to be the mean of the appropriate distribution. However, the self-loops have been retained since we feel they aid in clarity and add little additional computational complexity.
State Definitions
In addition to S 0 , the active state, let S yz denote a state with an average sojourn time of yz where y 2 Y, z 2 Z. The two sets, Y and Z, are defined as follows. Let Y = fL; E; O; Ig where L, E, O and I denote local memory access, external (remote) memory access, outbound memory request via network, and inbound memory response via network, respectively. Let Z = fR; F; A; Bg, with the elements defined as residual wait (R), full wait (F ), access resource (A), and backoff (only needed for I-SA network access) (B).
As shown in Fig. 2 , states often appear as triples: the access state, the full wait state, and the residual state [42] . The residual wait state represents a synchronization delay: the delay until the beginning of the next access frame of the resource. The full wait state represents queueing (with a RSS service discipline) for the resource, once the initial residual time has been met, until access is obtained. Note that these states have self-loops which could be eliminated as noted above since this is a semi-Markov model.
The traffic waiting to be transmitted at a node can be of two types: requests and responses. A request packet is generated by the tagged node upon an EC and LGM miss. A response packet contains the data in response to a request packet. The state definitions are:
S 0 : Processor active (in cache). The process enters S 0 and remains there for a duration of time equivalent to the time between extended cache accesses.
S LR : Local residual wait for memory. The process enters S LR when the tagged request is blocked due to an already busy local memory. The request waits for the residual memory service time before re-attempting access to the local memory. The tagged request contends for access with (non-local) requests that are either in residual or full wait when the residual memory service time is completed. The tagged request moves to S LA if the request succeeds in accessing the local memory, or S LF (full wait) if it loses arbitration with a non-local request. S LF : Local full wait for memory. The process enters S LF when LGM access is blocked at the beginning of a memory service cycle. This occurs when a newly released memory is requested simultaneously by the tagged node and at least one external request that was already queued waiting access and the tagged node loses the arbitration based on A 6 . In this case, the request waits for the entire memory access duration before access can be re-attempted ( LF = M). S LA : Local memory access. The process enters S LA if the local memory is idle (or arbitration is won) and the data required by an EC miss is contained in local memory (LGM hit). The process remains in S LA for a duration equivalent to the service time of memory. From S LA the process returns to S 0 and the process resumes its active state. The tagged request may lose network arbitration with a response packet generated by the tagged node (in response to a received request packet) that was either in residual or full waiting for network access. If blocked, the request waits for the full packet transmission (network slot) time before re-attempting access. This occurs when the request from the tagged node loses arbitration (and is not in backoff with I-SA), or when in backoff (see State S OB and S IB below) for I-SA. The process re-attempts access at the end of the full network waiting state, and remains in S OF if unsuccessful or moves to S OA if it wins arbitration.
The average sojourn times and definitions of states S Oz and S Iz , z 2 fR; F; B; Ag, vary depending on the access protocol. Specifics are considered later in this section.
S OB : Outbound backoff channel access. Required only with I-SA, due to a collision while transmitting a request packet from the tagged node in S OA . The process remains at this state after the sojourn time (equal to the channel time slot) has completed with probability 1 ? p b , otherwise the process moves to S OA for another attempt at transmission.
S OA : Outbound network access (transmission).
The request is transmitted to the target node according to the media access protocol. With I-SA, the process enters the backoff state to wait re-transmission if a collision occurs during transmission. With a successful transmission, the process enters S EA if the target memory is idle or S ER if the target memory is currently busy. Fig. 2 shows a transition from S OA to S OB that is used to include the possibility of packet collisions that require retransmission and is only required with I-SA (h = 0 with I-TDMA The above equations form a set of simultaneous nonlinear equations that must be solved. The nonlinearity is introduced because the transition probabilities are defined as functions of limiting probabilities, while the limiting probabilities are defined as functions of transition probabilities. The derivation of the transition probabilities may be found in Appendix A. The limiting probability of being in state S yz , denoted as P yz , can be expressed as:
v ij ij (15) for some y 2 Y and z 2 Z where v yz is the limiting probabilities of being in state S yz of the embedded Markov chain. A simple iterative algorithm, such as can be found in [43] , can be used to solve for the limiting probabilities.
The average sojourn times common to both protocols are summarized as follows: 0 = 1, yz = M for all y 2 fL; Eg and z 2 fF; Ag, LR = ER = M 2 , and OA = IA = T. The memory and channel sojourn times are actually deterministic since the block size and slot length on the network are constant.
For I-SA, yz = T for all y 2 fO; Ig and z 2 fF; Bg, and
where L denotes the the switching latency of the optical devices. (17) This model is used in the following section to examine performance impact with variations in system parameters.
Performance Analysis
The model developed in the previous sections is now used to analyze the behavior of the system. The analytic model is validated through a comparison with simulation. The performance of the system is analyzed in terms of the following metrics: average time per access, processor utilization, memory utilization and total channel utilization. The impact of two address allocation schemes and two optical media access protocols are investigated.
Performance Metrics
This section defines the performance metrics, and can be derived from the model as follows. The memory utilization per node is defined as u M = P LA + P EA (18) and the total memory utilization is U M = mu M . The total channel utilization is U C = m(1 ? h)(P OA + P IA )
The processor utilization u P is defined as u P = P 0 , but can also be approximated as u P = t 0 t 0 + T acc (20) where t 0 is the average duration of the processor executing out of EC before an EC miss occurs, and T acc is the average time required to process an EC miss (the time from when the process leaves state S 0 until it returns). Due to the assumptions described in Section 3.1, t 0 can be approximated since the distribution of the interarrival time of an EC miss is geometric with parameter . This implies that t 0 = 0 1 ?
, so
Let t avg denote the average transaction time per access, the weighted sum of an EC hit and miss, so 
Validation of the model
This section compares the values predicted by the analytic model to simulation results based on the assumptions of Section 3.1. The simulator is based on a stochastic self-driven discrete event model, written in the C programming language with a C-based library that provides discrete-event and random variate facilities [45] . Steady state limiting probabilities and transaction times were measured. Simulation convergence was obtained through replication/deletion with a 95% confidence in a less 2% variation from the mean.
Figures 3-4 compare the analytical model and simulation for both I-SA and I-TDMA. The figures graph processor utilization and average transaction time per access for varying system sizes and switching latencies. All graphs plot variations in the performance metrics with increased . They consider a constant channels to nodes ratio of C = m=2. Lines and points represent the analytic and simulation results, respectively. The simulation with I-SA also included stability support [46] . The maximum deviation between the analytic and simulation models for I-SA is shown at = 0:75, m = 32 and L = 8 in Fig. 3 to be less than 2%
and 5% for u P and t avg , respectively. The maximum deviation between the analytic and simulation models for I-TDMA is shown at = 0:75, m = 32 and L = 8 in Fig. 4 to be less than 6% and 8% for u P and t avg , respectively. Both graphs converge to a deviation of less than 0.1% as ! 1:0 which represents the case when an EC miss is extremely unlikely. The results show a high degree of correlation between the simulation and analytic models. The model developed in this paper has been extended in [33] to include the interaction of the cache coherence protocol, and was shown to retain a high degree of accuracy through a comparison to trace-based simulation.
DMAS (0.7) DMAS (0.8) DMAS (0.9)

FMAS
C=m/4 C=m
Processor Utilization Average Access Time α α 
DMAS & FMAS
This section examines system performance with two different address allocation schemes. All graphs in this section assume a negligible switching latency of L = 0 and use I-SA as the media access protocol. The performance impact with the two allocation schemes is reflected in the selection of the value of which denotes the probability that the requested block resides in LGM given a EC miss has occurred.
This section considers two cases: the system performance with DMAS and FMAS is compared in a highspeed environment, and then the channel slot time is varied with the FMAS case. The slot time is varied to determine how much faster the network needs to be before FMAS can attain the level of performance of DMAS and see when the complexity of DMAS is justified in a photonic domain.
DMAS migrates ownership of the blocks between local memories based on reference frequency, and FMAS retains static allocation regardless of the access frequency. FMAS statically allocates memory space between nodes in an interleaved fashion so a requested block could be owned by any of the nodes with equal probability. DMAS scheme migrates ownership, so the LGM hit rate is expected to be larger than the LGM hit rate with FMAS. To examine the relative performance of the two approaches, the following section considers 2 f0:7; 0:8; 0:9g for DMAS and = 1=m for FMAS. The two address allocation schemes are compared through the processor utilization u P and the average transaction time per access t avg . For C = m=4, there is a 69% increase in u P at = 0:75 when is increased from 1=m to 0.70. The improvement in processor utilization increases by 118% with C = m channels. In terms of the average transaction time per access, the delay is reduced by 44% from FMAS to DMAS with = 0:70 and C = m=4.
The graphs show that the increased traffic load of FMAS can be compensated by the increased bandwidth when the number of wavelength channels is increased and that the attractiveness of FMAS in the photonic domain is increased when the memory management complexity is considered. Fig. 6 varies the channel slot time to determine how much faster a network with FMAS needs to be to attain the level of performance achieved with DMAS. A system of 64 nodes is considered. The DMAS case assumes = 0:70, C = m=4, T = 10 and M = 10. FMAS assumes = 1=m, C 2 fm=4; m=2; mg, M = 10, and T 2 f1; 5; 10g. This set of graphs show how the additional channels and higher media speed enable the lower complexity FMAS approach to achieve superior performance over a DMAS approach with only a modest increase in media transfer rates. DMAS was proposed primarily for the electronic domain and, in practice, the speed of a photonic network may be orders of magnitude larger than an electronic network. The three graphs of Fig. 7(a) ,(c),(e) illustrate the per node memory utilization degradation as L is increased. For example, a 14% decrease in memory utilization for I-SA occurs when L is increased from 0 to 4 at = 0:75 and C = m. A further reduction by 12% occurs as L is increased to 8. Compared with I-TDMA, which suffers a degradation of 11% and 11% for the same cases, we see a slight improvement in sensitivity over I-SA since I-TDMA overlaps the switching latency with the synchronization time of the protocol. The graphs show that the differences in memory utilization with varying channels decrease as the switching latency increases. When C < m, I-SA can reduce its throughput degradation due to the switching latency by overlapping the initial tuning and transmitting times between nodes. A channel is held for only one slot (of time duration T) rather than a duration of L + T due to the initial tuning of the optical device. If slot extension is used, the tuning latency and the propagation delay in sending the acknowledgment is not overlapped and the channel slot duration will be L+T or 2L +T if the initial tuning is not performed. If slot extension is not used, the tuning time will add to the access delay but does not tie up any additional slots that could be used by other nodes so the impact to throughput decreases as the ratio m C increases. When C m, the throughput degradation can be significantly reduced. For example, u M increases by 43% and 19% for I-SA at = 0:75 and L = 0 when C is increased from m=4 to m=2 and from m=2 to m, respectively. The corresponding increases for I-SA are 28% and 18% at L = 4, and 36% and a 14% increase at L = 8. With I-TDMA, the corresponding memory utilization increases are 7% and 0% for L = 0, 5% and 0% for L = 4, and 5% and 0% at L = 8. This shows that although I-TDMA can decrease the performance degradation of the switching latency by overlapping with the synchronization delay, it is not able to take full advantage of an increase in channels due to inherent head-of-line effects. A generalized approach, based on I-TDMA but with C buffers at each node to pre-sort the outgoing traffic, has been examined [44] that eliminates the head-of-line effect problem so its performance will scale with an increase in channels. utilization than I-TDMA as the ratio of channels to nodes increases, and I-TDMA has a flat performance which is not sensitive to variation in channels. For example, a 44% and 19% increase in U C occurs with I-SA at = 0:75 and L = 0 when C is increased from m=4 to m=2 and from m=2 to m. A 6% increase in U C occurs with I-TDMA when C is increased from m=2 to m, and a negligible increase in channel utilization when C is increased from m=4 to m=2. Fig. 8 demonstrates the limitations of I-TDMA as the system size increases. The channel utilization of I-SA grows by 300% between m = 8 and m = 32 at L = 0, which illustrates a scalable shared memory system where memory utilization is being maintained as the system size is expanded. This holds for the case of an optical backplane for multiprocessor systems. In a LAN environment, the propagation delay would impact performance and severely limit scalability.
Channel and Memory Utilization
Impact of access times on system performance
This section considers the impact of varying memory and channel access times on the system performance for both I-SA and I-TDMA. To study the interdependence of the parameters, this section varies M and T as follows. Fig. 9 considers varying memory access times of M 2 f5; 10; 20g with a constant channel access time of T = 4, and Fig. 10 considers varying channel access times of T 2 f2; 4; 8g with a constant memory access time of M = 10. All graphs assume a negligible switching latency of L = 0 and vary system sizes with a constant channel ratio (m 2 f8; 16; 32g and C = m=2). Parts (a),(c)(e) of each figure plot processor utilization (u P ) while parts (b),(d)(f) plot the average access time (t avg ) with varying . Fig. 9 illustrates the reduction in performance that occurs with I-TDMA when the system size is expanded. The performance degradation is due to the extended frame length and head-of-line effects. I-SA is shown to be fairly insensitive to increases in system size providing the channels/nodes ratio is maintained. Consider the percent increase in delay as the number of nodes in a system is increased from m = 8 to m = 16 to m = 32 at C = m=2. The percent increases for I-SA are 7% and 5% for M = 5, 8% and 4% for M = 10, and 7% and 3% for M = 20. The percent increases for I-TDMA are 67% and 79% for M = 5, 57% and 71% for M = 10, and 44% and 59% for M = 20. Fig. 9 shows that the delay of I-TDMA is 12% greater than I-SA at m = 8 and = 0:75 but increases to a 200% difference at m = 32. This shows that I-SA has greater potential to support a large scalable shared-memory system while I-TDMA is not suitable in its current form and needs the generalized buffer scheme of [44] to avoid the head-of-line effects which still does not alleviate the sensitivity to the frame length that is inherent to time multiplexed approaches.
The set of graphs of Fig. 10 illustrate the sensitivity of I-TDMA to channel slot time since it directly extends the frame length. For example, Fig. 10(f) shows that the delay of I-TDMA increases by 220% at = 0:75, m = 32 and C = m=2 when T is increased from 2 to 8. This is reduced to a 130% increase in delay at m = 8. The performance of I-TDMA is shown to degrade with variations in parameters that cause an increase in frame length such as system size and slot size (channel access time). When both parameters increase, both the average access time and processor utilization severely degrade. This figure illustrates that the delay of I-SA scales with increases in slot length and remains essentially constant with increases in system size providing the ratio of channels to nodes is maintained. Fig. 11 examines the situation when the number of nodes become large. Fig. 11 (a) compares I-SA and I-TDMA with a varying system size of m 2 f64; 128; 256; 512; 1024g with constant ratio of channels C = m=2. This graph illustrates the scalability characteristics of I-SA, where virtually the same level of processor utilization is achieved for all values of system size, even with non-negligible switching latency as shown in Fig. 11(b) . However, this requires a constant ratio of nodes to channels which becomes very difficult to sustain as the system size increases.
Conclusion
This paper considered a passive optical wavelength-division multiplexed star-coupled network to support interprocessor communication in a distributed shared memory environment. This approach combines recent advances in optical fiber communication with wavelength tunable transmitters/receivers to create multiple multiaccess channels on a single optical fiber through wavelength division multiple access. A self-routing optical multiple access channel results which significantly reduces intermediary latencies. A semi-Markov model was developed to reflect the impact of low level optical media access issues on high level distributed shared memory system performance, enabling the model to predict the performance impact with different WDM access protocols and tuning latencies of the optical devices in the underlying photonic network. Two address space allocation schemes for distributed shared memory were compared. The performance was analyzed in terms of varying system parameters such as the number of nodes and WDM channels. The model was validated through a comparison to simulation. Two low complexity multiple access protocols were considered. The comparison showed that I-SA has better performance than I-TDMA when the nodes to channels ratio is low. I-TDMA was shown to be most suitable for an environment with a small system size and high nodes to channels ratio. The performance analysis shows the system to be scalable with I-SA, with the individual performance levels remaining constant when the system size is expanded providing the ratio of channels to nodes can be maintained.
A Transition Probability Derivations
The derivation of the transition probabilities of Fig. 2 is based on the behavior of one specific node, referred as the tagged node. The terms external, destination or target are used to refer to any node in the system other than the tagged node. Assumption A 3 states that the probability that more than one request arriving at the same time instant and the probability that multiple requests arriving at the instant of a release of memory or channel is negligible. This implies that the transition probabilities p 0 . An extended cache hit was previously defined as . Let b denote the probability of losing arbitration for the local memory when other requests are currently waiting. The probability that of one of the other (m ? 1) external nodes has been blocked making an external access to the LGM at the tagged node is 1 m ? 1 P EF + P ER ] since an external request is equally likely to target any other node with equal probability. Considering all (m ? 1) external nodes, P EF + P ER is the average number of requests from external nodes waiting to access local memory. The probability of a tagged node losing arbitration for its LGM is: b = P EF + P ER 1 + P EF + P ER (22) . If denotes the probability that the local memory is already busy when the tagged processor tries to access its LGM, then = P EA since the probability that a request from one of the external nodes is being processed at the tagged node is P EA m ? 1 and (m ? 1) nodes could be accessing the local memory.
. Let e f denote the probability that the target NLGM is already busy when a request from the tagged processor arrives at the target node:
e f = P LA + m ? 2 m ? 1 P EA (23) where P LA is the probability that the target processor is accessing its local memory and P EA m ? 1 is the probability that one of the other (m ?2) nodes is currently accessing the target memory, so (m ?2) P EA m ? 1 is the probability that the target memory is being accessed by one of the (m ? 2) nodes other than tagged and target nodes.
. Let d f denote the probability that an external request generated by the tagged node in full waiting at the NLGM loses arbitration to some other request: . The term e b denotes the probability that the local memory is currently busy processing a request generated by one of the m ? 1 external nodes when the response to an external memory access request generated by the tagged node is returned to the tagged node. Since P EA m ? 1 is the probability of one of the m ? 1 other nodes is currently accessing the memory of the tagged node, e b = P EA (25) Specific to I-SA: Now the transition probabilities specific to the case when I-SA is used as the access protocol are considered.
. Let f f denote the probability that the tagged request loses access to the channel (outbound network) either from residual or full wait. Access can only be gained from the full or residual wait states if the node is not in backoff.
Let be the event that the tagged request is blocked due to a response packet directed to one of the other m ? 1 nodes that is in backoff at the tagged node, therefore P ] = P IB since P IB m ? 1 is the probability that one of the m ? 1 other nodes is currently in backoff at the tagged node due to a response packet. P IB is the probability that the response packet directed to the tagged node is currently in backoff at an external node. Let be the event that the tagged request wins the arbitration given that the node is not in backoff:
P j ] = (27) . The probability of collision (h) when a packet is transmitted (outbound or inbound) is determined next. On average, m=C nodes share the same channel for packet reception. Successful transmission occurs when the same channel is not targeted by more than one node during the channel time slot. Two cases must be considered, 
. If denotes the event that the tagged node has a successful transmission,
m C ?1 (29) where (m ? C)(P OA + P IA ) C(m ? 1) is the probability that one of the m C ? 1 nodes that share a home channel with the source is targeting a node that shares the same home channel, and
is the probability that one of the remaining m ? m C nodes is targeting a node that shares the same home channel as the source.
With similar arguments, m C (30) . (31) . Let f b denote the probability that the response packet generated at an external node that is targeted to the tagged node is blocked from accessing the network (inbound) from either full or residual waiting. The access is blocked if the node is in backoff or the node is not in backoff but arbitration between other contending packets is lost. Let be the event that the tagged nodes response packet is blocked due to the external node being in backoff. Backoff could have been caused by either a request packet generated by the processor at the external node or a response packet to one of the m ? 2 other node, so P ] = P OB + m ? 2 m ? 1 P IB (32) where P OB is the probability that the target node resides in backoff state due to a request (outbound), and m ? 2 m ? 1 P IB is the probability that a response packet to one of the other (m ? 2) nodes caused the target node to enter the backoff state (inbound). If denotes the event that the response packet generated at the external node that is targeted to the tagged node wins arbitration given that the external node is not in backoff, then
P j ] = Specific to I-TDMA: Now the transition probabilities specific to the case when I-TDMA is used as the access protocol are considered.
I-TDMA can be modeled with a slight modification from I-SA. Since I-TDMA is a collisionless protocol in network, (h = 0 and p b = 0), so states S OB and S IB are not required. The modified transition probabilities are as followed:
f f = 1 ? 1 1 + P IF + P IR (35) and f b = 1 ? 
