Abstract-Scalable, hierarchical, all-optical WDM networks for processor interconnection in multiprocessor systems have been recently considered. The principal objective of this paper is to introduce an access protocol for this type of network which supports a distributed shared memory (DSM) environment. The objectives of the protocol are reduced averagelatency per packet, support of broadcast/multicast, collisionless communication, and exploitation of inherent DSM traffic characteristics. The protocol is based on a hybrid approach that combines reservation access and pre-allocated reception channels for a WDM system. The proposed approach trades maximum capacity for reduced communication latency to improve system response. The performance of the protocol is analyzed through semi-markov analytic and simulation models with varying system parameters such as number of nodes and channels. The performance of the new protocol is compared to a TDM-based protocol and their relative merits are examined.
I. INTRODUCTION
T HE performance of multiprocessor systems is significantly impacted by the interprocessor communication network performance, particularly in support of large shared memory systems. Traditional electronic interconnects have inherent bandwidth and speed limitations which impose an upper limit on system performance. Optical interconnects have characteristics that enable significant advancement beyond the current state-of-the-art in computer architecture. A major obstacle faced in building photonic systems is the speed mismatch between the high-speed optical components and the interface electronics. This is particularly true in a packet switched environment. Wavelength division multiplexing (WDM) is a technique that reduces the impact of the speed mismatch by partitioning the enormous bandwidth into multiple, multi-access, more manageable channels that operate at speeds compatible with the electronic interface [1] .
The objective of this paper is the design of a media access protocol for an optically interconnected distributed shared memory (DSM) multiprocessor system. The primary objective is low-latency low-cost communication that is sufficiently scalable to support thousands of processors. This paper describes an approach that combines the support of the cache coherence protocol, required for DSM, and the access protocol to achieve the low-latency requirement which exploits the natural traffic characteristics of this environment. A primary design constraint, needed to achieve the low-cost and scalability objectives, is the assumption that there will be only a few WDM channels (many less than the number of processors).
A scalable, hierarchical all-optical architecture has been developed to achieve the system objectives of low-latency and low-cost through a combination of spatial reuse of WDM channels [2] . The intention is to provide a reconfigurable structure that supports DSM and capitalize on any reference locality above uniform [3] . Each processor is associated with a portion of the system address space which can be accessed by other processors through message passing mechanisms. However, the communication appears to the user and operating system as a shared memory reference. Cache coherence protocols (snooping or directory based) have to be provided to maintain cache consistency across all the processors [4] . The performance of the DSM organization and the cache coherence protocol is dependent on the media access protocol. Careful attention has to be devoted to the design of the access schemes in an effort to minimize latency and deliver the performance advantage to the application level.
The traffic, generated by the cache coherence protocol and the operating system needed to achieve DSM, has two major forms: control (such as memory block requests, invalidations, cache-level acknowledgments) and memory blocks. Control packets are less than 64 bytes long while the memory blocks could be up to 8 Kbytes [5] . Furthermore, there are multiple control packets for every memory block that needs to be transferred. The protocol described in this paper exploits this characteristic.
WDM access protocols can be classified as reservation based and preallocation based. Reservation protocols typically allocate one channel as control channel which is used to reserve access to remaining data channels. Pre-allocation protocols allocate all channels for data transmissions and do not require control channels. These protocols typically require only one tunable device (transmitter or receiver) per node. An objective of this approach is to avoid having a control channel since the base assumption is that the system will only have a few WDM channels. Reservation based protocols such as TDMA-C [6] and PROTON [7] can support variable packet size, but have higher transceiver cost (multiple tunable devices per node), and significant protocol complexity. A control channel based protocol introduced in [8] requires 2M channels to support communication among M nodes. In addition to the other concerns with reservation based protocols, it linked the number of nodes and the number of channels which impacts scalability. A pre-allocation based protocol such as I-TDMA [9] achieves very high utilization but has high latency under light loads due to cycle synchronization and is inefficient in the support of variable sized packets. Variable packet sizes could be supported through segmentation, or trade-off utilization and pick the slot size to be equal to the longest packet size. Often communication protocols maximize throughput and sacrifice delay. In our environment reduced latency is the primary objective.
The protocol introduced in this paper, called FatMAC, is a hybrid approach that combines the advantages of receiver pre-allocation and reservation access strategies. The protocol has the following characteristics: low implementational complexity, collisionless transmission, low-latency at low loads and stability at high loads, support of varying propagation delays through interleaving cycles, and support of variable packet sizes without segmentation. This protocol reserves access on pre-allocated channels through control packets. The control packets supply the access information, such as the size of the associated data packet (if any), and also support broadcast of control information. The proposed protocol is based on tunable transmitters and fixed receivers, and is optimized for a laser-array transmitter. This paper studies the performance of FatMAC using analytic and simulation models under varying system parameters. The performance is also compared to the time division multiplexed protocol studied in [2] , [10] .
The paper is organized as follows. Section II briefly describes the network architecture. Section III provides the protocol descriptions of FatMAC and I-TDMA. Section IV develops a semi-Markov model for FatMAC and Section V evaluates the relative performance of the two protocols. Section VI summarizes the paper.
II. NETWORK ARCHITECTURE
This section describes the network architecture. The single-level topology is described first,and a hierarchical generalization is presented.
A. Single level architecture
A variety of optical channel topologies can be used to achieve a multiple access environment. Single-level interconnection can be accomplished using an optical passive star topology. Each node may have a tunable transmitter and a fixed receiver. The protocol considered here is based on channels pre-allocated for data reception where each node receives on a specific channel referred to as its home channel. The star configuration has been chosen for its passive nature, high fault tolerance, high node fanout, and complete unity distance connectivity. The following section describes the generalized hierarchical architecture.
B. Generalized hierarchical architecture
The all-optical architecture is a hierarchical structure with spatial wavelength reuse at each hierarchic level [2] . An example of a three level network is shown in figure 1 . A bus based system is drawn for clarity; an actual implementation would be a star-coupled based system. This system is based on a component called Λ-partitioner which is a space/wavelength switch. It is used as a wavelength-selective spacedivision switch to select a subset of wavelengths to switch. More details on this architecture are available in [2] .
An i-level Λ-partitioner couples a total of mi + 1 input fibers and mi + 1 output fibers: one fiber to/from its parent and mi fibers to/from its children. The functionality of a Λ-partitioner can be described as a 2 2 switching element (the lower mi links are passively coupled to a single fiber), as found in multistage interconnection networks, where the set of wavelengths below a partition point are bar-connected and the wavelengths above the partition point are cross connected. Functionally, this achieves wavelength re-use by retaining the wavelengths below the partition point to the local cluster thereby allowing all peer-level clusters to independently use the same set of wavelengths [2] . It employs a wavelength partitioner (denoted as Λ-x) to partition the traffic between different levels of the hierarchy without electronic intervention.
The processornodes are located at the leaves of the tree and each may actually be multiple processors. Fig. 2 shows the organization of a single processor including the Local Processor Cache (PC), the Extended Cache (EC) and the Local Global Memory (LGM). Only the processors are equipped with transmitters and receivers (a fixed receiver per level and a laser array transmitter). The nodes at all higher levels are Λ-partitioners that provide wavelength/spatial switching. An r-level system partitions the wavelength channels into r non-overlapping subsets. The partition points are defined as X = fX0;X1;X2;:::;Xrg where Xr = C, X0 = 0 and 0 < Xi < Xj < C if i < j for all 0 < i < r ?1 and 1 < j < r. Let Ci = Xi ?Xi?1 denote the number of channels allocated for i-level communication, 1 i r, such that
Ci.
Let Mj denote the number of j-level nodes in this hierarchy, given by the equation Therefore an r-level architecture provides a total of C = P r j=1 Cj Q r k=j+1 mk separate channels that may be concurrently accessed due to the combination of spatial and wavelength multiplexing.
Consider a small system of M = 4 4 with C = 4. If the partition is set so that 3 channels are used for level one, the total number of effective channels increases to a maximum of C = 13 through wavelength re-use.
III. PROTOCOL DEFINITION
This section describes the hybrid access protocol. It is defined for a single level system, and then generalized to the multi-level architecture defined in Section II Table I contains definitions used in the protocol description and analysis.
The traffic characteristics of the system play a crucial role in the access protocol design. Define two classes of traffic:
Class A: small amounts of data, such as control information generated by the cache control mechanism and the operating system. Examples are memory block requests, invalidations, cache-level acknowledgments, application-level low-latency messages, operating system control information, bandwidth reconfiguration (for the multi-level network) and other network management packets. Class B: large amounts of data, such as a memory block.
This class of traffic generates a reservation control packet for media access if a reservation based protocol is used. In this case, Class A control information can be piggybacked on a Class B reservation packet. Memory block length is expressed in terms of control packet length: let L denote the ratio of memory block packet length to control packet length.
Often only a control packet needs to be sent to transmit control signals. Block transfers occur only a fraction of the time. This provides the motivation to develop a protocol that can reduce communication latency and transmit unfragmented memory block packets. Define as the fraction of Class B packets generated in the system, where 0 1. The case = 0 corresponds to no memory block packets and = 1 to all packets being memory block packets. The distribution of packet types is determined both by the media access protocol and the cache coherence protocol. Cache coherence protocols for multiprocessor systems can be broadly classified as snoopy and directory based [4] . The write-invalidate and write-update consistency commands generate varying levels of cache control traffic. Snoopy cache protocols use some form of broadcast mechanism (for fast invalidations, etc.), and directory based schemes store information on where copies of blocks reside and usually require explicit invalidations. The value of depends largely on the amount of data sharing which is determined by the address references generated. Snoopy schemes generate less control packets since all nodes can read the invalidate commands. Directory based schemes generate more node-to-node traffic and are characterized by many more cache control packets.
The following sections describe the time multiplexed protocol I-TDMA [9] and the proposed hybrid protocol FatMAC. Each node requires a tunable transmitter and a fixed receiver subsystem. Each node receives traffic on a pre-determined channel. A source node tunes its transmitter to the home channel of the destination node and transmits according to the access protocol. A node receives and processes all traffic along its home channel. In a multi-level system, each node has one receiver per level tuned to the home channel for that level. A source node can compute the home channel of the destination node in a decentralized fashion through a simple computation based on the channel allocation policy. Let Ni and Ci respectively denote the number of stations and channels within a level i cluster. Node mk is assigned j as its home channel based on the allocation policy, where j 2 f0;1;2;:::;Ci ?1g and 0 k Ni ?1. An interleaved channel allocation scheme assigns channel j to station mk where j = k mod Ci.
A. Single Level Protocols
This section provides the protocol definition for a single-level network such as a passive star coupled network.
A.1 I-TDMA
This protocol is a multi-channel extension to basic TDMA. The single-and multi-level version of this protocol has been studied in [9] and [2] , respectively.
Time is slotted on each channel where each slot corresponds to a memory-block packet transmission time. Every node in the system has a chance to transmit on each channel per cycle in the simplest version of this protocol. Scheduling of slot allocation can be performed to support non-uniform access. It would generalize the number of slots per frame where slots are allocated by a resource scheduler based on demand. Scheduling would avoid static bandwidth allocation, providing a more dynamic environment, but creates a connection-style environment where link setup needs to be negotiated. This strategy works well with voice or video, since the connection duration is usually orders of magnitude longer than the setup time, but not when the link is only needed on a transient basis. Fig. 3(a) shows an example of a source node/channel allocation map for a single level system with M = 6 and C = 3. A node is assigned a total of C slots per cycle and remains idle for the remaining M ?C slots.
A node can transmit up to C packets per cycle. Although I-TDMA can achieve the maximum theoretical utilization, its latency characteristics limit scalability. The delay suffered by short packets is high when there are many more Class A packets than Class B packets. There is also reduction in throughput efficiency since slots are not fully utilized when short packets are transmitted. I-TDMA is also unable to support broadcast. Broadcasting is very desirable, and significant system-level performance improvement is possible in support of cache coherence mechanisms such as the snoopy scheme.
The FatMAC protocol described in the next section has been designed to eliminate some of the disadvantages and accommodate broadcasting and variable packet sizes.
A.2 FatMAC
This protocol has the following main goals: (a) reduce the latency especially under light loads; (b) minimize utilization loss and improve the system throughput for small values of ; (c) provide a broadcast mechanism which might be required by the applications or by cache coherence schemes; and (d) support variable sized packets with nonintrusive segmentation.
FatMAC combines reservation access and pre-allocated receivers. The satellite access protocol studied in [11] used a time multiplexed reservation cycle followed by a data cycle, overlapping the long satellite propagation delays through interleaving reservation and data cycles. FatMAC generalizes this two-phase approach to a multi-channel environment, retaining its propagation delay hiding capability. Typical WDM based reservation access protocols allocate one of the channels as a dedicated control channel [12] . This is not attractive in the target environment since the number of channels available is usually small.
FatMAC is a hybrid approach using reservation access but without a reserved control channel.
Transmission in FatMAC is organized into cycles where each cycle consists of a control phase and a data phase. The control phase operates in a broadcast environment. A control packet sent during the Reservation/Control Block (RCB) phase of a cycle may perform up to two functions: transmit a block reservation and/or include a Class A packet waiting to be transmitted through a small multi-purpose payload. A reservation specifies the destination, the data channel and the packet size (if variable sized packets are supported). Access during the control phase is TDM-based mainly to preserve a collisionless environment but other access strategies are possible. FatMAC exploits the laser-array orientation of the transmitter.
A control packet may contain only cache control information such as an invalidation and not need to reserve a data slot. This allows the average cycle length to be significantly reduced as illustrated in Section V. A fast broadcast facility is achieved through this multipurpose control packet. Multi/Broad-cast is achieved by modulating all lasers (allocated to level-i) of the array with the the same information. The operation with a single tunable laser would be similar, except that it would explicitly transmit on each channel in a pipelined fashion and the overall reservation interval would be extended by Ci ? 1 control slots.
Time is slotted in FatMAC on control packet boundaries. In I-TDMA, time is slotted on memory block boundaries as seen from The transmitter behavior can be described as follows:
1. When a packet is ready for transmission, the node waits for its turn on the reservation phase (cycle synchronization delay). Access during the reservation phase is based on TDM to support collisionless transmission. Access to the reservation phase could be based on other strategies, such as Slotted Aloha, to further reduce packet latency. With SA, acknowledgments are not needed for the reservation phases but the sensitivity of the system to propagation delay is increased. The performance comparison of SA and TDM for the control cycle has been studied in [13] . 2. The control packet is broadcast to all nodes during the reservation phase. This takes advantage of the structure of the transmitter which is an array of distinct wavelength lasers rather than a single tunable laser.
The control packet contains a class A packet and/or a block reservation if a block transfer is scheduled to follow the control packet. 3. The node waits until the end of the current control phase before it attempts to transmit the corresponding memory block if any. 4. If the node has a data block to transmit, a node calculates its offset within the data phase and transmits the memory block on the channel it specified in its control packet. Variable size blocks could be supported, but it is assumed in the analysis to follow that all nodes in a system use the same block size. Furthermore, it simplifies the network interface design if the block size remains constant.
Nodes that reserved access on the same channel transmit memory blocks in the order in which they transmitted their control packets. The service mechanism in this case is first-come first-serve though a general scheduling algorithm may be used. A reversing service during the control phase could improve fairness. The length of each data cycle is the maximum data cycle length of the individual channels within the channel set (the channels allocated to that level). Each node receives all control packets and knows the start of the next cycle. In the event that a node misses a control packet (through buffer overflow or corrupted transmission), start of a new cycle is detected by the synchronization sequence.
Note that the average cycle length of FatMAC is much shorter compared to I-TDMA especially under light loads. This is due to the fact that FatMAC is slotted on control packet boundaries whereas I-TDMA is slotted on memory block boundaries. The cycle length in I-TDMA is ML slots independent of traffic characteristics. The minimum cycle length in FatMAC is M slots where all packets in the cycle are control packets. The data cycle length depends on the number of memory block packets transmitted per cycle.
I-TDMA may be slotted on control packet boundaries. This reduces the cycle length to M slots and improves delay performance for control packets. Memory block packets will be segmented across different cycles increasing the delay to about LM. For the rest of this paper, we will study I-TDMA slotted on memory block packet boundaries. This approach trades channel utilization for reduced latency. There are advantages with this strategy: (a) collision-less transfer, (b) a significant reduction in latency under light to moderate loads when compared with I-TDMA, and (c) improved channel utilization for situations when is small since most I-TDMA slots will only be partially used. However, FatMAC does not match the maximum utilization achieved by I-TDMA for high values of .
Both protocols are collision free and do not require acknowledgment support under normal circumstances. However, there is always the possibility of corrupted or dropped packets, or loss of slot synchronization.
B. Multi-Level Protocols
This section describes the multi-level extensions of I-TDMA and FatMAC.
B.1 I-TDMA Fig. 4(a) shows the extension of I-TDMA to a multi-level system. Source-destination pairs are determined based on their addresses and the current time slot as explained in [2] . Each node computes its assigned channel in each slot using an identical distributed algorithm ensuring conflict-free and fair access to the nodes. The cycle length in I-TDMA is fixed and is dependent only on the number of nodes in the system. High utilization can be achieved as with the single-level system at the expense of high latencies under light loads. As discussed earlier, an to I-TDMA is to generalize the number of slots per frame where slots are allocated by a resource scheduler based on demand. Fig. 4(b) shows the extension of single-level FatMAC to a multi-level system. A cluster is defined as a set of nodes and channels operating at one level of the hierarchy. For example, consider a 2-level 4 4 system. There are 4 clusters of 4 nodes each at level 1 and one cluster comprising 16 nodes at level 2. Each node is thus associated with one cluster at each level. A cluster is associated with cycles composed of reservation and data phases as shown in the figure. A node is allocated one control slot in each of its clusters' control cycles. At any instant, a node transmits on only one cluster cycle and participates in a cycle on at most one level. The cycle lengths are seen to be different for each level. The figure shows the loss of utilization when channels are only partially utilized within a cluster during the memory block transfer phase of a cycle. Since the node-channel allocation for FatMAC is not fixed as in I-TDMA, a node may be assigned simultaneous reservation slots on different levels. However, the cluster which a node selects at any time is uniquely determined by the current packet chosen. The downside is that the node may miss its reserved control slot.
B.2 FatMAC
Each receiver of a node is passive since it receives all packets transmitted on its home channel. It updates cycle length information with the associated cluster if a control packet is received. A data packet is forwarded to the memory subsystem if it is the intended destination. A node is constantly aware of the data cycle status in all its associated clusters. In the event that a node fails to receive a control packet, it can be resynchronized by a synchronization sequence at the start of the next cycle.
The performance of FatMAC and comparison to I-TDMA is investigated in the following section.
IV. ANALYTIC MODEL
The performance of a simplified version of the protocol is analyzed through semi-Markov modeling. A semi-Markov model was developed in [14] for a multibus system. In particular, it developed the three state cluster which is inherent in the model developed below. A semi-Markov model is an approximation technique, useful because of the significant reduction in state space, and a comparison to simulation is used to ensure that the implied approximations are reasonable.
The analytic model enables prediction of channel utilization and average packet latency with varying system characteristics. Table I describes the notation used in the following analysis. The system is described in terms of time normalized to control packet transmission time. The model for the single-level protocol makes use of the following assumptions:
behavior of the nodes and channels. Channel traffic will not be uncorrelated in general,due to common transmit queues,but this approximation simplifies the transition probability derivations and is shown to be reasonable in the comparison to simulation in Section IV-D.
A2 : Time is slotted on control packet boundaries and packet generation per slot follows a Poisson process with rate .
A3 : The traffic distribution is largely determined by reference of locality. Typically, reference locality would be exhibited within a cluster (level). Since the analytic model considers only nodes within a single level, the uniform reference model is used for performance evaluation. A packet generated by node mi is targeted to node mj with probability 1 M ? 1 for i 6 = j and 0 i; j M ?1;
and with probability 0 for i = j.
A4 : A transmitter has a capacity to hold one packet when processing another packet. This is not an inherent assumption of the protocol and the model can be extended to include generalized transmitter buffer capacities as in [9] .
A5 : All memory block packets are of fixed length L times the length of a control packet. The performance of the protocol is modeled by studying the behavior of the transmitter which changes states most often. The receiver is fixed at its home channel and does not actively affect protocol behavior. The following section describes the possible states of the transmitter and the transition probabilities between states. Fig. 5 shows the semi-Markov process depicting the behavior of the transmitter. To summarize the steps involved in the transmission of a packet: A node generates a packet and waits for its turn in the next reservation cycle. There is a memory block transmission in the corresponding data cycle only if it is a class B packet. The states of the transmitter depict the behavior of the node as it sequences through these steps. The states represent one of the two cycles of the system: control or data. The state descriptions are as follows: S1; S2; S3 -These states represent the transmitter during the control cycle. The transmitter has zero, one and two packets queued on entering states S1, S2 and S3 respectively. S4; S5; S6 -These states represent the transmitter being idle during the current data cycle. The transmitter has zero, one and two packets queued on entering states S4, S5 and S6 respectively. S7; S8 -These states represent block transmission during the data cycle. The transmitter has sent a memory block reservation in the preceding control cycle. The transmitter has zero and one packet queued on entering states S7 and S8 respectively. S9; S10 -These states are entered after completion of packet transmission in the data phase. This is the residual wait in data cycle until start of the next control cycle. The transmitter has zero and one packet queued on entering states S9 and S10 respectively. The probability of a transition from state Si to state Sj is denoted as p i; j]. The following derives the transition probabilities between the states in the process. The transitions between states not mentioned below do not occur. Fig. 5 does not label all state transitions for the sake of clarity.
A. State Definitions
Let i denote the average sojourn time of Si. A packet can be generated in any control slot of a cycle. Packet generation is assumed to be a Poisson process with a rate packets per slot per node. The probability that at least one packet is generated in a control slot is denoted as where = 1 ? e ? . Let i = e ? i represent the probability that no packet is generated during i; i = ie ? i the probability that exactly one packet is generated during i; i = 1 ?
i ? i the probability at least two packets are generated during i.
The states of the transmitter can be broadly classified as control cycle and data cycle states. The states are differentiated based on the number of backlogged packets in the transmitter queue.
A.1 Control cycle states
States S1, S2 and S3 represent the transmitter during the control phase. The node has zero, one and two backlogged packets respectively on entering states S1, S2 and S3. The sojourn time of the all three states is M, the control cycle length.
When a packet is generated during S1, the node might miss its turn in the current control cycle or it may transmit in the current cycle. A rotating order of reservation slot assignment is assumed. Let represent the probability the node does not miss its turn in the control cycle. It is derived as follows. The conditional probability of a packet arriving in slot i given that at least one packet arrives in M slots is given by
where 0 i M ?1. This represents the fact that a packet is generated in the slot i but not in any preceding slot. If a node arrives in slot i, the probability that any of the following M ? i slots is its turn in the cycle is
. This is used to derive the following equation:
Let denote the probability that a node will have a reservation to transmit in the current control cycle. This packet may be backlogged or generated during the control cycle. The probability that no other node in the system has sent a reservation in the current control cycle is denoted by = (1 ? ) M?1 , where is given by Eqn. 7.
The transitions from state S1 are: 1. The node remains in state S1 if the node (i) generated exactly one class A packet which was sent in the same cycle or (ii) did not generate any packet, and if no node transmitted a reservation in the current cycle, so p 1; 1]
2. The node moves to state S2 if (i) the node generated two packets where the first is a class A packet and was sent in the same control cycle, or (ii) if the node generated generated one packet and missed its turn, and if no node has sent a reservation in the control cycle.
The probability is p 1; 2] = 1 (1 ? ) + 1 (1 ? )].
3. The node moves to state S3 if (i) the node generated two packets and the node missed its turn in the control cycle, and (ii) no node has sent a reservation in the control cycle. The probability is Table II . The transition probabilities are used to obtain the limiting probabilities of the embedded Markov chain from which the steady state probabilities of the semi-Markov process are derived.
B. Steady state probabilities and cycle length
The limiting probabilities of the embedded Markov chain are obtained using the following equation: (2) where V represents the limiting probability vector and P the transition probability matrix. The steady state probabilities of the semi-Markov process are obtained using:
The transition probabilities and sojourn times of the states are dependent on the expected data cycle length (CL) and expected delay within a cycle (D). The expected cycle length is dependent on the number of nodes that are active in a cycle. Given M nodes, C channels and the probability that a node has a packet to transmit in a cycle, the expected cycle length (CL ) and average delay per packet within a cycle (D) is derived from hashing theory [15] , [16] , [17] . This problem has its equivalent in hashing where there are C hash buckets and each element can be randomly placed in one of these buckets. The maximum bucket occupancy corresponds to the length of the data cycle in our problem.
Let k be the number of reservations for packets of length L in a cycle with C channels. Each packet is uniformly destined to any of the channels. Define # = k=C. If we let O# denote the maximum number of packets per channel, the expected data cycle length has been derived in [17] and is given by:
Given the probability that a node will make a data packet reservation, M k k (1 ? ) M?k denotes the probability that k reservations are made in the control cycle. Cl(k) denotes the average cycle length given that k reservations are made. The probability that at least one reservation is made is given by 1
and the average delay per packet within a cycle after transmitting the control packet is given by
Dl(k) is equivalent to the expected number of searches for a given entry in the hash table with C buckets and is obtained from [17] . The remaining probabilities are derived as explained above.
The expected cycle length and packet delay are derived from the semi-Markov transition probabilities which are dependent on CL and D. This leads to an iterative solution with the following steps:
1. Choose an initial value of the probability of a node reserving a memory block in a cycle (0 < 1). 2. Estimate CL and D. P1 + P2 + P3 (7) 4. Repeat until the value of converges. The following section derives the average packet delay and channel utilization based on the transmission probabilities in this section.
C. Performance Metrics
The performance metrics of interest are channel utilization and average packet delay. Network channel utilization (Γ) is defined as the average number of slots utilized across all channels per control slot. Average packet delay (T ) is the time between packet generation at the source and packet reception at the destination.
C.1 Channel utilization
Channel utilization within a cycle is determined by the number of nodes active in the cycle. The probability of leaving a state is given by Ψi = Pi= i [14] . A class A packet is sent during states S1, S2 and S3. A class B packet is sent during states S7 and states S8. The class A and class B per-node throughput, and overall utilization are given by:
C.2 Average packet delay
Packet delay is defined as the time between packet generation and packet reception. The average packet delay (T ) can be obtained by listing the possible paths a packet can traverse using Little's law. 
D. Model Validation
This section validates the analytic model through discrete-event simulation. Simulation results have been obtained using the stochastic self-driven discrete-event models, written in C with YACSIM [18] . YACSIM is a C based library of routines that provides discrete-event and random variate facilities. Steady state transaction times and utilization were measured. Simulation convergence was obtained through the replication/deletion method [19] , with a 95% confidence in a less than 5% variation from the mean. Fig. 6 compares the performance predicted by the analytic and simulation models for M 2 f16;32g, C 2 f2;4;8g, 2 f0:5;1:0g and L = 16. The deviation of network channel utilization predicted by the analytic model was less than 5% and the deviation of average packet delay was less than 8%. Fig. 6(a) shows that the performance of the protocol improves with increasing C. This is due to the reduced traffic per channel which decreases the expected cycle length reducing latency and increasing utilization. The increase in maximum utilization is 100% when C is increased from 2 to 8 for M = 16. The increase in utilization with increasing C will be higher for a larger M=C ratio . Fig. 6(b) shows that the validity of the model for a system with = 0:5. The graph plots the overall traffic delay and channel utilization for both control and block packets. The model also enables prediction of control packet and data packet performance separately as explained in Section IV-C. The following section compares the performance of FatMAC and I-TDMA for varying system parameters.
V. PERFORMANCE COMPARISON WITH I-TDMA
This section compares the performance of FatMAC and I-TDMA for single-level and multi-level systems. The performance of I-TDMA has been studied through simulation and analytic models in [2] . The system parameters of interest are C, M, and the traffic generation rate.
Other parameters include the number of levels (r), the fraction of class B packets ( ) and traffic fraction requiring level-i communication (pi) where 1 i r. For the rest of this analysis, we will assume that each transmitter queue has the capacity to hold up to 5 packets. This value has been empirically chosen to ensure that there is virtually no loss of throughput due to packet blocking.
The ratio has a significant impact on system performance and in identifying the performance advantagesof either protocol. A simulation study of snooping-based and directory-based cache coherence scheme was used to estimate . The FFT and SPEECH multiprocessor traces [20] were used with traces from the SPLASH benchmark suite [21] to estimate .
Snooping-based coherence schemes, which generally require a broadcast facility for requests and invalidations, used with the mp3d and water SPLASH benchmarks had a value of 0:3. Directorybased coherence schemes, studied with the FFT and SPEECH traces, usually use explicit invalidations and the measured value of ranged between 0.06 and 0.03, depending on the amount of sharing. This shows that traffic is mostly composed of control packets in the studied applications even with a snooping-based coherence strategy. This led to the primary motivation for the development of FatMAC since a low value of results in utilization loss and high latency for I-TDMA. In the study below, we examine protocol performance for 2 f0:1;0:5;0:9g.
A. One Level System Performance
This section compares the performance of FatMAC and I-TDMA for a single level system. The reduction in latency and loss of maximum utilization of FatMAC are quantitatively analyzed in this section. than that with I-TDMA. This represents a significant reduction in delay mainly due to the large cycle length of I-TDMA and the fact that most traffic is control packets. The reduction in delay for a similar system at utilization of 1.5 is 80% for = 0:9. In general, the latency advantage of FatMAC increases with increasing system size, packet size, and decreasing . When the fraction of Class B packets increases, the expected cycle length of FatMAC increases reducing the magnitude of latency reduction. For low , the utilization of FatMAC is higher than I-TDMA for systems with small M=C ratios. Under heavy loads and for large the maximum utilization of I-TDMA is higher.
A.1 Variations in C
The maximum throughput attained by FatMAC is less than that of I-TDMA since a node can transmit only one packet per data cycle in FatMAC compared to C packets per cycle in I-TDMA. The cycle lengths for FatMAC is approximately M + ML=C. This assumes that the packets generated by the M nodes are distributed evenly over C channels. The cycle length of I-TDMA is given by ML. Comparing the two cycle lengths yields:
For L C, which is the case of our target environment, the second term in the above summation will be dominant. The cycle length of FatMAC will be 1=C times that of I-TDMA. The maximum utilization in this case will be roughly the same for both I-TDMA and FatMAC.
As the ratio L increases, the throughput advantage of I-TDMA decreases. This is due to the longer data cycles which reduces the relative overhead due to the control cycle. Maximum utilization Γmax with C channels for FatMAC and I-TDMA is given as:
The utilization advantage of I-TDMA is higher as C increases. The system under consideration contains a small number of channels and in the case where L C and M C which is the case of our target environment, Γ Fat max ! C.
Though I-TDMA can theoretically achieve the maximum utilization of C, it depends on the rate at which the system interface can generate outgoing packets. The network interface has the capability to send consecutive packets on the channels up to C packets in a I-TDMA cycle. The system interface may not be able to sustain such high packet rates in a realistic situation. Consider a system with 16 nodes, 4 channels and transmission speed of 2 Gbit/s. The cycle length with 8 Kbyte packets is 525 s. If the system interface is able to generate 7500 memory blocks per second that is uniformly targeted to all the 16 nodes, the maximum utilization can be achieved. The protocol is thus limited by the maximum traffic the system interface can generate. The above paragraphs quantified the reduction in latency achieved by FatMAC particularly under light loads at the expense of bandwidth. In a high speed environment, bandwidth tradeoff for lower delay can be reasonably justified. The following section examines the protocols with respect to scalability in system size. M   Fig. 8 compares the performance of the two protocols with respect to system size scalability. The graph shows the performance for M 2 f16;32;64g, C = 4, 2 f0:1;0:5g and L = 128. Increasing M increases the average latency for both protocols since the cycle length is proportional to M. Under light loads, the increase in cycle length is proportional to memory packet and control packet length for I-TDMA and FatMAC respectively. This contributes to the lower latency of Fat-MAC. For small values of , the delay reduction gained with FatMAC is higher since most of the slots in I-TDMA are under-utilized. The increase in delay under light loads when M is increased from 16 to 64 is 200% and 300% for FatMAC and I-TDMA respectively. However, the delay obtained with FatMAC for M = 64, C = 8 and = 0:1 is less than that of I-TDMA. For higher values of , there is little variation in the delay of I-TDMA since the cycle length is independent of . The cycle length increases with for FatMAC and hence delay is higher. The reduction in latency holds for higher values of but the magnitude is diminished as seen from Figs. 8(a)-(b) .
A.2 Variations in
For very large values of M, the delay due to long control cycles may impact performance adversely. One solution is to start the system with a smaller number of control slots less than M. Each control slot may be accessed by some kind of random access mechanism with some kind of collision resolution. The system could monitor collisions on the control channel and dynamically increase the number of control slots or switch over to TDM under heavy traffic. A more detailed analysis of this random access approach may be found in [13] .
The following section compares the performance of the protocols for two level systems. The magnitude of latency reduction decreases with increasing as in the single level case. For small values of , the synchronization delay of I-TDMA is of the order of ML whereas the latency of FatMAC is of the order of M; as increases, the delay of FatMAC increases due to longer data cycles and hence the performance advantage decreases. It is also observed that maximum utilization is higher for FatMAC at lower and higher for I-TDMA at higher . This again illustrates the tradeoff of bandwidth to minimize latency. The level 2 traffic among nodes is quite low and does not benefit from an increase in channels beyond 2 at that level. The number of channels allocated to each level can be varied in accordance with pi to maintain uniform traffic intensity across levels. to that shown for p1 = 0:9. There was little variation in performance with varying channel allocation for = 0:1. The general trend of low latency with FatMAC is observed in all the cases. When the number of channels used in the first level (C1) is increased from 1 to 2, there is improvement in performance for both protocols. When C1 is increased from to 3, the performance degrades. This is due to the fact that only one channel is available for level 2 traffic creating a bottleneck. The maximum utilization decreases for the case of C1 = 3 and C1 = 2 for both protocols. For a multi-level system, the node does not transmit a control packet on another cluster until it is finished with transmitting the current packet. This can lead to reduced utilization since a node misses its allocated control slot on a different level. The protocol can be modified to detect such a situation and efficiently utilize the bandwidth but results in added complexity and extra processing latency. The interaction of traffic between the two levels will have impact on system performance. This is due to contention between the different levels for the single transmitter at the node. An expensive solution to overcome this contention will be to provide a transmitter per level.
B. Two Level System Performance

C. Propagation Delay
Propagation delay is a serious concern with high speed optical networks even for short distances of the order of a few Km. Two main synchronization issues impacted by propagation delay are Slot Synchronization, which defines when each node accesses the network during the control phase, and Frame Synchronization, which determines when the nodes transmit the data cycle after the corresponding control cycle packets have been processed [22] .
Two schemes have been analyzed for slot synchronization in [22] which are defined as LockStep and Distributed Clock mechanisms. Similarly, two schemes have been studied for frame synchronization defined as Data Cycle Stall (DCS) and In-Flight Reservation (IFR). The reader is referred to [22] for the details of the analysis. The results are summarized here.
The studies in [22] show that the lock step scheme is suitable for low propagation delay of the order of up to 100 m. The distributed clock mechanism is suitable for values of propagation delay of the order of up to 10 Km. This shows the applicability of the simpler lock step scheme in a multiprocessor environment with locally distributed processors. The DCS scheme is appropriate for propagation delays up to the order of a few Km. The IFR scheme is suitable for larger propagation delays up to a few hundreds of Km. Further research is required to extend the protocols to support longer distances.
D. Dedicated Control Channel Protocols
This paper compared the performance of FatMAC to I-TDMA since this was the protocol considered for the hierarchical architecture proposed in [2] and both protocols have the same transmitter/receiver complexity. Both protocols are based on pre-allocated receivers, and support the bimodal traffic studied in this paper.
To provide additional perspective, this section will compare FatMAC to a protocol based on dedicated control channel denoted as TDMA-C [6] . Both FatMAC and TDMA-C are designed to support variable length packets and are based on a reservation mechanism.
Transceiver cost: TDMA-C requires a tunable transmitter, a fixed receiver and a tunable receiver. FatMAC requires only a tunable transmitter and a fixed receiver. Tunable receivers with wide tuning range and fast tuning times are still not readily available. Control processing: TDMA-C requires the receiver subsystem to continuously monitor control packets. FatMAC also requires control processing but only at the end of data cycles. Control processing can be overlapped with data cycle transmission as in the In-flight reservation scheme [22] . Data channels available: One basic system constraint is that the system will only have a few WDM channels. The practical limitations imposed by building tunable laser transmitters reduces the number of available wavelengths. Dedicating one of these channels to control processing reduces a significant fraction of the bandwidth available for data transmission.
Status information: TDMA-C requires status tables to keep track of channel information such as current owner and slots allocated. Each control packet that is received updates the status tables which imposes high storage and processing requirements. FatMAC on the other hand maintains a single counter per channel to keep track of offsets within the data cycle. Implementation simplicity: FatMAC was chosen over TDMA-C due to simplicity of implementation and also since the one level protocol could be easily extended to support multiple levels. Extending TDMA-C to support two levels will require more complexity due to more status tables and processing. Performance: TDMA-C does provide additional flexibility in data channel availability. FatMAC trades off bandwidth in the data cycle which can lead to performance problems for highly nonuniform traffic. A thorough performance comparison of FatMAC and TDMA-C needs to be investigated. Data Broadcast: Both FatMAC and TDMA-C provide a facility for control packet broadcast. However, it is easier to accomplish block data packet broadcast with FatMAC by allocating one slot in the data cycle to the node requesting broadcast. It is more complex to achieve this with TDMA-C since the node has to wait until all data channels are free before broadcast. To summarize, the main advantages of FatMAC are light-weight protocol processing and lower component cost. The main advantage of TDMA-C is the flexibility of data channel availability.
VI. CONCLUSIONS
This paper studied the design of a media access protocol for an optically interconnected multiprocessor system. In particular, it is optimized for the traffic characteristics of a distributed shared memory multiprocessor. One primary objective is minimum latency, and another is low system cost when C M requiring fixed receivers and tunable transmitters per node. A hybrid access scheme combining a reservation approach with fixed receiver architecture was proposed in this paper. This protocol made a trade-off of maximum capacity to achieve a significant reduction in latency. The impact of system traffic characteristics and system parameters on protocol performance was studied. The proposed protocol offers lower latencies under light loads and is stable under heavy loads unlike random access schemes. Table 2 lists the transition probabilities. 
(2) (3) (4) (5) 
(1)
(1) Average Packet Delay Average Packet Delay C= (1, 3) C= (3, 1) C= (3, 1) C= (1, 3) C= (3, 1) C= (1, 3) C= (3, 1) C= (1, 3) C= (3, 1) C= (1, 3) C= (3, 1) C= (1, 3) C= (1, 3) C= (1, 3) C= (1, 3) C= (3, 1) C= ( 
