This paper introduces two hierarchical optical structures for processor interconnection and compares their performance through analytic models and discrete-event simulation. Both architectures are based on wavelength division multiplexing (WDM) which enables multiple multi-access channels to be realized on a single optical fiber. The objective of the hierarchical architectures is to achieve scalability yet avoid the requirement of multiple wavelength tunable devices per node. Furthermore, both hierarchical architectures are single-hop: a packet remains in the optical form from source to destination and does not require cross dimensional intermediate routing. The first structure is physically hierarchical but wavelength flat: all nodes share the same wavelength space. The second structure is a wavelength multiplexed hierarchical structure with wavelength channel re-use at each level, allowing it to be scaled to very large system sizes. It employs acousto-optic tunable filters in conjunction with passive couplers to partition the traffic between different levels of the hierarchy without electronic intervention. An advantage of the second structure is its ability to dynamically vary the bandwidth provided to different levels of the hierarchy. The architectures are compared in terms of complexity and performance scalability.
Introduction
The Fat-Tree [1] was proposed as an interconnect strategy to support concurrent interprocessor communication through a hierarchy of limited-bandwidth spatial channels. Unlike conventional tree processor interconnection approaches, the Fat-Tree provides increased bandwidth at levels closer to the root. The processors of a Fat-Tree are located at the leaves of a complete binary tree, and the internal nodes are spatial switches. The number of links connecting a node to its parent increases as the level increases to provide increased communication bandwidth. The rate of growth is influenced by the desired scalability and cost-performance ratio. A variation of the Fat-Tree network has been employed in Thinking Machine's CM-5 [2, 3] . This paper introduces an optical-based approach, denoted as the space-wavelength hierarchical architecture (SWHA) , that achieves the Fat-Tree objectives while also obtaining significant improvement in flexibility, performance and fault tolerance through wavelength-division multiple access (WDMA) photonic interconnection. Not only can the Fat-Tree strategy of providing increased bandwidth per link at levels closer to the root be obtained, but now adaptable bandwidth allocation can be achieved at each level: the bandwidth at each level does not need to remain fixed but can be dynamically reallocated to adapt to changing communication requirements. Furthermore, a highly scalable architecture is achieved, overcoming the fixed number of WDMA channels through spatial re-use. The hierarchy resembles a general m i -ary tree, where an i-level node branches to one parent and m i children. The processors (0-level nodes) are located at the leaves and the internal nodes are wavelength/spatial routing switches denoted as FatNodes due to its motivation from the Fat-Tree. The result is a system with a unity degree per processor yet unity diameter: any two processors can communicate without any intermediate routing. This is significant since the intermediate routing latencies may be much larger than the packet transmission time in an optical environment. One additional advantage of the proposed architecture is the independence of the number of channels in the WDMA network and the number of nodes interconnected by it.
In addition to a very large bandwidth, optical interconnects offer many desirable features such as a relaxed distance-bandwidth product, large fanout capability, low power requirement, reduced crosstalk and immunity to electro-magnetic interference. Computer architecture may employ optical interconnects at multiple system levels: chip-to-chip, module-to-module, board-to-board and node-to-node [4] . However, little benefit is obtained by simply replacing metal interconnects with optical fiber in a one-to-one fashion since the result would be many expensive but under-utilized communication links. Furthermore, the true advantages of optical interconnections would not be harnessed due to the speed mismatch between the optical and electronic components. The speed mismatch is a major problem hindering the development of photonic based processor interconnection architectures. The low loss region of a single mode optical fiber has a bandwidth of about 30THz [5] : the optical media bit and packet throughput rates greatly exceed the capacity of the electronic interface components. Wavelength-division multiplexing (WDM) is an approach that circumvents the speed mismatch problem where the bandwidth is partitioned into many, more manageable, high speed channels that may be concurrently accessed. The WDM channels form a set that can be individually switched and routed. However, wavelength tunable transmitters and/or receivers are required. Section 2 contains a brief outline of their characteristics. Depending on the architecture, optical self-routing is achievable where a node only receives data destined to it and the system has the non-blocking connectivity characteristics of a crossbar [6] . Optical self-routing partitions the traffic, achieving a significant relaxation of the receiver subsystem design constraints since a node will not have to receive and process all network traffic. This is an important characteristic since the photonic network can support a throughput rate far beyond the packet processing capability of a typical processor.
WMCH is a hypercube based structure that combines both spatial and wavelength multiplexing through optical WDMA channels spanning each dimensional axis [7] . The WMCH requires a greater than unity degree: an r-dimensional hypercube requires r transmitter/receiver pairs per node. Each processor in a r-dimensional structure is connected to r channels, each spanning different dimensions. An i-dimensional channel has m i processors attached for a total of M = r Y i=1 m i processors. A major objective of the WMCH was to support system size expansion beyond the optical power budget (OPB) limitation of the single star-coupled configuration shown in Fig. 1 , while also extending the number of available channels through a combination of spatial and wavelength switching. The WMCH does not require optical amplifiers and uses a multi-dimensional structure to achieve scalability. This requires that each node have r transmitter/receivers for a r-dimensional structure, which has significant cost implications since the cost of wavelength tunable devices is not negligible. Another limitation of the WMCH is that the system is multi-hop: a packet may not remain completely in the optical domain between the source and destination. There is a latency incurred at each intermediate node due to the optical/electrical conversion, routing and re-transmission that is required each time a packet crosses a dimensional boundary. Though the average distance and diameter of the WMCH is very low, the impact of routing latency on the total delay can be significant due to the high speeds of the optical media. Developing interconnection architectures that require a single transmitter/receiver pair per node yet achieve excellent performance through the avoidance of multi-hop is a major objective of this paper.
System scalability in this environment is limited by three main issues: physical scalability which bounds the maximum number of interconnected nodes through the OPB, performance scalability which is bound in the optical domain by the tunability of the optical devices (transmitters and receivers/filters) that limit the number of wavelength channels that are created, and scalability limitations when the cost or complexity grows faster than the performance. Physical scalability is examined by developing a framework to extend the system range beyond the OPB limit of a single star-coupled network as shown in Fig. 1 . Performance scalability is examined by using both spatial and wavelength multiplexing (multiple, but spatially separate, channels of the same wavelength) to extend the performance capacity beyond the limit of C channels imposed by the wavelength tunable device characteristics. Complexity scalability is supported through the hierarchical structures that retain a unity degree: a processor only requires a single transmitter and receiver regardless of system size. Our objective is to decouple the maximum system size and the number of wavelength channels, avoiding the situation where a system of M nodes requires M channels since this would create a situation where the scalability of the system is too dependent on the characteristics of the wavelength tunable devices.
A wavelength-flat but physically hierarchical structure, denoted as FHA, is described that provides both physical and complexity scalability but is not performance scalable since the same wavelength space is shared by all processors. The resulting system is logically similar to a single star network of Fig. 1 . The SWHA achieves (physical, performance and complexity) scalability through both spatial and wavelength multiplexing. The FHA and SWHA are both single hop architectures so the significant intermediate routing latencies of the WMCH, relative to the nominal packet transmission time, are eliminated. The proposed architectures are defined in Section 2, and the performance and cost behavior of each approach is considered in Section 3.
Photonic Interconnection Architectures
Both the physical limitations and performance requirements must be considered when developing scalable photonic architectures. The objective of this section is to provide a framework where system size can be varied within a desired range while maintaining a specified level of performance.
Photonic network architectures may adopt multi-access or virtual point-to-point communication strategies. Multichannel multiaccess protocols provide distributed access scheduling by employing either reservation or pre-allocation strategies [8] . Virtual point-to-point networks rely on embedding an interconnection topology in the wavelength domain [9] . The embedded network may be used as a virtual multi-hop interconnect [10, 11] (using only fixed wavelength transceivers) or as a topologically reconfigurable pointto-point interconnect [12, 13] (using tunable multi-channel receivers). Tunability may be employed to enable redefining the networks logical connectivity in response to traffic flow irregularities [14] .
WDM networks require wavelength tunable transmitters and/or receivers to switch between the multiple channels created on the single optical fiber. A limited tuning range, transmitter linewidth and filter bandwidth constrain the maximum number of selectable wavelength channels. Wavelength tunable transmitters can be obtained through a number of techniques: fast tunable laser diodes, an array of non-tunable but different wavelength laser diodes, or LED spectral slicing [5, [15] [16] [17] . Wavelength selectivity can be achieved either with a coherent receiver, or a tunable filter with direct detection. The first approach is more expensive, but has higher channel selectivity. A lower cost alternative is a wavelength tunable filter with direct detection [18, 19] . Active filters have either electro-optic or acousto-optic control. Acousto-optic devices have a tuning range across the full 1.3 -1.56 m range, while electro-optic devices are limited to about 15 nm [19] . Acousto-optic devices typically have slower switching speeds ( s) than electro-optic devices (ns). Acousto-optic tunable filters (AOTF) have the advantages of a broad tuning range, electronic control of tunability, and narrow filter bandwidth. Several implementations of an optical cross-connect based on an AOTF have been described in [5, 19] where different wavelength channels can be routed independently to different output fibers. The selected wavelength is switched in an AOTF by varying the acoustic wave control frequency. This allows an individual wavelength to be switched or not switched. Multiple wavelengths may be switched with this device by the superposition of multiple acoustic control signals.
A mixed radix system is used to represent the node numbering. Let M (the total number of processors) be a decimal integer represented as a product of r factors:
The processor identifier P, 1 P M, can be represented as an r-tuple P = (p r ; p r?1 ; : : :; p 1 )
where 1 p i m i for all 1 i r. The term r denotes the number of hierarchical levels with the FHA and SWHA. The term processor is used to denoted a node at the leaves of the tree (a 0-level node). An actual system might place m 0 processors at each 0-level node, perhaps interconnected through a shared metal bus as in [20] , to amortize the cost of the communication subsystem over multiple processors depending on the cost, performance and typical communication requirements of the individual processors. The address of a processor is defined by the r-tuple of Eqn. (2) . 
A Flat Hierarchical Architecture
A straightforward method to achieve physical scalability beyond the OPB limitation of the single star-coupled configuration of Fig. 1 is through the incorporation of optical amplifiers in a hierarchical fashion. This is now a practical approach due to the rapid advancement in the development of optical amplifiers [21] [22] [23] [24] . However, the FHA is a physical hierarchy: the system is flat in the wavelength domain since all nodes share a common wavelength space. The FHA addresses the OPB limitation and achieves the complexity scalability objective but does not address the performance scalability limitation.
This hierarchical approach (and also that of the SWHA) is assumed to have r levels, where an i-level node has one parent and m i children nodes for all 1 i r. The number and placement of optical amplifiers are highly dependent on the OPB characteristics of the system. for all 1 j r ? 1.
An advantage of this approach is that the system is not multi-hop: no intermediate routing between source and destination is needed. Depending on the access protocol and configuration, optical self routing may be achieved due to the traffic partitioning characteristic of WDM. A packet undergoes optical/electrical conversion only at the source and destination. A limitation is that the extended system is still limited to a total of C wavelength channels. Furthermore, the network is now not completely passive and fault tolerance concerns arise since the failure of an optical amplifier disconnects a fraction of the network (m 1 nodes in the example of Fig. 2 ). This structure has excellent topological characteristics: unity degree and unity diameter. Routing is very simple since it is a single hop network.
High connectivity is achieved with optical multiple access channels. WDM creates multiple multi-access channels on a single fiber and a large number of nodes may be interconnected with a unity distance. A major concern with a multi-access approach is that a media access protocol is required. An efficient and simple (due to the high speeds of the optics) solution is essential to obtain the performance and complexity advantages of a multiple access environment. Reservation and pre-allocation are two possible strategies.
Reservation techniques designate a channel as the control channel that is used to reserve access on the remaining channels for data packet transmission. Media access protocols are required to provide arbitration on both the data and control channels. Pre-allocation techniques pre-assign the channels to the nodes, where each node has a home channel that it uses either for all data packet transmissions or all data packet receptions. Reduced system complexity is achieved with pre-allocation since a node is not required to possess both a fast tunable transmitter and a fast tunable receiver. Furthermore, the difficulty of supporting a control channel is eliminated and all channels are used for data transmission [25] .
A time multiplexed pre-allocation protocol is used in this paper to provide access arbitration. A system has M nodes and C WDM channels where each node has a tunable transmitter and a fixed (or slow tunable)
receiver. The number of channels required for the interconnection pattern is independent of the number of processors. Techniques have been proposed to overlap the tuning time of the optical devices for this protocol [26] . This protocol is the basis for the multi-level protocol defined in Section 2.2. A source node tunes its transmitter to the home channel of the destination node and transmits according to the access protocol. A source node can determine the home channel of a destination node in a decentralized fashion.
Node A is assigned c A as its home channel based on the allocation policy, where 0 c A C ? 1 and 1 A M. An interleaved home channel allocation policy is defined as c A = A mod C. Each node with this approach has the opportunity to transmit on each channel once per cycle as shown in Fig. 3 .
Each node maintains C transmit queues to avoid head-of-line effects [25] . Collisions are avoided and the difficulty of supporting acknowledgments and retransmissions is eliminated. A cycle, denoted by L, is defined as the length of time for all nodes to be assigned a slot on all channels. Typically, L = m, but may differ depending on the switching latency (the protocol processing overhead plus the switching time of the optical devices) hiding strategy [26] .
The common set of channels between all nodes with the FHA enables a reservation-based media access protocol [27] and a snooping-style of cache coherence protocol to be considered. However, a reservationbased media access protocol has greater system complexity than a pre-allocation based configuration due to the requirement of multiple wavelength tunable components per node. A common channel would enable a snooping-based cache coherence scheme to be considered, rather than requiring explicit invalidations with a directory-based cache coherence scheme for distributed shared memory systems [20, 28] . The FHA supports physical scalability with stable complexity implying that the snooping-based cache coherence approach would be possible with systems scalable up to the point where the limit of the C channels in the common wavelength space and the overhead (mainly the synchronization delay due to the cycle length) of the media access protocol begin to restrict performance.
... 
Channels Transmitting Nodes
C C+1 C C-1 C-2 C-1 C-4 C-3 C-5 C-4 C+2 C+1 C C+1 C C-1 C-2 C-1 C-2 C-3 C-3 C-2 C-2 C-1 ... Cycle Slot 2 Slot 1 Slot M Slot M-1 Slot M-2 Slot 3 Slot 2 Slot 1 Slot M0 1 2 M-1 M-3 M-2 M-1 M-2 M-1 M-1 M M M M M Slot M-3
A Hierarchical Architecture in Space and Wavelength
The SWHA achieves the Fat-Tree objective of increased bandwidth per link at higher levels of the hierarchy while also obtaining a significant improvement in flexibility, performance and fault tolerance. The SWHA achieves adaptable bandwidth allocation: the bandwidth at each level does not need to remain fixed but can be dynamically reallocated to adapt to changing communication requirements. An advantage of this approach is that the system adapts to the users (programmers) instead of the users adapting to the system. Bandwidth reallocation is invisible to the users, and is initiated by the system to balance the per channel intensities described below. A highly scalable architecture is achieved, overcoming the fixed number of WDMA channels through wavelength channel spatial re-use. The SWHA interconnection network is based on a component called the FatNode which is a space/wavelength switch constructed from passive couplers and an acousto-optic tunable filter (AOTF). The AOTF is used as a wavelength-selective space-division switch to select a subset of wavelengths to switch. Section 2.2.2 presents the media access protocol employed in the SWHA architecture.
Architecture
As with the FHA, the term processor is used to denote a 0-level node which may actually be multiple (m 0 ) processors. Note that only the processors possess transmitters and receivers. All other i-level nodes, 1 i r, are FatNodes that provide wavelength/spatial switching. An advantage of the SWHA, when compared with a structure such as the WMCH, is that only one tunable transmitter is required per processor. Note that the link from the upper level contains no optical energy on channels f 0 ; 1 ; : : :; X i ?1 g so all traffic arrives along channels f X i ; X i +1 ; : : :; C?1 g and is directly routed to the lower level.
an AOTF can cross-connect or bar-connect wavelength channels individually through the superposition of multiple acoustic control frequencies, a FatNode does not require this degree of sophistication to relax device design constraints and reduce device cost.
Spatial re-use of wavelength channels occurs at each of the r levels in this hierarchy. Limited by the characteristics of the optical devices, let C denote the total number of wavelength channels capable of being separated on a single fiber and = f 0 ; 1 ; : : :; i?1 ; i ; i+1 ; : : :; C?1 g denote the set of wavelength channels. An i-level FatNode is electronically configured to a crossover channel denoted as X i , where X i 2 , such that bar-connections are established for channels channels f 0 ; 1 ; : : :; X i ?1 g and cross-connections are established for channels f X i ; X i +1 ; : : :; C?1 g. Essentially, this mapping is used to retain the local i-level traffic f 0 ; 1 ; : : :; X i ?1 g, and forward the global traffic f X i ; X i +1 ; : : :; C?1 g to the (i + 1)-level FatNode. The dynamic reconfigurability characteristic is obtained by altering the wavelength channel partition at each level. Fig. 4(b) shows an implementation of a FatNode. Due to the traffic partitioning characteristics, traffic arriving from the upper level does not require partitioning and may avoid the AOTF and be coupled directly to the m i nodes attached at the lower level. Only the lower level traffic must be partitioned. Bar-connections are established for traffic arriving on channels f 0 ; 1 ; : : :; X i ?1 g, locally retaining the traffic under the i-level FatNode. Traffic arriving from the lower level along channels f X i ; X i +1 ; : : :; C?1 g is cross-connected and passed to the upper level as shown in Fig. 4(b) . This exploits the locality of communication expected within a cluster: the communication generated and destined to nodes beneath the i-level FatNode remains local but global traffic, destined to a node under the j-level FatNode for i < j k, proceeds up the tree to the j-level FatNode. The term "i-level cluster" is used to denote all processors below an i-level FatNode. Topology: An i-level FatNode, 1 i r, (r is the number of levels in the structure) partitions the traffic it receives according to its space-wavelength configuration. An r-level system partitions the wavelength channels into r non-overlapping subsets. The partition points are defined as X = fX 0 ; X 1 ; X 2 ; : : :; X r g where X r = C, X 0 = 0 and 0 < X i < X j < C if i < j for all 0 < i < r ? 1 and 1 < j < r. The configuration remains fixed at the partition point and is not altered until the system is reconfigured to adapt to a change in the locality characteristics. The partition points X are shifted during reconfiguration which allows the system to adapt to a shift in traffic locality characteristics and dynamically reconfigure the amount of communication bandwidth provided at each level of the hierarchy. This relaxes the design constraints on the FatNode AOTF since fast tunability beyond the currently achievable microsecond switching speed is not required. Furthermore, narrow filter bandwidth is not required since individual channel selection is not needed.
As with the FHA, let M j denote the number of j-level nodes in this hierarchy, given by Eqn. m k (4) separate channels that may be concurrently accessed due to the combination of spatial and wavelength multiplexing. For example, consider the case of M = 24 in Fig. 5 . If C = 7 and X = f0; 4; 6; 7g, a total of C = 29 separate channels are effectively created.
Low-latency, high-throughput interprocessor communication is achieved with a unity diameter and a unity degree with SWHA. There are additional advantages with this approach:
. The bandwidth provided to each level of the hierarchy can be dynamically increased or decreased depending on the local/global communication characteristics. Adaptive bandwidth allocation is achieved by shifting the local/global crossover point X i , 1 i r, at each level of the hierarchy through the tunable AOTF within each FatNode. At least one channel must be allocated to each level to maintain complete connectivity.
. The design of the FatNode does not require AOTF tuning speeds greater than the microsecond speeds already achieved.
. The design constraints on the AOTF filter within a FatNode are further relaxed since individual channels do not need to be selected which would require very narrow filter bandwidths. The control is simplified since only a single acoustic frequency needs to be specified to identify the crossover point.
Each processor is required to possess a receiver capable of receiving a total of r channels. A typical implementation of the receiver subsystem would be a single (fixed or slow tunable) multi-channel filter such Routing: Routing is simple since the network has unity diameter and intermediate routing is not necessary: all packets are transmitted directly from the source to the destination processor with no electronic intervention. The routing algorithm must only determine the appropriate level and channel based on the source and destination addresses, the channel partition X, C, and M. Once the level and channel have been determined, the packet is transmitted according to the access protocol on the home channel of the destination.
A processor is denoted by an r-tuple as defined in Eqn. (2) . Consider two processors: A = (a r ; a r?1 ; : : :; a 1 ), the source node, and Z = (z r ; z r?1 ; : : :; z 1 ), the destination node. Fig. 5 illustrates a 3-level structure where M = 2 3 4 and r = 3. In this example, processor A = (a 3 ; a 2 ; a 1 ) uses channels f 0 ; 1 ; : : :; X 1 ?1 g to communicate to processors with an address (z 3 ; z 2 ; z 1 ) if z 3 = a 3 and z 2 = a 2 but z 1 6 = a 1 . Channels f X 1 ; X 1 +1 ; : : :; X 2 ?1 g would be used if z 3 = a 3 and z 2 6 = a 2 ; and channels f X 2 ; X 2 +1 ; : : :; C?1 g would be used if z 3 6 = a 3 . Determining the correct channel to transmit the packet is a two stage process:
Determine the minimum level of communication required to reach the destination Z through a digit-by-digit comparison of the source and destination address r-tuple: i-level communication is required if z i 6 = a i and z j = a j for all i < j r. 2. Determine the home channel of the destination processor Z at that level. The i-level home channel of A is determined based on the allocation policy which in this case is interleaved and is given by H Z;i = X i?1 + Z mod X i ? X i?1
Multi-level Access Protocol
The SWHA architecture needs a multi-level media access protocol. Random access or static allocation protocols can be employed at the different levels of the hierarchy. The media access protocol considered in λ0   λ1   λ2   λ3   λ4   λ5   110 111  112 113  114  115  116  117   000  001  002  003  004  005  006  007  010  011 012  013  014 015  016  017  100  101  102 103  104  105  106  107   XX0 XX1 XX2 XX3 XX4 XX5 XX6 XX7   XX1  XX0   XX0 XX1   XX7   XX7   XX6   XX6  XX5   XX5  XX4   XX4  XX3   XX3  XX2   XX2   X13   X14   X14   X15   X15   X16   X16   X17   X17   X10   X10   X11   X11   X12   X12   X13   115  116  117  110  111  112  113  114   X03   X04   X04   X05   X05   X06   X06   X07   X07   X00   X00   X01   X01   X02   X02   X03   105  106  107  100  101  102  103  104   X04   X03   X03   X04  X05  X06  X07  X00   X05  X06  X07  X00  X01  X02   005  006  007  000  001  002  003  004   X13  X14  X15  X16  X17  X10  X11  X12   X14  X15  X16  X17  X10  X11  X12  X13   015  016  017  010  011  012  013  014   X02  X01   XX0 XX1 XX2 XX3 XX4 XX5 XX6 XX7   XX1   XX1   XX0   XX0   XX7   XX7   XX6   XX6  XX5   XX5  XX4   XX4  XX3   XX3  XX2   XX2  XX2   XX1   XX3 XX4 XX5   XX7   XX6   XX6   XX5   XX5   XX4   XX4   XX3   XX3   XX2   XX0 XX1 XX2  XX0 XX1 XX2 XX3 XX4 XX5 XX6 XX7   XX1   XX1   XX0   XX0   XX7   XX7   XX6   XX6 this paper for the SWHA is a multi-level generalization of the single level WDMA protocol described in Section 2.1. Time is slotted on the channels at all levels of the hierarchy. The time-multiplexed protocol avoids collisions on all channels, eliminates the need to support media access level acknowledgments (coherence level acknowledgments are still required as in [20] ), and is insensitive to propagation delay.
Performance comparisons of random and static access protocols for this environment showed that timemultiplexed protocols exhibit excellent performance but are limited in terms of scalability due to the long cycle lengths [25] . However, the protocol presented in this section limits the performance degradation due to long cycle lengths since an i-level cycle only depends on the number of reachable processors through i-level communication rather than on all processors. With increased communication locality within clusters, the impact of large higher level cycle lengths is reduced.
The multi-level protocol is described through a 3-level example. The time-wavelength diagram is shown in Fig. 6 where M = 32 = 2 2 8, C = 6, and X = f0; 3; 5; 6g. The allocation map shows time slotted on all channels, and every node is assigned a slot on each channel per cycle. Slots on every channel are combined into groups of m 1 . The 1-level clusters of the system are allocated slots in the group assigned to each of the M 1 clusters. This ensures fairness among all the clusters in the system and among all the nodes within a cluster. Note that each processor does not maintain the table shown in Fig. 6 : this figure is used only for illustrative purposes. An actual processor would determine its assigned slots in a decentralized fashion through the simple algorithm defined below based on , X, M, C and its own address. Execution of the slot assignment algorithm is not required on slot or cycle boundaries. It is executed infrequently, only to support the dynamic re-allocation of bandwidth through reconfiguration by altering the partition points X. The characteristics of this multi-level protocol are: Cycle Length: Let W i denote the number of processors reachable by a processor through i-level communication (W 0 = 1):
The cycle length depends on the number of nodes connected at that level: L i = W i . Every node is assigned a slot on every channel at every level. In the example of Fig. 6 , L 1 = 8, L 2 = 16, and L 3 = 32.
Node address: Every node is represented by an r-tuple address given by Eqn. (2) . In the example of Fig. 6 the nodes have addresses with 3 digits. i k) are assigned a slot on level-k channels during the cycle with s 1 2 f0; : : :; m 1 ? 1g. The channel on which a node is permitted to transmit in any slot is described in channel assignment discussed below. Nodes with a 2 = 0 are assigned level-2 channels during slots with s 2 = 0. Nodes with a 2 = 1 are assigned level-2 channels during slots with s 2 = 1. Nodes with (a 3 ; a 2 ) = (0; 0) are assigned level-3 channels during slots with digits (s 3 ; s 2 ) = (0; 0), nodes with (a 3 ; a 2 ) = (0; 1) are assigned level-3 channels during slots with digits (s 3 ; s 2 ) = (0; 1), nodes with (a 3 ; a 2 ) = (1; 0) are assigned level-3 channels during slots with digits (s 3 ; s 2 ) = (1; 0), and nodes with (a 3 ; a 2 ) = (1; 1) are assigned level-3 channels during slots with digits (s 3 ; s 2 ) = (1; 1). Channel Assignment: The assignment of slots for i-level channels, 1 i r, to the nodes can be summarized as follows. At higher levels, clusters are assigned to slots as described above. The channel on which a node with a 1 = j in slot with s 1 = k is given by c(i; j; k) = (
A node can transmit on the assigned channel if the assigned channel number is within the valid range at that level. Nodes which have slots assigned on channels that are not within the valid range for that level do not transmit in that slot.
Channel Partition
An objective of the SWHA is to support communication when there is very little inherent locality but can adapt to take advantage of locality when it exists by providing increased bandwidth to the lower levels. The SWHA allows the channel allocation to adapt to variations in reference locality on a process-by-process basis by shifting the partition configuration X through the retuning of the FatNodes.
The model presented here assumes that locality may be exhibited in general at any level, only relative to the levels above it. Typically, reference locality would be exhibited within a cluster (level) range. Let the i-level uniform reference probability be denoted byp i , and the actual i-level reference probability when locality is considered (non-uniform reference) be p i . This represents the probability of targeting a destination through i-level communication that can not be reached through (i ? 1)-level communication, 1 < i r, which implies that the source and destination have the same i-level parent but reside in different (i ? 1)-level clusters. Uniform reference probabilities arep i = W i?1 (m i ? 1)=(M ? 1) for all 1 i r and W 0 = 1. The i-level traffic intensity is defined as the mean total traffic destined to i-level nodes (the fraction of the total traffic targeted to a node reachable through i-level communication). Assuming a geometric traffic generation process with a per slot generation probability of g per node, the i-level traffic intensity is I i = gp i W i (7) for all 1 i r. I i represents the traffic intensity within a single i-level cluster. M i is defined by Eqn. (3) to indicate the total number of i-level nodes so I i M i represents the spatial re-use of the channels allocated to support i-level communication. The total traffic intensity given is:
I i M i (8) which reduces to I = gM. A fair number of channels need to be allocated to each level based on the relative traffic intensities with a minimum of one channel being assigned to each level to maintain complete connectivity. Let C i = X i ? X i?1 denote the number of channels allocated to support communication within a particular i-level cluster. Fair allocation can be determined from a channel partitioning factor ! = I i C i for all i, 1 i r, so
I i (9) and therefore C i = I i ! (10) for all 1 i r. Eqn. (10) can be expanded to show that it is just a weighted sum
Eqn. (11) shows that the channel partition is independent of m 1 . The optimal channel allocation (OCA) for a 2-level SWHA is
with C 2 = C ? C 1 . Section 3.3 shows that this partition achieves a uniform saturation load at each level. 
Structural Analysis
System scalability, defined in Section 1, is primarily governed by three limitations: physical limitation due to optical power budget considerations, complexity limitations, and performance limitations due to the finite number of channels within the tuning range of optical sources and filters. Section 3.1 examines the physical limitation through an optical power budget calculation and determines the required amplification for the hierarchical architectures. Section 3.2 compares the system complexity of the architectures under consideration. Section 3.3 evaluates the delay-throughput performance via analytic models and discrete-event simulation.
Physical Scalability
The power budget calculations presented in this section are based on the device loss models given in Appendix A. The fanout of a single star coupler is OPB limited to a maximum of a few hundreds of ports, depending on its structure and realization technology. However, current implementations have not reached the OPB limitations and current devices are limited to up to about 100 ports [29] . Physical scalability, defined in Section 1, for both the FHA and SWHA is achieved with optical amplifiers at the level hierarchical boundaries.
FHA: A tree coupler realization is assumed throughout the analysis. L T (n 1 ; n 2 ) denotes the power losses of an n 1 n 2 coupler and P M denotes the overall system power margin, as defined in Appendix A. The amplifier gain is assumed to be uniform at all level boundaries, denoted as G. The total losses of an The required amplifier gain for an r-level FHA system is given by:
SWHA: The incorporation of one amplifier within each FatNode is considered, being placed at the input to the AOTF. Due to practical device characteristics, it is assumed that the number of nodes per cluster at any level is less than 256 (64 is a typical figure) . The power loss the signal encounters through a FatNode is dependent on its direction (up or down the hierarchy), as can be noted from Fig. 4 . For a two-level system, it will be shown that one amplifier per FatNode is sufficient to support system size on the order of a few thousand nodes with a gain requirement of less than 25 dB. A signal path will encounter both the upward and downward directions of the FatNode component. The the FatNode loss at i-level can be defined as
where L w represents the AOTF insertion loss. It is desirable to have the amplification requirement uniform throughout all levels. This results in minimizing the maximum required gain at any level, maxfG k j1 k rg, since some configurations require higher gain at different levels. The initial, nonuniform, gain requirements are:
for all 2 k r. Having uniform amplification, G 1 = = G r = G, requires that for all 1 k r; kG
It can be shown that in a two-level SWHA the gain requirement is often independent of the node hierarchic distribution (m 2 ; m 1 ). This characteristic is important since it alleviates any constraints on the node distribution due to amplification requirements so the distribution can be mainly determined by a cost-performance metric. It is first shown that 
Complexity
This section evaluates the complexity of the FHA, SWHA, and WMCH; and compares to the hypertree structure of the CM-5. The architectures employ tunable transmitters and fixed (slowtunable) receivers. The WMCH requires r transmitters per node, while both the FHA and the SWHA require only a single transmitter. A single receiver is sufficient for the FHA. An r-level SWHA may use a single receiver with an r-channel AOTF to monitor the incoming traffic from all r levels arriving on a single fiber. The WMCH needs one distinct receiver per dimension in each node as a result of its topological definition. Complexity expressions are summarized in Table 1 , where the first two entries present the overall node complexity (transmitters and receivers) Fiber links while the remaining entries present the interconnection network complexity (star couplers, AOTFs, amplifiers, and fiber links). Star couplers have different port configurations depending on the number of nodes attached to each. The number of fiber links in the system also represents the total number of star coupler ports employed. This could be used as a measure of the overall coupler complexity based on the way couplers are fabricated.
The CM-5 employs a hierarchical network with increased capacity at higher hierarchical levels. The network is based on the Fat-Tree [1] , but differs in being a quad-tree (rather than a binary tree) and in the rate of channel capacity increase at higher levels [3] . The CM-5 interconnection scheme is denoted as hypertree in [3] (not the same hypertree introduced in [30] ). The CM-5 hypertree employs data routers at the internal nodes and processors at the leaves of the tree as in the Fat-Tree. However, the CM-5 hypertree data routers provide multiple paths between the lower and upper level nodes [3] . The above comparison shows that the SWHA network has a significant asymptotic complexity advantage. Routing is also simpler since the internal nodes are configured only upon bandwidth re-allocation and the actual destination routing is furnished via the multi-level access protocol in a completely decentralized fashion. However, the CM-5 employs conventional electronic routers that are more readily available at lower cost than the optical filters used by the SWHA. The relatively high data router complexity in the CM-5 provides more data transfer simultaneity that serves its architectural design goal of supporting a cooperative conflict-free communication model. The SWHA employs fewer space channels and bandwidth allocation components (routers) due to the availability of multiple WDM channels on a single fiber. Conflict-free communication is provided as well, due to the time-multiplexed access protocol. Comparing the effective throughput of both networks would largely depend on the objective functionality (code execution mode) in each as well as other factors such as the communication locality in the SWHA and the adopted routing policy in the CM-5 hypertree. The SWHA throughput is evaluated in Section 3.3 based on a global interprocessor communication model with variable spatial reference locality.
Performance
This section is concerned with the performance of the photonic architectures under consideration. Analytic models and stochastic self-driven discrete-event simulation models are used to evaluate the delay-throughput characteristics. Section 3.3.1 presents the analytic models developed for the hierarchical structures. Section 3.3.2 validates the analytic models through comparison to simulation. The impact of varying system parameters on the performance is discussed in Section 3.3.3.
The performance metrics of interest are the total average access delay (D) and system throughput (?). The delay is defined as the time between the generation of a request and its delivery (and acknowledgment if necessary) to the destination. The total system throughput is defined as the total number of concurrent transfers per slot. D i and ? i denote the delay and throughput of transfers that require i-level communication, for 1 i r. Time is normalized to the channel slot length.
Analytic Models
The FHA architecture is only physically hierarchical: the communication medium is a flat shared wavelength spectrum of C channels. The hierarchy of the SWHA involves partitioning the channels among the system levels. The arrival process of new traffic generated at each processor has a geometric distribution with parameter g, so g represents the per slot probability of a node generating a communication request. The preallocation access protocol requires separate buffer space for each channel to avoid head-of-line blocking [25] . This allows a simple but accurate analytic model to be developed based on Geom/D/1 queues.
FHA: A typical channel queue is considered. The geometric arrival rate to a given channel queue is g=C, implying a uniform reference model and that C divides M exactly so that each channel queue is responsible for an equal number of destinations and therefore have equal arrival rates. The model developed for the SWHA considers in detail the case when the number of channels allocated to a level does not exactly divide the number of nodes communicating through that level.
Instead of a uniform reference model, a non-uniform client/server reference model was also considered. Let r ij denote the probability of a packet being destined to node P j after being generated at node P i . Let P 1 be the server, then for a client node P i , 1 < i M: 1:0. A uniform reference model is used in Section 3.3.3 because the simulation studies showed that traffic partitioning through the pre-allocation of channels limited the performance impact when traffic on one channel is significantly greater than on the other channels. Furthermore, the C transmit queues at each node eliminated any head-of-line effect due to the server traffic. This resulted in performance characteristics of non-server traffic not significantly different than with the uniform traffic model: the congestion was contained to the server channel and had limited impact on non-server channel traffic.
The service time equals the cycle length 1 = L = M, so the channel queue utilization factor is = gM C . An average cycle synchronization time of (M ? 1)=2 is incurred since arrivals are equally likely to occur at any time instant, so the average access delay is then:
The saturation load is determined by g sat = C=M, so the total system throughput is:
SWHA: Channel queues of all levels behave as Geom/D/1 queues when m 1 C. This constraint guarantees that no source node is given a chance to access more than one destination in a single time slot as shown in the allocation map of Fig. 6 . The constraint is practical since C is limited by the tuning range of the optical devices and the most cost-effective configuration of the system is with the largest possible value for m 1 . Removing this constraint results in service interruptions at lower levels.
Apart from the above constraint, the number of channels is not related to the system size or configuration. This section determines the delay and throughput of traffic that requires i-level communication, for 1 i r, and then determines the total average of the performance metrics.
W i denotes the number of processors reachable through i-level communication, and C i is the number of channels allocated to this level so there are a total of C i transmit queues used to pre-sort outgoing i-level traffic. The transmit queues may be further classified based on the number of nodes who use the corresponding channel as a home channel. In general, there will be N ci = W i mod C i queues for channels that are home channels for dW i =C i e destination nodes (called a ceil-queue), and N fi = C i ? (W i mod C i ) queues belonging to channels that are home for bW i =C i c destination nodes (called a floor-queue).
The behavior of the ceil-queues and floor-queues need to be considered individually since a ceil-queue is served by a channel that is home for one more destination node than a floor-queue therefore the likelihood of an outgoing request joining a ceil-queue is higher.
All queues are assumed for modeling purposes to have infinite capacity. The model considers one ceil-queue and one floor-queue at each level as typical queues. The probability of an outgoing packet joining an i-level ceil-queue is q ci = dW i =C i e W i and the probability of a packet joining a floor-queue is q fi = bW i =C i c
W i
A node does not usually target itself, but this is ignored for simplicity in determining the probability of joining a queue. This simplification is shown in Section 3.3.2 to have no effect on the model accuracy for various system parameters in the model validation graphs of Fig. 9 .
The arrival to an i-level queue has a geometric distribution with parameter gp i q ci for ceil-queues and gp i q fi for floor-queues. The deterministic service time of any i-level queue is W i . The average delay of an i-level ceil-queue is therefore (20) and the average delay of an i-level floor-queue is (21) and so the total average i-level delay is D i = q ci D ci + q fi D fi (22) The overall average access delay for SWHA can now be expressed as a weighted sum of the individual level delays:
The i-level ceil-and floor-queues have a slightly different saturation point of g ci;sat = 1=(p i dW i =C i e) and g fi;sat = 1=(p i bW i =C i c), respectively. The system saturation load is determined by g sat = minfg ci;sat j 8i 2 1; r]g, since g ci;sat g fi;sat . Due to the characteristics of the media access protocol, the throughput of an i-level ceil-queue is ? i . When the system has an optimal channel allocation, g ci;sat g cj;sat for 1 i r and 1 j r, so the system saturates at all levels at the same traffic load. Stable behavior is still maintained when there is an imbalance which the model accurately illustrates in the throughput graph of Fig. 10 .
Validation of Analytic Models
This section validates the analytic models developed for FHA and SWHA through a comparison to discrete-event simulation. The simulators for FHA and SWHA are written in C, with a C-based library of routines that provide discrete-event and random variate facilities. Steady state delay and throughput are measured. Simulation convergence is obtained through the replication/deletion method [31] with a 98% confidence in a less than 2% variation from the mean.
A comparison of the results predicted by the analytic model and simulation for FHA is presented in Fig. 8 . A varying number of channels (C 2 f8; 16; 32g) with a fixed system size (M = 256) is considered in Fig. 8(a) , while Fig. 8(b) plots a fixed number of channels (C = 8) with a varying number of nodes (M 2 f64; 256; 512g). The graphs show that the analytic model predicts the performance of the system with a high degree of accuracy under conditions of varying system parameters. This set of graphs shows that the analytic model is accurate for a wide variation in system parameters. The behavior of SWHA with varying system parameters is examined in the following sections.
Impact of Varying System Parameters
The impact of variations in communication locality and channel partitioning is investigated first. at X = f0; 4; 32g. Six cases of reference probabilities to the two levels are considered: fp 1 ; p 2 g = ff0:9; 0:1g; f0:8; 0:2g; f0:6; 0:4g; f0:4; 0:6g; f0:2;0:8g;f0:06;0:94gg. The case of fp 1 ; p 2 g = f0:06; 0:94g corresponds to a uniform reference traffic pattern. The performance of FHA for the same system size (M = 1024) with C = 32 is also plotted. Fig. 10(a) plots the average access time as a function of traffic rate for FHA and SWHA. The reference probability fp 1 ; p 2 g = f0:8; 0:2g indicates that 80% of all references generated by a processor are directed to processors within the same 1-level cluster, and 20% are directed to 2-level clusters. The channel partition X = f0; 4; 32g allocates 4 channels for 1-level traffic and 28 channels for 2-level traffic creating 156 separate channels. The uniform reference traffic pattern generates traffic destined to all the nodes with equal probability. The delay characteristics of the SWHA with a uniform reference probability is same as that of FHA. As the reference probability is varied to increase the references to 1-level nodes with a fixed channel partition, the average access delay decreases and capacity increases. The difference in cycle lengths at different levels together with the increased 1-level reference probability results in the decreased access delay. The given partition X = f0; 4; 32g is the optimum partition for the reference probability fp 1 ; p 2 g = f0:8; 0:2g. This case has the highest system capacity and can be seen in Fig. 10(a) . Fig. 10(b) plots the throughput characteristics of SWHA and FHA. Although the FHA and SWHA with uniform reference probability have the same delay characteristics, Fig. 10(b) shows that the SWHA achieves a significant increase in capacity even when there is no locality due to the spatial re-use of wavelength channels at each level of the hierarchy. The FHA cannot exploit locality due to its common wavelength space. The maximum throughput of ? max = 156 is obtained with the SWHA even with only a slight reference locality. The maximum throughput depends on m 1 and m 2 and the rate of increase of throughput depends on the reference probabilities as described in Section 3.3.1.
Variation in channel partition: The impact of variations in channel partitioning with a fixed communication locality is considered next. The behavior of SWHA at different levels is examined to investigate the impact of channel partitioning on the maximum capacity of each level. The system saturation point is determined by the individual saturation points as described in Section 3.3.1. Fig. 11 plots the access delay and throughput of a 2-level SWHA for M = 1024 = 32 32 and C = 32, where the partition is varied as X 2 ff0; 2; 32g; f0; 4; 32g; f0; 8; 32g; f0; 12; 32g; f0; 16; 32gg. The reference probability is fixed at fp 1 ; p 2 g = f0:8; 0:2g. When the traffic load is low, the average 1-level access delay is the average cycle synchronization time L 1 =2 = 16 slots and the average 2-level access delay is L 2 =2 = 512 slots as shown in Fig. 11(a) . The low traffic delay is independent of channel partition. The total average access time is dependent on the reference probabilities and the level access delays as described in Section 3.3.1.
As the partition is varied to increase the channels assigned for 1-level communication with fixed reference probabilities, the 1-level delay decreases but the overall performance of the system may not improve. One case of X is the optimum partition for a particular reference probability set, such total delay and the maximum total system capacity. The optimum partition is the case when all levels in the system have identical maximum capacities. The system saturation point depends on the saturation points of the individual levels, g sat = minfg sat;1 ; g sat;2 g, as shown in Fig. 11(a) . This illustrates that maximum system performance can be maintained with varying reference locality through the dynamic re-allocation of bandwidth feature of the SWHA.
Variation in system size: The performance impact with varying system sizes is examined through a two-level configuration where m 1 is held constant at m 1 = 32. Variation in size is accomplished for M 2 f64; 128; 256; 512; 1024g through an increase in m 2 . The reference probability is fixed at fp 1 ; p 2 g = f0:8; 0:2g, and C = 32. An optimum channel channel partition is used for each system configuration. Fig. 12(a) plots the access delay-throughput characteristics of SWHA. As described in the previous section, the average access delay at each level is dependent on the cycle length at the levels and the reference probabilities. The 1-level cycle is constant at L 1 = 32 in this example, but the 2-level cycle length L 2 increases as the system size increases.
Expansion results in increased average access delay if the reference probabilities remain fixed.
Although C remains constant, the number of concurrently usable channels increases as the system size increases through channel re-use so a larger system has larger capacity as seen in Fig. 12(a) . However, the marginal increase in capacity decreases as the system size increases for this example of fixed m 1 = 32 and r = 2. For example, a 60% increase in maximum throughput occurs when the system is increased from M = 2 32 to M = 4 32 but the increase is reduced to 13% when the system is expanded from M = 16 32 to M = 32 32. Variation in number of channels: Fig Variation in system configuration: The impact of variations in configuration for a particular system size is considered next. Fig. 13 plots the delay and throughput of 2-level and 3-level SWHA with a fixed system size of M = 1024 and C = 8. The 2-level SWHA has fp 1 ; p 2 g = f0:8; 0:2g and optimum channel allocation for varying configurations of fm 1 ; m 2 g 2 ff64 16g; f32 32g; f16 64g; f8 128gg. The 3-level SWHA is configured as M = 2 16 32 with varying locality fp 1 ; p 2 ; p 3 g 2 ff0:8; 0:15; 0:05g; f0:8;0:1; 0:1gg.
System expansion should take place, for both cost and performance reasons, by first expanding the lower levels size to the maximum as determined by OPB limitations. The improvement in throughput under this policy is due to the increase of spatial re-use of lower level channels (1-level in the figure) . This policy results as well in lower system complexity, upon scaling up the system, as was noted by Table 1 . The benefit of larger lower level sizes is maximized when communication locality is higher, since the contribution of the lower level throughput to the system throughput is higher.
Conclusions
This paper introduced photonic-based structures for low-cost high-performance processor interconnection. Hierarchical architectures based on time, space and wavelength multiplexing were examined. The hierarchical approach preserved low cost while the combination of spatial and wavelength multiplexing, with a time multiplexed access technique, achieved excellent performance characteristics. Optical single-hop architectures were devised that eliminated all intermediate routing. In general, system scalability is hindered by the OPB, the number of channels realizable on a link, and the overall system complexity. The FHA overcomes the OPB limitation through optical amplification, but is still limited by the number of channels since all processors share the same wavelength space. The SWHA overcomes OPB and channel limitation by employing wavelength multiplexing and spatial re-use of channels at each level of the hierarchical architecture. An adaptable system results since the channels can be dynamically allocated to each level of the SWHA based on the reference locality. The SWHA has lower node complexity than the WMCH, a hypercube based structure which also employs spatial and wavelength multiplexing. The performance of the architectures were examined through analytical models and discrete-event simulation. The analytical models are based on Geom/D/1 queues and shown through a comparison to simulation to be accurate. The impact on the system performance was considered through variations in the communication locality, system size, configuration, number of channels, and channel configuration. The performance comparison between the two hierarchical structures shows that SWHA has consistently better performance than FHA both in terms of average access delay and system throughput. The optimum channel allocation for the SWHA was derived based on reference probability and shown to provide the best performance. The configuration that provided maximum channel re-use for a given system size was shown to achieve optimal performance. The FHA could not take advantage of any reference locality while the SWHA was shown to achieve significant improvements in performance through its adaptability characteristic even when there was only a weak reference locality present.
A System Optical Power Budget
The optical power budget (OPB) limits the maximum number of stations based on the optical power that must be delivered to each receiver to maintain a specified bit-error-rate. The OPB is mainly determined from three factors: power coupled from source, system losses, and receiver sensitivity. The difference between the source power and the receiver sensitivity is defined as the power margin (P M ) which also must account for device characteristic variation. This paper considers star-coupled systems due to their inherent OPB advantage over optical bus based configurations [32] .
Losses in an n 1 n 2 star-coupled system are mainly due to the coupler since fiber loss is typically less than 1 dB/Km and may be ignored for local interconnection [32] . The system losses are mainly affected by the coupler insertion losses due to the following three components: power split losses (L s ), excess losses (L e ), and connector insertion losses (L c ). The losses are dependent on the size and configuration of the coupler. Three schemes to construct a general n 1 n 2 star-coupler with logarithmic loss factors have been considered: a star through, for example, fused fibers; dual binary tree, constructed from 2 1 (integrated or discrete) passive couplers; and multistage constructed from 2 2 (integrated or discrete) passive couplers.
The star configuration is typical for fiber-based couplers while both the tree and multistage configurations have been considered for integrated couplers or for constructing large arrays out of basic discrete elements. The star and multistage configurations have the same power loss for an identical square configuration. They both introduce less excess losses by log 2 n factor than the tree configuration, where n 1 = n 2 = n. Either a star or a single binary tree configuration can be used to construct 1 n combiner and n 1 splitter stages. The total insertion losses for the star and tree coupler configurations are characterized below:
L S (n) = (3 + L e ) log 2 n + 2L c (27) L S (n 1 ; n 2 ) = L S (maxfn 1 ; n 2 g) (28) L T (n 1 ; n 2 ) = L e log 2 n 1 + (3 + L e ) log 2 n 2 + 2L c 
