As represented by look-up tables, because large quantities of read-only data need not be replicated locally.
Introduction and motivation
The difficulty of providing sufficient communication resources between processor and memory elements in parallel, multiprocessor systems has led to many proposals to employ optical interconnects for improved bandwidth and latency [l-41. These proposals are driven by communication requirements anticipated from significant increases in computing power per node (1 GFLOPS per CPU near term [5] ) and system node count, and the recognition that traditional electronic interconnects will have increasing difficulty in meeting these requirements. Enhanced interconnects are required to provide sufficiently rapid access to remote, distributed memory so that available computing power is fully utilized for applications requiring tightly coupled multiprocessing. Cache-coherent, shared memory operation places additional stress on inter-element communications due to the short messages and rapid memory access associated with coherence traffic [6] . In addition, rapid remote access can significantly improve memory costs for scientific codes in which complex, underlying physics is Figure 1 : Lambdabus schematic.
In this paper, we focus on the basic WDM star-coupled system, referred to as Lambdabus, rather than larger, massively parallel systems, because its scale conforms to our expectations for the future ''sweet spot" of the multiprocessor market and needs for embedded systems on mobile platforms, while it also provides a building block for larger machines. Our concern is with the interconnect hardware requirements to provide robust, scalable performance at the level of 100 sustained GFLOPS and a few hundred nodes.
Optical interconnect hardware
Optical transmission over single-mode optical fiber (SMF) offers serialized channel transmission rates of 10-40 GHz and demonstrated potential for 100-channel WDM systems [9] . Unfortunately, such SMF technology is unsuitable for robust, cost-effective computer interconnects and embedded systems for several reasons.
Tight SMF optical alignment tolerances (0.2 pm to 2pm for efficient coupling) increase transceiver cost and shock, vibration, particulate, and temperature sensitivities.
More optical power is required for error-free transmission at higher serial rates, sacrificing connectivity/fanout and reliability by reducing the power budget [IO] .
High-speed serialization adds complex and expensive clock recovery and multiplexing between interconnect and logic speeds. Serial data rates 22 GByte/s require 21O:l muxing to match anticipated 1 GHz logic speeds [5] .
High serial bitrate is incompatible with MMF dispersion, which limits 8 GByte/s streams to distances <6m. In certain applications, this constraint restricts the technology's applicability, limiting commercial development and availability.
For these reasons, we pursue a technology based on parallel transmission over multimode fiber (MMF) optic ribbon cables. A similar philosophy governs current work on single-wavelength links by several organizations [ 11- 141, which offer robust optical packaging (tolerances =]OX looser than SMF) and high channel capacity via the aggregation of multiple (32 demonstrated [ 111) fiber bitlines--each running at up to 2.8 GHz [14] while avoiding the difficulties with optical power budget, complexity, and dispersion associated with high-speed serialized links. These links can provide a few GBytesfsec bandwidth with end-to-end latencies of a few nsec (excluding time-of-flight) [ 151. The electrical power consumption of this optical transceiver technology is comparable to that of high-performance electronic transceivers [15] , and while the cost of the technology is currently high due to its recent commercial introduction, we anticipate significant cost improvements as the technology gains acceptance. The two major issues associated with building upon this technology for a Lambdabus architecture are (i) providing WDM capability and (ii) relatively high optical transceiver "costs".
While these "costs" will likely be acceptable for a small number of parallel transceiver arrays per each node, they will prove prohibitive if many arrays are required at each node --for example, if a large number of receiver circuits are used, as shown for large n in the "hn Rx's" of figure 1.
The cost of multiple arrays includes both raw financial costs and those deriving from footprint constraints (about 1 in2 per array module), the associated packaging, and n: 1 multiplexing to access intranode interconnect media.
To avoid a large number of receiver modules per node (as suggested from the above cost rationale), we cannot allocate one receiver array for each system wavelength on every node. Therefore, wavelength selectable transmitter (Tx) andfor receiver (Rx) modules are required. Wavelength-selectable Rx's can be obtained by either (i) fixed 1 :n optical wavelength demultiplexing to multiple receivers, followed by electronic selection of the associated WDM channel, or (ii) tunable optical demultiplexing to a single receiver module. The first approach requires many optoelectronic Rx modules and is precluded by the above cost rationale. The second approach is precluded by the slow (several 100's of nsec) tuning times of MMF WDM demultiplexers. We therefore desire a system in which a few fixed wavelengths are received at each node, using fixed demultiplexers and one Rx module per received wavelength. This approach requires rapid wavelength selection of Tx wavelengths to achieve low latency, a capability not available in current versions of MMF array interconnects. 2 shows our proposed Tx module design, which provides =1 nsec wavelength selection, broadcast capability, and large output power using a single module containing two optoelectronic chips. The first chip contains an array of A laser diodes, each emitting at a different wavelength, with A equal to the total number of wavelengths in the system. The second chip contains two arrays of semiconductor optical amplifiers (SOAs) interconnected by a passive star coupler. The lasers emit continuously, and Tx wavelength is selected in the optical domain by using one SOA array (the leftmost in figure 2 ) to select Tx wavelength. The second SOA array provides modulators to impress word-wide electronic data onto the word-wide spatial channels realized via broadcast over the star coupler. A similar split-and-modulate approach for single-wavelength parallel Tx's has been proposed elsewhere [16] ; our module differs in its WDM capability and use of SOAs to provide wavelength-insensitive modulation and high power output.
The integration technologies required to realize each of the two chips have already been demonstrated at several research labs [e.g., 171. Particular advantages leading to the design of figure 2 are: 0 Optical, rather than electronic, wavelength selection with =1 nsec SOA gating eliminates on-chip laser thermal transients which cause wavelength drift [ 
181.
e WDM multicast capability. * SOAs improve optical power budget for large fanout and to hedge against degradation or high-temperature operation.
* All spatial channels (MMFs) are driven with exactly the same wavelengths. * The two chip approach simplifies fabrication (only one active device type per chip), and permits the use of cleaved end facets for laser cavity feedback.
From a link-level perspective, the proposed Tx provides rapid wavelength selection with bandwidth, latency, footprint and power consumption comparable to those of the current, single-wavelength Tx modules described above [12] . The number of wavelength channels A is limited by the SOA gain-bandwidth (60-90 nm) and stability constraints on the interchannel spectral spacing. We anticipate that modules with A=16 to 64 wavelengths should prove feasible. Preliminary, proof-of-principle link demonstrations at 1 Gbitls per fiber show low bit-errorrates even with a large mode selective loss. We anticipate a skew comparable to that of single-wavelength transceivers [Il-141, which will limit bitrate to a few Gbitls per fiber.
s sy §~e
The preceding discussion leads to a Lambdabus configuration in which each node contains a single, wavelength-tunable Tx and a few fixed-wavelength Rx's. The number of system wavelengths A is less than the number of nodes N, and each node does not receive all A channels. In particular, we assumed the "lowest Rx cost" configuration in which each node receives only one wavelength channel carrying memory access traffic. While increasing the number of memory traffic wavelengths received per node will undoubtedly improve coherence protocol [5] must be used if cache-coherent, shared memory operation is desired. Since several nodes receive on the same WDM bus, our assumed implementation incorporates multicast ability, which aids cache grouping schemes that minimize the cost of coherence directory memory [6, . It also adds contention, however, since messages destined to different nodes may require transmission on the same bus.
Since each of the A optical busses shares the fiber cable medium among all nodes, each requires a multi-access control (MAC) protocol to ensure transmission of only one message per bus at any given time. Previously proposed MACs for passive stars are random access (e.g., ALOHA) [ 1,2] and pre-allocation (time-slotting) MACs [3] . The former reduces capacity under heavy traffic load (37% for slotted ALOHA), while the latter increases latency for light load. We therefore propose to use arbitration, as in [22] , and envision a "replicated arbitration" approach in which control information (medium access requests) is broadcast and received by all nodes, using a control bus implemented with a time-slotted MAC and either separate fiber cabling or a separate, out-of-band wavelength (e.g., a 1300 nm control bus wavelength in addition to the A memory traffic wavelengths in the 800 nm band). All nodes process control information using identical replicas of the same VLSI arbiter, similar to those in electronic busses [23] . This approach enables the fast MAC associated with "centralized" bus arbitration, while maintaining the fault-tolerance of "distributed" arbitration.
While arbitration adds latency to our interconnect, implementing the control channel with the same aggressive technology as the data channels minimizes delay. We estimate achievable arbitration latency Lab as (1) where N is the number of nodes, Icntrl =log2A is the control information required from each node, Bcntrl is the control bus bandwidth, TOF is the time of flight across the interconnect (5 nsec/m in glass + 2 nsec optoelectronic delay), and Tarbiter is the time to decode and arbitrate the control information. Aggressive optoelectronics can provide Bcntrl =32 GB/s (128 fibers wide, 2 Gb/s per fiber), resulting in a control bandwidth latency NeIcntrl /Bcntrl= 5 nsec for N=256 and A=32. Since each wavelength is arbitrated separately and in parallel, the arbitration time is Tarbiter=3'lOg2N gate delays [23] . These values yield an estimated total arbitration latency Lab = 20 nsec for a 256-node, 32-bus system of 2 meter spatial extent. Using less control channel optoelectronics (32 fibers wide, 1 Gb/s per fiber) yields Larb = 53 nsec. Larb = N'Icntrl mcntrl + TOF + Tarbiter 9 partial snoopy coherence protocols [3, 19] , this assumption s of si d La us system system performance, for example by enabling snoopy or was made to assess the performance of the minimal (lowcost) system using the simulations described below. These assumptions imply that each distributed portion of main memory is remotely accessed by one unique system wavelength, and that Some form of directory-based
The performance of the Lambdabus system was assessed Using "Cerberus" [24] , a discrete event simulator for shared memory multiprocessors, in which algorithm execution at the instruction level is simulated in time steps equal to one CPU clock. The SMP maintains cache coherency using a write-invalidate, write-back, directorybased approach described elsewhere [20] . In brief, a full directory-based scheme is employed [5], using valid and dirty bits to track the state of each datum. Each datum has a "home" memory module, which contains its directory information. If a given memory address is shared or dirty, all other copies are invalidated before any node is permitted to modify the contents of that address. The remainder of this section details the characteristics of the simulated system elements.
Node details and memory hierarchy
Each Cerberus node consists of a RISC processor CPU with an instruction set derived from the Ridge 32, a computer manufactured by the now defunct Ridge Computers, Inc., which is compatible with a fully pipelined processor timing model and supports a large number (256) of (simulated) outstanding requests. The simulated CPU clock was adjusted to 0.4 nsec to obtain near-GFLOPS performance for a single node (single Cerberus CPU) on two algorithms (Table I ). While our adjusted clock is somewhat faster than that expected [ 5 ] , this approach is justified for investigating interconnect performance, due to anticipated near-term GFLOPS node performance from other improvements (e.g., more superscalar units) or increasing CPU count per node.
Algorithm
Performance GFLOPS fast 5 n s mem slow 7011s mem Inner Product 9 m i n t stencil 0.7 1 0.28 Table I : Uniprocessor performance Cerberus simulates the flow of data and cache coherence traffic and assumes that all instructions are already cached locally. Data availability is determined by the memory subsystem performance. Each Cerberus node includes a 4-way set-associative data cache, with typical size of 64 cache lines and l-cycle access time. Cache lines were either 64B or 128B.
The main memory modules consist of memory, a memory access controller, and a directory for cache coherence information. If valid data is not locally cached, the processor transmits a request over Lambdabus to the appropriate memory module. Upon receipt, the memory controller queries the coherence directory for the status of the requested data. If the main memory contains unshared, valid data, main memory is accessed locally and the retrieved information is then transmitted back to the requesting node over the optical interconnect. If the data is invalid or shared, the memory controller initiates optical messages to other nodes in order to have the data sent to the requester, in a process described more fully below. Our simulations assume a memory controller response time of 20 nsec, including directory access, which is appropriate for directory implementation in a fast SRAM memory technology. Simulations were performed for two different memory access times of 70 and 5 nsec. The longer time is characteristic of today's DRAM access. The shorter time represents systems for which the entire problem set can be cached (L2) in fast SRAM.
Optical Interconnect
Based on the discussion of 42, the optical interconnect was simulated using the model outlined below.
*BUS architecture: The interconnect comprises A parallel, independent busses. Each bus is asynchronous; nodes transmit transmission requests at any time, unconstrained by time-slotted boundaries.
. Uncompelled, split-transaction bus protocol: After transmission of a request, the bus is not held by the transmitting node while waiting for a reply or acknowledgment, but rather is relinquished for use in other transactions.
Arbitration: After a node requests access to a bus, there is a time delay Lab due to arbitration latency of 20 or 53
nsec. These values are justified in 43. The node can transmit on the bus following this delay only if there is no contention for that bus. Contention is resolved on a first-come, first-serve basis.
Ungranted requests reniain queued, and need not be retransmitted. Arbiter pipelining is assumed, so that a previously queued message can be transmitted immediately after transmission of the preceding message. While this implies no guard bands, the effect of guard bands during transfer of bus ownership is accounted for in our model for transmission latency. Simultaneous request arrivals are resolved with a pseudo-random algorithm. Transmission latency: After permission to transmit is granted, there is an additional transmission time delay given by (2) for message size M in bytes, channel bandwidth BA in GByte/s, combined optical and optoelectronic time of flight TOF, guard band for bus ownership transfer Tg, and channel efficiency eff <1 to account for message headers and coding. Data reads are performed with message size M equal to one cache line, and coherence messages (read/write requests, invalidates) all assumed to be M=lB. The channel bandwidth BA was treated as a variable parameter in the simulations. Since the efficiency and fixed latency TOF+Tg are implementation dependent, we assumed eff=1.00 and TOF+Tg=O in the simulations. Our results for a given BA will be comparable to other systems with no pipelining, different efficiency eff, and nonzero latency TOF +Tg', if that system has a link bandwidth
This model for Blink is conservative because it assumes no pipelining of bus transactions; i.e., multiple messages cannot be simultaneously in flight over the same bus.
0
Interleaved Addressing: only one bus provides access to a given main memory address or to a given node. Memory addresses are interleaved on the cache line size across memory controllers. Nodes are interleaved across busses.
Workload
We simulated four common scientific application kernels:
1024x 1024 matrix-vector multiplication (MVPROD), 256x256 2-dimensional iterative relaxation using a 9-point stencil (RELAX), 256x256 2D complex fast Fourier transform (FFT), and a scatterlgather operation (SG) for a representative finite-element crash dynamics problem (an automobile in DYNA3D). Performance was evaluated from the reduction in total execution time as additional nodes were added to the interconnect as quantified by the "speedup", or ratio of single-node execution time (e.g., Table I ) to N-node execution time. Execution rate for the simpler numerical algorithms (mvprod, relax) was also quantified in GFLOPS, with each FLOP corresponding to one single precision operation (add or multiply).
MVPROD employs row-by-row assignment of matrix portions to nodes. The matrix is read from memory, without being shared or over-written. This stresses only memory bandwidth (no coherence traffic).
In RELAX, each node reads an overla ping submatrix
Processors share a small amount of data, generating a small amount of cache coherence traffic.
FFT computes a 2D complex FFT using two sequences of row-by-row 1D-FFTs followed by matrix transposing. Data sharing generates cache coherence traffic. SG perfoms the gather and scatter operation of a finite element code, of the form:
and computes an (M/dN) X (M/ $ N) submatrix.
gather:
where E is an element quantity (e.g., stress) , X is a vertex quantity (e.g., force), P is a connectivity array (which maps elements to vertices). The parallel algorithm assigns an equal number of elements to each node. The selected P represents a commercial auto (Ford Taurus) with a problem set of 26729 vertices, 340 hexahedral elements, 140 beam elements, and 27873 shell elements. Unlike the other codes simulated, the scatter operation involves significant write access competition among nodes, increasing both coherence traffic and miss latency. Figure 3 shows simulation results for Lambdabus performance. The figure shows performance in GFLOPS (MVPROD, RELAX) or speedup (FFT, scatter-gather) as a function of the number of =1 GFLOPS nodes in the system, for an interconnect with A= 8 or 32 busses, each with BA= 8 GByte/s bandwidth. Additional simulation parameters are 128 Byte cache line size, 8KB 4-way setassociative cache, and fast arbitration (20 ns) and memory access (5 ns) times. These results show that an optical bus can support scalable computing to 256 nodes at the 100 to 150 GFLOPS level, using only one receiver per node for reduced optoelectronic interface cost as well as a small number of busses. A speedup of 50-200X was obtained for all algorithms, provided a sufficiently large problem was executed. For excessively small problem size, such as SG on a smaller data set than described above (2088 elements, lowest curve, figure 3 ), speedup was limited by the intrinsically small parallelism of the problem. Notably, system performance saturates at A=8 to 16 optical busses, which is significantly fewer than the number of nodes N. Performance did not vary dramatically with the details of the memory system within each node. This is shown in figure 4 for the MVPROD algorithm, for which =lo0 GFLOPS performance is achieved using either slow memory access (70 ns) or fast memory access (5 ns).
Selected results
Similarly, relatively small (20 to 40% for mvprod) effects were observed due to changes in cache line size (64 verses 128 B) or arbitration latency (20 verses 53 ns). Scalable performance depends on several factors in addition to the interconnect performance, such as the simulation problem size (discussed above in reference to the scatter-gather algorithm), system size (scalability limited to N), algorithm properties (e.g., computation to communication ratio), and traffic details. To assess interconnect performance under the varying traffic patterns of the different algorithms, we measured the average cache miss latency, or the time delay required to fetch a datum that is not in local cache. This time delay includes all latency associated with network access and communication to fetch the datum, as well as any required invalidations (if the datum is remotely cached) and memory accesses. Figure 4 shows that this latency for the mvprod algorithm is 100 to 300 CPU cycles under light traffic loading (small node count N), which is small enough for adequate performance. The interconnect enables scalable performance provided that the latency remains at this level as node count increases. Beyond a critical system size, the latency increases, marking the transition from an interconnect limited by fixed latencies to one limited by throughput capability. For large N, a simple shared medium model suggests that the latency should scale linearly with N. The simulations show similar behavior NX, with exponent x=1.04 to 1.22.
128B
Increasing the number of optical busses A i throughput, and thus increases the system size supported without latency penalty.
Matrix-vector multiplication performance depends primarily on the speed of remote memory accesses because the kernel involves no coherence tr reason, the performance for MVPROD closely behavior of the cache miss latency. Figure 4 shows that performance scales while the miss latency is dominated by fixed system latencies, and that performance saturates when the latency becomes throughput limited. The transition from latency-limited to throughput-limited performance clearly indicates the number of optical busses A required to support a given system size. Notably, performance can be improved by reducing memory access time only if the interconnect resources are sufficient to avoid the throughput-limited regime. The cache miss latency for the relax and FFT algorithms behaves in a qualitatively similar manner to that observed for mvprod. Miss latency remains approximately constant below a critical system size, beyond which it increases in slightly superlinear fashion with node count N. The miss latency for relax and FFT is always larger than that for mvprod in the throughputlimited regime because mvprod communications involve mostly data reads, with little additional cache coherence traffic (write invalidates). The additional coherence traffic necessary to satisfy read requests increases the cache miss latency in the relax and FFT codes.
The relationship between cache miss latency and system performance is algorithm specific, due to dependencies on factors other than communication as indicated above. For example, FFT shows better scaling than MVPROD despite longer cache miss latency. The lower cache miss rate for this code reduces the overall dependence of speedup on communication. As a result, the onset of throughput-limited communications causes only weak saturation of FFT speedup (figure 5), as compared to MVPROD (figure 4).
The communication behavior for the scatter-gather code behaved rather differently from the other codes simulated ( figure 6 ). The miss latency shows a markedly weaker dependence on the number of optical busses A. This difference arises from the different SG communication pattern. The other codes do not compete for write access to the same data because none of their output data are shared, whereas the scatter operation allows cache lines to be written by several processors. In addition, the large cache line size (128B) increases false sharing. The resultant increase in SG cache coherence traffic increases miss latency and degrades performance ( figure 6 ). This traffic increases nonlinearly as the number of processors is increased. Therefore, the miss latency shows a markedly weaker dependence on the number of optical busses A for this algorithm.
To optimize cost-performance tradeoffs for the multiple optical bus, it is necessary to quantify the requirements on link bandwidth. Notably, the channel bandwidth BA used in the above discussion differs significantly from the bandwidth of the optoelectronic link Blink due to the effect of fixed latencies, as described in eq. (3). For our system, we assume a message size M= 128 B, an efficiency effzU1.125, and a fixed latency L= TOF + Tg = 10 ns. For these circumstances, we found a similar behavior for all simulated algorithms. For small node count N, performance is proportional to the aggregate link bandwidth for all optical busses (AOBlink), and depends only weakly on A or link bandwidth provided their product is fixed. At larger node count, however, the performance saturates at lower levels for smaller channel number A. This occurs because the channel bandwidth saturates with increasing bandwidth, to a value MOeffL dominated by the fixed latency. Parallel optical busses are required to improve performance in the presence of fixed transmission latencies, whereas increasing link bandwidth results in limited improvement. Such behavior is portrayed in figure 7 , which shows the maximum performance (optimized by selecting the best N5256) as a function of aggregate link bandwidth for the relax code. Similar results were obtained for the other algorithms. The figure shows that a link bandwidth of 4 to 8 GByte/s per channel is sufficient to maintain high performance, and that greater bandwidth does not significantly improve performance. It should be stressed that these conclusions depend on the assumption of no transmission pipelining, as discussed in connection with equation 3. Pipelining transactions will improve the latency-limited channel bandwidth, to a value M*eff/Tg, by eliminating time-of-flight effects. Since the assumed latency (TOF +Tg =10 ns) in figure 7 is significantly larger than achievable guard bands Tg of 1-2 ns, we have conservatively evaluated the required link bandwidth. 
Summary and conclusions
We have proposed a robust, high-performance transceiver technology for star-coupled, optical interconnects based on WDM transport over multimode fiber ribbon cables, and shown that this approach enables multiprocessor scaling to at least 256 nodes and about 100 GFLOPS sustained performance for some algorithms. Because the proposed transceiver's wavelength tuning latency is less than that required for bus arbitration, WDM tuning does not impact system performance. Our results quantify requirements on the optical bus in order to realize such systems. Only a moderate number (8 to 32) of wavelengths, each supporting a moderate link bandwidth of = 4 to 8 GByte/s, are required. Furthermore, each node needs only a single optical bus receiver operating at a fixed wavelength. These parameters are well within the capabilities of the proposed technology.
