Abstract : We propose a scalable all-optical switching architecture towards contention-free data center networking. The arrayed-waveguide-grating router (AWGR) based switching fabric utilizes wavelength parallelism to alleviate contention, to reduce latency, and to improve scalability.
Introduction
As massively parallel computing architectures and large data storage systems on the scale of petaFLOPs and petaBytes are being deployed today, the performance bottleneck of data center networks has shifted from the computing systems to the communication infrastructure. In fact, today's data centers and computing centers are designed on interconnection of many electronic switches hierarchically concatenated in multi-stage interconnection topologies (e.g. Fat-Tree, CLOS, Torus, etc [1] ). However, due to the multi-hop interconnections of these designs, the switching latency was relatively high and thus the end-to-end throughput of many kinds of connection-oriented applications (e.g. TCP based applications) was relatively low. Moreover, the use of large amount of commercial electrical switches catastrophically degrades the power efficiency in the data center by substantially increasing the amount of energy consumed in the interconnection networks.
All-optical switching, on the other hand, can benefit from inherent parallelism of optics mapped naturally to the need for massive concurrency and scalability in parallel computing. Output Queued (OQ) switch with "speedup" of N [2] is impractical for traditional electrical switches with high line rates and/or with a large number of ports, since the fabric and memory of an N X N switch must run N times as fast as the line rate. Unfortunately, at high line rates, memories with sufficient bandwidth are simply not available [3] . However, the capability to transport and switch many parallel data flows on parallel wavelengths on an optical waveguide resolves severe I/O limitations faced by all electrical data systems, and thus making OQ switch practical today. While there have been demonstrations of an arrayed waveguide grating router (AWGR) based all-optical packet switching system scalable to 2 million x 2 million ports and 42 petabit/second throughput with nano-second switching time [4] , the scalability of optics in data plane was strikingly verified. The synergy between the need for parallelism in computing and the inherent parallelism offered by optics strongly requires the scalability of the control plane. In this paper, by decreasing the "speedup" from N to k, the N input ports were divided into k contention groups, and therefore the contention probability was localized in each group. Scalability of control plane can be achieved by appropriately choosing k according to N. Simulation results show that, when compared with a traditional Ethernet switch, the new optical interconnection system yields significant improvement in switching capacity and latency. Fig. 1(a) presents a high-level overview of the optical switching system architecture. The key features pursued in the optical interconnecting system are: (a) ultralow latency from greatly simplified arbitration; (b) freedom from head-of-line-blocking and contention-less optical switching; (c) optical concurrency; and (d) scalability. At the core of the switch are an optical switching fabric that includes tunable wavelength converters (TWC), an AWGR, and a shared SDRAM buffer with parallel optical wavelength de-multiplexers (Optical DEMUX) and multiplexer (Optical MUX). The Optical Channel Adapter (OCA) residing between the output ports of the AWGR and the input port of each peripheral node includes an Optical DEMUX interface at the input and an electrical aggregator at the output. In addition, the switching system includes a control block that processes the header field of packet and then arbitrates each packet by checking wavelength availability on the output port and the buffer status in the OCA.
Optical switch architecture
The optical switch fabric supports a single-hop star topology interconnection network with extremely high scalability. The N × N AWGR provides the non-blocking interconnection of N nodes simply by tuning the wavelength on the TWCs by virtue of the well-known wavelength routing characteristic [5] . Another notable feature of the AWGR based switch is that each output port interconnects with multiple input ports on separate and distinct wavelengths. Therefore, placing an Optical DEMUX in the OCA can separate different flows from each wavelength, a338_1.pdf OSA / IPR/PS 2010 PMC2.pdf thus achieving contention-free optical switching. The labels containing information including destination address and packet length will be optically extracted and received at the control block, where arbitration takes place. While packets from different source nodes may contend for a common destination node at the same time, the packets will be on different wavelengths and will be directed to separate queuing buffers in the OCA after being de-multiplexed by 1:k Optical DEMUX and then converted into electrical signals. When k=N, each buffer exclusively serves a single wavelength carrying packet flows from a specific node. A credit-based flow control mechanism [6] is used to prevent buffer overflow in the OCA. The buffer status, recording the number of available buffers (credit), is transmitted to the control block of the optical switch when certain thresholds are reached. Optionally, the packets failing to gain both an output wavelength (when k<N is employed) and buffer space (due to insufficient electrical buffer space at OCA) will be temporarily stored in the shared SDRAM buffers via the (N+1)th AWGR output port (A (N+1)*(N+1) AWGR is used in this case). These stalled packets will be re-circulated to the (N+1)th AWGR input port once the resources become available. An electrical buffer is used instead of optical fiber delay lines, because, presently, only an electrical buffer will allow an arbitrary delay length. Fig. 1(b) presents the architecture of the control block of the proposed optical switch. The control block consists of 2N input ports, a buffer status scoreboard and arbitration logic. The first N input ports of the control block receive labels extracted from the N switch input ports. The second set of N input ports receive the buffer status reports from the OCA and present them to the scoreboard, which assign the credits. Since all 2N inputs are signals in the optical domain, optical-electrical conversion must be first applied to each input, followed by serial-parallel conversion before feeding into the scoreboard and control logic. The arbitration logic implements a virtual-channel like flow control mechanism. Each output port can receive optical packets from at most k wavelengths simultaneously. If k equals N, the switch will be a non-blocking OQ switch. If k is smaller than N, input ports will be divided into k groups. When contention arises, packets from different groups will be allowed to move through the switch without any blocking, but packets from the same group will be arbitrated in a round-robin fashion. The blocked packets will be switched to the output port connected with the shared SDRAM buffers. After allocating the switch resources, the arbitrator checks the scoreboard to determine whether each allotted packet has available buffer space at the receiving OCA. The buffer status scoreboard kept in the control block records the available space in each buffer. To reduce control overhead, the status scoreboard is updated only when the buffer size increases above the preset high threshold, or when the buffer size decreases below the preset low threshold, or the status has not been updated over a certain period.
As shown in the Fig. 1 , the shared electrical buffer that holds the packets that are delayed due to contention uses only one input port of AWGR. The delayed packet will then use one of k wavelengths on each output port instead of choosing from any of the k wavelengths. To reduce the dwelling time of delayed packets in the optical switch, multiple AWGR input ports can be used for the shared electrical buffer. Since each output port can support k wavelengths, the maximum number of AWGR input ports used for the shared electrical buffer is k.
Results and discussion
The following simulation results demonstrate the performance of the proposed AWGR-based optical switch against a commercial Ethernet switch. Both simulations were conducted in OPNET [7] and include a star topology a338_1.pdf OSA / IPR/PS 2010 PMC2.pdf comprised of 24 peripheral nodes and one central node to serve as either the Ethernet or the optical switch. The Ethernet switch scenario uses the 10Gbps_Ethernet link model and the Cisco WS-C6509 switch model. The optical switch scenario uses the 10 Gbps customized optical fiber model and the proposed optical switch model. Each peripheral node generates traffic where the destinations have been drawn uniformly among the other peripheral nodes. Simulations are conducted using a constant packet size of 1500 bytes. The number of wavelengths used on each output port (denoted as k) varies 2, 4, and 24 over the different simulation trials. We simulate the number of pairs of AWGR ports used for the shared electrical buffer (denoted as f) for the case where f is 1 ((N+1)*(N+1) AWGR required) and where f equals k ((N+k)*(N+k) AWGR required). We assume that the system's total buffer size, Btot, is fixed and equal to the total number of packets that can be buffered simultaneously. The electrical buffers are placed at either the central optical switch (BS) or each OCA (BOCA). All simulations use a relatively small number of buffers; Btot = 24*120 packets (around 4.32 Mbytes) for the entire switch system. Note that the OCA can often be part of an existing backplane switch equipped with Gigabytes of total memory. Fig. 2(a) shows the normalized throughput as the load increases. It is clear that the Ethernet switch suffers from saturation at a low traffic load level (0.5), while the proposed optical switch can accommodate a heavy traffic load. Fig. 2(b) shows the performance with respect to the End-to-End (ETE) latency for the different switch models. The ETE latency of the Ethernet switch is much higher than that of the optical switch, whereas again -as the figure shows -the Ethernet switch saturates at a relatively low traffic load. In addition, as shown in Fig. 2(b) , a flat ETE latency for the optical switch can be achieved when k is equal to N, since contention will not occur at the central switch. When k is less than N, the latency will still be minimal under low and moderate traffic loads. Furthermore, Fig. 2 (b) also indicates that increasing the number of AWGR input ports used for the shared electrical buffer allows the optical switch to drain out delayed packets more quickly, thus further reducing the entire latency. The last valuable information we can obtain from the figure is that putting more buffers at peripheral nodes can also help to reduce ETE latency. Having sufficient memory on the OCA side causes fewer flow controls and utilizes the switching capacity more completely, resulting in the number of packets delayed at the optical switch due to flow control to be greatly reduced. Fig. 2(c) shows that the contention ratio is slightly higher when the shared electrical buffer of the optical switch uses multiple AWGR ports, while it remains contention-free for the case of k = N. 
Conclusion
A novel AWGR-based optical switching architecture is proposed for data center networking. Multiple packets can travel on different wavelengths to the same output port simultaneously, thereby reducing the contention probability at the core switch. Contention groups localized the contentions and therefore enhanced the scalability of the control plane of the optical switch. Overall, the simulation results show that the proposed switch architecture significantly outperforms Ethernet switch in terms of latency and throughput.
