Abstract-We demonstrate a scalable optical datacenter architecture with multicasting capability. The modular architecture relies on pods interconnected on a WDM fiber ring. Inter-pod traffic is handled on a per-wavelength basis, whereas slotted network operation allows fine switching granularity within the pod. Network operation and dynamic reconfiguration are enabled by the frequency-selectivity of a novel, phase agnostic coherent receiver scheme that obviates the need for DSP at the receiver. System performance is investigated through numerical simulations and successful operation is validated experimentally. Slotted network operation is demonstrated by means of an FPGA implementation.
INTRODUCTION
Driven by the massive explosion of cloud applications and high-definition video services, the insatiable appetite for bandwidth resources places increasing demand on the datacenter side. Global annual datacenter IP traffic is expected to reach 10.4 zettabytes by the end of 2019 with the largest portion of this traffic (73.1%) lying within the datacenter [1] . As datacenter networks strive to cope with these emerging trends, the limitations of conventional architectures are surfacing. The traditional "fat tree" approach using higher rate ports towards the root of the tree has a number of drawbacks including underutilization of resources, poor scalability and low energy efficiency. A folded Clos topology can only partially tackle these shortcomings using low-port-rate commodity switches, but its overall cost still scales superlinearly [2] . Besides, such architectures are tailored to northsouth traffic and fail to comply with the intra-datacenter traffic demand which is expanding along the east-west orientation. It becomes apparent that a complete re-architecting of the underlying infrastructure is urgent. Moreover, network dynamicity is also crucial for scaling datacenter networks. The rigid allocation of bandwidth resources forces designers to use over-subscription, resulting in poor average network utilization and exorbitant power and cost curves [3] . To sustain growth and support the emerging disaggregateddatacenter model it is imperative to improve utilization of the available resources, by means of architectures providing more fine-grained resource provisioning and better scalability [4] .
Optically-switched architectures based on a ring topology have been proposed, using a wavelength-selective switch (WSS) in each rack to route traffic to and from the ring [5] . To further enhance scalability and functionality in the opticallyswitched network, more elaborate architectures have been proposed combining WSS-based top-of-rack switches with large-radix core switches [12] . The broadcast-and-select design of these concepts offers multicasting capability, but their implementation cost is high due to the use of multiple optical switches per rack. In addition, network dynamicity is limited by the reconfiguration time of the underlying components (WSS, MEMS switches). Wavelength-switched architectures based on fixed wavelength demultiplexers provide shorter reconfiguration times, yet at the expense of poorer scalability and reconfigurability [5] .
In this paper we demonstrate a novel, scalable datacenter network architecture based on optical pod switches interconnected in an optical ring. Network operation relies on fast wavelength switching inside the pod (interconnecting multiple racks in a tree topology) with WSS switching in the optical ring interconnecting multiple pods to gracefully scale network dimensions. Fine switching granularity is obtained by means of a novel frequency-selective receiver and slotted network operation, addressing the limitations of current datacenter architectures. The proposed topology allows the network to expand along the east-west axis without the need for sophisticated hardware, thus complying with the stringent cost requirements of the Datacom market. Section II describes the overall methodology and network architecture. Sections III and IV present the evaluation of the phase-agnostic coherent receiver and the FPGA connectivity platform respectively.
II. CONCEPT AND SYSTEM ARCHITECTURE
The proposed architecture is shown in Fig. 1 and follows a modular approach that facilitates system scaling. The primary building block of the architecture is the pod, which in essence is a small, self-contained datacenter. The pod is controlled by an optical pod switch, which serves up to 64 server racks interconnected via equal top-of-rack (ToR) switches. The ToRs are connected to the pod switch in a tree topology through a 1x64 splitter. Scaling the pod to serve larger network dimensions is achieved by interconnecting multiple pods through a WDM optical ring. This topology is flat in the sense that above the ToR level switches, there is only one additional level of pod switches, allowing the network to expand in the east-west direction instead of the north-south as in the case of legacy fat-tree topologies. In order to realize inter-and intra-pod communication, a pair of fast wavelength selective switches (WSS) and optical couplers are located at the east and west of each pod switch. Thus, the incoming signal is split and sent both to the input (west) and to the output (east) WSSs; the same holds for signals transmitted by any ToR switch inside the pod. As far as inter-pod communication is concerned, any incoming signal with destination inside the pod will be routed southbound to the ToR switches by the west WSS but will be blocked by the east WSS so that it will not propagate to the next pod. A signal entering the pod with destination outside the pod will be blocked by the west WSS but forwarded to the next pod by the east WSS. Similarly, a signal transmitted by a ToR within the pod with destination outside the pod will be sent northbound to the east WSS towards the destination pod, but will also enter the west WSS where it will be blocked. Intra-pod communication is implemented in the same manner, since any transmitted signal by a ToR with destination inside the same pod will be blocked by the east WSS but will be rerouted southbound towards the ToR switches by the west WSS. Optical amplifiers are located just before the two couplers in order to balance the power loss due to the signal splitting. In the southbound level, each rack is administered by a ToR switch that is equipped with a wavelength tunable transmitter and receiver, tuned to a given wavelength at a given point in time (specified by the control plane e.g. as in [17] ), i.e. all traffic at wavelength λi is received by rackj. In the remaining of this paper an implementation of this architecture is considered consisting of 20 pods and 64 ToRs per pod. Each ToR has a 10 Gb/s transmitter, tunable across 64 wavelengths in the C-band. With the use of 64 parallel fibers in the ring, the network under consideration provides full bisection bandwidth of 12.8 Tb/s.
The tunable receiver is a phase-agnostic coherent receiver, used for detection of intensity modulated signals (NRZ, PAM) exploiting the local oscillator (LO) of a standard coherent receiver so as to achieve the tunable wavelength selection functionality (section III). This avoids the use of fixed optical filters in the optical distribution network and supports dynamic allocation of wavelengths (i.e. resources) within the pod. Another significant advantage that stems from this receiver architecture is that multicast transmission can be achieved simply by setting the local oscillator of multiple receivers to the same wavelength. The tunable transmitter consists of a tunable laser providing a continuous wave (CW) optical carrier at the desired wavelength, and an optical modulator where the electrical data is converted into the optical domain. It should be noted that wavelength tuning at the transmitter is not a prerequisite for the architecture, but is highly beneficial to avoid wavelength contention and simplify scheduling of the slotted network. In order to leverage existing technology from WDM telecom networks, a 50 GHz wavelength grid in the C-band as specified in ITU's recommendation G.694.1 [6] is considered. As far as the laser is concerned, a digital supermode distributed Bragg reflector (DS-DBR) laser is considered as it achieves fast wavelength tuning in less than 50 ns [7] [8] [9] , providing dynamic reconfiguration to the network.
In order to enable dynamic allocation of the available bandwidth for optimum utilization of resources, slotted (TDMA) operation of the ToR transceivers inside the pod is envisaged. The packets have duration of 200 μs and the intermediate guard periods were chosen to be 10 μs, to match the switching time of the WSS under consideration [10] . This way, resources are dynamically allocated and traffic contention is avoided with sub-wavelength granularity.
III. PHASE-AGNOSTIC COHERENT RECEIVER
Coherent detection is used in the context of this work as a means for dynamically selecting the desired WDM channel by adjusting the wavelength of the local oscillator at the receiver. The proposed phase-agnostic coherent receiver retains the wavelength-selective properties of coherent detection (as applied in slotted metro networks [11] ) but ignores phase information which is not relevant in intensity modulated datacenter links, thus avoiding the use of computationallyintensive DSP for phase estimation. The following subsections describe the architecture and demonstrate concept functionality through simulations and experiment.
A. Receiver architecture
The phase-agnostic coherent receiver is shown in Fig. 2 and is a simplified version of a typical heterodyne receiver. It employs a simple 50/50 coupler followed by a balanced photodetector, thus avoiding the complex 90° hybrid associated to intradyne implementations. The coupler operates as the optical mixer that allows interference of an incoming optical signal with the tunable laser local oscillator (LO). A WDM signal is fed into one of the two inputs of the optical coupler and a continuous wave laser that serves as a LO into the other. A balanced photodetector is employed so as to subtract the outputs detected at each coupler port. Thus, the DC terms of the resulting photocurrents cancel out after subtraction while the sinusoidal terms are added [13] . To accommodate the signal bandwidth at the heterodyne receiver, the LO laser is detuned from the selected channel by a small frequency offset. In the rest of this paper the bitrate is assumed to be 10 Gb/s and the offset is set to 15 GHz. As a result, the selected channel is down-converted to an intermediate frequency (IF) that equals the frequency offset between the LO and the signal, in this case fIF=15 GHz. Following the same procedure, any channel at any wavelength can be detected by tuning the LO to the channel's frequency, off-set by fIF. However, since the detected signal is down-converted to the IF, further downconversion has to be employed. Rather than using an electrical RF mixer, a rectifier circuit can be used instead, performing envelope detection. The rectifier block consists of a simple signal squarer and a low pass filter with appropriate cut-off. An additional low pass filter is usually employed after the receiver and before A/D conversion so as to avoid aliasing from the signal itself or from its neighboring WDM channels.
B. Receiver Simulation
The receiver architecture was simulated using the VPItransmissionMaker TM software. The test WDM signal was generated by an array of six optical transmitters (Tx1-Tx6) operating at 10 Gb/s with launch optical power of 0 dBm and frequencies from 193.1THz to 193.35 THz, fed into an optical WDM multiplexer. At the receiver side, the WDM signal is mixed in a 2x2 optical coupler with a tunable continuous wave laser that serves as a local oscillator. The photoreceiver consists of two identical photodiodes followed by a filter in order to emulate their bandwidth limitation. A Variable Optical Attenuator (VOA) is employed in order to adjust the received optical power at the balanced photoreceiver so as to evaluate the system's BER performance against input optical power. In order to select a given wavelength, the frequency of the local oscillator (LO) was adjusted to a frequency offset of 15 GHz with respect to the corresponding WDM channel. Each pod has two inputs and two outputs. The optical flow from the ring is fed into input1. Output1 forwards the signals whose destination is any of the following pods. These signals may originate from inside the current pod or from any of the previous pods. For simulation purposes, Input2 emulates the transmitters of the current pod (added traffic) while Output2 emulates its receivers (dropped traffic). Inside each pod, an EDFA with noise figure 3.8 dB is employed in order to maintain the signal power level constant.
To assess the effect of EDFA noise accumulation for up to 20 pods as well as inter-channel crosstalk due to filtering or nonlinearities between multiple neighboring channels on a 50 GHz grid, ten transmitters Tx1-Tx10 and twenty pods were simulated (Fig. 4) . The pods support add-drop functionality enabled by the WSSs of Fig. 1, i .e. each pod was configured to add or drop particular wavelengths coming from or heading to the ring respectively. In order to assess the receiver's efficiency as well as the network's scalability to large number of pods, the signal was evaluated through Bit-Error-Rate (BER) measurements.
D. Experminental setup and discussion
The experimental setup used for the evaluation of the concept is shown in Fig. 6 . The main purpose of the experiment was to validate the wavelength selectivity of the proposed phase-agnostic coherent receiver, as part of the overall system. For this reason and due to the unavailability of a fast tunable laser at the time of the experiment the tests were carried out with a static wavelength assignment. A 10 Gb/s NRZ signal was produced by a pulse pattern generator (PPG) whose output was introduced to a LiNbO3 Mach-Zehnder Modulator (MZM). An off-the-shelf modulator driver was used to adjust the input power level of the data stream before entering the MZM to reach Vπ. Three Distributed-Feedback lasers (DFB) emitting at 1554.94 nm, 1556.55 nm and 1558.17 nm respectively were multiplexed in a 4x1 WDM Multiplexer (MUX), providing the optical carriers for the MZM. The optical power for each laser source was set at 10 dBm.
The modulated NRZ optical signals were de-correlated by means of optical fibers of different lengths for each signal path. To achieve this, the MZM output was fed to a 1x4 WDM demultiplexer and after decorrelation the signals were multiplexed again via a 4x1 MUX. At the receiver side, the WDM optical signal was fed in a 2x2 optical coupler along with a LO produced by a 13 dBm tunable distributed Bragg reflector (DBR) laser. The wavelength of the DBR LO laser was tuned properly in order to measure the different channels. More specifically, the difference between the LO and the signal frequencies was set at 15 GHz in order to realize heterodyne down-conversion of the desired signal at an IF of 15 GHz. The two outputs of the coupler served as respective inputs for a commercial balanced photoreceiver. For the timesynchronization of the two data streams, an Optical Delay Line (ODL) was used. An Erbium Doped Fiber Amplifier (EDFA) and a Variable Optical Attenuator were employed to adjust the incident optical power in the balanced photoreceiver for BER measurements. The signal and LO were manually copolarized in the experiment. Polarization issues could be resolved with a polarization diversity receiver [14] . Due to the unavailability of a rectifier for the experiment, its operation was emulated in MATLAB through offline DSP, after signal digitization in a 33 GHz Digital Sampling Oscilloscope with 80 GSa/s sampling rate. After resampling, a 25 GHz low pass filter was used in order to suppress the outof-band neighboring channels. The rectifier was implemented using a squaring operation followed by low-pass filtering in order to recover the original signal. Finally, the square timing algorithm was employed to recover the optimum sampling point to symbol detection and BER estimation. The 3dB cut-off frequency of the second low pass filter was varied and the optimum results were obtained for f3dB=8 GHz. The LO optical power entering the coupler at the receiver was measured to be 11 dBm. In Fig. 7 , transmission of a single channel is shown in order to validate the concept of the proposed receiver architecture. Channel 1 is received at the intermediate frequency fIF=15 GHz, and the signal is filtered but not yet squared and low-pass filtered, i.e. before entering the rectifier. As a result, the information of zeros and ones is encoded in the absence or presence of the 15 GHz IF respectively. Fig. 8 shows BER curves of all three channels as a function of the received optical power. For received optical power higher than -8.3 dBm, no errors were observed. It is worth noting that since the record length of the real-time oscilloscope was limited, the maximum number of bits that could be processed was ~2•10 6 and as a result the minimum BER that could be estimated was in the order of 4•10 -7 at 95% confidence level. However, significantly lower BER is expected for the cases where no errors were observed. All three channels exhibited operation below FEC threshold for a wide range of received optical power values. Moreover, no performance degradation was observed in the middle wavelength that has neighboring channels from both sides.
The experimental results prove successful wavelengthselective operation of the phase agnostic coherent receiver. Thus, the proposed subsystem can serve as a key enabler for the optically-switched architecture described in section II, and could provide rapid network reconfiguration by leveraging the fast wavelength-tuning capabilities of broadly available laser technologies as demonstrated in [7] - [9] , enabling guard periods as low as 50 ns for intra-pod traffic. Scaling to larger network dimensions is feasible with a reasonable penalty over the back-to-back performance demonstrated in this section, owing to EDFA noise accumulation as shown in the simulations of section III.
IV. FPGA CONNECTIVITY PLATFORM
Slotted TDMA functionality was evaluated in a separate experiment since real-time detection was not possible with the phase-agnostic coherent receiver due to unavailability of the rectifier circuitry. Instead, a standard photoreceiver with directdetection was used to verify that TDMA operation can be integrated in the proposed datacenter architecture, as shown in the transmission scenarios and results of this section. A 10 Gb/s optical link was established and two Xilinx 7-Series FPGA boards (Virtex VC707 and Kintex KC705) were used as transmitter and receiver. In the course of the experiments we tested the following scenarios: a) one FPGA board transmitting data through the optical path and b) both FPGA boards transmitting data alternately. In both cases there is only one FPGA receiver, which receives and analyzes the data. The FPGAs are configured to function in burst mode operation. The operation time of the FPGAs is divided in periods called transmission periods and silence periods (guard periods). Each transmission period for the Tx FPGA board is 200 μs long and is followed by a silence period. Two different silence period durations were tested: 20 μs and 10 μs. The guard period was realized by transmitting zeroes.
A. Architecture   Fig. 9 depicts the transmitter's architecture. The Preamble Generator and the Data Generator blocks are PRBS generators. The Frame Counter block consists of a counter, which increases each time a frame is sent and thus, it forms the sequence number of each frame. The control unit resets and enables each of the aforementioned blocks; it also controls MUX1 in order to create frames in the format described in the previous section. The output of MUX1 is forwarded to the Scrambler [15] . The Scrambler is used to facilitate more accurate and quick Clock and Data Recovery (CDR) on the receiver and reduce the inter-carrier signal interference. The synchronization pattern block is a register that contains the 64-bit sequence, which delimits the beginning of the payload of the frame (Sync word). The synchronization word is the only part of the frame that is not scrambled in order to be transmitted. MUX2 is also controlled by the control unit. Fig.  10 presents the overall architecture of the receiver. The first block of the receiver is the Xilinx GTX transceiver [16] . The GTX transceiver's Serializer-Deserializer (SERDES) performs deserialization of the serial bitstream into 64-bit words. The 64-bit words are forwarded to the Synchronizer, which processes the incoming data in order to identify a sequence matching the Synchronization word. The Synchronizer checks all the 64 possible concatenation schemes until it locates the Sync word. When the Synchronizer detects the Sync word it will keep delivering data in the format of 64-bit words which are aligned to the Sync word.
The Clock Domain Crossing (CDC) FIFO is utilized to ensure that the internal logic of the receiver operates under a stable clock. The GTX performs CDR on the incoming data and provides 64-bit words accompanied by the recovered clock. Since burst mode operation is targeted, during the silence period of the transmitter the GTX is unable to recover a stable clock. At the time that the Synchronizer locates the Sync word the Control Unit enables the Scrambler and the PRBS Generator; both these units are identical with the transmitter's respective units. Therefore, the inputs of the BER counter are two 64-bit words: 1) the de-scrambled PRBS that was sent by the transmitter and 2) the same PRBS sequence that is generated locally. The comparison of these two 64-bit words allows the calculation of the Bit Error Rate (BER) for each received frame. The result for each frame is forwarded to the accumulator; the accumulator's contents and the number of received frames are used to calculate the overall BER. 
B. Frame Description
The Frame format is shown in Fig. 11 . The first part of the frame consists of the preamble. The preamble precedes the payload of the frame to ensure that the receiver (CDR circuit) and the optical path are correctly configured before any valid data are received. The preamble involves a PRBS sequence. A variety of preamble sizes was tested (50 μs-8 μs) and in the following sections we present the results (Bit-Error Rate, frame loss) for each size. The field following the preamble is the Sync Word (delimiter). The Sync Word is a unique sequence of 64 bits (0xc5e51840fd59bb49) that delimits the beginning of the frame. The following field is the frame sequence number. The frame sequence number is used to calculate the frame loss, e.g. how many frames were not recognized by the receiver. Finally, the last field of the frame is the Data field (Payload) which in our case contains a PRBS sequence. An additional field, called Tx_id, is used in the case of multiple transmitters to distinguish each frame's origin. 
C. Testing Suites
Two different configurations of the interconnected FPGAs were evaluated. The first configuration consists of a single transmitter (KC705) and a single receiver (VC707). The transmitter is configured to transmit frames of 200 μs followed by a silence period of 10 μs. The receiver calculates the BER for each frame, stores the bit errors value in an accumulator, counts the received frames and compares the received frames value with the frame counter to calculate the frame loss.
The second configuration involves two transmitters (one transmitter on each FPGA board) and a single receiver. One board is considered to be the Master transmitter (KC705) and the other one (VC707) includes the Slave transmitter and the receiver. The purpose of realizing this setup is to receive frames from both transmitters alternately, i.e. a frame transmitted by the KC705 is followed by a frame transmitted by the VC707 and vice versa. The Master transmitter is configured to transmit frames of 200 μs followed by a silence period of 240 μs. During this silence period the slave transmitter utilizes the link for 200 μs. The result on the receiver is frames of 200 μs followed by silence period of 20 μs. The frames in this case contain an additional field, which is the Tx_id. The Tx_id field is used by the receiver to distinguish which transmitter sent each frame. 
D. Experimental Results

V. CONCLUSIONS
We demonstrated a novel scalable datacenter architecture employing WDM optical switching and slotted TDMA operation. The architecture is enabled by a simple phaseagnostic receiver topology that allows for multicast transmission. The system was simulated for different scenarios investigating its scalability up to 20 pods, corresponding to 1280 server racks. The receiver concept was validated experimentally for three WDM channels with the use of offline DSP for signal rectification, exhibiting robust performance and operation below the FEC threshold. Slotted TDMA operation was also experimentally verified in a 10 Gb/s optical link scenario with real-time FPGA implementation for the transmitter and receiver. Simulations and experimental results indicate robust performance and prove the successful operation of the overall system and the viability of the architecture in dynamically-reconfigurable datacenter networks.
