Abstract-High-speed, submicrosecond-latency, largeport-count (thousands) optical packet switches (OPSs) for intercluster communication networks can become a key element in the deployment of cloud-oriented largescale data centers. In this work we numerically investigate the performance of a large-port-count wavelength-division multiplexing (WDM) OPS based on a Spanke-type architecture with highly distributed control. We analyze it under a data center traffic model to determine its suitability for this type of environment. Results indicate that the proposed architecture can be scaled to 4096 ports while providing packet loss below 10 −6 and latency under 1 μs, with a total switching capacity over 55 Tbits∕s. Additionally, we propose and analyze two WDM OPS architectures. The first one detects and processes small and large-sized Ethernet packets with two parallel switches. The second architecture includes multiple receivers to decrease packet losses and latency while using very limited electronic buffers. Results indicate that both techniques can lead to substantial improvements. In terms of packet loss and latency, they allow up to 40% higher input load with respect to the original WDM OPS architecture.
I. INTRODUCTION
A s Internet services increasingly grow, the trend toward cloud-oriented, server-side computing has led to the proliferation of warehouse-scale computers with hundreds or thousands of servers. Data center networks typically deploy a multilayer, hierarchical, fat tree topology based on gigabit Ethernet switches. Racks of computers grouped into clusters communicate among themselves by means of cluster switches. 10 Gbit∕s Ethernet switches are common in the top hierarchical levels of data centers. However, their limited bandwidth and the latency induced by the network topology create bottlenecks between groups of servers located at different clusters. Communication lag between different clusters can be on the order of a few hundreds of microseconds [1] .
Optical packet switches (OPSs) scaling to thousands of ports can represent a valuable solution for removing the bottlenecks of electronic intercluster data center networks and reduce the power consumption-first, by speeding up the network beyond 40 Gbits∕s, allowing higher communication bandwidth, and second, by breaking the fat tree structure and enabling a flat network as pictured in Fig. 1 . A flat and faster network with less latency can fill the gap between the growth rate in current data centers' size and their interconnection costs. This can facilitate the sharing of resources between servers and extend parallel distributed computation. Third, no power-hungry opticalelectrical-optical (O/E/O) conversions are needed. Indeed, in current data centers, optical technology is used only for point-to-point links, while the switching process is handled by electronic packet switches. This means that power-hungry O/E/O conversions are needed at each hop of the DC tree topology.
The general intercluster network system that we consider is schematically shown in Fig. 1 . The OPS interconnects a number of ingress/egress clusters. The clusters contain (electronic) queues of packets. The OPS is assumed to operate transparently in the optical domain, while the switch control is in electronics. In this scenario, OPS architectures with thousands of ports could be a valuable solution, flattening the DC network topology and avoiding the power-hungry O/E/O conversions. Achieving such interconnectivity requires OPSs that not only scale to a large port count but are also capable of switching high-bit-rate data channels at the expense of little latency.
Several OPS architectures have been reported in [2] [3] [4] [5] [6] [7] [8] [9] [10] . Despite the physical impairments, such as losses and noise accumulation as the port count increases, fast switchreconfiguration time to keep latencies low while scaling to large port counts remains an issue. The choice of switch architecture impacts network performance. The proposed architectures are based on rearrangeable, nonblocking, multistage Beneš, Banyan, or other architectures with centralized control. However, at any time slot in which the input state changes, the entire switch matrix needs to be reconfigured to establish a new connection map. The computation time of the best-known algorithm for reconfiguring the switch scales proportionally to N log 2 N (on a single CPU), where N equals the number of inputs and outputs of the switch; this is the well-known looping algorithm [11] .
N log 2 N implies that, for instance, scaling up the switch to 1000 ports will increase the computation time by a factor of 10,000. During the reconfiguration time, the switch is unable to handle data; thus incoming packets are either lost or need to be stored in electronic buffers in the ingress nodes, both at the expense of latency and large-size electronic buffers. If the number of switch ports exceeds a critical threshold, the switch is no longer capable of handling the incoming switching requests within the maximum latency of the system. An analytical comparison between different OPS architecture has been reported in [12] . As a result, latency and computation requirements limit the maximum ports count of such switch architectures.
Recently, we have investigated and experimentally demonstrated a modular wavelength-division multiplexing (WDM) OPS based on a strictly nonblocking Spanke-type architecture [12] [13] [14] [15] [16] . In order to avert the limitations described above, the proposed architecture features highly distributed control. Because there is no centralized control, computation times no longer depend on the architecture size. Thus, latency and control computation requirements do not limit scalability, and the number of ports can be increased beyond the thousands order [12, 13] . In [14] [15] [16] we have experimentally demonstrated 40 Gbit∕s operation of a modular 16 × 16 WDM OPS with control time as low as 25 ns and total power consumption of 76 pJ∕bit by using either off-the-shelf optical components or photonic integrated devices. In [17] we have experimentally demonstrated the performance of the WDM OPS with flow control and packet retransmission in the case of slotted uniform traffic. In [13] we present a discussion on the scalability in terms of optical components, optical losses, power consumption, and costs as the port count increases. The technological roadmap and feasibility to realize a largeport-count WDM OPS based on highly distributed control is also described in detail in [13] .
We have also numerically analyzed the performance of the proposed WDM OPS under a very simple slotted Bernoulli traffic pattern [18] and then with an asynchronous, more complex traffic model [19] . The numerical results indicate that submicrosecond-latency operation is possible while scaling the switch to 4096 input/output ports. After those promising results, further investigation under realistic data center traffic is needed to validate whether the architecture is suitable for a data center environment.
In this work we analyze the performance (data loss, throughput, latency, scalability, and electronic buffer dimensioning) of the proposed WDM OPS architecture under a realistic data center traffic model. Numerical results show that the WDM OPS can support large port count (up to 4096 ports at 40 Gbit∕s data rate), packet loss <10 −6 , and submicrosecond-latency operation under typical data center traffic loads, providing more than 55 Tbits∕s of switching capacity. This performance indicates that the WDM OPS architecture could be used in real data center environments.
We have also investigated two novel architectures that allow for a higher input traffic load and therefore an enhanced switching capacity. The first one, based on two parallel OPSs, considers the bimodal distribution nature of data center traffic in short-and long-sized packets. This novel architecture is capable of handling separately the small-and the large-size packets. We found out that the admissible input traffic load can be almost 30% higher than that of the original WDM OPS architecture, maintaining the same packet loss and latency performance while using the same electrical buffer size. This will result in a total switching capacity larger than 74 Tbits∕s.
The second investigated OPS architecture exploits multiple receivers for alleviating the packet losses and subsequent retransmission of contended packets. Results show that this novel WDM OPS architecture could extend operation up to an admissible input load of nearly 0.5. This means a 40% increase with regard to the original WDM OPS architecture, providing more than 80 Tbits∕s of switching capacity at the price of doubling the number of receivers.
The paper is organized as follows. In Section II the system under investigation is described. In Section III the simulations layout and the traffic model are presented. Several system performances are analyzed, and the results obtained are discussed in Section IV. In Section V the two modified architectures to further improve the system performance are investigated. Finally, Section VI concludes the paper by summing up the most important results. Figure 2 shows the WDM OPS architecture with distributed control as an intercluster communication network in which the total number of input ports is N F × M, with F being the number of input fibers, where each fiber carries M wavelength channels. Each cluster groups M top-ofthe-rack (TOR) switches. Each of the M TOR switches has a dedicated electrical buffer queue as shown in Fig. 1 . Each cluster is connected to the flat OPS switch by optical fibers. Thus, the packets generated by each queue are electricalto-optical (O/E) converted using commercial WDM optical transceivers. We assume in our simulation that the optical links between clusters and the WDM OPS have a length of 50 m. Longer or shorter links do not impact the performance of the OPS but can be seen as a larger or smaller latency offset to the latency introduced by the OPS. The OPS processes in parallel the N input ports by using parallel 1 × F switches, each of them with local control, and parallel F × 1 wavelength-selector (WS) contentionresolution blocks (CRBs), also with local control, enabling highly distributed control [12] . Contentions occur only between the F input ports of each F × 1 WS. This is true because the wavelength converters at the WSs output prevent contentions between the WS's outputs destined to the same output fiber.
II. SWITCH ARCHITECTURES AND CONTROL
The optical packets consist of the payload, which carries the real data, and the optical label that provides the packets' destination. In the OPS, the labels of the optical packets are separated from the payload and processed by the label processors [15] . The label processor thus controls the 1 × F semiconductor-based optical switch to forward the optical packets to one of the F possible output ports via the CRB. The CRBs, each one consisting of an M × 1 WS [14] and fixed wavelength converters (FWCs), avoid collisions between packets received from different input fibers. The outputs of the switch reach the destination clusters by means of an optical link, and a positive flow-control signal acknowledges the reception of the packets. At the destination cluster, the M WDM channels are then detected by O/E converters and then, similar to the ingress cluster, buffered (not reported in Fig. 1 ) and forwarded to the M TOR switches. Therefore, each of the F × M input channels can be connected to each of the F × M output destinations.
Contrary to an OPS with centralized control, in the architecture with distributed control shown in Fig. 2 the control complexity and the configuration time are largely determined by the label processing time. Because the reconfiguration time of F × 1 and M × 1 is on the order of subnanoseconds and there is on-the-fly operation of the FWC, one of the key issues for minimizing the latency of the OPS with distributed control is the implementation of an extremely fast label processor (and labeling technique) that allows for processing a large number of labels for controlling the large-port-count OPS with a limited increase of the latency. In [15] we have demonstrated a novel in-band RF labeling technique in which the label processing time is independent of the label count and is on the order of a few nanoseconds. This has a critical impact on the cluster electrical buffer size in a flow-controlled operation.
To investigate the performance of the WDM OPS system with flow-control mechanisms, our simulator implements one to one all the functionalities of the WDM OPS and the cluster buffer queues. First the packet is stored in the electronic buffer, and a copy is sent to the OPS via a 50 m optical link. At the OPS node the optical packet label is processed, and the photonic switch is reconfigured in a few nanoseconds to forward the packet to the appropriate destination. When the packet arrives at the CRB, if no collision occurs, it is forwarded to the output port that leads to the destination cluster. In that case, an acknowledgment (ACK) signal is sent after the system's round-trip time (RTT) to the input cluster and the original packet, stored in the buffer, is erased. Two or more packets coming from the same input fiber can have the same output port. When they reach the same CRB simultaneously, a contention occurs. In this case, only one of the packets reaches the output, whereas the others are simply dropped. At the input node, only one of the packets will be acknowledged and erased, while the other ones will need a retransmission. If the input buffer is full, new packets that arrive to the cluster electrical buffer cannot be served, and they will be counted to calculate the packet loss ratio.
III. SIMULATION SETUP
We simulate the system under investigation using the OMNeT++ Network Simulation Framework software [20] . Figure 3 shows the block diagram after the architecture is implemented in the simulation software.
The system is set to operate at a data rate of 40 Gbit∕s. In the first simulation set, several values for the number of nodes (F) and carried wavelengths (M) are tested. Afterward, the system is tested with a total of M · N 1024 input/output ports (M F 32). The default distance between the input/output nodes and the actual OPS is set to 50 m. This, added to the delays caused by the optical modules, translates into an RTT in the system of 560 ns. This value is the minimum latency a packet will experience and acts as an offset for latency measurements.
During the simulations the traffic-generation modules create packets, according to a certain pattern and load value, and forward them to the switch. The electronic buffers, photonic switches, and CRB modules operate as described in the previous section. The output modules receive the transmitted packets and collect statistical information.
A. Traffic Generation
To evaluate the switch performance in an intercluster communication network environment, a data-center-like traffic pattern is generated. The traffic sources are programmed to create packets with any arbitrary length and to send them according to specific inter-arrival-time parameters. Each of these modules simulates the aggregated load of a large number of servers, which constitutes the traffic of each TOR switch. In our experiments, each of the M wavelengths in every input cluster receives the input traffic generated by 200 simulated servers. The amount of input traffic load is normalized and can be scaled from 0 to 1. The generation of data center traffic for our experiments is based on referred publications regarding this kind of traffic [21] [22] [23] . Data from more general studies about Internet traffic are taken into account too [24, 25] .
Packet length in real scenarios is mostly found to be a bimodal distribution around 40 bytes and 1500 bytes [21] [22] [23] . These two values match the Ethernet minimum and maximum lengths and are found in other network environments too [24, 25] . Figure 4 contains the cumulative distribution function (CDF) and the histogram of the packet length generated during the simulations. Packet arrival times are modeled matching ON/OFF periods (with and without packets transmission). This is the traffic behavior found in data centers and, in general, in Internet traffic [21] [22] [23] [24] [25] [26] . The length of these ON/OFF periods is characterized by heavy-tailed random distributions. In our simulations, we model them with a Pareto distribution. Figure 5 contains the CDFs of both these periods. ON periods follow the same length distribution regardless of the simulated input traffic-load value. However, the time between them (in other words, the OFF periods) is proportional to the chosen simulation load value. When a higher load is selected they become shorter and vice versa. This way, only traffic density varies from one simulation to another while traffic complexity remains constant. As a consequence of flattening the network, communication performance does not depend on the distance or hierarchical position of the nodes. Considering also the statistic of the aggregated traffic from hundreds of servers at the cluster level and not server level, we assume that traffic in the network is equally spread through all the resources. Translating it to our model, all the packets in every transmission (ON) period are randomly sent, with uniform probability, to one of the possible destinations. Data center traffic characteristics are strongly dependent on the applications running in the servers. This means that they vary from one environment to another. It is then difficult (if even possible) to define a concept, such as typical data center traffic. The traffic model here deployed is only a generalized approximation to the very different scenarios that can be found in real data centers. However, it can be said that usually (during 95% of the time) traffic in data centers does not exceed 10% of the maximum network capacity [21] . Additionally, for more than 99% of the time, traffic is not beyond 30% of the total capacity. When typical data center traffic is mentioned in the forthcoming sections, these values reported in Table I should be considered.
The traffic sources we employed in the simulations operate independently. This means that the overall traffic injected in the switch is not constant even if all the sources are programmed to provide the same value of aggregated load. During the time of the simulations the normalized load fed into the switch may vary from 0 to 1 according to the traffic sources' independent ON/OFF periods.
B. Simulation Sets
Several sets of simulations have been run in order to determine the performance of the system under different configurations. In particular we analyze the data loss, throughput, latency, scalability, and electronic buffer dimensioning of the proposed WDM OPS architecture.
First, in Subsection IV.A, we consider the possibility of scaling the architecture to a large port count. One of the most critical hardware dimensions of the proposed architecture is the electronic input buffer. The buffer stores the packets while they are traveling through the OPS so that, in case of collision, they can be retransmitted. Packet losses in the simulations occur when these buffers are full and new packets arriving to the cluster are discarded. Setting input buffers with large capacity should, in principle, help avoid losses. However, large buffers capable of working at 40 Gb∕s are still technologically difficult to obtain, and heavy parallelization is required. The purpose of these simulations, as described in Subsection IV.B, is the investigation and dimensioning of the minimum buffer size that fits the performance requirements. The optical links between the clusters and the WDM OPS architecture are set to 50 m. When the propagation delay is added to the processing delays of the OPS optical subsystem, a 560 ns RTT is obtained. If the physical dimensions of the links (and the data center) could be shrunk, the overall system latency would be reduced. The effect of this length reduction is studied in Subsection IV.C.
The results of the simulations are represented in several graphs. The average normalized input load refers to the traffic received at every port of the switch. For instance, input load of 0.1 means that during the total length of the simulation, each of the M · F ports received on average a traffic load of 0.1 · 40 Gbits∕s 4 Gbits∕s.
IV. SIMULATION RESULTS

A. Switch Scalability
WDM OPS scalability to a large port count as a function of load traffic is investigated. We simulate the system with F M f2; 4; 8; 16; 32; 64g, giving a total of M · F f4; 16; 64; 256; 1024; 4096g ports. Figure 6(a) shows the packet loss as a function of the average input load for different system sizes. The first consideration is that for a larger number of ports performance degrades due to larger contention probability at the CRB. Thus, more packet retransmissions turn into higher input buffer occupation. For example, a 16 port OPS system starts losing packets with an input load 12% lower than in the 4 port case. However, when scaling the system beyond 1024 ports, this effect is no longer appreciable.
Scaling the system has a similar effect on throughput, as shown in Fig. 6(b) . Doubling the size of both F and M in a small architecture (from 4 ports to 16 ports) turns into a throughput decrease. However, systems sized 1024 or 4096 in/out ports provide almost identical results (less than 0.5% of difference in performance).
In terms of latency, Fig. 6(c) shows the same effects when the number of ports grows. For instance, at input load 0.4, growing from 4 to 16 in/out ports means 250 ns of added delay. The interesting point is that beyond 1024 ports, again the differences become minimal. Packet loss lower than 10 −6 and an average packet latency of 909 ns is assured for input traffic up to 0.3 and a 20 kB buffer. Considering that for the 99% of the time the load traffic is below 0.3 [21] , even if in the remaining 1% of time the traffic load is between 0.3 and 1, these results confirm that, regardless of the port count, the switch can handle data center traffic if an electronic input buffer larger than 20 kB is employed.
B. Dimensioning the Electrical Buffers
Electrical buffers must be large enough to hold packets in case of packet collisions. However, technological constraints apply on SRAM memory speeds and capacities. In this section we investigate the effect of different buffer sizes on the system performance. We consider a 1024 input/ output port OPS and electronic buffer size of 5, 10, 15, 20, 25, and 50 kB. In Fig. 7(a) we can see how, as the buffer size increases, the system can manage heavier traffic loads without packet losses. Too small sizes, like 5 or 10 kB per buffer, prove to be insufficient with data center traffic (see Table I ). The minimum buffer size that allows good performance is 15 kB. With this value, the system could receive an input load up to 0.3 with a packet loss smaller than 10 −6 . A data center would operate with a load smaller than that during 99% of the time. Increasing the buffer size beyond the minimum of 15 kB provides better performance (in terms of packet loss). Because the architecture relies on the packet retransmission algorithm, the total amount of traffic inside the switch will always be higher than the input traffic. For this reason, even an infinite buffer would not achieve lossless operation beyond a 0.68 normalized input load. On the other side, the buffer size also affects the average packet latency. In fact, the retransmission algorithm relies on a trade-off between packet losses and latency: if more packets can be held at the buffer, they will end up waiting for a longer time. Figure 7(b) shows the effect on average packet latency of the different buffer sizes tested. According to those results, buffers of 20 kB should be enough to ensure a packet loss below 10 −6 for typical data center traffic (load <0.3) and up to 0.34 with submicrosecond latency (909 ns). This is the reason why 20 kB has been intentionally used in Subsection IV.A to evaluate the architecture scalability. More buffer capacity could be considered in order to extend lossless operation to higher input loads. For example, 50 kB of buffer per port could attain load up to 0.45. However, the trade-off between data loss and latency implies that packets would suffer a latency of 2 μs or more. In a real operation environment, an optimal size should be chosen depending on the application requirements, latency limitations, and other factors.
C. Shortening the Optical Link Length
The packets' latency is composed of four factors: the input buffering time, switch reconfiguration time, delay due to the switch and the optical links, and retransmissions. Buffering, retransmissions, and switch reconfiguration times depend on the architecture. The reconfiguration time of the WDM OPS architecture with highly distributed control is constant, regardless of the port count, and is set as a fixed value of 60 ns (including delay due to the switch) [16] . The delay due to the optical link was set to 50 m in the simulation. Packets traveling through the switch have to cover this distance at least twice (to and from; see Fig. 1 ), so the total delay will be 500 ns. Adding the reconfiguration times (60 ns), the total RTT of the switch will be 560 ns. In this section we evaluate the effect of reducing the length of the optical links on the system performance. The tested RTTs are 560, 460, 360, 260, 160, and 60 ns.
The simulation results show three important things. First, the packet loss and throughput vary only very slightly from one setup to another. For this reason they are not plotted in this document. Second, the RTT acts only like an offset for the average packet latency measurements as shown in Fig. 8 . At low input loads, the lines in the graph only differ in the offset value. Last, as the input traffic load increases, the effect of shortening the optical link does not reduce the latency, which is mostly determined by the buffering time. Heavy loads produce high buffer occupation and long queuing times. In such a situation, the waiting times in the buffer become the dominant cause for latency.
At moderate load, designing the optical links as short as possible does not modify the packet loss but decreases the latency. This can be exploited for improving the trade-off between packet loss and latency. For instance, shortening the link allows the use of a larger buffer size to improve packet losses while keeping latency below the 1 μs boundary. Shortening the optical link is restricted by several factors, like having proper cabling paths, thermal and power design, and so forth. This topic is out of the scope of our investigation.
V. IMPROVED OPS ARCHITECTURES
A. Architecture With Two Parallel Switches
The packet length of data center traffic is a bimodal distribution around the minimum (40 bytes) and maximum (1500 bytes) sizes of Ethernet frames, as shown in Fig. 4 . Because the difference between both values is more than one order of magnitude, we investigate the benefit to use two different OPSs to switch packets separately depending on their size. The idea comes from the fact that bigger packets are more prone to be dropped at the clusters when the buffers are almost full. This can be observed in Fig. 9 . If we consider the packet length distribution shown in Fig. 4 , most of the dropped packets are large Ethernet packets.
To investigate the effectiveness of employing two parallel OPSs, the software model has been modified to diverge the traffic load arriving to the clusters to two distinct buffers, one for short packets and the other for long packets. The threshold that determines to which of the two buffers the packet has to be diverged is set at 1300 bytes. Other threshold values between 800 and 1400 bytes have provided almost identical results. Note that for such a threshold, the total normalized load is composed of 42% of short packets and 58% of long packets. The effective traffic load at the buffers for short packets and long packets is 50% of the input traffic load. The buffer for small packets is configured to be 5 kB, while the buffer for big packets is 15 kB. Thus, the total amount of electronic buffer is 20 kB, the same as in the previous simulations.
With packets diverged to different networks, the effective load is lower, and thus the contention probability at the CRB blocks is lower. Consequently, fewer retransmissions are needed, and thus buffer occupation is lower and less packets are lost at the input buffers. The performance of the system architecture with two OPSs is shown in Fig. 10 . As a reference, the performance of the system based on the single OPS (with total buffer size of 20 kB) discussed in Section IV (labeled "total") is also displayed in Fig. 10 . Small packets are transmitted with no losses independently of the input load. Furthermore, average latency is kept well below 1 μs (in fact, below 750 ns). Long packets do experience losses, but they appear when total input load goes beyond 0.4. This means that the admissible traffic in the switch can be 30% higher than with the original architecture, even with a buffer size of 15 kB for the long packets. The benefits of two separate OPS networks come at the expense of deploying almost the same architecture twice. This would be certainly more expensive than the regular one; it could be cost-effective for some specific data center environments. For instance, urgent, application-critical short data packets could make use of a very fast semiconductor-based WDM OPS [14] , not the congested network. Then, long-packet traffic with no latency constraints could be processed by a micro-electro-mechanical systems (MEMS)-based slow WDM OPS, which could feature arbitrarily larger buffer sizes to improve packet loss.
B. OPS Architecture With Multiple Receivers
The WDM OPS architecture relies on the CRB and the retransmission flow control to compensate packet losses. At the CRB, only one packet at a time can be saved and forwarded to the output [12, 13] . The other packets are simply retransmitted at a later time. This mechanism makes the WDM OPS feasible in terms of hardware and strongly decreases the latency for handling typical data traffic as reported in Table I . However, if the traffic load will increase beyond 0.2-0.3, performance in terms of packet loss and latency might not be met. If more than one packet could be saved, this will avoid some retransmissions, leading to better performance even at higher load.
Here we study the effect of saving more packets in case of contention by using multiple receivers at each output port of the OPS. The same approach was applied in the Osmosis project [27] for an OPS architecture with centralized control. Figure 11 plots packet loss and latency in the system when the multiple receivers are employed for saving more than one packet. The simulations show results for one packet saved as reference (the original architecture configuration) and also 1, 2, 3, 5, and 10 extra packets saved. The benefits of introducing multiple receivers are notable. In the simplest case (saving one more packet), the system could extend lossless operation up to an input load of nearly 0.5 (around 40% increase). Saving two more packets would, of course, provide better performance, almost doubling the switching capacity. Improvements in terms of average packet latency, as seen in Fig. 11(b) , are very promising too because for every saved packet, retransmission delays are avoided. This allows for submicrosecond operation of the OPS even at input loads beyond 0.5.
The proposed modification will remarkably improve the performance of the WDM OPS. On the other hand, those improvements demand a more complex (but feasible) hardware implementation and higher costs of the multiple receivers instead of a single one.
VI. CONCLUSION
We have numerically investigated a scalable WDM OPS architecture with highly distributed control as a flat intercluster data center network with high-speed operation (beyond 40 Gbit∕s) and submicrosecond latency. Numerical performance analyses show a packet loss lower than 10 −6 and a latency of 909 ns for input loads up to 0.34 and electronic buffer size of 20 kB for a WDM OPS architecture with up to 4096 port count. As a result, the proposed architecture could lead to a system with a potential throughput larger than 55 Tbits∕s:
T num ports · datarate opt · load norm
· 40
Gb s · 0.34 55.71 Tb s ;
where num ports is the port count and load norm is the maximum normalized input load with lossless operation.
In order to increase the overall performance, two modified architectures have also been proposed and analyzed. The first modified architecture is based on two parallel OPSs for switching large and small packets, respectively. This technique provides a 30% performance improvement, leading to a switching capacity of 73 Tbits∕s.
The second modified architecture is based on the usage of multiple receivers to improve contention resolution and decrease packet retransmission. Results indicate that, at the cost of doubling the receivers, over 0.5 input load traffic could be served with performance similar to the one obtained with the original architecture. This is more than 40% performance improvement with respect to the original WDM OPS architecture, leading to more than 80 Tbits∕s of switching capacity under typical data center traffic.
ACKNOWLEDGMENT
This work was supported in part by the Netherlands Organization of Scientific Research (NWO) through the VI and NRC Photonics programs and in part by the EU FP7 LIGHTNESS project. 
