Abstract-This paper presents a new petabit photonic packet switch architecture, called PetaStar. Using a new multidimensional photonic multiplexing scheme that includes space, time, wavelength, and subcarrier domains, PetaStar is based on a three-stage Clos-network photonic switch fabric to provide scalable large-dimension switch interconnections with nanosecond reconfiguration speed. Packet buffering is implemented electronically at the input and output port controllers, allowing the central photonic switch fabric to transport high-speed optical signals without electrical-to-optical conversion. Optical time-division multiplexing technology further scales port speed beyond electronic speed up to 160 Gb/s to minimize the fiber connections. To solve output port contention and internal blocking in the three-stage Clos-network switch, we present a new matching scheme, called c-MAC, a concurrent matching algorithm for Clos-network switches. It is highly distributed such that the input-output matching and routing-path finding are concurrently performed by scheduling modules. One feasible architecture for the c-MAC scheme, where a crosspoint switch is used to provide the interconnections between the arbitration modules, is also proposed. With the c-MAC scheme, and an internal speedup of 1.5, PetaStar with a switch size of 6400 6400 and total capacity of 1.024 petabit/s can be achieved at a throughput close to 100% under various traffic conditions. Index Terms-Clos network, optical time-division multiplexing (OTDM), packet scheduling, photonic switch.
I. INTRODUCTION
T HE INTERNET has become a fundamental driving force for a variety of information technologies due to its evergrowing ability to handle traffic, its ubiquitousness, and the services [1] . New applications, such as sensor fusion, bio-informatics, grid computation, global data storage, and on-line video applications are emerging. Common among these applications is their demand for a huge amount of bandwidth and a global packet switching infrastructure. In contrast to the success in increasing the raw bandwidth for terabit transmission capability using dense wavelength division multiplexing (DWDM) technology toward the end of the last century [2] - [5] , today's electronic router technology may soon exhaust its capacity of a few terabit/s [6] - [8] . To find a cost-effective way to build a router with a few tens of terabit or even petabit capacity will be the key to the continuing success of the next-generation Internet [9] - [12] . While electronics has been widely used in building large buffers and high-speed packet schedulers [13] - [16] in routers, it is still very difficult to build a large-dimension (beyond 1000 1000) electronic switch fabric with an OC-192 (10-Gb/s) line rate. Currently, electronic crossbar switches can be implemented on a single chip with a capacity of up to 320 Gb/s at a link rate of 2.5 Gb/s [17] . To build a petabit (10 Gb/s) switch fabric with a three-stage Clos-network, the dimension of interconnection may reach 100 000 with the port rate at 10 Gb/s. The total number of interconnections can be estimated as: 100 000 (connections between stages) 2 (stages) 4 (number of 2.5-Gb/s electronic planes to form a 10-Gb/s port) 800 000 (interconnections). This large number of fiber interconnections for an electronic switch fabric becomes too complex to implement (or even manage). Thus, we are motivated to look into a photonic switch system as an alternative for building a petabit switch fabric.
Several promising photonic technologies have been proposed in the implementation of large-dimension space switches [18] - [20] . The optical microelectromechanical system (MEMS) has demonstrated a 1296 1296 cross-connection switch with a total capacity of 2.07 petabit/s [21] . The drawback to these types of switches is the slow reconfiguration speed (in the s or ms range) because of moving and stabilizing the mechanics of the mirrors. For packet switching, the switch fabric needs to reconfigure its connections on a per-packet basis, usually in the nanosecond time scale. Another alternative for building a photonic switch fabric is to combine tunable lasers with an array waveguide grating (AWG) [22] - [25] device as a space switch. The advantage of this approach is the rapid switching speed achieved by tuning the wavelength of the lasers at a nanosecond speed [26] - [28] . When cascading the AWG-based devices for a multistage switch fabric, wavelength conversion becomes an essential element because the switching property of the AWG depends on the change of the input wavelengths [29] . All-optical wavelength conversion has been achieved using nonlinear effects in either electro-optic crystal [30] or optical fibers [31] . However, to fully exploit nonlinearity, the power requirements on these methods may not be practical in system applications. Another approach is to use resonance nonlinearity in semiconductor optical amplifiers [32] , [33] , which has higher nonlinearity than the other media. The power requirement of such an approach can be as low as tens of femto-joules per pulse, suitable for large-scale system applications.
We previously proposed a single-stage photonic packet switch fabric with a capacity of up to several terabit/s [34] .
Here, we present a new photonic packet switch architecture with petabit/s switching capacity for next-generation routers, called PetaStar. The switching fabric uses multidimensional photonic technologies to achieve highly scalable, rapidly reconfigurable interconnections in a three-stage Clos network. Packet buffering is implemented electronically outside the central photonic switch fabric, allowing the high-speed optical signals being transported without electrical-to-optical conversion. We believe this is more practical than those switches that use optical delay lines as buffers [35] . This is because optical memory technology is still immature and its size is very small, e.g., a few tens of packets. Optical time-division multiplexing (OTDM) techniques [36] - [38] are used to increase the port speed to hundreds of gigabit/s. Based on the PetaStar, a 6400 6400 switch can be realized using 80 wavelengths for each individual switch module in a three-stage Clos network architecture. With the port speed of 160 Gb/s, a total switch capacity reaches 1.024 petabit/s. The number of fiber interconnections is greatly reduced by aggregating more bandwidth on a single fiber. With the above example of building a 100 000 100 000 switch with the port rate at 10 Gb/s, the total fiber interconnections are reduced from 800 000 to 12 000, making the proposed architecture more feasible for implementation. The reduction is contributed by increasing the bandwidth of fiber interconnections from 2.5 to 160 Gb/s.
Although there are many effective scheduling schemes that have been proposed to resolve output contention for single-stage switches [39] - [57] , until now, only a few matching schemes for Clos-network switches have been proposed [15] , [16] , where the first and the third stages have shared-memory buffers, thus limiting the switch capacity. It is very challenging to find an efficient and fast scheduling scheme to provide high performance under various traffic conditions in the three-stage bufferless Clos-network switch, such as the PetaStar. The scheduling scheme must be able to scale gracefully with a large switching capacity because the time for resolving output contention becomes more constrained with the increase of switch size and port speed. This paper presents a new matching scheme, called c-MAC, a concurrent matching algorithm for Clos-network switches. It is highly distributed such that the input-output (I/O) matching and routing path finding are concurrently performed by scheduling modules. To relax the strict arbitration time constraint and round-trip delay between line cards and the packet scheduler (and to accommodate slow-configuration optical switch fabrics), c-MAC operates on a frame basis, consisting of cells ( ). To improve I/O matching efficiency, c-MAC slightly modifies the exhaustive dual round-robin matching (EDRRM) scheme [54] to improve throughput under various traffic conditions. The throughput of the c-MAC scheme is close to 100% under various traffic distributions at a cost of internal speedup of 1.5. A feasible implementation of the c-MAC scheme, where a crosspoint switch is used to provide the interconnection between the arbitration modules, is also proposed.
The paper is arranged as follows. Section II describes the system architecture including the data structure, the packet flow, and the design of electronic buffering at the input port controller and output port controller. Section III presents the c-MAC packet-scheduling scheme for the three-stage Clos-network switch. It also shows the implementation complexity of the c-MAC packet scheduler and its performance study results. Section IV details the implementations of the proposed photonic switch fabric. Section V presents our conclusions. Fig. 1 shows the system architecture of the proposed petabit photonic packet switch, called PetaStar. The basic building modules include the input and output port controllers (IPCs and OPCs), input grooming and output demultiplexing modules (IGMs and ODMs), photonic switch fabric (PSF), centralized packet scheduler (PS), and a system clock distribution unit. The PSF is a three-stage Clos-network with columns of input switch modules (IMs), central switch modules (CMs), and output switch modules (OMs). The incoming and outgoing line rates are assumed to be 10 Gb/s. All incoming lines will first be terminated at line cards (not shown in the figure), where packets are segmented into cells (fixed length data units) and stored in a memory. The packet headers are extracted for IP address lookup. 1 All cell buffering is implemented electronically at the IPCs and OPCs, leaving the central PSF bufferless, i.e., no photonic buffering is required in the system. As a result, the bit rate of each port can operate at a speed beyond the limits of the electronics. The port speed can be equal to or greater than (with a speedup) times the line rate, where is the grooming factor. Virtual output queues (VOQs) at IPCs, together with the PS, provide contention resolution for the packet switch. At the input end, the majority of incoming packets are stored in the ingress line cards, where packets are segmented and stored in its VOQs. Packets destined for the same output port are stored in the same VOQ. VOQs implemented at the IPC serve as the mirror of the VOQ memory structure in the line cards. As long as they can keep the cells flowing between the line card and IPC, the size of the VOQs at the IPC can be considerably smaller than its mirror part in the ingress line cards. Buffers at the OPC are used to store cells before they are sent out to the destined egress line cards. A large buffer with a virtual input queue (VIQ) structure implemented in the line card (not shown in the figure) is used to store the egress cells from the PSF and reassemble them into packets. Fig. 2 shows how packets flow across the system. At the input, variable-length IP packets are first segmented into cells with a fixed length of 64 B (including some cell overhead), suitable to accommodate the shortest IP packet (40 B) . At each IPC, a total of input lines at 10 Gb/s enter the system and terminate at the IPC. To reduce memory speed, each VOQ has a parallel memory structure to allow cells to be read at the same time. These cells form a photonic frame. Each cell, before entering the IGM for compression, is scrambled to guarantee sufficient transitions in the data bits for optical receiver and clock recovery. In the IGM, these cells are compressed at the optical time domain to form a time-interleaved OTDM frame at Gb/s. Let be the cell time slot and ns for 10-Gb/s line rate. Let be the compressed cell time slot at the port speed ( Gb/s) and B/( Gb/s), where is the speedup factor. Then, the compressed photonic frame period B/( Gb/s). With and , the frame period is equal to the cell slot, . Guardtime is added at the head of the frame to compensate for the phase misalignment of the photonic frames when passing through the PSF and to cover the transitions of optical devices.
II. SYSTEM ARCHITECTURE

A. Petabit Photonic Packet Switch Architecture
At each stage of the photonic switching fabric, the corresponding subcarrier header is extracted and processed to control the switching fabric. Since the PS has already resolved the contention, the photonic frame is able to find a path by selecting the proper output links at each stage in the switching fabric. Once the photonic frame arrives successfully at the designated output port, it is demultiplexed and converted back to cells at 10 Gb/s in the electronic domain by the optical demultiplexing module. The OPC then forward the cells to their corresponding line cards based on the output line numbers (OLs).
As Figs. 1 and 2 show, the optical signals run between the IGM and the ODM at a rate of Gb/s, or 160 Gb/s for . All optical devices/subsystems between them operate at Gb/s. However, the electronic devices only operate at most 10 Gb/s (with speedup of one), or even lower with parallel wires, e.g., four SERDES signals, each at 2.5 Gb/s (or 3.125 Gb/s including 8B/10B coding). Fig. 3 illustrates the data structure at each stage of the switch. Before the data payload, two header fields that contain the OL in the destined output port and the input line number (IL) of the switch are added to each incoming cell [see Fig. 3(a) ]. The OL is used to deliver cells to the destined output lines when the photonic frame ( cells) arrives at the OPC. A validity bit is inserted at the beginning of the cell to indicate if the cell is valid or not. The overhead bits introduced by OL and IL can be calculated as and , respectively. For example, for a petabit system with and , the cell header length is 21 bits (1 4 16) . Bits in each cell are compressed and time-interleaved using optical time-division multiplexing (OTDM) techniques in the IGM to form the photonic frames that are ready to transmit through the PSF [see Fig. 3(b) ]. Each photonic frame goes along with an out-of-band subcarrier (SC) header. Using the photonic frame as its carrier, the SC header is amplitude-modulated on the photonic frame at a much lower subcarrier frequency. The estimated raw bandwidth required for the SC header is about 600 MHz. Standard multilevel coding schemes can be applied to further compress the SC bandwidth to 80 MHz or less, allowing the SC header to be carried around the dc frequency. The first field in the SC header is a flag containing a specific pattern for frame delineation since the photonic frames carrying the SC header do not precisely repeat in the time domain. The payload is 8B/10B coded for correctly finding the flag. Three fields are attached to the SC header to provide routing information at each stage of the PSF. The three fields include CM, OM, and OPC numbers with , , and bits of information, where and are the numbers of CM and OM, and is the number of outputs at each OM. At the beginning of the frame, a validity bit is added to indicate if the frame contains valid cells. As soon as a cell arrives at the input line memory, a request is sent to the packet scheduler that tracks of all the incoming cells. The scheduler, based on a new hierarchical frame-based exhaustive matching scheme, sends back the grant signal if the transmission has been granted. As a result, 16 cells ( to ) from input line memory 1 are selected at the first frame period to form frame number 1, followed by 16 cells ( to ) from input line memory selected at the next frame period to form frame number 2. In this example, the remaining eight cells ( to ) will be aggregated with another eight cells from C packet ( to ) to form frame number 3. The reason that packet is granted prior to the second half of packet A is because packet B has a filled frame and thus has a higher priority for transmission. At the IGM, cells are compressed into the time-interleaved photonic frames and thus ready to be routed through the PSF.
Following the above example, Fig. 5 shows how packets , and are processed as they are demultiplexed at the OPC and reassembled at the egress line cards. Photonic frames containing the compressed cells are demultiplexed in the ODM and sent into parallel inputs of the selector array. According to the cell header, to go to the 16 first-in-first-outs (FIFOs) located in output line memory 1 at the first frame period. At the next frame period, to are sent to the same 16 FIFOs in output line memory 1. At the next frame period, photonic frame number 3 arrives at the OPC. The remaining part of packet is sent to input line memory 1, while cells from packet go to output line memory . These cells are then read out from the FIFOs to the designated output line (output line 1 in this case) at a speed larger than 10 Gb/s. The VIQs at the line card are used to reassemble packets , , and .
Synchronization can be challenging as the system scales. To achieve synchronization, a centralized frame clock will be supplied to each module in the system. Each switching action, including buffer reading and writing, switching of laser wavelength, and OTDM multiplexing and demultiplexing, will be synchronized according to the same frame clock signal with a frequency of, e.g., 1/51.2 ns, or 19.5 MHz. The clock signal will be distributed among the modules using optical signals through fibers to provide a sharp stroke edge for triggering operations. A sinusoidal signal at 10 GHz will be distributed to each module as the base frequency for the synchronization. The subcarrier provides a trigger signal at each switching stage. Upon detecting the subcarrier signal in the subcarrier unit (SCU), which indicates the arrival of a photonic frame at the input of the module, the SC processor will wait for a precise time delay before starting the switching operation. The time delay through the fiber connections will be chosen precisely [58] so that it can accommodate the longest processing delay in the header process. The minimum timing tolerance contributed by both photonic and electronics devices, as well as fiber length mismatch in the system, will ultimately determine the guardtime between the photonic frames. For instance, with a frame period of 51.2 ns and 10% used for the guardtime overhead, the guardtime can be 5 ns, which is sufficiently large to compensate for the phase misalignment, fiber length mismatch, and optical device transitions.
III. PACKET SCHEDULING FOR THE THREE-STAGE CLOS-NETWORK SWITCH
One of the most complex units in this architecture is the packet scheduler (PS). It first resolves the contention of the frames from different input ports that are destined for the same output port. It then determines a routing path through the center stage (i.e., chooses a CM) for each matched I/O pair. Since there can be multiple possible paths (determined by the number of CMs) for each matched I/O, how to choose a CM to reduce internal blocking and thus increase the throughput further complicates the scheduling algorithm design.
For an input-buffered switch with a VOQ structure, several maximum-weight matching (MWM) algorithms have been proposed to achieve 100% throughput for independent and identically distributed (i.i.d.) arrivals (uniform or nonuniform). Some theoretical results for delay bound have been determined [39] - [43] . But MWM is not practical to implement in hardware due to its high complexity. In [44] - [47] , some scheduling algorithms for VOQ switches with speedup 2 can exactly emulate an output-queued switch. The maximal matching algorithm [40] , [48] is more practical than MWM, and it has been proven to be stable with a speedup of 2 [49] , [50] . A randomized scheduling algorithm presented by Tassiulas et al. [51] guarantees 100% throughput with low complexity but high delay. Other practical iterative matching algorithms such as PIM [52] , iSLIP [53] , and DRRM [9] , use multiple iterations to converge on a maximal-size matching.
With the increase of switch size and port speed, arbitration time to resolve the output port contention becomes very stringent. To relax the stringent arbitration time constraint, several schemes that use large time scales have been proposed [55] - [57] . In [55] , a new method of using large frame sizes is proposed for switching variable-sized packets over a crossbar switch to minimize reconfiguration frequency of the switch fabric. This method provides attractive delay bounds. However, it not only requires more than schedulers ( is the switch size) in the system, it also requires each scheduler to perform sophisticated packet scheduling schemes, such as weighted fair queueing or shaped virtual clock, which are prohibitively costly. In [56] , time slots are grouped into a frame. Contentions are solved and a set of matching is generated at each frame boundary and during a frame the switch fabric is updated at most times. In [57] , a batch of requests is accumulated and a corresponding schedule for a constrained switch is generated. However, to achieve high performance, those schemes suffer high time complexity. Fig. 6 shows an implementation architecture for the PS, which consists of scheduling input modules (SIMs) and scheduling output modules (SOMs), each of which corresponds to an input switch module (or output switch module) in the three-stage Clos-network switch (see Fig. 1 ). There are input port arbiters (IPAs) divided into groups, IPAs are in each SIM. Each SIM consists of virtual output port arbiters (VOPAs), each of which corresponds to an output port in the corresponding OM. Each SIM has an input module arbiter (IMA), and each SOM has an output module arbiter (OMA). A crosspoint switch with a predetermined reconfigured pattern is used to interconnect these SIMs and SOMs.
A. Packet Scheduler Architecture
As 
B. Concurrent Matching Algorithm for Clos-Network Switch (c-MAC)
This section describes a new Concurrent Matching Algorithm for Clos-network switches, called c-MAC, to match the input and output ports of the PSF and to find routing paths through the CMs. Here, we extend the switching unit from a cell to a frame of cells that are contributed by input lines in the same group. For instance, letting , the period of a photonic frame at 160 Gb/s is equal to a cell time slot at 10-Gb/s line, i.e., 51.2 ns. The c-MAC scheme divides one frame period into matching cycles as shown in Fig. 7 . In each matching cycle, each SIM is matched with one of SOMs. The c-MAC scheme during each cycle includes two phases to find the I/O matches and routing paths, respectively. To achieve high I/O matching efficiency, the exhaustive dual round-robin matching (EDRRM) scheme [54] is slightly modified here. Most iterative matching schemes, such as iSLIP [53] and DRRM [9] , suffer from the problem of throughput degradation under unbalanced traffic distribution. The EDRRM scheme improves throughput by maintaining the existing matched pairs between the inputs and outputs so the number of unmatched inputs and outputs is drastically reduced (especially at high load), thus reducing the inefficiency caused by not being able to find matches among unmatched inputs and outputs. One of the major problems of the EDRRM is that it may cause starvation in some inputs. One way to overcome this problem is to limit the maximum number of cells that can be continuously transmitted from each input port.
At the beginning of each cycle, each SOM passes bits to the corresponding SIM as shown in Fig. 7 , where bit corresponds to the state of input links of the corresponding OM;
bit corresponds to the state of output ports of the corresponding OM. There are four possible states for each output port: "00" when the output port is unmatched; "01" when the output port is matched with low priority in the last frame period; "10" when the output port is matched with high priority in the last frame period; "11" when the output port is matched in this frame period.
Here, we assume the matching sequence between SIMs and SOMs is predetermined. For instance, in the first cycle, SIM is matched with SOM , where ; . In the second cycle, SIM is matched with SOM . The procedure is repeated times. To achieve matching uniformity for all the SIMs, the beginning matching sequence between SIMs and SOMs is skewed one position at the beginning of each frame period.
Phase 1: Find I/O Matches: Phase 1 consists of three steps as described below.
Step 1) Request:
• Each matched IPA only sends a high-priority request to its matched VOPA; each unmatched IPA (including the currently matched IPA but whose matched VOQs queue length is less than a threshold ) sends a 2-bit request to every VOPA for which it has queued cells in the corresponding VOQ.
["00" means no request; "01" means low-priority request because queue length is less than ; "10" means high-priority request because queue length is large than ; "11" means the highest priority because the waiting time of the head-of-line (HOL) frame is larger than a threshold, ].
Step 2) Grant:
• Only the "available" VOPA performs the grant operation. A VOPA is defined to be "available," if its corresponding output port is a) unmatched; or b) matched in the last frame period with low priority (the VOPA receives at least one high-priority request at this frame period); or c) VOPA is matched in the last frame period with high priority, but it receives the request from the matched IPA and its priority is becoming low-priority at this frame period.
• If a VOPA is "available" and receives one or more highpriority requests, it grants the one that appears next in a fixed round-robin schedule starting from the current position of the high-priority pointer. If there are no high-priority requests, the output port arbiter grants one low-priority request in a fixed round-robin schedule starting from the current position of the low-priority pointer. The VOPA notifies each requesting IPA whether or not its request is granted.
Step 3) Accept:
• If the IPA receives one or more high-priority grants, it accepts the one that appears next in a fixed round-robin schedule starting from the current position of the high-priority pointer. If there are no high-priority grants, the input , the three-stage Closnetwork is a nonblocking circuit switch. Although it has been theoretically proven that a three-stage bufferless Clos-network switch is rearrangeably nonblocking when , the already known rearrangement scheme is impractical to implement in the high-speed switch due to the prohibitive high time complexity. For instance, some recent research results show that the "Euler Split" scheme has a time complexity of [59] , where is the switch port size. To reduce computation complexity, a simple parallel matching scheme [60] , [61] is adopted. Let us label the set of output links of each SIM by ( ) and the set of input links of each SOM by ( ), as shown in Fig. 8 . Each and contains exactly elements denoting the state of each physical link, i.e., , . Note that each "0" means that this link has not been matched; otherwise, "1." To find a path between SIM and SOM , one just needs to find a vertical pair of zeros in and . To improve the matching efficiency of the parallel matching scheme, we propose the following two methods. 1) When searching for vertical zero pairs in the and , choose those that are not occupied by any SIM and SOM in the last frame period. This improves the efficiency of finding routing paths by intentionally making routing paths available to those I/O pairs that are matched in the last frame period. In other words, the spirit of exhaustive matching is further extended from I/O matching to internal path finding. 2) Group multiple SIMs (or SOMs) together into the same device (e.g., of them with ) to achieve local optimization of routing resources. However, this increases the I/O bandwidth of the device, thus limiting the number of SIMs (or SOMs) that can be integrated in the same device. Based on the exhaustive matching policy in phase 1, the c-MAC scheme maintains a register ( ) for each SIM (SOM ).
has a structure similar to the ( ). Each and contains exactly elements, each denoting the matching state of each physical link at the end of the previous frame period, , . When an IMA receives the accept information from VOPAs, it performs parallel matching between SIM and SOM to find the routing paths between SIM and SOM . While finding a vertical pair of zeros in the and , it takes into account the state of and . It first picks an available link unmatched in the last frame period (i.e., ). If no such output links can be found, it picks any unmatched links instead (i.e., ). If no unmatched links can be found for the matched I/O pair, the "accept" in step 3 of phase 1 is changed to "reject." Fig. 8(a) shows an example of the group size . SIM and SIM are integrated into group SIM-G , while SOM and SOM are integrated into group SOM-G .When performing parallel matching between SIM-G and SOM-G , all the link state information (including SOM and SOM ) is sent to SIM-G . Within SIM-G , parallel matching between SIM and SOM is performed by finding a vertical pair of zeros in the and (i.e., ). The link state information of SOM is also examined. We first pick those zero pairs, for which SOM has already matched with another SIM (i.e.,
). This leads to more matching opportunities when SIM is to match with SOM later. For instance, by choosing link first, Fig. 8(b) results in one more match than Fig. 8(c) .
When phase 2 is complete, each SIM passes bit (link state information and output port matching results) to the corresponding SOM. Each SIM updates its and ; each SOM updates its and accordingly. Note that at the beginning of each frame period, and are reset to zero.
C. Packet Scheduler (PS) Time Complexity
Here, we estimate the time complexity of the PS. Let us assume , , , the amount of information that needs to be transmitted from each IPC to the PS is 26 B [ (6400) bit] per time slot, or about 4 Gb/s. In return, the PS sends back the matching results to each IPC that becomes the cell headers of the frame, and to each IGM that becomes the in-band header of the frame. The former has a total of 93 bits (13 80 ) and the latter has a total of 30 bits ( 7 7 7 1 8). Thus, the bandwidth is about 1.8 Gb/s from the PS to the IPC and 600 Mb/s from the PS to the IGM.
Let us estimate the required frame period to finish the c-MAC arbitration. Assume that group size is set to modules as shown in Fig. 8 ; the I/O bandwidth of one chip is BW; the time spent for I/O matching and finding routes within the SIM in each matching cycle is . Because of grouping, the number of required matching cycles is reduced to , while the information between a SIM and a SOM increases to . So, can be estimated as (1) For example, when assuming , ns, Gb/s, , should be set larger than 496 ns, equivalent to 10 cells at 10-Gb/s line rate, or 160 cells at 160-Gb/s port speed. The total switching capacity of the crosspoint switch in Fig. 6 is 400 Gb/s 80, or 32 Tb/s, about 1/30 of the photonic packet swtich fabric. The crosspoint switch can be implemented with 160 switches in parallel, each with a 80 80 crosspoint chip at 2.5-Gb/s link speed (raw data rate at 3.125 Gb/s [17]).
D. Performance Study 1) Performance of c-MAC in a Single-Stage Nonblocking Switch:
The performance of the c-MAC scheme was evaluated using computer simulations for uniform and nonuniform traffic models. We first studied the performance of the c-MAC in a single-stage nonblocking switch. 2 The delay, in unit of cell time slot, is measured from when a cell enters the IPC to when it arrives at the OPC. Fig. 9 shows the throughput of c-MAC and iSLIP under Bernoulli traffic with different unbalanced probabilities ( ) and frame sizes ( ) for a switch size ( ) of 64. is defined to be a portion of traffic that is sent to a particular output port with the remaining traffic evenly distributed to all others. When , the traffic is uniformly distributed to all output ports. When , it is circuit switching, meaning that the traffic from each input port is all destined for output port . The number of iterations of finding matches in each frame period is set to one, 2 It can also be a three-stage Clos-network switch with m 2n 0 1. The performance and impact of m in a three-stage bufferless Clos network is further discussed in Section III-D2. so is the average burst length . Both iSLIP and c-MAC achieve very high throughput under uniform traffic distributions. The former can even achieve 100% due to the desynchronization of pointers in the input and output arbitrators. However, the latter provides much higher throughput (close to 100%) under unbalanced distributions. This is because c-MAC significantly reduces the number of unmatched inputs and outputs by maintaining the existing matches, thus reducing output port contention.
The c-MAC scheme still maintains high throughput with a large frame size of 16. This is because filled frames have higher priority than those unfilled to be served, thus providing higher frame utilization. As a result, currently matched input and output pairs may not be able to maintain matching if their corresponding queue lengths are less than .
To avoid some pathological traffic conditions that may lead to starvation, the head-of-line (HOL) frame's priority level is boosted to the highest when its waiting time exceeds a certain threshold, say . When the HOL frame's waiting time exceeds , the input port arbiter only sends these highest-priority requests to the VOPAs, which then first grant the highest-priority requests. We assume that , , , and in the simulations. Setting too small may break already established I/O pairs and adversely impact throughput. However, when is set as large as 16 000 (cell time) as shown in Fig. 10 , high throughput, e.g., 0.98, can be achieved. This is because large waiting time thresholds will not break already established matches too often. Thus, starvation is prevented and high matching efficiency is maintained. Fig. 11 shows the average delay of the c-MAC scheme with different frame and switch sizes under uniform burst traffic. We assume that the burst length is exponentially distributed with an average burst length of 16, observed at each input of the switch fabric. Under the low traffic load, the average delay of the c-MAC scheme increases approximately proportionally with the frame size. This is because an arriving cell must wait for the beginning of the next frame if it is chosen under the frame-based mechanism. However, at the high-load area, all VOQs tend to be nonempty and the frames transmitted through the switch fabric tend to be full due to the priority mechanism introduced in the c-MAC scheme. This explains that the impact of frame size on the average delay becomes less (i.e., delay curves gradually converge) under heavy traffic load, and that close to 100% throughput is achieved even with large frame sizes and switch sizes as the traffic load approaches 100%. Fig. 12 shows that the c-MAC scheme with local optimization, i.e.,
2) Performance of c-MAC in a Three-Stage Clos-Network Switch:
, [see Fig. 12(a) ] can achieve much higher throughput than that without local optimization [see Fig. 12(b) ] in the three-stage Clos-network switch by considering different CM numbers and speedup factors under unbalanced traffic distribution. We assume of 2. With local optimization, even without bandwidth expansion (i.e., ), an internal speedup of 1.5 can achieve close to 100% throughput in the three-stage Clos-network switch. The throughput improvement is due to the fact that the existing matched routing paths are maintained when performing parallel matching. Higher matching efficiency for routing paths is obtained by checking for more routing resources when SIMs and SOMs are grouped together. Fig. 13 shows the delay performance, including input delay at the IPC, output delay at the OPC, and the total delay (sum of the input and output delays) for different switch sizes without internal bandwidth expansion, but with a small internal speedup of 1.5. We assume the service rate of the OPC is equal to the port speed. We evaluated the average delay under nonuniform burst traffic. Increasing the switch size to 256 does not affect the delay performance much, especially under nonuniform traffic distributions. This is because with an internal speedup of 1.5, most of frames have been transferred to the ELCs, where excessive delay is expected. Thus, the overall delay performance approaches that of output-buffered switches.
IV. PHOTONIC SWITCH FABRIC (PSF)
A. Multistage PSF Fig. 14 shows the structure of the PSF. It consists of input modules (IMs), central modules (CMs), and output mod- ules (OMs) in a three-stage Clos network. The switch dimensions of the IM and OM are while CM is . The IM at the first stage is a simple AWG device. The CMs and OMs consist of a SCU, a wavelength switching unit (WSU), and an AWG. A 6400 6400 switch can be realized using 80 wavelengths, for, i.e.,
. With the port speed of 160 Gb/s, a total switch capacity reaches 1.024 petabit/s.
Based on the cyclic routing property of an AWG router, full connectivity between the inputs and outputs of the IM can be established by arranging input wavelengths. By switching the laser wavelength at each of the inputs, the incoming optical signal that carries the photonic frame can emerge at any one of outputs, resulting in an nonblocking space switch. Since the AWG is a passive device, the reconfiguration of this space switch is solely determined by the active wavelength tuning of the input tunable laser. The wavelength switching can be reduced to a couple of nanoseconds by rapidly changing the control currents for multiple sessions in tunable semiconductor lasers [26] - [28] .
An example of a wavelength routing table for an 8 8 AWG is shown in Fig. 15 . A wavelength routing table can be established to map the inputs and outputs on a specific wavelength plan. In general, the wavelength from input ( ) to output ( ) can be calculated according to the following formula:
modulo . For example, input (5) needs to switch to wavelength to connect to output (3) for as highlighted in the router table. To cascade AWGs for multistage switching, CMs and OMs have to add wavelength conversion capability, where the incoming wavelengths from the previous stage are converted to new wavelengths. An all-optical technique is deployed to provide the necessary wavelength conversion without O/E conversion. Fig. 16 illustrates the detailed design of the CM and OM. Three key elements used to implement the switch module are the SCU for header processing and recognition, the WCU for performing all-optical wavelength conversion, and the AWG as a space switch (the same as the one in the IM). The main function of the SCU is to process the SC header information for setting up the switch path. The SC header information, which consists of 3 B, is readily available at each stage of the PSF as these bytes are carried out-of-band along with each photonic frame as shown in Fig. 3(c) . Upon arriving at each module, a portion of the power from the photonic frame is stripped by an optical tap and fed into the SCU for subcarrier demodulation. At the front end of the SCU, a low-bandwidth photo-detector and low-pass filter is able to recover the header information from the photonic frame. The SC header information is used to set the wavelength of the continuous-wave (CW) tunable laser in the WSU. On the data path, a fixed fiber delay is added to allow the SCU to have sufficient time to perform header recognition and processing. The total propagation time between the input and output links is properly controlled to guarantee that the frame arrives at each switch module within system timing tolerance.
Recently, wavelength conversion at the OTDM rate up to 168 Gb/s was demonstrated that used a symmetric Mach-Zehnder (SMZ)-type all-optical switch [32] , [33] . The strong refractive index change from the carrier-induced resonance nonlinearity in the semiconductor optical amplifiers (SOAs), coupled with the differential interferometric effect, provides an excellent platform for high-speed signal processing. A similar device has also been demonstrated in demultiplexing an ultra-high bit rate OTDM signal at 250 Gb/s, which shows its excellent high-speed capability. Therefore, we can consider using an array of such devices to accomplish the all-optical wavelength conversion at an ultra-high bit rate.
The basic structure of the WCU, based on a Mech-Zehnder (MZ) interferometer with in-line SOAs at each arm, is shown in Fig. 17 . The incoming signal with wavelength ( ) is split and injected to the signal inputs, entering the MZ from the opposite side of the switch. Fig. 17(b) shows the operation of the wavelength conversion. A switching window at time domain can be set up (rising edge) by the femto-second ultrafast response induced by the signal pulses through carrier resonance effect of SOAs. The fast response of the SOA resonance is in the femto-second regime, considerably shorter than the desired rise time of the switching window. Although the resonance effect of each individual SOA suffers from a slow tailing response ( 100 picoseconds), the delayed differential phase in the MZ interferometer is able to cancel the slow-trailing effects, resulting in a fast response on the trailing edge of the switching window. By controlling the differential time between the two SOAs accurately, the falling edge of the switching window can be set at the picosecond time scale. The timing offset between two SOAs located at each arm of the MZ interferometer controls the width of the switching window. To be able to precisely control the differential timing between two arms, a phase shifter is also integrated in the interferometer. The wavelength conversion occurs when a continuous wave (or CW) light at a new wavelength ( ) enters the input of the MZ interferometer. An ultrafast data stream whose pattern is the exact copy of the signal pulses at is created with the new wavelength at the output of the MZ interferometer (marked switched output in the figure), completing the wavelength conversion from to . Using active elements (SOAs) in the WCU greatly increases the power budget while minimizing the possible coherent crosstalk in the multistage PSF. As shown in Fig. 17(a) , the incoming signal pulses, counter-propagating with the CW light from the tunable distributed feedback (DFB) laser, eventually emerge at the opposite side of the WCU, eliminating the crosstalk between the incoming and the converted outgoing wavelengths. The required switching energy from the incoming signal pulses can be as low as a couple of femto-joules due to the large resonance nonlinearity. After the wavelength conversion, the output power level for the new wavelength may reach mW-level coming from the CW laser. Therefore, an effective gain of 15-25 dB can be expected between the input and output optical signals through the WCU. This effective amplification is the key to the massive interconnected PSF maintaining effective power levels for the optical processing at each stage. The building modules used in the PSF have the potential to be monolithically integrated due to their similar architectures. There have already been attempts to build integrated SOAs in a waveguide structure on planar lightwave circuit (PLC) technologies [62] . As shown in Fig. 16 , the components in the dashed lines are the best candidates for integration due to their similarity in architecture and design. This integration provides dramatic savings on the power budget and component cost.
To reach a total switch capacity of 1.024 petabit/s, the required bandwidth can be estimated to be 192 nm assuming 80 wavelengths at a 160-Gb/s port speed, ultimately limiting the system scalability. It is necessary to apply techniques such as polarization multiplexing and the binary coding scheme to further reduce the total spectral width by a factor of two or more. The tuning range of laser and SOA also limits the scalability. However, we propose to use multiple components for tunable laser and SOAs, each of which is capable of tuning over a subset wavelength of the whole spectrum.
B. OTDM Input Grooming Module (IGM)
Optical time-division multiplexing (OTDM) can operate at ultrafast bit rates that are beyond the current electronics limit, which is around 40 Gb/s. By interleaving short optical pulses at the time domain, aggregated frames can be formed to carry data at bit rates of hundreds of gigabits per second. Using the OTDM technique, there can be at least one order of magnitude in bandwidth increase compared with the existing electronics approach.
The IGM interfaces with parallel electronic inputs from the IPC. Fig. 18 shows the structure of the IGM based on the OTDM technology. It consists of a short-pulse generation unit, modulator array, and a passive fiber coupler with proper time delays for time-interleaved multiplexing. Optical pulses with widths of several picoseconds can be generated using electroabsorption modulators (EAMs) over-driven by a 10-GHz sinusoidal clock signal. Using a tunable CW DFB laser as the light source, the wavelength of the output ultra-short pulses can also be tunable. The pulse width will be around 7-10 ps generated by the cascaded EAMs, which is suitable for data rates up to 100 Gb/s. To generate pulses suitable for higher bit rates ( 100 Gb/s), nonlinear compression with self-phase modulation (SPM) can be used. The pulses, generated from the EAMs, are injected into a nonlinear medium (a dispersion shifted or photonic bandgap fiber) followed by a compression fiber (dispersion compensation fiber) to further compress the pulse width to about 1 ps. The parallel input lines from the IPC electronically modulate the modulator array to encode the bit stream onto the optical pulse train. Precise time delays on each branch of the fiber coupler ensure time-division multiplexing of inputs. Through the parallel-to-serial conversion in the multiplexer, cells at 10 Gb/s from the IPC are now effectively compressed in the time domain as the RZ-type photonic frame that operates at Gb/s in serial. The fiber coupler and time delays can be integrated using planar waveguide structures [62] .
C. OTDM Output Demultiplexing Module (ODM)
At the receiving end of the system, the ODM demultiplexes photonic frames from the output of the PSF into parallel electronic signals at 10 Gb/s. As Fig. 19 shows, the ODM consists of a quarter-phase detector and quarter-phase shifter, an array of OTDM demultiplexers (DEMUX) based on EAMs, and the photo-detector (PD) array for O/E conversions. We have previously demonstrated ultrafast demultiplexing at 40, 80, 100, and 160 Gb/s using cascaded EAMs as the gating device. As shown in the inset of Fig. 19 , the OTDM demultiplexer consists of two cascaded EAMs based on multiple quantum well devices [37] , [38] . An SOA section is also integrated with the EAM to provide optical amplification at each stage. The optical transmission of the EAM, controlled by the driving electronic signal, responds highly nonlinearly and produces an ultra-short gating window in the time domain. Cascading the EAMs can further shorten the gating window compared with a single EAM. The incoming optical signal is split by a optical coupler into modulators located in the array structure. Each EAM is over-driven by a 10-GHz sinusoidal radio frequency (RF) clock to create the gating window for performing demultiplexing. The RF driving signals supplied to adjacent modulators in the array structure are shifted by a time , where is the bit interval inside the photonic frame. As a result, modulators are able to perform demultiplexing from Gb/s down to 10 Gb/s on consecutive time slots of the photonic frame.
The incoming frames may inherit timing jitters induced by either slow thermal effects (fiber, device, and component thermal lengthening) or system timing errors. The result is a slow (compared with the bit rate) walk-off from the initial timing (phase). Since the frames are operating on a burst mode, traditional phase lock loop cannot be applied here. To track on the slow varying jitters on the burst frames, we suggest a quarter-phase locking scheme using phase detection and a shifter.
A quarter-phase detector is shown in Fig. 20 . Four OTDM demultiplexers, based on EAM technology, are used as the phase detectors because of their high-speed gating capability. The driving RF sinusoidal signal for each modulator is now shifted by . Depending on the phase (timing) of the incoming signal, one of the four demultiplexer outputs has the strongest signal intensity compared with the three other detectors. A 4 : 2-bit decoder is then used to control the quarter-phase shifter to align the 10-GHz RF signal to the chosen phase. For example, assuming EA aligns best with the incoming signal at one incident, output from Q would be the strongest signal and would be picked up by the comparator. The clock that is supplied to the OTDM demux is then adjusted according to the detected phase.
The quarter-phase shifter, also shown in Fig. 20 , is used to rapidly shift the phase according to the detected phase. The quarter-phase shifter has been demonstrated using a digital RF switched delay lattice. The semiconductor switch is used to set the state at each stage. Depending on the total delay through the lattice, the output phase can be shifted by changing the state at each lattice. The resulting clock is synchronized with the incoming packet with a timing error less than .
V. CONCLUSION
We have presented a highly-scalable petabit photonic packet switch architecture, called PetaStar. The switching fabric utilizes photonic multiplexing techniques in space, time, wavelength and subcarrier domains to achieve switching capacity up to petabit/s regime. Packet buffering is implemented electronically at the input and output port controllers to avoid buffering in the central photonic switching fabric. The result is a bufferless switch fabric, capable of transporting ultrafast photonic data frames without E/O conversion. All-optical wavelength conversion using semiconductor optical amplifier (SOA) and Mach-Zehnder interferometer is used at each stage to enable the cascading of multiple stages of wavelength-to-space switches based on AWG device. A 6400 6400 switch based on a three-stage Clos network can be realized using 80 wavelengths at each switch module while the port speed can be increased to 160 Gb/s by the OTDM technology, resulting in a total capacity of 1.024 petabit/s.
To resolve the contention in the three-stage Clos network, we present a new matching scheme, called c-MAC, a concurrent matching algorithm for Clos-network switches. It is highly distributed such that the I/O matching and routing-path finding are concurrently performed by scheduling modules. To relax the strict arbitration time constraint and round-trip delay between line cards and the packet scheduler and to accommodate slow-configuration optical switch fabrics, we introduce the concept of a frame as the basic switching unit in an arbitration (or switching) cycle. A frame aggregates multiple cells that are going to the same output from the input lines, allowing sufficient time to complete the arbitration and switching operation. The c-MAC scheme provides high throughput with few required arbitration steps by maintaining the existing matched pairs between the inputs and outputs so that the number of unmatched ports is drastically reduced at each arbitration cycle. By implementing a timer for each head-of-line frame and grating the highest priority to those ports whose timer expires, the starvation problem can be eliminated. At the expenses of a moderate internal speedup of 1.5, a throughput close to 100% can be achieved under various traffic conditions. A feasible implementation of the c-MAC scheme, where a crosspoint switch is used to provide the interconnection between the arbitration modules, is also described.
