Abstract-We present experimental and numerical studies of a novel packet-switch architecture, the data vortex, designed for large-scale photonic interconnections. The selfrouting multihop packet switch efficiently scales to large port counts ( 10 k) while maintaining low latencies, a narrow latency distribution, and high throughput. To facilitate optical implementation, the data-vortex architecture employs a novel hierarchical topology, traffic control, and synchronous timing that act to reduce the necessary routing logic operations and buffering. As a result of this architecture, all routing decisions for the data packets are based on a single logic operation at each node. The routing is further simplified by the employment of wavelength division multiplexing (WDM)-encoded header bits, which enable packet-header processing by simple wavelength filtering. The packet payload remains in the optical domain as it propagates through the data-vortex switch fabric, exploiting the transparency and high bandwidths achievable in fiber optic transmission. In this paper, we discuss numerical simulations of the data-vortex performance and report results from an experimental investigation of multihop WDM packet routing in a recirculating test bed.
I. INTRODUCTION

W
ITH THE ever-growing demand for higher data rates and a wide variety of services, a fully transparent optical-switch network element presents an attractive way to overcome the electronic speed bottleneck. However, all-optical implementations encounter considerable challenges in the processing and buffering of optical data, which are essential to ensuring switch performance and to providing the necessary services and protections. Therefore, most optical-switch architectures actually employ hybrid optical-to-electronic technologies where the strengths of optics such as transparency and Q. Yang is with the Electrical Engineering Department, Princeton University, Princeton, NJ 08544 USA (e-mail: qyang@ee.princeton.edu).
K. Bergman was with the Electrical Engineering Department, Princeton University, Princeton, NJ 08544 USA. She is now with Tellium, Inc., Oceanport, NJ 07757 USA. high bandwidth can be fully exploited while the weaknesses can be bypassed in the electronic domain. Nevertheless, issues such as contention resolution, buffering, scalability, and latency remain the key challenges in photonic packet switching [1] - [4] . To achieve high throughputs, a packet switch generally processes multiple packets simultaneously. Therefore, whether the architecture is based on a single hop or multiple hops, contention is inevitable when multiple packets are competing for the same output port. To solve the contention problem, either buffering or deflection techniques can be implemented. Simple deflection methods without buffers (hot-potato routing) usually introduce severe performance penalties in throughput, latency, and latency distribution [5] . Current optical buffering techniques configured as optical fiber delay lines in a traveling or a recirculating geometry do not have random access capability [6] , [7] . The straight delay-line buffers can introduce large timing errors and the recirculating rings are generally bulky and expensive. Therefore, it is still impractical to employ optical buffers within a switching network element, especially for very large-scale switches.
In recent research, the use of wavelength division multiplexing (WDM) in packet switches has been explored to achieve greater flexibility. Technologies such as wavelength routing (WR) and wavelength conversion (WC) are proposed to solve the difficulty of optical buffering [8] - [12] . These techniques are very attractive in maintaining the throughput performance as well as the optical data transparency without resorting to optical buffers. The wavelengths are exploited as logical buffers within the existing architectures. Generally, however, the additional WR and WC devices add performance penalties and make such systems complex and costly.
In this paper, we present experimental and numerical studies of a new architecture, the data vortex, as a candidate for large-scale packet switching [13] . The multihop packet switch tightly couples the deflection method and a virtual buffering mechanism to achieve hardware simplicity, scalability, and high throughput. The timing and control algorithm permits only one packet to be processed at each node in a given clock frame and, therefore, the need to process contention resolution is eliminated. The wavelength domain is additionally used to enhance the throughput and to simplify the routing strategy. Numerical studies of the traffic flow within the switch architecture have shown that the data vortex can efficiently scale to greater than 10 000 ports while maintaining a low packet switching latency and a narrow latency distribution.
The rest of the paper is organized as follows. Section II provides an overview of the data-vortex switch architecture, and in Section III, the numerical studies of the system performance are 0733-8724/01$10.00 © 2001 IEEE presented. In Section IV, experimental results from the recirculating test bed are discussed. Finally, we present our conclusions in Section V.
II. ARCHITECTURE OVERVIEW [13] The data-vortex architecture was designed specifically to facilitate optical implementation by minimizing the number of the switching and logic operations, and by eliminating the use of internal buffering. The architecture employs a hierarchical structure, synchronous packet clocking, and distributed-control signaling to avoid packet contention and reduce the necessary number of logic decisions required to route the data traffic. As a result of this novel topology and data flow, the node design for routing the optical packets is greatly simplified. All packets within the switch fabric are assumed to be the same size, and are aligned in timing when they arrive at the input ports. The routing paths of the switching architecture are designed properly to maintain this synchronous slotted operation. A detailed description of the switch architecture is provided in [13] .
The data-vortex topology consists of routing nodes that lie on a collection of concentric cylinders. The cylinders are characterized by a height parameter corresponding to the number of nodes lying along the cylinder height, and an angle parameter , typically selected as a small odd number , corresponding to the number of nodes along the circumference. The total number of nodes is for each of the concentric cylinders. The number of cylinders scales with the height parameter as . Because the maximum available number of input ports into the switch is given by , which equals the available number of output ports, the total number of routing nodes is given by for a switch fabric with input-output (I/O) ports. To illustrate the packet-routing paths through the cylinders, we show the routing tours from a top view of the concentric cylinders and from a side view of each cylinder level. In Fig. 1 , an example of switch fabric is shown. Each cross point shown is the routing node, labeled uniquely by the coordinate where . As shown, the packets are injected at the outermost cylinder from the input ports, and emerge at the innermost cylinder toward the output ports. Each packet is selfrouted in the fashion of binary-tree decoding as it propagates from the outer cylinder toward the inner cylinder. Every cylindrical progress fixes a specific bit within the binary header address. By allowing the packet to circulate, the innermost cylinder also alleviates the overflow of the output-port buffers.
Packets are processed synchronously in a highly parallel manner. Within each time slot, every packet within the switch progresses by one angle forward in the given direction either along the solid line toward the same cylinder or along the dashed line toward the inner cylinder. The solid routing pattern at the specific cylinder shown can be constructed as follows. First, we divide the total number of nodes along the height into subgroups, where is the index of the cylinders. The first subgroup is then mapped as follows. For each step, we map half of the remaining nodes at angle from the top to half of the remaining nodes at angle from bottom in a parallel way. This step is repeated until all nodes of the first subgroup are mapped from angle to angle . If multiple subgroups exist, the rest of them copy the mapping pattern of the first subgroup. This mapping rule can easily be verified from the example of switch fabric shown. The solid routing paths are repeated from angle to angle, which provide permutations between "1" and "0" for the specific header bit. At the same time, due to the smart twisting feature of the pattern, the packet-deflection probability is minimized because of the reduced correlation between different cylinders. The dashed-line paths between neighboring cylinders maintain the same height index because they are only used to forward the packets.
The avoidance of contention, and thereby reduction of processing necessary at the nodes, is accomplished with separate control bits. Control messages pass between nodes before the packet arrives at a given node to establish the right of way. Specifically, a node A on cylinder, has two input ports: one from a node B on the same cylinder and one from a node C on the outer cylinder . A packet passing from B to A causes a control signal to be sent from B to C that blocks data at C from progressing to A. The blocked packet is deflected and remains on its current cylinder level. As mentioned above, the routing paths along the angle dimension provide permutations between "1" and "0" for the specific header bit. Therefore, after two node hops, the packet will be in a position to drop to an inner cylinder and maintain its original target path. The control messages thus permit only one packet to enter a node in any given time period. Because the situation of two or more packets contending for the same output port never occurs in the data vortex, it significantly simplifies the logic operations at the node and, therefore, the switching time, contributing to the overall low latency of the switching fabric. Similar to convergence routing [14] , the control mechanism and the routing topology of the data-vortex switch allow the packets to converge toward the destination after each routing stage. The fixed priority given to the packets at the inner cylinders by the control mechanism allows the routing fairness to be realized in a statistical sense. The data vortex has no internal buffers; however, the switch itself essentially acts as a delay-line buffer. Buffers are located at the input and output ports to control the data flow into and out of the switch. If there is congestion at an output buffer, the data waiting to leave to that buffer circulates around the lower cylinder and, thus, is optimally positioned to exit immediately as soon as the output ports are free.
III. PERFORMANCE
The performance of the data-vortex architecture is investigated with numerical simulations. We assume a uniform, random traffic model in which all the launched packets have a balanced, uniform distribution of their destination addresses. In addition, each packet injection is statistically independent of injections occurring at other input ports or other clock cycles. The input-port injections are determined by a load parameter, which is defined as the fraction of time a new packet injection is attempted. For example, for a load parameter of 0.5, on average, a new packet injection is attempted at every other time slot. For simplicity, only the height information is encoded as the packet destination address, and all of the available angles at the output side accept the switched data packets. System performance characteristics such as injection rate, latency, and latency distribution are evaluated under various loads and various switch sizes.
A. Injection Rate
We use an injection rate parameter to represent the sustained throughput of the switch. The injection rate is defined as the ratio of successful packet injections into the switch over the number of injection attempts. Generally, the maximum throughput supported by the switch will be less than the product of the input-ports number and the bandwidth supported by each port because packets may share the same routing paths. The actual throughput can be studied by numerical simulation of the traffic flow within the switch under various load conditions for various switch fabric sizes.
Considering the number of I/O ports, the switch can be operated in two different modes: symmetric or asymmetric. As mentioned, we always assume that all of the available output angles are operated as the receiving ports. At the input side, however, we either choose to inject packets to all the angles in the symmetric mode or choose to inject packets into a fraction of the angles in the asymmetric mode.
In Fig. 2 , the injection rate is plotted as a function of the input loads for various switch heights with . The switch is operated in the symmetric mode, i.e., all five angles are used as input ports. The switch fabric in this symmetric mode has (5 ) input and (5 ) output ports. As shown, the difference in injection rate for various switch fabric sizes is more pronounced under heavier loads. However, as the switch size increases to , the degradation in injection rate slows down rapidly even for heavy loads. For example, in a fully loaded (load ) switch with 640 (5 128) I/O ports, about 26.1% of the new injections are successful. For the same load condition, a symmetric switch with k (5 2048) I/O ports can still maintain about 23.7% successful injections. Thus, the data vortex is shown to scale to large port counts ( k) while maintaining a reasonable rate of successful packet injections even under fully loaded conditions. By injecting packets into only a fraction of the available input ports, or operating asymmetrically, better injection rates and latency performances can be achieved. The downside, however, is that the switch cost rises because the same resources are now shared among a smaller number of input ports.
To study the performance improvement under the asymmetric operation, a switch of and is used. In this case, only out of input angles are active. The asymmetric switch now has input ports and output ports. In Fig. 3 , the successful injection rate is shown as a function of the traffic load for varying numbers of . As indicated, the injection rate performance is improved from 23.7% in the symmetric mode to 39.2% for and to as much as 97.5% for under a fully loaded condition. Therefore, the throughput performance can be improved dramatically even though the hardware cost increases by a factor of five given the same number of input ports. However, because each node only requires simple routing function and has potential in large-scale integration, the hardware implementation of the switching nodes can be potentially economic. As shown later, the asymmetric I/O operation also improves the mean latency performance.
B. Latency
In addition to the efficient scalability of the data vortex, the switch fabric also exhibits low latencies. The efficient routing schemes and small deflection penalty act to reduce the time of flight latency. We study the switch latency, defined as the mean number of hops propagated by the packets, as a function of the switch size and under various loads. Figs. 4 and 5 show the latency as a function of the switch height, with for various loads and various asymmetric modes. The axis is plotted in a log scale; therefore, the mean latency is linearly proportional to as shown, and the slope of the lines depends on the scale of the input loads as well as the degree of asymmetry.
For the symmetric case (with ) shown in Fig. 4 , the overall latency performance is improved (i.e., smaller slope) with reduced load simply because the switch is less crowded and the average deflection probability is decreased.
In Fig. 5 , out of angles at the input side are active injecting ports. The load at each injecting port is set to 1.0. The more asymmetric cases (smaller for a given ) have lower overall traffic within the switch and, therefore, smaller latencies due to the reduced deflections. The absolute latency in the data-vortex switch is low. For example, as shown in Fig. 5 , in a switch fabric with 2-k input 10-k output ports under a fully loaded (load ) condition, the mean latency is only 22 hops. Because each node hop requires very little processing, the physical delay time at each hop can be very small, e.g., less than 10 ns (as will be shown in the experimental section). Thus, with the data-vortex architecture, one could construct a switch fabric with more than 2 k-10 k optical I/O ports and an overall packet latency in the 200-ns to 300-ns range.
C. Latency Distribution
The latency distribution of the switched packets is an important aspect of the system performance. It is easier for the receivers to restore the order of packets if their switching latencies are narrowly distributed around the mean values. At the physical layer, packets are also similarly degraded because they went through small variations of routing-path lengths. The latency distribution is shown in a histogram format with the percentage of the total number of packets plotted as a function of corresponding number of hops. The latency distribution is studied under various traffic loads and for different asymmetric modes. The results are shown in Figs. 6 and 7, respectively. Fig. 6 shows the latency distributions in a symmetrically operated switch of under various traffic loads. Lighter traffic loads result in narrower distribution curves because packets are less likely to interfere with each other or accumulate large differences of deflection penalties. The coordinate of the distribution curve peak, which also reflects the mean latency of the system, is shifted to the left as the degree of load is reduced. Fig. 7 compares the latency distributions for various asymmetric modes. For all cases in Fig. 7 , the load for injecting ports is set to 1.0, and the same switch size is studied as in Fig. 6 . For smaller , or the more asymmetric I/O case, a narrower distribution performance is achieved. This is simply due to the fact that fewer packets are injected into the switch, leading to a lower number of deflections and less variation in the number of deflections. We note that the improvement from to in latency distribution is minor compared with the case. In addition, we find that these results are similar for different switch sizes. Therefore, the overall latency of the data-vortex switch is considered narrowly distributed. For example, in a switch, the 1% latency distribution tail is extended at a range of 40-50 hops, even for the worst traffic conditions.
IV. EXPERIMENTAL TEST BED
The throughput of the data vortex can be enhanced by using WDM to encode the packet payload. In addition, encoding the header bits by WDM can greatly simplify the routing strategy at the node. Because each of the cylinders decodes a specific header bit in a binary tree fashion, passive wavelength filtering can be implemented in the nodes to perform the routing operations. Using WDM in the header field also reduces the switch latency. Each incoming packet into the switch is then constructed as in Fig. 8 . It consists of a payload field, i.e., the WDM data and a header field, which includes a framing bit to indicate the presence of a packet and the WDM header bits that are the destination address of the packet.
To experimentally demonstrate the routing functions of the data-vortex switch and measure the physical layer performance, we constructed a routing-node test bed [15] , [16] . The test bed is configured in a recirculating loop fashion to emulate packets propagating through multiple node hops.
The experimental setup is shown in Fig. 9 . Each routing node has two in (as the North and West) and two out (as the East and South) optical ports. At the packet generator, two continuous wave (CW) lasers LD1 (1547 nm) and LD2 (1550 nm) are used to carry framing and header information, and three other CW channels (1552 nm, 1553 nm, 1555 nm) are used to build the WDM payload. Electrical data of framing and header bits are generated from the pulse-pattern generator 1 (PPG1) HP81200 with a packet period of 32 ns. The modulated optical header and framing bits are aligned with the corresponding optical payload channels in time. The payload information is modulated at 10 Gb/s from the PPG2 (MP1763B, Anritsu Corp., Tokyo 106-8570, Japan), which includes a 4.8-ns guard-time band at the edge of packet boundary to allow for the switching transients. An erbium-doped fiber amplifier (EDFA) is used to compensate the loss through the modulating and multiplexing stages. A polarization controller (PC) is inserted at the input of the node to compensate for the polarization sensitivity of the switch.
The packet routing is realized by a LiNbO 2 2 crossbar switch [17] . By properly applying the switching signal to the crossbar switch, the input optical packet can be directed to one of the two output ports. Based on this configuration, the electrical switching signal is derived from the routing information of the incoming packets [West's framing (Fw), West's header (Hw), and North's header (Hn)] and the control signal (Cs), which originates from the competing node at the neighbor cylinder. In the experimental setup, however, the control signal is programmed by PPG1 directly, and aligned with the corresponding routing information at the decision circuit. Fw, Hw, and Hn are tapped and filtered from the optical packets and converted into electronics feeding the decision circuit.
The recirculating configuration is achieved by connecting the output of the node (the South port) back as the North input. Within the loop, an EDFA with amplified spontaneous emission (ASE)-peak filtering is used to compensate the loss through the switches and through the taps and splices. The loop delay is designed to be five packets long to keep the routing synchronous. An optical delay line is inserted after the tap off, allowing the appropriate temporal alignment of the optical packets with the switching decision. Depending on the associated destinations and control conditions programmed, each packet will recirculate the node a varying number of times. A packet sequence entering the test bed will, therefore, emerge out of the original order because different packets travel a different number of hops through As an example, we tested a 40-packet sequence given by "0010 1110 1101 0111 1010 1110 0000 0000 0000 0000", where "0" presents an empty packet slot, and "1" presents an existing packet slot. According to the specific header address and control for each packet, the output sequence is expected to be "0000 1010 1111 0101 1100 0000 0000 0000 0001 1111". As shown in Fig. 10 , the experimental results are consistent, demonstrating the correct routing function of the node.
To verify the successful multihop routings, we also examine the detailed payload of each packet from input to output. For example, the fifteenth packet in the input sequence is programmed to travel through six node hops and arrive as the fortieth packet at the output sequence. The end portions of this fifteenth-input packet and fortieth-output packet are shown to match in Fig. 11 .
The eye diagram results of the filtered payload channel at 1555 nm are given in Fig. 12 . The output shows the collective eye diagrams of all of the packets that have propagated through varying number of node hops. For a maximum of six node hops, we find clean and open eye diagrams. Payloads at the other two channels have similar results.
For practical large-scale switching fabrics, LiNbO switchbased routing node will not be an economical implementation due to the difficulty of integration and a relatively large insertion loss. Therefore, semiconductor optical amplifier gate-based node with a similar switching speed is more likely to be the solution. In addition to integration advantage, it provides a higher ON-OFF ratio and is able to compensate the loss within the same device. A similar routing-node test bed is currently being built and under research. To better study the switching scalability, physical models are also being built for detailed studies of the noise and crosstalk accumulation in such multihop systems.
V. SUMMARY
We have investigated, in detail, the performance and experimental implementation of the data vortex, a new photonic packet-switch architecture. The simplified bufferless packetswitch design enables the exploitation of the high bandwidth and transparency afforded by optical transmission. The numerical studies show that the data vortex is highly scalable to large port counts ( k) while maintaining low latencies (22 hops) and a narrow latency distribution. A wavelength-encoding method is additionally employed for the header bits, allowing the use of passive wavelength filtering for the routing operation. Under modest asymmetric I/O operation of the switch, sustained high switching throughput and a narrow latency distribution can be achieved. Therefore, for large-scale packet-switched application, it is feasible to build the data-vortex architecture with a slight penalty of the switching complexity. An experimental test bed was constructed to demonstrate the basic function of the routing node as well as successful routing performance for six-cascaded nodes.
