We present the theory, experimental results, and analytical modeling of high-speed CMOS switches, with a 2-D layout, suitable for the implementation of packet-switched free-space optoelectronic Multistage Interconnection Networks (MINs). These switches are fully connected, bi-directional, and scaleable. The first design is a proof of concept of the half-switch, which is a two-to-one multiplexer, and the 2-D layout. The second design introduces a novel self-routing concept, with contention detection and packet drop-and-resend capabilities. It uses three-valued logic, with 2.5V being the third value for a 5V power supply. Simulations show that for a 0.8 µm CMOS technology the switches can operate at speeds up to 250 Mbits/sec. Scaled-down versions of both designs have been successfully implemented in 2.0 µm CMOS. The analytical modeling of the switches show that large scale free-space optoelectronic MINs using this concept could offer close to Terabit/sec throughput capabilities and very reasonable power and area figures. For example, a 4096 channel system would offer 256 Gbits/sec aggregate throughput for a total silicon area of about 18 cm 2 and a total power consumption (optics plus electronics) of about 90 W. 2
. Introduction
One of the most important features in a parallel processing system is the communications subsystem, linking processors, memories and input/output controllers.
The simplest implementation is to use a high bandwidth bus, which provides good performance when the number of processors is kept small. For large systems, with hundreds or more processors, an interconnection network, defined as a system of switches and links that connect N inputs to M outputs, provides a more viable architectural alternative [1] . A crossbar is one such network that provides a dedicated channel from each input to each output in a single stage. Such networks quickly become prohibitive in cost as the number of processors is increased since MN crosspoint switches are needed. A more scaleable method of providing high bandwidth communication at a reduced cost is to use a Multistage Interconnection Network (MIN) [2] .
The Optical Transpose Interconnection System (OTIS) is an optoelectronic MIN developed for parallel processing systems [4] . The N processor nodes are divided into groups of N nodes each. The nodes of a group are connected to each other electrically in a hypercube network via the electronic switches. Connections between groups are made via free-space optics: each processor has an optical transmitter/receiver pair with which it sends and receives optical signals and are connected via two planes of lenslet arrays. These optical links connect the i th processor of the j th group to the j th processor of the i th group: a transpose of group and position coordinates. In such a free-space optoelectronic MIN, electronic bypass-and-exchange switches are required to do the local routing. It has been shown that for a MIN with N inputs and N outputs, the bandwidth and the power consumption are optimized if the electronic switch planes are partitioned into N switches [3] . Thus, each switch has N inputs and N outputs and is defined as N x N switch.
Two separate circuits, that use an optimized 2-D layout and are compatible with the Optical Transpose Interconnection System (OTIS), have been designed, implemented, and analyzed. The first design (design A) was used as a proof of concept for the optimized 2-D layout, the second design (design B) is a bi-directional self-routing concept that uses 3 level logic. Note that the optoelectronic implementation of this second design based on the CMOS/SEED technology [6] is under way.
In section 2 of this paper, the optimization of the system is summarized and OTIS is introduced. In section 3, the overall construction of the switch is explained, along with the functionality of the building block, which is called a half-switch. The two designs of a halfswitch, referred to as design A and design B, are discussed in detail, and experimental results are presented. In section 4, the switch is modeled for various technologies and number of channels in terms of speed, area, and power consumption. Finally, section 5 presents a discussion of the results.
. Optical Transpose Interconnection Network
In a free-space optoelectronic multistage interconnection network (MIN), as described in [3] , local routing is done with electronic fully connected bypass-and-exchange switches, and longer/global connections are achieved with optical links. For a MIN with N inputs and N outputs (i.e. N channels), the bandwidth and the total power consumption, both electrical and optical, of the overall system have been shown to be optimized if the N channels are partitioned into N switches with N channels each [3] . In Figure 1 , K is the number of channels per electronic switch, that is, each switch has K inputs and K outputs. To build a switching network with N channels, one can choose to have an all electronic network with no optical links at all. In this case, the MIN will consist of a single switch (K = N) with optical input/output. On the other extreme, one can choose to implement the same network with as much optics as possible. Then, the network will have log 2 (N) stages of electronic switches with K=2, and log 2 (N)-1 stages of optics.
A third option is to choose a middle ground by setting K to be greater than 2 and smaller than N.
The number of electrical and optical stages will then be adjusted to achieve a fully connected network. As shown in [3] , as K grows beyond a certain point compared to N, that is, as more and more electronics is utilized, the bandwidth of the overall system drops dramatically. The reason for this drop is that as the electronic switch gets bigger, the maximum electronic delay inside the switch gets bigger, so the clock speed needs to be reduced, which in turn reduces the bandwidth of the network. For the range, where the bandwidth is constant, the total system power, including both the electrical and the optical power consumption, is minimized when K ≅ N . Note that if K = N , only log N N ( ) = 2 stages of electronics are required to achieve a fully connected network. In addition, since optics is used to connect the electronic stages, only 1 stage of optics is needed for the 2 stages of electronics. This optimized point of K ≅ N , is only valid for given technology assumptions, however, further scaling investigations in [3] showed that the trend remains the same for other technologies. It is based on this optimization that the Optical Transpose
Interconnection System (OTIS) was conceived [4] . There are two stages of electronic switches on either side of the single optical stage. The 16 outputs of each switch in plane 1 (i.e. arbitrarily chosen to be the left plane in Figure 2 ) have a 1-to-1 optical link with one of the inputs of the 16 switches in plane 2 (i.e. the right plane in Figure   2 ). Routing of a data packet is illustrated in Figure 2 . The incoming data enters one of the 16 switches. It is then routed within the electronic switch to the specific output that has the optical link to the switch on the other side which contains the final destination of the data packet. The data is routed one more time, inside the electronic switch in plane 2, to arrive at the desired output node. Both the electronics and the optics are bi-directional so every input also acts as an output depending on the desired direction of the data flow.
. Electronic Switches
Throughout this section, the design and experiments of a switch with 16 channels (i.e.
K=16, N=256) are described. However, every derivation and the modeling analysis in section 4 apply to switches with an arbitrary number of channels.
.1 . Overview of a Switch
A switch is implemented by cascading 2x2 bypass-and-exchange switches, partitioned into two 2-to-1 multiplexers, called half-switches. A pair of such half-switches is called partner halfswitches. Every half-switch gets an input of its own, plus the input of its partner half-switch as its second input. Depending on the control signal, which can be fed externally as in Design A, or can be computed internally as in Design B, each half-switch transmits one of the two possible inputs to its single output. The complete switch consists of log 2 N = 4 stages of N = 16 half-switches each, for a total of Nlog 2 N = 64 half-switches. The block diagram of a 16 channel switch is shown in Figure 3 .
As an example in Figure 3 , at the third stage, the 8th and the 12th half-switches are partner half-switches. Assuming the direction of data flow is from left to right, number 8 in the middle (i.e. in the third stage) gets two inputs, one from number 8 of the previous stage (i.e. the second stage) labeled A, and one from its partner labeled B, which is the output of number 12 of the previous stage. Equivalently, this is the first input of number 12 of the third stage. Then it outputs one of these inputs to node C, where D is a floating node, that is, no transistor is pulling node D up or down. On the other hand, when the direction of data flow is reversed so that it goes from right to left, C and D become the two inputs to number 8, with A as its single output, and B is floating.
Laying out the switch as it is shown in the block diagram of Figure 3 , with all of the 16 inputs on one side, and the 16 outputs on the other side, gives a 1-D layout, suitable for VLSI implementations. However, if all the inputs and outputs are distributed in a 4x4 array of constant pitch, one can achieve a 2-D layout, suitable for optoelectronic implementations. The log 2 K halfswitches from each stage with the same number are grouped together in the layout as in Figure 4 .
In Figure 4 , each triangle represents an input/output pair since the switch is bi-directional.
Every rectangle in the figure contains log 2 K = 4 half-switches, which form a channel. In addition, every connection between two half-switches of different channels imply that those are partner halfswitches. Therefore, their inputs and outputs are connected to each other as shown in Figure 3 .
When 1-D and 2-D layout strategies are compared, the maximum wire length in the whole switch is much shorter in the 2-D layout. Thus, in terms of RC-limited maximum operation frequency, the 2-D layout has an advantage over the 1-D layout. In addition, if the 1-D layout is used for the optoelectronic implementation, extra routing is required between the actual inputs/outputs of the half-switches, and the transmitters/receivers of the system, since the latter are likely to be laid out on a 2-D array with a constant pitch. As a result, the 2-D layout is advantageous in terms of total area as well as maximum operation frequency [3] . This advantage is magnified when bigger size networks are employed.
.2 . Designs of Half-Switches
Two different half-switches have been designed for OTIS. The first one, design A, is a simple, bi-directional 2-to-1 multiplexer, that has no additional features that could be desired for a more powerful system. This was built as a proof of concept for the 2-D layout, and the operation of a half-switch as the building block of the complete switch. The second design, design B, is a novel self-routing half-switch, that can detect contention, and drop-and-resend data packets.
2. 1. Design A
The block diagram of design A is shown in Figure 5 -A [5] . The half-switch uses an external direction signal that is also broadcast to every other half-switch in the entire switch. This signal determines the direction of the data flow. The direction signal is arbitrarily chosen to be 1 (dir is the direction signal) for a left to right data flow. In this case, x0 and x1 are the two inputs and y0 is the output, while y1 is floating. Another external control signal is sent to the halfswitch, controlling which input channel it should transmit to its output. Again arbitrarily, c is chosen to be 1 (c is the control signal) when x1 is to be transmitted, and similarly, c = 0 causes x0 to go through. If direction is reversed (i.e. dir = 0), then c = 0 causes y0 to be sent to x0, and c = In this design, only four control bits are used for a four stage switch, labeled c0 through c3
(refer to Figure 3 ). The same control bit is sent to all the half-switches on the same stage. As a result, only the final output destination of a single input can be determined, whereas the remaining 15 inputs go to the other 15 outputs without contention.
To speed up the overall system and to make it scaleable, the control signals, c0-c3, are fed into the switch in a pipelined fashion. In other words, the control bit, c1, which belongs to the second stage is delayed externally by the same amount of time it takes for a signal to propagate through the first stage of half-switches. Similarly, the control bit to the third stage, c2, is delayed twice that amount, and so on. This method ensures that the control signal and the inputs of a given stage arrive at the same time at the desired half-switches. Then, the speed of the overall system is directly equal to the speed of a single half-switch. As the switch size increases, the number of stages and the total number of half-switches increase but the overall speed stays constant since the propagation delay of a single half-switch is constant. This circuit has been implemented through MOSIS and its operation has been verified at 90
Mbits/s experimentally. The difference between the experimental and the simulated results is believed to arise from the experimental setup rather than the circuits themselves. The main problem may be due to the probes being used, that are not suitable for higher-speed measurements.
2. 2. Design B
Design B is built upon design A, but it adds functionality to the operation of the switch.
The block diagram is shown in Figure 6 . It still acts as a 2-to-1 multiplexer. However, this time, it is self-routing, that is, the control bit for each half-switch is computed internally. Every input packet contains as a header, the address of its desired output destination (i.e. for N=256 channels, log 2 N = 8 bits of address are needed). As data packets are presented, the half-switches in the first stage process the first bit of each of their inputs, and decide on their control signal. The remaining 23 bits are then transmitted untouched. The same processing is done in the next stages until the data packet arrive at their output destination.
As packets are transmitted through the switches, two of them may have to use the same half-switch to arrive at their output, and thus, there is contention (hot spot). In this case, the halfswitch transmits one of the inputs in a deterministic way, and drops its other input. At the same time, to ensure that the dropped data is not lost, a contention signal is generated within the halfswitch, where the blocking happened. This contention signal propagates in the direction opposite to the data flow, and follows backwards, the path that the dropped packet of data had followed up to that point. Once it reaches the dropped packet's input buffer, it sets the input buffer to resend the same packet, so that all the information is eventually routed through the network.
In this design, the direction of data flow is again determined by an external direction signal supplied to all the half-switches. In addition, an external transmission signal (t is the transmission signal) is provided to inform each half-switch that data transmission is occurring. This signal is set to 1 if the incoming bit is a data bit, and is set to 0 if it is an address bit. Therefore, for all the half-switches at a given stage, t = 0 during the first cycle of a data packet, and = 1 for the remaining 23 cycles. Just like in design A, the transmission signal is pipelined, that is, delayed by the same amount of time that the input takes to reach that stage. As a result, a transmission signal of 0 for 1 cycle, and 1 for 23 cycles, propagates from stage to stage at the same speed that the data propagates, with the 0 bit arriving at a stage when the control signals are to be computed at that stage (i.e. the incoming bits are address bits). This way, a single pulse of t = 0 at the input buffer stage enables all the half-switches in the entire switch to know exactly when to process the incoming bits as their address bits rather than data bits. As each data packet is introduced into the pipeline, first the address bits are processed. This processing of the header of a data packet takes exactly 2log 2 N cycles, where a cycle is equal to the duration of a single bit. This is the time it takes for the last address bit (i.e. the (log 2 N) th bit) to be processed by the last stage (i.e. the (log 2 N) th stage). After that, the speed of that channel's throughput is equal to the speed of a single half-switch, since the pipeline is completely filled up at this point.
The design of the half-switch consists of three separate circuits, namely, the output, the control, and the contention circuits.
2. 2. 1. Output Circuit
The circuit schematics of design B's output is given in Figure 7 . The numbers next to the transistors indicate their width/length ratio in units of λ. Because of the contention possibility between data packets, the output circuit of design A is modified. In design B, the signal c0 is equivalent to the signal c of design A, that is, it determines which one of the inputs will be transmitted to the output. However, there is the possibility that neither of the two inputs will be active. This is the same thing as if both those inputs wanted to use its partner half-switch to reach their final destinations. Then c0 is a "don't care", that is, the half-switch does not care which input is transmitted, since neither of the inputs will be using that path. In that case, c1 is set to 1. In other words, c1 is a 1 if c0 is a "don't care", and is a 0 otherwise.
To increase the bandwidth of the overall system, if the half-switch, say on the second stage, finds out that none of its inputs will be occupying that path, then somehow, it should send a signal to the next stage (i.e. the third stage), indicating that the half-switches on the third stage should transmit their other inputs no matter what. Because of this third possible output state of a half-switch, the communication between stages have to be modified from design A. One way is to add a second wire between all the stages so that the three levels of output can be transmitted with two separate wires. However, as the system size grows, the global wiring becomes very difficult, especially in terms of area.
Instead, a novel technique is employed. The single output wire between consecutive stages is designed to be able to carry three levels of logic. For a 5V V dd , these levels would be 0V, 5V, and 2.5V as the extra third level. Continuing our example from the previous paragraph, if a given half-switch computes its c0 to be a "don't care" when t = 0 (i.e. it is the cycle to compute its address bit), then in the very next cycle (i.e. when t(+1) = 0, where t(+1) is simply the signal t delayed by one cycle), the transistors M1 and M2 will be turned on (refer to Figure 7 ). At the same time, transistors M3 and M4 will be off, and disable the remaining of the circuit except for transistors M5 through M8. This will provide a direct feedback path between the input and the output of the final inverter (i.e. M5 through M8). By appropriately sizing these transistors, the only stable voltage level, when M1 and M2 are on, can be set at 2.5V. Note that the only possible cycle that a half-switch will set its output to 2.5V is when the half-switches on the next stage are computing their own control signals.
At this point, there is a direct path from V dd to GND, so to reduce the power consumption, after another cycle (i.e. when t(+2)=0 and t(+1) goes back to 1), M2 turns off. At the same time, the transistors M9 and M10 pull the inverter's (i.e. M5-M8) input to 0V and cause its output to go to 5V. If this happens, then the power consumption of that half-switch for the remaining 22 cycles is exactly zero. Therefore, on average, a half-switch that has to output a 2.5V signal indeed consumes less power than an average half-switch that is active for the entire 24 cycles of a data packet.
The functionality of the direction signal and the remaining transistors are exactly the same as in design A. The direction signal disables half of the circuit, whose output is not necessary (i.e.
in the wrong direction). The above example assumes that dir = 1 so that the transistors, that are not numbered, have no effect on the output of the half-switch.
In addition, since the whole design is asynchronous, the timing is crucial. Therefore, the transistors in Figure 7 are designed such that the propagation delay of the half-switch's output circuit will be the same whether it is transmitting 0V or 5V through the two inverters (i.e. M11-M16 and M5-M8) or it is outputting 2.5V through M1 and M2.
2. 2. 2. Control Circuit
In this part of the half-switch, the address bit is computed for routing the data packets. As explained before, each half-switch knows that the incoming inputs are address bits when the transmission signal is set to zero (i.e. t = 0). The two transistors, M1 and M2 (refer to Figure 8 -A and 8-B), ensure that the determination of the control signals, c0 and c1, occur only when t = 0.
Otherwise, the whole circuit is disabled, and the two control signals float at their previously computed values for 23 cycles until t is 0 again (i.e. until it is the beginning of a new data packet for that half-switch).
In addition, for both of the control signals, the transistors M3-M6 provide that the output will be determined by the inputs coming from the right direction. In other words, if dir = 1, x0
and x1 determine c0 and c1, and otherwise, y0 and y1 determine c0 and c1.
For c0, refer to Figure 8 -A. c0 is the control bit that tells the half-switch which input will be transmitted to the single output during the next 23 cycles. If c0 = 0, then x0 is sent to y0, and if c0 = 1, then x1 is sent to y0 (assuming dir = 1). During t = 0, if x0 = 0 (arbitrarily chosen), then that input uses the half-switch for transmission. On the other hand, if x1 = 1, then the second input uses that half-switch. If either one of them is 2.5V, then that input is neglected, since it means that there will not be any data coming from that channel. Again arbitrarily, if both x0 = 0 and x1 = 1 at the same time, that is both inputs want to use that half-switch, then x0 is dropped and x1 is transmitted in a deterministic way. The complete truth table for c0 is given in Figure 8 -C.
In the schematics, assuming that t = 0 (so that M1 and M2 are on) and dir = 1 (so that M3
and M4 are on, M5 and M6 are off), c0 is determined by the competition between M7 and M8 (i.e. x0 and x1). The two transistors are sized so that if either one of them is 2.5V and the other one is completely on, then the transistor, that is completely on, wins. In other words, c0 is set so that the half-switch will transmit the data packet carried on the input channel, which is connected to the completely turned-on transistor. If both are on completely, then M8 wins over M7 (or M10 wins over M9 when dir = 0). The speed of this competition is greatly enhanced with the addition of the two inverters, M11-M14, at the output.
In addition to c0, there is a second control bit, c1, which tells the half-switch whether c0 is a "don't care" or not (refer to Figure 8 -B). If c0 is a "don't care", then the output circuitry will produce a 2.5V output in the next cycle (i.e. when t(+1) = 0). The truth table for c1 is given in Figure 8 -C as well. Notice that c1 =1 when c0 is a X (i.e. "don't care"), and is 0 otherwise.
The implementation of this signal is somewhat complex. The signals, t and dir, have the same functionality for c1 as they did for c0. When t = 0, M1 and M2 are on. Also, when dir = 1, M3 and M4 are on and M5 and M6 are off so that x0 and x1 determine c1. There are two sets of transistors, M11 and M12 for x0, and M7-M10 for x1. These transistors are sized so the if x0 is 0 or if x1 is 1, then c1 is 0 since at least one of the inputs want to use that half-switch and thus, c0 is not a "don't care". On the other hand, if x0 = 2.5V or 5V and if x1 = 2.5V or 0V, then c1 is set to 1. The transistors M19-M24 are again added to improve the speed performance, as well as to take care of the necessary logic calculations.
The key in this implementation is that both x0 and x1 are directly fed into the competition transistors (i.e. M7-M12) without being inverted or buffered at any point before the competitions.
The competition transistors refer to those that can provide a path to Vdd and Ground at the same time (i.e. they will try to pull the output up and down simultaneously) so that the outcome will depend on their input voltages. The reason is that the state of 2.5V is not stable at all once it is produced at the previous stage's output circuit. If the input is 2.5V into an inverter, a 0.1V variation in the input level corresponds to almost a 1V variation at the output. As a result, the noise margins would be greatly reduced. However, with our implementation methods, a 1V noise margin is achieved for all cases. In other words, the three logic levels were 0-1V, 1.5-3.5V, and 4-5V. If the output circuit was stabilizing the third logic level at anywhere between 1.5V and 3.5V, instead of exactly at 2.5V, due to variations in the fabrication process, the right results would still be obtained. This choice of a 1V noise margin is enough to cover the parameter variations, but still gives us a comfortable margin to distinguish the three levels from each other, as well as maintain the high speed of the network. If the margin was lower, the variations, that can change the speed, or the output current, of a transistor as much as 40%, could lead to a third level outside the expected and/or acceptable levels. On the other hand, if the margin was higher, then the transistors in competition would have to be sized closer to each other, and the net current that drives the load would be reduced, which in turn reduces the network's speed.
2. 2. Contention Circuit
All the signals that relate to dropping-and-resending data packets are computed by the contention circuitry. All signals starting with the letter "s" are in this category (refer to Figure 6 ).
These signals are sx0, sx1, sx, sy0, sy1, and sy. and sx1 are computed. sx1 is then sent to the partner half-switch, and the partner half-switch's sx1, which is called sx1', is received. Then, sx0 and sx1' are processed to find out what sx needs to be. If either one of them is a 1, then sx = 1, which means that the data packet, that used that specific half-switch, was dropped either at that stage or at some following stages before it was able to arrive at its final output destination. If it was dropped at a following stage, then that information would be carried to the half-switch through sy (i.e. for dir = 1, sy is an input). Note that sx0 and sy0 cannot be on (i.e. arbitrarily chosen to be 1) at the same time, and similarly, sx1 and sy1 cannot be set to 1 at the same time. The reason is that the data should flow in one of two directions, and the contention signals will have to flow in the opposite direction. As a result, these two pairs of signals are treated as one (i.e. sx0 = sy0, sx1 = sy1), and then which direction they should flow is determined at the final stage when sx and sy are computed, with the help of the direction signal. This will allow us to reduce the necessary number of transistors, as well as to reduce the number of links between partner half-switches by one (i.e. sx1 = sy1, instead of two separate signals, sx1 and sy1).
The output of each half-switch goes to two partner half-switches on the next stage as input.
As an example, refer to Figure 3 . Number 8 on the left sends its output to number 8 and 12 in the middle as their x0 and x1, respectively. If that data packet is dropped at some point in the switch, and it was using no. 8 in the middle, then sx0 of no. 8 would be set to 1, which in return will set sx of no. 8 to 1. On the other hand, if the same packet was using no. 12 in the middle and was dropped, then sx1 of no. 12 would be set to 1. This signal would be sent to no. 8 in the middle, which in return would set sx of no. 8 to 1. In either case, the intermediate signals (i.e. sx0 of no.
8 and sx1 of no. 12) determine whether sx of no. 8 is 1 or 0. In other words, they determine whether the primary input of no. 8 (i.e. x0 of no. 8 in the middle, or y0 of no. 8 on the left) was dropped or not.
Looking at Figure 9 -A, sx0_ (or sy0_) can be set to 0 in one of two ways (i.e. sx0 = sy0 = 1) after it is reset to 1 when t0 turns 0 and back to 1. One possibility is that there is a contention in that very half-switch. For this to happen, assuming dir = 1, x0 needs to be 0, which would turn on M23, and x1 needs to be 1, which would turn on M25. At the same time, it is required that it is the half-switch's time to compute its control signals (i.e. t = 0, and M24 is on), so that the incoming x0 and x1 are address bits. In this case, x0 would be dropped in a deterministic way, as explained before, and so sx0_ would be set to 0 (i.e. sx0 = 1).
The second possibility for sx0 to be 1 is that if the data packet is dropped at some later stage (i.e. sy = 1), and x0 was using that half-switch for transmission (i.e. c0 = 0). In this case, M26 and M28 would be turned on, and sx0_ would be 0 again.
On the other hand, when there is contention, always x0 is dropped and x1 is transmitted, so when sx1 is calculated, the first possibility mentioned above does not exist for sx1 (refer to Figure 9 -B). Then sx1_ = 0 (i.e. sx1 =1) only when the data packet was dropped at a later stage (i.e. sy = 1), and the half-switch was transmitting its second input, x1, at that time (i.e. c0 = 1).
Then, M31 and M33 are on, and sx1_ = 0. Of course, this is true when the half-switch is not resetting (i.e. when t0 is not 0). Figure 9 -C shows the final processing of sx0_ and sx1'_, or equivalently sy0_ and sy1'_, to compute sx and sy. Note that the sx1'_ that is used to compute sx and sy comes from the partner half-switch, where that half-switch's sx1_ is sent to its partner half-switch, so that its partner can compute its own sx and sy. If either sx0_ or sx1'_ is zero, then the output is set to 1.
M38-M41 determine which output needs to be computed. M42 and M43 ensure that if the system is being operated in unidirectional mode, the disabled (i.e. the unused) output, which is sy if dir = 1, and is sx if dir = 0, does not float up to 5V as time passes.
. Modeling
The switch is implemented with the HP 0.8 µm CMOS technology where λ=0.5 µm but during fabrication, the minimum gate length is reduced from 1.0 µm to 0.8 µm. This process allows a faster maximum speed for the transistors, while not increasing the contact or metal resistance. The following analysis applies to the above mentioned CMOS process for various system sizes. It could be extended to other technologies by adjusting the transistors' input gate capacitance and the layout area. The design of a switch is evaluated in terms of its speed, area, total power consumption, and power density.
.1 . Area
First, the area of the switches will be discussed. The system is implemented using the CMOS-SEED [6] technology, so the pads are located above the CMOS circuits, and thus, they are not included in the area analysis. These pads make contact to the transmitters and receivers through the third metal level (i.e. metal3 layer). The area of a switch is determined by the number of half-switches per channel (i.e. log 2 K, where K = N with N being the total number of channels of the system), the size of a single half-switch, the area for routing the wires, and the area for the transmitters and receivers. The area of each channel is:
The area of a half-switch is about 230 µm x 110 µm. For each stage, the routing requires 6 horizontal and 6 vertical wires with 2 µm thickness and 2 µm spacing (i.e. x0, x1, y0, y1, sx1_, 
Once the total area of a channel is known, the minimum required constant pitch between adjacent channels can be calculated, as well as the total area for a switch plane:
A(switch-plane) = N * A(channel)
Thus, for a system, that is partitioned into K switches with K channels for a total of N channels, the length of a side of the complete electronic plane is:
K, A(channel), ∆ , A(switch-plane), and L plane are calculated for N = 256, and N = 4096 in Table 1 .
.2 . Speed
The second variable is the speed of the system. Since the switch is pipelined by delaying the external signals the same amount of time it takes for the initial input to reach a given stage of half-switches, the overall system speed is equal to the speed of a single half-switch. From the simulations, this is 250 Mb/s. As the technology goes to smaller feature sizes, this speed will increase up to the point where the speed is limited by the RC time constant of the maximum length of wire, existing inside the switch.
For a system with N channels, each switch has K = N channels, distributed in a K x K array configuration. If ∆ is the constant pitch between adjacent channels (eqn. (5)), then the maximum wire length in a switch is:
which is equal to half the length of the side of a switch. Because of possible extra routing that may be needed, another 20% is added to get:
To calculate the limiting frequency, the 0-90% rise/fall time of the circuit is calculated, where a driver transistor is driving the maximum length wire with a capacitive load at the end. The capacitive load is the input capacitance of all the transistors connected to that wire at the next stage.
For a distributed RC load, it is weighed with 1.0, and for a lumped one, it is weighed with 2.3 so the resulting time delay is given by [7] :
where R int and C int are the resistance and capacitance of the wire (i.e. of the interconnect), R tr is the on resistance of the driving transistor, and C load is the total load capacitance due to transistors' input gate capacitance of the next stage. R int and C int are equal to ( L max * R wire ) and ( L max * C wire ), respectively, where R wire and C wire are fabrication dependent parameters, and are given per unit length.
For the 0.8 µm process with λ = 0.5 µm, and for a 2.0 µm thick wire (i.e. width of the wire = 4λ ) in metal2, R wire = .03 Ω/µm, and C wire = 44 aF/µm.
In [7] , it is shown that the on resistance of a MOS transistor can be approximated as:
Since the driver PMOS and NMOS transistors are sized so that their output currents would be equal (i.e. the rise and fall times would be equal), they have the same on resistance, which turns out to be ( for the specific chosen run): R tr = 585 Ω.
To calculate C load , note that each driver drives all the transistors labeled x0 in one halfswitch on the next stage, and the ones that are labeled x1 in its partner half-switch on the next stage. In addition, it drives all the transistors with y0 as input within its own half-switch, and the ones that have y1 as input in its own partner half-switch on the same stage. 11) where L max is given in µm, and the resulting T 90% is in seconds.
Since T 90% is the rise/fall time of the maximum length wire, the maximum frequency in Mb/s ( note that maximum frequency is not in MHz) will be:
The above maximum frequency is labeled as RC because this is the frequency limited by the RC time constant of the longest wire. However, the maximum frequency of the circuits, from 
.3 . Power Consumption
In the power consumption calculations, the consumption of the wires inside the halfswitches is neglected since these wires are quite short. Only the transistors in the half-switches, and the long wires between stages and between partner half-switches are taken into account. For all cases, the power consumption is given as :
where C is the capacitive load (i.e. input gate capacitance for a transistor, or the substrate to metal capacitance for a metal wire), V is the voltage swing (i.e. either 2.5V or 5V, depending on whether the third logic level is used or not, respectively), f is the operating frequency in Mb/s (i.e. divide by 2 to convert from Mb/s to MHz), and p is the probability of the previous situation changing (i.e. data switching, or the information on the wire switching).
First, the probability equations are examined. For a packet-switching network made up of k x k interchange boxes, if the initial probability for data to appear at the first stage is p 0 , then the probability of data arriving at the end of i th stage can be approximated as [2] :
but in our case, p 0 = 1 (i.e. every input buffer gets a data packet so the calculations are for a fully loaded network), and k = 2 (i.e. the switch is made up of 2 x 2 bypass-and-exchange switches partitioned into two half-switches), so eqn. (14) simplifies to:
The values of p i for i = 0,...,12 are given in Appendix 1.
In addition, let S = log 2 N be the number of stages in the whole network, and note that there are only S/2 stages of half-switches in one switch plane, and the whole network consists of two switch planes. When the power consumption per channel is calculated, the calculations are done for both switch planes together, and the total power consumption is the sum of all the switches on both the switch planes.
The total power consumption per channel is grouped into three parts, namely, the halfswitches, the output wires, and the contention wires. The half-switches consist of the transistors, since the short wires within the half-switches are neglected. To simplify the expressions, the input gate capacitance of an NMOS transistor with a given width, W, in terms of λ will be written as NM(W), and similarly for a PMOS, the gate capacitance will be given as PM(W). The gate length is not considered since the length of all the transistors are equal to 2 λ, which is equal to 0.8 µm after the fabrication. As an example, the capacitance of an NMOS with width = 10 µm = 20λ is NM(20).
The fact that the output signals, x0, x1, y0, and y1, will switch or not, does not depend on the direction of data flow, but on the probability of data arriving at that stage. Then, all the transistors in the three circuits of a half-switch are counted, and the total capacitance for these signals in one half-switch is found. The equations for the transistor capacitance due to contention for one cycle out of "d" cycles. Each data packet is proceeded by a header of log 2 N bits of address so for a packet with 16 bits of information, d = 16 + log 2 N. Then the power consumption per channel due to x0, x1, y0, and y1 is:
where the first term in the first parentheses is for an active channel, which is transmitting data.
The 1/2 term is due to the fact that data has a 1/2 probability of switching from its previous value.
The second term in the first parentheses is for a disabled channel. Then it will output 2.5V (i.e. Table 2 The power consumption per channel due to these two signals is:
The first term inside the bracket says that c0 has a 1/2 probability of switching at every new data packet. The second term says that c1 will switch if there was an input in the previous cycle, and there is not one in the following cycle (i.e. p i+1 * (1 -p i+1 ) ), or there was not one in the previous cycle, and there is one now (i.e. multiply that term by 2). The 1/d factor is for the fact that this switching occurs at every 1/d bits (i.e. once per data packet). Again, the results are in Table 2 .
The capacitance due to the external signals, t and dir, is: The power consumption due to t and dir is:
Cap.(t) is multiplied by 2 since t gets reset to 0 for one cycle and then equals 1 for the remaining d-1 cycles (i.e. switches twice every d bits). On the other hand, the direction bit has a 1/2 probability of switching at every data packet, assuming a bi-directional switch operation. In addition, it is divided by d because these signals switch once at every data packet, and is multiplied If there is a contention, then sx or sy switches, but only one of the other two pairs will switch with it. In other words, the contention comes from one of two channels, so the effective total capacitance is obtained by adding Cap.(sx, sy) with 1/2 of the other two capacitance (i.e.
there is a 1/2 probability for either of the channels). Then, the resulting power consumption is: 
The total power per channel from the half-switches is simply equal to the sum of eqns. 16 -19, and is given in Table 2 for N = 256, and N = 4096.
In addition to the half-switches, the power is consumed by the long wires between stages and between partner half-switches. These wires can be categorized into two groups, output wires, and contention wires. The length of wires at each stage is not a constant. Looking at the 2-D layout (refer to Figure 4 ), a formula is derived that will give us the length of wire between partner half-switches at each stage in terms of ∆, the constant pitch. Then, the length of wire at stage i, would be f i * ∆, where f i is the multiplicative factor, and is given as: 
The whole summation is pre-multiplied by ∆ since f i is the length of wires in terms of ∆.
The first parentheses tells us that the output of a half-switch is connected to y0 and y1 of halfswitches at the same stage (i.e. f i ), and x0 and x1 of half-switches at the next stage (i.e. f i+1 )
assuming dir = 1. The next bracket covers the two possibilities that the output level on the wire can either be 0 or 1 for d cycles, or it can be 2.5V for one cycle and 0 for the rest of the d-1 cycles.
Similarly, the capacitance for the contention wires is the same, and the power consumption is given as:
Again, the summation is pre-multiplied by ∆. In addition, there is a 1/d term because the contention circuitry switches once at every data packet. The 1/2 term is due to the fact that only sx1 signal drives a long wire between partner half-switches, and sx0 is an internal signal to the half-switch. As a result, only if c0 = 1, the long wire switches. The ( p i+1 -P s ) term is the probability that there will be a contention signal propagating. This signal is generated only if the data packet is dropped before it reaches its final destination but after it goes through that given stage of half-switches.
Summing eqns. 16-22, gives the total power consumption per channel. Looking at the power consumption for the different components in Table 2 , the output signals dominate the overall power consumption. This is due to the fact that the consumption during when there is a direct path from Vdd to Ground (i.e. output is set at 2.5V) is included in this term. In addition, these transistors turn on and off at every bit, compared to the others that get switched every d cycles.
The total power consumption for the whole system (only the electronic power) is:
The power density is given as:
In eqn. (24), ∆ 2 is the area per channel per switch plane, and the 1/2 term is added in because each channel occupies that much area on each switch plane. Effectively, the total area per channel is (2 * ∆ 2 ). From Table 2 , the electronic chip will have no trouble handling the heat dissipation in these circuits, despite the fact that there is a direct path from Vdd to GND at certain times.
. Discussion and Conclusions
Looking at the results from tables 1 and 2, a network with 4096 channels seems feasible.
The side length of a switch plane for a 4096 channel network is found to be approximately 3 cm.
Note that a network with N channels is constructed with N = K switches, and that these switches do not share any internal wires but only the external signals of direction and transmission. As a result, one can easily implement a network of this size by tiling together small switch chips, thus reducing potential fabrication problems on the electronic chips. Such a system has a K x K = 8 x 8 array of switches per switch plane, with each switch handling 64 channels. Thus, each switch only takes up a chip of side length 3 cm / 8 = 3.75 mm, which is readily available in 0.8 µm CMOS technology at commercial silicon foundries.
A maximum speed of 250 Mb/s has been simulated for the electronic switches and knowing that the probability of acceptance of a 4096 channel system is 1/4 (refer to Appendix 1 with i = log 2 N = 12 stages), this yields a total throughput of (250 Mb/s) x (1/4) x (4096) = 256
Gb/s. In a related work [8] , the modeling presented in section 4 of this paper has been extended to include all the various components of the OTIS system including the free-space optical system required to implement the network. For the network size and speed mentioned above, the power consumption of the entire optoelectronic system is found to be approximately 90 W. This yields 22 mW per channel for the entire optoelectronic system, which is very competitive with available networks. The power consumption of the switches is modeled to be about 40 W, while the optical power requirement is only 1 W, which is readily available with solid-state lasers. The rest of the power is consumed by the transmitter circuits (2 W) and mostly by the receiver circuits (50 W). It is interesting to note that most of the power consumption comes from the receiver circuits and mainly from their DC power consumption, which is why these circuits are now receiving a lot of attention in the community in order to improve their performance.
The on-chip power density at 250 Mb/s is calculated to be 5 W/cm 2 . Electronic chips do not require any special heat dissipating schemes until the power density approaches 10 W/cm 2 , so there should not be any thermal dissipation problems for a 4096 channel network.
As smaller CMOS technologies are employed, the speed of the switches will increase.
Adding to this the possibility of increasing the network size in terms of number of channels, one could reach a throughput in the Terabit regime in the near future. Note that as the system size is increased, the system speed will not be reduced due to the pipeline structure of the switches, and one can still maintain a relatively high yield on the electronic chips due to the independence of switches from one another.
Finally, an optoelectronic switch chip is now being built based on the AT&T flip-chip bonded CMOS-SEED technology that combines 0.8 µm CMOS chips (as modeled in the previous section) with GaAs MQW optical transmitters and receivers where operation of the receivers and transmitters at over 600 Mb/s [9] has been demonstrated. This chip is expected to be back from the foundry and packaged in early spring of 1996. Test results will be reported in a future publication. and assuming f = 100 Mb/s). 
. Acknowledgments

