Abstract. The continuing improvement of processor performance has increased the demand on interconnection bandwidth at a rate that outpaces the bandwidth provided by conventional electrical interconnects. By combining a high-bandwidth optical interconnect technology with the ubiquitous high-performance CMOS technology, optoelectronic routers show the potential to supply greater bandwidth capacity as well as complex functionality suitable for developing high-performance interconnection networks demanded by current and next-generation processors. However, developing optoelectronic chips at this level of complexity is not conventional and, hence, there are several issues to be investigated. In this work we evaluate design issues regarding the integration of complex CMOS core circuitry with optoelectronic SEEDs using a semi-analytical model. Our results show that complex optoelectronic chips can still yield better interconnection bandwidth compared to high-performance CMOS chips, albeit at the expense of decreased transister density and increased critical paths.
Introduction
The interconnection network is the communication backbone of a parallel processor system on which all remote data accesses occur and, thus, has a strong influence on the overall performance of the system. While the bandwidth delivered by conventional electronic-based networks has increased slowly in recent years, the bandwidth demanded by processors has increased at a much faster pace, soon causing the network to become a performance bottleneck. Optoelectronic-based networks can potentially provide much higher bandwidth capacity to mitigate this problem [1, 2] . However, with the present growth rate of communication demand of distributed multiprocessor systems, even a highbandwidth optoelectronic network (see figure 1), could become oversaturated unless advanced routing techniques are incorporated in the interconnect architecture to efficiently utilize the bandwidth.
In this paper, we evaluate the design trade-offs and present architecture and operation of the WARRP II router chip (wormhole adaptive recovery-based routing via preemption). The router supports true fully adaptive routing that allows packets to use any profitable path through the network without restrictions. Network traffic can thus be evenly distributed throughout the links, achieving maximum bandwidth utilization. The router employs novel mechanisms to handle network deadlocks-hold-and-wait situations where no packet can make progress due to cyclic dependences on network resources. Although previous work has indicated that deadlocks can occur infrequently [3] , they must be guarded against to ensure proper network operation. The WARRP II router handles impending deadlocks by progressively recovering from them. Each router uses a centralized deadlock buffer, shared among its neighbours, to perform a deadlock recovery operation through what is essentially a deadlock recovery path used to route a potentially deadlocked packet.
On the optoelectronic side, the WARRP II router explores a critical design space that integrates large and complex circuitry with a dense array of optoelectronic devices. Although this is integral to the development of highperformance optoelectronic processors, memories, switches and network interfaces, it has not been heavily investigated. The rest of the paper is organized as follows. Section 2 gives background for this work. Section 3 presents the optoelectronic implementation of WARRP II and provides simulation results. Section 4 explains WARRP II design issues and constraints and models the effects of those constraints on complex CMOS/SEED integration. Section 5 gives the details of the WARRP II router operation. Section 6 concludes this work.
Background
Network and system performance can benefit from the increased bandwidth provided by optical interconnects in several ways. First, with dense optical I/O pin-outs, network latency can be reduced by means of wider channels and/or low-diameter topologies. Second, network fault tolerance can be improved by robust organization of the dense I/O pinouts, e.g. by assigning redundant links to each channel. Third, high-bandwidth capacity can help support latency-hiding techniques (e.g. multithreading and prefetching) to mitigate the widening processor-memory performance gap. Because of these potential advantages, optoelectronic network routers are an immediate application for the high bandwidth provided by optical interconnects. Figure 2 plots estimated WARRP router complexity in terms of the number of transistors required and the number of I/Os required for various architectures. We show that the most complex WARRP router (8D-256B-Bi-3VC) can be implemented with ∼ 20 × 10 6 transistors and ∼10 000 I/Os, which is near-term technology for both semiconductors and optoelectronics. In such a configuration, the WARRP router's performance would easily surpass that of the current-and next-generation state-of-the-art network routers [4, 5] .
Our first optoelectronic router implementation was the WARRP core [6] . It integrated 1400 GaAs-MESFET transistors with six pairs of LED/OPFET photodetectors on a 2 × 1 mm 2 chip. The chip was designed to Figure 2 . WARRP router complexity plotted in terms of the number of transistors and I/Os required (excluding power and ground pins), ranging from a small four-bit-wide unidirectional-link torus with one virtual channel (1D-4B-Uni-1VC) to a large 256-bit-wide bidirectional-link eight-dimensional torus with three virtual channels (8D-256B-Bi-3VC). Most data points (up to 8D-16B-Bi-3VC) were obtained by exhaustively synthesizing each design with the EPOCH CAD tool from CASCADE Inc. With current processor trends, 64-bit-wide or 256-bit-wide channels should soon be common.
implement only those core functions sufficient to demonstrate deadlock recovery router mechanisms. Our next-generation optoelectronic router chip is the WARRP II router which is a fully functional deadlock recovery router. In designing the chip several architectures were explored and design tradeoffs made in implementing the scaled-down version of the WARRP architecture [7] . In section 4, we explain some of these design issues and their implications, but first we present the basic chip implementation resulting from these architectural and technological constraints. 
WARRP II implementation
The WARRP II router chip implements a scaled-down, fully functional version of the WARRP network router architecture [7] integrating an array of 20 × 10 SEEDs on a 2 × 2 mm 2 CMOS circuitry, via flip-chip bonding. Each self-electrooptic effect device (SEED) is 20 × 60 µm 2 with a horizontal pitch of 62.5 µm and a vertical pitch of 125 µm, respectively, and operates at 850 nm wavelength. Recent experiments show that this promising technology can provide more than 47 000 devices on a 3.7 × 3.7 mm 2 area in the near future [1] , and each can currently operate at up to 2.48 Gb s −1 with only 300 µW optical power input in dual-rail mode [2] . Using the HP14B CMOS process (a 0.5 µm, three-metal layer, 3.3 V supply voltage process), this chip contains approximately 15 000 transistors, of which 3500 are used for I/O pad drivers and optical transceivers. These peripheral circuits occupy almost 40% of the chip area, leaving the remaining 60% for the router circuitry. Figure 3 shows the internal modules of the WARRP II chip which consists of four-flit-deep input buffers, three-flitdeep output buffers, an address decoder, a 2 × 3 crossbar, a crossbar arbitrator and a deadlock core module (i.e. a deadlock buffer and its associated flow controller and channel pre-emption logic implemented in the WARRP core chip [6] ). This chip implements a four-bit-wide unidirectional torusconnected topology with one virtual channel and associated deadlock recovery mechanisms (1D-4B-Uni-1VC as shown in figure 2) using 20 optical I/O pin-outs (18 I/Os were used for router ports and two I/Os were used for testing purposes). Another 16 signals (for the processor port) were implemented electrically.
Our design was simulated extensively using switch-level IRSIM (due to its complexity, exhaustive SPICE simulations were not possible given the limited design time). The maximum operation speed is estimated to be 25 MHz, about half that expected in the original design. This is due to a longer critical path resulting from limited metal-3 usage (see section 4). An FPGA version of the WARRP router was also designed and simulated, yielding a similar speed as its optoelectronic counterpart. A testboard for both versions is being built. We expect a system-level test set-up of the FPGA version to be completed by the fourth quarter of 1998. An electronic version of the WARRP II chip should be available by the third quarter of 1998, followed by an optoelectronic version sometime later.
Optoelectronic design issues and constraints
Implementation of complex optoelectronic designs such as the WARRP II router raise critical integration issues that must be addressed. For instance, the chip I/Os are randomly organized by the CAD design tools to optimize chip performance. This makes it difficult to connect those I/Os to a regularly distributed SEED array. Second, the I/Os should be laid on the dies in a structured pattern in order for chips to be connected by a space-invariant optical system. Third, to achieve production-level yield, the SEED array is limited to a 3.7 × 3.7 mm 2 area [1] , which is typically smaller than that of a complex CMOS-VLSI circuit. Thus, optoelectronic I/O pin-outs would only be located over an area in a specific region of the chip. Fourth, current integration techniques require that at least the top metal layer be exclusively used for SEED wiring. Finally, a large array of SEEDs must be efficiently integrated with complex CMOS core circuitry. Note that these issues are exclusive to large and complex hybrid CMOS/SEED integration, referred to as level-5 genius pixels in [8] or core-based designs in [9] . In contrast, smaller pixelated circuits are much easier to optimize and fit in a structured array under the SEEDs, which usually is the case for level 4 and lower pixel-based designs.
High-performance ball-grid array (BGA) packaging, on the other hand, also features area-distributed I/Os (called 'balls') for the CMOS chips. However, this packaging technique is not restricted by any of the above issues because of the following. First, it requires a transceiver that is already compatible with CMOS which can be seamlessly and efficiently integrated with the core circuitry. Secondly, each transceiver can be wired to any of the nearest balls; there is no space-invariant constraint. Third, metal layers are not used exclusively for ball wiring. Finally, the ball array can be distributed throughout the entire chip area.
Intuitively, the above design issues lead to a reduction of the transistor density of core-based designs compared to optimized pure-CMOS versions. The main reason is that longer wires are required to connect the SEED array to the corresponding transceivers and CMOS I/Os which effectively reduces the availability of metal layers for circuit wiring. In addition, the critical paths are expected to increase, affecting the clock rates enabled by the technology. In what follows, we elaborate on these effects specific to our WARRP II router design and, further, project these effects onto current and next-generation technologies.
The design of WARRP II was split into two phases: CMOS circuitry optimization by EPOCH and manual integration of the CMOS circuitry with the SEED array using MAGIC. Because SEEDs are bonded to metal-3 pads and a large number of global connections are needed in the design, only two metal layers could be used by the CMOS circuitry. Another limitation was that the design had to fit in a ∼1.6 × 1.6 mm 2 area (which excluded the I/O pads). Our synthesis tool yielded a circuit density of ∼ 6000 transistors/mm 2 without the metal-3 layer or ∼19 000 transistors for the entire area. Our observations show that, on average, our layouts expanded by 35% without metal-3. Since the transistors are moved far apart to make room for circuit wiring, the critical path length doubles. This severely affects the chip functionality and performance as stated in section 3. Figure 2 shows that, with this constraint, only a 1D-4B-Uni-1VC configuration of the WARRP router could be implemented which uses ∼11 500 transistors and 34 I/O pin-outs. Thus, our design was transistor-limited as a result of the reduced transistor density due to optoelectronic integration. Our design would have been optical I/Olimited in implementing a 2D-8B-Bi-1VC configuration of the WARRP router with 122 optical I/Os (dual-rail) if a 5 × 5 mm 2 chip area was available. Before discussing the operation of the WARRP II router, we speculate on the effects of the wiring constraints of core-based CMOS/SEED integration as they pertain to the implementation of future designs.
We estimate the effect of optoelectronic integration on transistor density by assuming that each SEED connection with CMOS I/O is randomly distributed. SEEDs are dividing into two equal halves representing the transmitter and receiver groups, and are placed symmetrically with the optical axis which is parallel to the x-axis. Collectively, these connections take away some of the interconnection length provided by the interconnect layers and, thus, support the connection of a lower number of transistors. Consider the case in which the SEED array is at least equal to the underlying CMOS circuit area. For this case, as is the case in our WARRP II design, at least two metal layers are used to connect the CMOS I/Os to the SEED array in the x-and y-directions [9] .
We use the notion of wiring capacity to represent the number of wires that can be placed per unit SEED area (the area surrounding each SEED). We can write the wiring capacity available in the x-and y-directions as
where K i , K j are the wiring utilization of metal layer i and j , respectively, D is the total number of SEEDs, P is the pad size, X pitch , Y pitch , m X-pitch and m Y -pitch are the pitch of the SEED and the pitch of metal layer in the x-and y-directions, respectively. Also, m X is the top metal layer to be used as the bonding pads and m Y is the subsequent metal layer under m X and thus it can use all the SEED area to route. In calculating the wiring cost, we assume that transceivers are in the proximity of the SEEDs to which they are to be connected and that all signals are dual-rail. Hence, we can write the wiring cost per unit SEED area required to route the entire SEED array in the x-and y-directions as where D x and D y are the number of diodes in the x-and ydirections, respectively. The ratios (1)/(3) and (2)/(4) simply determine whether additional metal layers are required in order to route the SEED array. And if so, the procedure is repeated until both x and y wiring costs are covered by the wiring capacity with P in equation (2) disappearing since bonding pads are no longer a wiring constraint. All the parameters except wiring utilization are usually known during the design process. A methodology to estimate the wiring utilization is quite complicated and, hence, is not elaborated here (details can be found in [9] ). An illustration of the models and the assumptions is shown in figure 4 . By modelling in this way, we can speculate on the effects of core-based CMOS/SEED integration on the number of metal layers required to route the SEED array and on the transistor density. Using projections for semiconductor and optoelectronic technology roadmaps [10, 11] , we estimate that the required number of metal layers for SEED wiring increases steadily with the technology and could reach four metal layers for submicron technologies, as shown in figure 5 . The primary reason is the increasing number of SEEDs in the array (see figure 6 ). Consequently, fewer transistors can be connected on the chip due to less interconnection available to circuit wiring. Nevertheless, the decreased transistor density is not critical to the design of core-based CMOS/SEED chips because this effect exists only on the area beneath the SEED array and transistors are getting cheaper with time.
In addition, the reduction in transistor density also implies longer connections among transistors, some of which may lie in the critical paths and, hence, decrease the achievable clock rates. The results from exhaustive experiments on various core-based chip designs [9] show that the on-chip clock could be reduced by as much as 30% in the next decade.
In contrast, BGA packaging for pure-CMOS chips does not require as regular an I/O pattern and, thus, it is possible to utilize all the transistor density. However, in terms of I/O pin-out capacity and aggregate off-chip bandwidth (assume that off-chip clock rates are equal to on-chip clock rates for CMOS/SEED chips), optoelectronic packaging can offer up to an order of magnitude higher I/O capacity and bandwidth. Because both qualities could pose more of a constraint as suggested in figure 2, this emerging technology enables the design of pin-out-limited routers which are not possible by the BGA packaging technique (e.g. the 8D-256B-Bi configuration). These results are depicted in figure 6.
WARRP II router operation
We now describe the operation of the WARRP II router using two scenarios: normal packet transmission and deadlock recovery transmission. Under normal transmission, a packet is routed to the next node via one of the available output buffers. There is no restriction on which path can be selected so long as that path is minimal, i.e. it moves the packet closer to the destination. The packet is transmitted to the receiving node at a four-bit-wide flit (flow control digit) granularity upon the availability of the input buffer and the channel through the DATA lines.
Since the WARRP II router employs asynchronous data transmission, SEND STROBE and FULL signals are required to perform flow control. To fully utilize the link bandwidth, the flow control signals are fully pipelined. The input buffers are four-flit deep to account for the round-trip wire delay and the data latch delay. By exchanging both signals, the receiver knows when to latch the incoming flit via the SEND STROBE signal and responds back to the sender via the FULL signal. This process is illustrated in figure 7(a) .
A blocked packet is considered to be potentially deadlocked if it has made a dateline [5] crossing. If such conditions are met, the packet becomes eligible to undergo the deadlock recovery process. Note that the deadlock detection process takes place concurrently on every node and, hence, can lead to multiple outstanding deadlocked packets requesting the shared deadlock buffers. To prevent deadlock from occurring during the deadlock recovery process, WARRP II implements mutually exclusive access to the shared deadlock buffers using an on-chip asynchronous token circuit. In normal transmission, each node lets the token propagate throughout the system in a cyclic path. Once a potential deadlock is detected, a node will capture the token once it arrives. Since only one node can capture the token at any instant and the router arbitrates among its packets, mutual exclusion is achieved.
If the token is successfully captured, the potentially deadlocked packet is forwarded to the node's deadlock buffer. Recovery path formation begins by exchanging the DB REQ (deadlock buffer request) and the DB ACK (deadlock buffer grant) handshaking signals between nodes and repeating the process along the path to the destination. Physical channel bandwidth is pre-empted from normal packets along the way, using the DB PATH signal, to transmit the deadlocked packet to the deadlock buffer in the next node. When the deadlocked packet has reached its destination, the recovery path is torn down and normal transmission resumes. The deadlock recovery process between two nodes in the recovery path is illustrated in figure 7(b) .
Conclusion
The WARRP II chip is the first fully functional deadlockfree optoelectronic network router that explores a complex design space and integrates complex VLSI circuitry with a dense SEED array. Such integration capability fuels the case for complex optoelectronic-based processors, routers and memories which loom likely in the near future owing to the increasing bandwidth demanded by multiprocessor systems. Some obstacles in integrating optical transceivers with complex electronic circuitry must still be addressed such as connecting the irregularly distributed CMOS I/Os to a regularly patterned SEED array in such a way that it is supported by a space-invariant imaging system. This issue can diminish the fundamental advantages of CMOS/SEED integration technology because of long wires and reduced transistor density. Based on our semi-empirical models, however, the results show that such negative effects do not seriously impact the chip performance, especially in terms of off-chip bandwidth which is shown to be up to an order of magnitude higher than that of BGA packaging.
