Abstract-Future inter-and intra-ULSI interconnect systems demand extremely high data rates (up to 100 Gbps/pin or 20-Tbps aggregate) as well as bidirectional multiI/O concurrent service, re-configurable computing/processing architecture, and total compatibility with mainstream silicon system-on-chip and system-in-package technologies. In this paper, we review recent advances in interconnect schemes that promise to meet all of the above system requirements. Unlike traditional wired interconnects based solely on time-division multiple access for data transmission, these new interconnect schemes facilitate the use of additional multiple access techniques including code-division multiple access and frequency-division multiple access to greatly increase bandwidth and channel concurrency as well as to reduce channel latency. The physical transmission line is no longer limited to a direct-coupled metal wire. Rather, it can be accomplished via either wired or wireless mediums through capacitor couplers that reduce the baseband noise and dc power consumption while simplifying the fabrication process by eliminating vertical metal studs needed in three-dimensional ICs. These new advances in interconnect schemes would fundamentally alter the paradigm of ULSI data communications and enable the design of next-generation computing/processing systems. 
I. INTRODUCTION
T HE SYSTEM performance of modern ULSI is being limited by its interconnect bandwidth in both on-chip and chip-to-chip communications [1] - [3] . The rapid evolution of ULSI has demanded that the interconnect system be fast and at the same time be flexible, reliable, and low cost. Ideally, future interconnect systems must encompass the following important features:
• Ultra-high data rates (e.g., 100 Gbps/pin or 20-Tbps aggregate [2] , defined as the total sum of data rate for each pin on a chip or within a system of chips)
Manuscript received September 8, 2004 . The review of this paper was arranged by Editor B. Zhao.
The authors are with the High Speed Electronics Laboratory, Department of Electrical Engineering, University of California, Los Angeles, CA 90095-1594 USA (e-mail: cchien@sst.com).
Digital Object Identifier 10.1109/TED.2005.850699
• Concurrent multiI/O's service for simultaneous and bidirectional communications on a shared transmission medium • Realtime re-configurability in connectivity and bandwidth for optimized channel efficiency and fault-tolerance
Additionally, the fabrication of future interconnect systems must be compatible with the mainstream system-on-chip (SOC) and system-in-package (SIP) technologies for low-cost system production.
Traditional inter-chip and intra-chip communications are based solely on time-division multiple access (TDMA). In a TDMA-interconnect (TDMA-I) system, each I/O pair communicates over a shared transmission medium by transmitting only during its scheduled time slot in which no other I/O pair may transmit. In essence, time is being divided or allocated to each individual I/O pair so that a given transmission medium may be effectively shared. Furthermore, advanced TDMA-I (bus and links) in recent years has exploited multilevel signaling and dispersive signal equalization techniques to achieve multigigabits per second throughput [4] - [6] . Nevertheless, this type of system is limited to fixed and nonreconfigurable architecture that has high data transmission latency and can not support bidirectional and simultaneous transmission of multiple I/Os on the same physical channel.
To overcome the limitations of traditional TDMA-I, a number of new interconnect schemes have been investigated recently to greatly increase the aggregate data rate and concurrency as well as to reduce latency and power consumption [7] - [13] . These new schemes permit the use of a combination of other multiple access techniques, namely code division multiple access (CDMA) and frequency division multiple access (FDMA). In CDMA interconnect (CDMA-I), each I/O pair is assigned one or more pseudonoise (PN) codes with near-ideal correlation property so that any other I/O pair assigned with different PN codes will contribute no interference when they are transmitting concurrently onto the same medium. In contrast, FDMA-I allows sharing of a transmission medium by assigning I/Os to different frequency channels. I/Os assigned with different frequency channels may communicate concurrently with virtually no interference, provided that undesired frequency channels for a given I/O are filtered out properly. FDMA-I and CDMA-I may be combined into a multicarrier CDMA-I, whereby concurrent I/O transmissions are accomplished by properly assigning codes and frequencies to each I/O pair. Fig. 1 compares new schemes based on CDMA-I and FDMA-I against the traditional TDMA-I in terms of two critical features for interconnects used in future ULSI systems: namely, aggregate data rate and re-configurability. Interconnect systems based on the multicarrier CDMA-I achieves the highest aggregate data rate and re-configurability. The high aggregate data rate is a result of the increased bandwidth made available through the use of more than one frequency channel. The high re-configurability arises from the increased number of combined code and frequency channels (or and , which denote respectively the number of code channels and the number of frequency channels) now available to the system to dynamically assign based on the specific operation requirement. Note that both FDMA-I and CDMA-I have similar degrees of re-configurability but FDMA-I has higher aggregate data rates than CDMA-I due to the additional bandwidth made available through multiple frequency channels. In theory, TDMA-I has the same degree of re-configurability as either FDMA-I or CDMA-I in that a TDMA-I scheme can re-configure based on rescheduling and reassigning time slots for a given set of I/Os depending on the operation needs. In principle, the number of time slot can be made equal to or . However, even though an increase in results in proportional reduction in the average data rate per I/O pin, the burst data rate for each I/O pin still remains the same as the aggregate data rate. The high I/O bust data rate makes implementation difficult due to the need for high-order modulation and time-domain equalizers at high operating speed. For CDMA-I and FDMA-I, on the other hand, the burst data rate per I/O is inversely proportional to and , which simplifies the signal processing required for each I/O. In particular, for CDMA-I, Rake receiver architecture may be employed to compensate for time dispersion in the transmission media but it requires much less complexity than a time-domain equalizer [14] . To put this into the perspective of practical implementation, wired TDMA-I has limited re-configurability due to difficulty in increasing without excessive complexity and power dissipation in the transceiver system design.
The re-configurability of wired TDMA-I may be improved with the single-carrier radio frequency (RF)-interconnect scheme (or SCRFI) shown in Fig. 1 . The SCRFI uses only one frequency channel and achieves similar throughput as a wired TDMA-I system. However, it is able to achieve higher re-configurability than TDMA-I, since the transmission medium is no longer limited to fixed wiring but rather may be wirelessly broadcasted through coupling capacitors to communicate with different receivers. Such a scheme not only simplifies the fabrication process by eliminating the vertical metal studs needed in future three-dimensional (3-D) IC but also reduces the noise and dc power consumption. Transmission through capacitor-coupling can also be applied to FDMA-I, CDMA-I, and Multi-Carrier CDMA-I (or MCCDMA-I) as well.
In this paper, we will review recent progress in each of the new interconnect schemes described in Fig. 1 and discuss their applications in future ULSI interconnect implementations and architecture designs of next-generation computer/processor systems. We will first describe a wired CDMA-I that achieves channel re-configurability while providing simultaneous multiI/O's services. We will then discuss the wired FDMA-I that is able to achieve multiband (or multimode) channel communications. Subsequently, SC-RFI specifically designed for 3-D IC will be presented. Finally, wireless inter-chip communication based on MCCDMA-I will be discussed along with other system applications such as CDMA dynamic random access memory (CDMA-DRAM) and re-configurable interconnect for next-generation systems (RINGS).
II. CDMA-INTERCONNECT
Conventional TDMA-I bus and links are designed for nonreconfigurable system architectures and have extensive data bus latency and limited system performance. Therefore there is a great demand for improving the efficiency of both on-chip and off-chip interface circuits without increasing the cost and overhead. This could be achieved by bus-oriented and system-oriented approaches by changing the nature of the request stream, increasing the channel concurrency and decreasing the channel latency. To accomplish this, we have proposed using the CDMA-I for future re-configurable SOC [10] .
The proposed CDMA-I uses different PN codes to separate different users and I/O ports on a shared bus. The CDMA-I modulates the data into spread spectrum signal by orthogonal Walsh or PN codes [14] . For I/O ports, the minimum spread ratio is . The code-modulated signal from each I/O is combined into a multilevel signal on the channel. Re-configurability is a unique advantage of the CDMA-I because the receiver can detect data sequence of different I/Os by simply changing codes through firmware rather than additional hardware needed for retiming and/or framing of different users. Because the same code can be used in different CDMA demodulator, the CDMA-I transceiver is more than an -to-data switch but rather could implement any -to-I/O mapping. Moreover, the CDMA-I allows not only re-configurability in connectivity but also in bandwidth, whereby one or more I/Os could be allocated more bandwidth by simply assigning more than one code to the specific I/O. It can also reduce the communication overhead in a packet switched system by eliminating the need for having I/O port addresses in the headers of the transmitted packets. The CDMA-I can be used for re-configurable chip interconnects [10] , wired busses [11] and future gigascale integrated systems. Fig. 2(a) shows a 4-port system interconnected by the 
CDMA-I.
Each port on the bus sends a coded data, which can be recovered by any other port with the corresponding code key. Fig. 2(b) gives an example of spreading, de-spreading and correlating DATA by using orthogonal Walsh codes in a CDMA-I with two concurrent I/O users. Fig. 3 shows the CDMA-I transceiver architecture, which implements the transceiver for each of the port shown in Fig. 2(a) . The transmitter consists of a baseband CDMA-modulator to modulate the user data with the assigned code, a Walsh code generator to provide the orthogonal data "keys", a data combiner to put together all the user data into a serial data sequence with multiple signal levels for transmission, a phase-lock-loop clock generator, and finally a 50-ohm matching output buffer to transmit the signal. The receiver picks up the multilevel signal transmitted through an inter-chip interconnect line; similar input matching network at the transmitter is used. As the first step, the clock is recovered from the received multilevel signals, and the signal is quantized into a multibit data. The I/O's data can be recovered by demodulating the multibit data with corresponding Walsh codes. An asynchronous error-detection correlator is designed to obtain synchronization of the input data within one-symbol clock period.
Since the received signal has multiple values, a multilevel signal clock-data-recovery (CDR) circuit is needed to synchronize the clock and recover the baseband multilevel data. To fully utilize the edge information of the input multilevel signal and avoid extra coding to decrease the transfer efficiency [15] , a multilevel clock data recovery circuit with Alexander-type phase detector is designed that uses all data transitions to reduce jitter in the recovered clock and data. This turns out to be possible because the received data has more data transitions due to modulation by different Walsh codes. Fig. 4(a) shows the Alexander-type clock data recovery circuit architecture for the multilevel input signal. It consists of two A/D converters, which are used to quantize the input signal at the rising and falling edge of the clock. To achieve immunity to noise on the control line so as to satisfy the jitter requirement of the CDMA-I while using less silicon area, a differential control VCO using symmetrical MIS varactor pairs is implemented [10] .
A 2 2 re-configurable CDMA-I transceiver has been implemented in TSMC 0.18 m CMOS technology. The entire interconnect system has been implemented with the transceiver chip-set packaged in TQFP on a printed-circuit board with 31-mil thick FR4. Fig. 5(a) shows the two transmission data sequence D0 and D1. Both of them are modulated by different Walsh codes and combined together into multilevel signal outputs, whose peak-to-peak voltage is 400 mV. Fig. 5(b) shows the received data eye of the multilevel signal output. Fig. 6(a) shows the recovered clock spectrum at 2.65 GHz and show conventional TDMA-I links that transmit digital data either directly or in a PAM form [16] . Those signals occupy only the lowest frequency band (i.e., the baseband) of the physical channel. One may envision the possibility of transmitting data concurrently in RF-modulated frequency bands to extend the data rate of the same physical channel. With proper frequency band separations, transmitted RF-modulated signals can be allocated in spectrum to minimize the cross-band interference among all available bands. With that in mind, we have developed a novel RF/baseband combined FDMA-Interconnect (FDMA-I) scheme, shown in Fig. 7 (c), to allow additional data bandwidth over the conventional baseband-only interconnect. We also find that the FDMA-I data transmission can be designed to support the bidirectional communication link without extra overhead.
FDMA-I can also be adopted to improve the interface bus performance. Fig. 8(a) shows a conventional parallel bus which suffers from interference from adjacent wires, which becomes more serious as the spaces become more compact [17] . This problem can be largely eliminated when transmitting data on adjacent wires in different RF-modulated frequencies as shown in Fig. 8(c) , [9] . Fig. 9 shows the advantage of a FDMA-I system in reducing the cross-band interference. According to system simulations, more than 20-dB attenuation may be obtainable when transmitting Gbps simultaneously via adjacent RF and basebands, separated by the carrier frequency of 8.5 GHz without extra signal filtering. Fig. 10 shows a dual-band example of a system based on FDMA-I. The physical transmission line is terminated at its characteristic impedance in both ends to eliminate signal reflections. Each I/O port contains baseband and RF-band transceivers to support simultaneous multiband data communication. The above FDMA-I can be reconfigured to route between I/Os by selectively switching among various frequency bands and also possibly supporting both uni-directional and bidirectional data transmissions.
The FDMA-I transceiver prototype chip has been implemented in 0.18 m CMOS [18] . The test set-up consists of an inter-chip interconnection situated on a FR4 board with a 10-cm 50-ohm micro-strip line and two prototype chips wire-bonded to the ends of the line. Measurements were taken at a total data rate of 4 (or Gbps in RF and baseband, respectively) Gbps/wire in both uni-directional and bidirectional modes. Eye diagrams of received data under bidirectional transmission are shown in Fig. 11(a) . R.M.S. jitters of the received data, shown in Fig. 11(b)-(c) , are 19.4 and 34.6 ps for the baseband and RF-band, respectively. Transmitted and received data patterns of the baseband and RF-band (inverted data) are compared in Fig. 11(d) . These results demonstrate that the FDMA-I system can provide high data rate and bidirectional communications with sufficient jitter performance. The jitter performance can be further improved by using more sophisticated signal filtering and impedance matching techniques.
IV. SINGLE CARRIER RF-INTERCONNECT FOR 3-D IC
With the dramatic developments in semiconductor technology and circuit designs, more sophisticated systems have been implemented on a single chip. While the expanding market keeps pushing for the higher speed, lower power, more powerful and less costly single chip systems, it is actually becoming more difficult to use conventional planar technology to design multifunction and low-cost chip systems especially in deep submicron technologies due to high parasitic capacitance, short-channel effect and strong crosstalk between interconnect [19] . Furthermore, conventional planar technology also faces fundamental physical limits and will encounter more significant interconnect issues in the future. All of these issues heavily impact the next-generation IC development. A 3-D IC (3D IC) has been proposed to overcome the above drawbacks to allow stacking of active device layers or chips. With this alternative, 3-D IC will surpass traditional two-dimensional (2-D) IC in reduction of chip area, power consumption, timing constraints and even cost [20] . Therefore, 3-D IC has gradually become a mainstream in future ULSI development.
In 3-D IC, several key obstacles must be surpassed, one of which is to connect multiple device layers effectively. Traditionally, vertical interconnects are formed by etching via through layers and depositing metal studs to physically connect various active device layers (Fig. 12) [21] . This manufacturing method, however, becomes less manufacturable when the total number of vertical active layers becomes large, leading to increased etching depth and vertical line parasitics. In order to overcome the above drawbacks, a self-synchronized single carrier RF-Interconnect (SCRFI) has been proposed [13] . Unlike the wired via/stud interconnects, the proposed SCRFI is based on wireless capacitor-coupling and peak-signal detection. Since the coupling is accomplished through capacitors, there is no need of fabricating via and stud and would consume no dcpower along the transmission media.
The SCRFI is also different from the previously discussed RF-Interconnect (RFI). The RFI reported previously in [9] , [18] requires the transmitter to up-convert the baseband signal using an RF carrier and the receiver to down-convert the signal with the same frequency RF carrier for data recovery. Although it enables effective data transmission for inter-chip applications, the RFI is less efficient for extremely short distance inter-layer communication applications in 3-D IC. First, the RFI needs precise local oscillators (LOs) in both transmitter and receiver for signal modulation and demodulation, which increase the design complexity and manufacturing cost. Second, the LOs at both sides must be synchronized, which require matching crystal and power hungry oscillators and phase locking circuits.
To surpass the above drawbacks, a SCRFI with smaller chip area and less power dissipation has been realized by using a simple signal peak-detector in the receiver for baseband signal recovery [13] . This self-synchronized (or noncoherent) interconnect scheme does not need LO for frequency demodulation in the receiver and extra phase-locked-loop circuit for data synchronization.
This self-synchronized SCRFI circuit architecture is shown in Fig. 13 , in which the transmitter (Tx) includes an input buffer circuit and an ASK modulator [22] ; the receiver (Rx) consists of a signal peak-detector and an output buffer. Due to the high pass nature of the inter-layer capacitor coupling (Fig. 13) , the baseband signal is first up-converted by the LO carrier. ASK modulation is chosen for its simplicity and efficiency. The ASK signal can be generated by switching on and off the LO carrier through a modulator, as shown in Fig. 14 . When the baseband signal is high, the LO carrier is switched on, otherwise the LO is blocked. The ASK modulated signal, after passing through the channel, is recovered in the Rx by the peak-detector, in which both pMOS and nMOS are used to be as diodes so that logic "1" and logic "0" can be equally and effectively passed without the threshold loss, as shown in Fig. 15 . An output buffer following the peak detector is used to rectify the signals and drives the off-chip load.
A prototype SSRFI has been implemented in 0.18-m CMOS technology [13] . Fig. 16 demonstrates the eye diagram of the received data from a 3-Gbps PRBS input data stream modulated by a 10-GHz LO. The eye height and width are about 220-mv rms and 257-ps rms, respectively. Fig. 17 shows an excellent output signal jitter of 1.28 ps. The measured bit error rate (BER) is below . The coupling capacitance is chosen as 60 fF in this particular demonstration. Theoretically it can be as low as 10-20 fF when assuming the bias resistance is set at 10-20 K to ensure the high-pass corner frequency being significantly lower than that of LO. Since the required coupling capacitance is rather low, the implementation of such capacitors should be straightforward in future 3-D ICs.
V. SYSTEM APPLICATIONS OF NEW INTERCONNECT TECHNOLOGIES
New generations of SOC or SIP require all the previously mentioned properties of interconnects with high throughput, low latency, and high bandwidths. Deep-submicron and nano technologies add numerous extra challenges to these requirements, such as wire delays, reliability issues, jitter effects, and so on. However, system and application requirements add two additional challenges to this list: low power (and low energy) and re-configurability.
In this section we will discuss the requirements of these nextgeneration systems. We will illustrate advantages and needs for new types of interconnect to address these issues. Three types of systems will be depicted: a general purpose compute platform, a wireless multi-carrier CDMA-interconnect (MCCDMA-I) for SIP integration and the RINGS platform. 
A. General Purpose Computing Platform
The communication between a CPU and the main memory system are performance-limited by the behavior of the bus. Conventional TDMA-I based memory busses are designed for fixed and nonreconfigurable systems and have extensive data bus latency. Therefore, there is a great demand for improving the efficiency of the processor-to-main memory interface without increasing the cost and overhead.
The bus interface based on CDMA-I described before in Section II can be used to achieve several goals: reduced latency, increased bandwidth, re-configurability and reduced energy consumption. It achieves this by utilizing both CDMA-I and a variation of source synchronous clocking for multilevel superposition. The multiple off-chip transmitters can occupy the same channel at the same time, but are separated from each other by using a set of orthogonal code [14] . This source synchronous CDMA interconnect (SSCDMA-I) bus interface, which can be applied for small memory subsystems and chip-to-chip interconnects including processor-to-cash bus, back-side bus, and shared memory multiprocessor systems, improves the system performance due to the increased channel concurrency and decreased data bus latency. Lower power and more cost-effective system is also achieved owing to the use of reduced number of pins and PCB traces and smaller die and package size [11] , [12] .
The cost associated with DRAM memories increases with the number of I/O pins on the DRAM package. The performance of a memory system could be increased somewhat by widening memory channels and/or providing independent DRAM banks. However, both these approaches increase the cost; furthermore, channel latency and concurrency problems still exist. High-speed narrow channels used Direct Rambus DRAM (D-RDRAM) may suffer from long channel latency: for instance, if two read requests arrive at DRAM1 back-to-back or two read requests arrive at DRAM1 and DRAM2, respectively, the second request must stall until the first request finishes using the shared data bus. Although increasing the bus speed improves the performance somewhat, channel request latency still exists.
The conventional bus interface, as shown in Fig. 18(a) , uses binary signaling and requires two PCB channels for 2-bit data transferring simultaneously. Only one transmitter and one receiver can access the shared bus simultaneously; the other devices should wait until the two devices finish their job. Thus this end-to-end request delay increases the channel latency. However, in the SSCDMA-I Bus interface, as shown in Fig. 18(b) , two off-chip transmitters and two receivers can access the shared bus simultaneously using only 1 PCB channel for 2-bit data transferring, thus the end-to-end data request delay is removed and the channel latency is decreased. And, (for the same data rate) the required number of channels becomes half. This reduced number of channels decreases the channel power consumption and the overall system cost.
The application of SSCDMA-I to a DRAM memory subsystem is illustrated in Fig. 19 . It shows an example with two busses, each having a 4-to-4 CDMA DRAM interface. In the first case, the SS-CDMA-I has been configured to read the first bit of each DRAM on bus0 and the fourth bit on bus1. By a simple re-configuration, the memory controller reads all 8 bits of DRAM0. This is one illustration of the reconfiguration-on-the-fly: it shows how the memory bandwidth and memory architecture can be adapted to the application running on the CPU. For instance, streaming video has a different memory transfer pattern than data packet routing or desktop applications. Thus we envision that in the future the software or the operating system, can adjust the memory and cache hierarchy to bring the data being processed "closer" to the CPU and in this way, bandwidth, and performance are improved while the energy consumption remains low. Fig. 20 shows the brief characteristics comparison of the CDMA DRAM system and the other conventional DRAM systems. Since the SSCDMA-I bus interface uses the current mode output drive the power dissipation in a terminated bus is only a function of the total number of high-speed buses [11] . This proposed re-configurable DRAM uses half the number of high-speed buses of the conventional D-RDRAM and hence reduces the channel power dissipation by at least 50% [11] , [12] . 
B. Wireless MultiCarrier CDMA-Interconnect (MCCDMA-I) for SIP
In Section IV, we have described the use of a single carrier RF-interconnect (SCRFI) for vertical integration of 3-D ICs through interlayer capacitor coupling. In that case, the communication distance is extremely short (on the order of 10 m). A simple ASK-modulation used in the transmitter and a self-synchronized peak detection used in the receiver are proven to be sufficient for achieving effective interlayer data communication. The situation however is different in forming an efficient interconnect system for longer distance and multichannel communications inside the SIP. The future SIP interconnect systems would require wide bandwidth, short latency, real-time re-configurability and simultaneous multiI/O's service. To meet such system needs, we have developed a MCCDMA-I (multicarrier CDMA-Interconnect) by combining the FDMA-I and CDMA-I to take advantages from both systems [7] , [9] . The MCCDMA-I can be connected in either wired (via direct-coupling) or wireless (via capacitor-coupling) fashion depending on specific system applications. The possibility of using the MCCDMA-I in wireless manner is discussed as follow, which has advantages of eliminating the dc power consumption of the transmission line and the baseband switch noise due to the high pass nature of the capacitor coupling.
The intended MCCDMA-I wireless interconnect system is illustrated in Fig. 21 as a miniature wireless local area network (LAN) located inside a SIP (or a MCM) [7] . Like any other wireless communication system, this miniature LAN contains ULSI I/Os as users, capacitor couplers as near field antennas, RF transceivers and off-chip but in-package MTL (microwave transmission line) as a shared broadcasting medium. Output signals can be up-linked to MTL via transmission capacitive couplers , then down-linked via receiving capacitive couplers to input ports to fulfill the interconnect function. Electrodes of capacitor couplers can be easily formed between the uppermost ULSI metal and the MTL. With a shared MTL, combined FDMA/CDMA multiple-access techniques can be used to alleviate the cross-channel interference. With orthogonal-coded and/or frequency-filtered RF transceivers, a passive MTL is suitable to relay ultra-broadband signals up to at least 150 GHz [23] . Fig. 22 shows such a representative MCCDMA-I channel shunted by multiple I/Os through capacitor couplers. Since the channel is designed to hold bidirectional communications, both ends of the MTL are terminated with to avoid the signal reflection. The transceiver's I/O impedance ( and ) must be order of magnitude greater than and to preserve MTL's characteristic impedance and achieve dispersion-free signal transmission. Simulations in Fig. 23 show insignificant signal attenuation (0.3-0.8 dB/cm) and dispersion caused by 20/20 shunted I/O transceivers when choosing and as 2-5 k and and on the order of 10 fF. Assuming the vertical coupling distance is 25 m and using polyimide as the dielectric between coupler electrodes, we estimate the pad size of or on the order 1000 m . Capacitive couplers of this size can be easily implemented in ULSI. Further simulation based on full wave system analysis suggests that low loss and dispersion-free transmission over a broad-band ( GHz) is achievable with SNR of 15 dB as calculated in Fig. 24 for reaching a low error signal transmission, where the system is designed to hold five RF carrier-channels over the complete 100 GHz and each carrier-channel covers 20-GHz band and contains four CDMA subchannels with 5-10 Gbps/subchannel, depending on the modulation scheme. Under such an arrangement, the S/N budget for the intended RF-interconnect system Fig. 26 illustrates a schematic of 2-user MCCDMA-I system where each I/O port is allocated with different RF-carriers ( and ). Within each carrier band the RF-modulated digital data is further mixed by using orthogonal Walsh codes to spread the spectrum. The first RF-modulated CDMA interconnect system has been demonstrated in [9] .
C. RINGS for SOC and SOP
While the previous applications show the usage of SS-CDMA-I for conventional CPU-memory communication and MCCDMA-I for wireless interconnect in SIP, the new interconnect technologies can also be used in low power embedded SOC. Examples are next-generation cell-phones, personal digital assistants (PDAs), portable game devices, etc. The trend is to merge multiple functions into one device, e.g., a cell phone with gaming and video capabilities, while at the same time maintaining the battery life and reducing power consumption. The approach to combine performance and reduce energy is a SOC architecture with a heterogeneous mix of several dedicated cores, IP modules, embedded processors, etc. For instance, there will be a module for radio communication (e.g., 802.11 and Bluetooth and multiband cellular standards), DSP processors for wireless baseband communication, video and audio decoders, encryption, and security modules, and so on. Typical for the applications running on such a device is that the traffic patterns between the IP-cores are also heterogeneous in terms of different data rates, bursty, or regular, periodic, or sporadic, etc.
A typical example is illustrated in Fig. 27 : it shows a multimedia mobile phone system on a RINGS platform [24] . Typical for a RINGS architecture is the co-existence of very different optimized modules, connected together by a re-configurable interconnect.
This example is used to illustrate that CDMA, FDMA, and TDMA interconnect can co-exist and provide the right communication channel for very different traffic flows, which have to co-exist on a single SOC or SIP. The CT-Bus (CDMA/TDMA bus), proposed in [25] , integrates both CDMA and TDMA in a hierarchical structure and takes strengths from both of them. As shown in Fig. 28 , a fixed amount of CDMA subchannels are separated by different spreading codes. Two or several subchannels can be grouped as one subchannel group. In Fig. 28 , there are four CDMA subchannels, and they are divided into three different subchannel groups. The DF1 to DF4 illustrate data flows that need to be assigned to different subchannel groups. Because of the channel isolation feature of CDMA scheme, data flows on different subchannel groups will be well isolated without having impact on each other. For each CDMA subchannel-group, the data flows are further assigned to different TDMA time slots. Reallocation of the traffic flows to subchannel groups can be achieved by simply re-assigning the spreading codes to traffic flows, enabling the reconfiguration of the CT-Bus. Table I specifies four data flows as similar to the mobile phone system mentioned previously. The CT-Bus has a total bandwidth of 1 Mb/s, and supports four subchannels. Three subchannel groups have been created with a bandwidth of 512 kbps, 256 kbps, and 256 kbps. The regular data flows, DF1 and DF3, are assigned to a dedicated 256 kbps subchannel. DF2 is assigned to the 512 kbps subchannel. DF3 with a periodic data is assigned to another 256 kbps subchannel.
As a comparison, the same data flow patterns have been applied to a WFQ (Weighted-Fair-Queuing) TDMA bus with the bandwidth of 1 Mbps. The arbiter of WFQ-TDMA bus uses block mode, which will transmit the DF with the highest priority. Time slots will be assigned in proportion to the average data rates if data flows have the same priority.
We run simulations of the CT-Bus and the WFQ-TDMA bus for a period of one second and record the average latencies of data flows. Fig. 29 shows the average latencies of the data flows on WFQ-TDMA bus. DF2 has the highest priority and has three periodic bursts in a second. It is obvious that the regular data flows DF1 and DF3 have been influenced by the burst data flows DF2 with higher priority. Fig. 30 shows the average latency of the data flows on the CT-Bus. Because DF1 and DF3 have their own subchannel, the maximum latency is less than 100 us and the burst traffic from DF2 has no interference. The average latency of DF2 is twice as large as in WFQ-TDMA. However, DF2 can still meet its requirement, each burst has delay no more than 0.2 second.
The three examples in this section clearly indicate that nextgeneration systems can no longer be using TDMA busses alone. Interesting future research work will include modeling the new interconnect technologies such that they can be used into architecture and system design. 
VI. SYSTEM BENEFITS
With the feature size of CMOS shrinking below 90 nm, many complex data processing and function features that previously could not be integrated on a single chip due to either large area or high speed can now be successfully implemented, as evidenced by products that are emerging with complete SOC and/or SIP. The major challenge in pushing the frontiers of system integration further in the next decade or more lies in low-cost, low-power interconnect technology needed to provide the necessary aggregate data rate, re-configurability, and latency required by chips or system of chips with every increasing complexity. In this paper, we have reviewed several new interconnect schemes that can potentially meet this challenge and will now discuss their benefits to the performance of next-generation ULSI systems.
1) Aggregate Data Rate: Several research efforts on interand/or intra-chip high-speed interconnects have been pursued in the recent past. For example, high-speed end-to-end wire interconnects (with signal-processing hardware, such as, channel equalizers, at both ends) for data transmission speeds up to 4-10 Gbps have been successfully demonstrated [4] , [18] . However, such wire interconnect is often based on TDMA-I. To compensate for the signal loss over the wire, such systems would require sophisticated equalizers to compensate for frequency dependent attenuation. In particular, since the I/O data rate equals the aggregate data rate and cannot be separately controlled in a system, the aggregate data rate is likely to be constrained to 40 Gbps/pin even in 65-nm CMOS technology (limited by of the CMOS technology [26] , [27] , where GHz forecasted by ITRS [28] ). In contrast, our proposed interconnect schemes based on CDMA and FDMA can achieve aggregate data rate beyond 100 Gbps/pin given the availability of the same 65 nm technology (in principle, the maximum carrier frequency could approach ). This is because in both CDMA and FDMA, the I/O data rate can be reduced by increasing and , respectively. For example, given a MCCDMA-I scheme with of 32 channels, of 4, and an aggregate data rate of 128 Gbps/pin, each I/O then will run at 1 Gbps. At this speed and with four codes, the correlator required for each I/O can be implemented in less than 1000 m while dissipating less than 1 mW in 65-nm CMOS technology [7] , [28] .
2) Concurrent MultiI/O Services: As more systems become more heterogeneous, a variety of data sources with different latency constraints and/or priority must be communicated between various subsystems. It becomes important to provide concurrent multiI/O services to these subsystems so as to reduce latency as discussed in two examples mentioned earlier in DRAM and RINGS. Using FDMA and/or CDMA, one can easily provide concurrent multiI/O services, while maintaining reliability. Another potential benefit of multiI/O services to a complex SOC is that it may enable the architect to cut down the number of I/O ports on the chip, since one can use FDMA-I together with CDMA-I to interleave multiple data sets on the network, one I/O port can serve multiple I/O needs simultaneously.
3) Re-Configurability: As systems become more versatile, next-generation devices may become realtime configurable into different functions. For instance, the next-generation smart phone can morph into a portable digital television, DVD player, or handheld computer. The architectural flexibility that is being increasingly demanded by such a device is the ability to reconfigure any given subsystem and interconnections in realtime. In our system, CDMA-I, FDMA-I, or MCCDMA-I can allow each I/O pair to choose an orthogonal address code and/or frequency channel. The address code can be electronically changed for interconnect reconfiguration on-the-fly.
VII. SUMMARY
Future ULSI interconnect systems, either in SOC or SIP, demand not only to be extremely fast but also to be re-configurable and must be capable to provide bidirectional and multiI/O services. In this paper, we have reviewed recently developed advanced RF/baseband interconnects that can be further developed to satisfy all above system needs. Unlike traditional wired interconnects based only on TDMA-I for data transmission, these new interconnect schemes allow using additional modern multiple access algorithms including CDMA-I, FDMA-I, SCRFI, and MCCDMA-I and their derivatives to boost the bandwidth, the channel concurrency and at the same time reduce the channel latency. The physical transmission media is also no longer limited to the direct-coupled metal wire. It can be capacitor-coupled and/or through any gufoaided/free transmission media. The capacitor-coupled SCRFI can also play an important role in future 3-D IC by eliminating the costly vertical metal interconnects. The continuous development of the new interconnect schemes would inevitably change the scope of ULSI data communication structure and impact the next-generation computer system design.
