Approved for public release; distribution unlimited.
List of Figures

List of Tables
Introduction
Several technology trends have converged on a need for ultra-wideband digital-to-analog converters (DACs). The first of these trends is the push for higher data rates in communication systems. With wider instantaneous bandwidths, more data can be sent in parallel, increasing the data rate. Another trend is the push for software defined radios (SDR) (1) . SDR looks to replace many parallel radios with a single radio that can perform all of the same functions through software reprogramming. One of the key requirements of SDR is a DAC that can handle the bandwidth requirements of all the individual radios. A third trend is the need of the military to reduce the size, weight, power and cost (SWAP-C) of existing electronic equipment. This can be accomplished through the use of SDR, or through replacing legacy systems that were implemented in older technologies. These older technologies were the only way to achieve the high performance requirements needed to satisfy the war fighter at that time, but they are often bulky and very expensive. A low-power, low-cost, high performance ultra-wideband DAC can reduce the SWAP-C of these existing systems. A fourth trend is the push for higher performance electronics required to perform electronic warfare (EW) and radar. EW/radar systems typically have instantaneous bandwidths that span from 500 MHz all the way to above 20 GHz, depending on the application. Providing a single DAC that can handle the entire transmit spectrum will allow for higher performance EW/radar while simultaneously lowering SWAP-C.
DACs have so far been limited to sampling rates in the low GHz range in complementary metal oxide (CMOS) technologies, such as (2-6), or have used silicon germanium (SiGe) to achieve 20GSps operation (7) (8) (9) . This report details the design of a 40 GSps, 6-bit DAC that could be expanded for much higher resolution and speed. The goal of the design is to demonstrate the new time-interleaving architecture that provides beyond-Nyquist operation, allowing this extremely high speed operation.
DAC Architecture
The DAC works by using time interleaving of individual subDACs to provide a composite output that is much faster than the subDAC. Figure 1 shows the output from an ideal DAC that has a zero order hold output (i.e., non-return-to-zero output [NRZ] ). The black sinusoid is the original, ideal analog signal that has been digitized and fed to the DAC input. The blue stair step pattern is the ideal output from the DAC. Time interleaving involves using multiple subDACs in parallel where the outputs from the subDACs are added to produce the composite output, as shown in figure 2 . Here, 4 subDACs are combined to produce 1 composite output. Notice that the clock period for each subDACs is the same; however, the clocks are delayed with respect to each other. This allows each DAC to be run at the same data rate, but the output will be M times faster. The trick to getting this to work is that the digital data must be generated at the edge of the clock, and aligned to that clock. This means that data word D1[n] must be the data that corresponds to the clock edge for clk1, D2
[n] must be the data for clk2, etc. This can be seen more clearly in figure 3 . The data for each subDAC cannot be the time delayed data. This means that data word D1 [1] cannot be sent to DAC1 on clk1, DAC2 on clk2, etc. The data must be sampled at the correct clock edge to provide correct output summation.
One of the consequences of this data sampling is that even though each subDAC is running at a clock rate of F clk_sub , the data has to be sampled at M*F clk_sub . This sample rate corresponds to the overall sample rate of the composite DAC. For the DAC that was designed, there are 4 subDACs. Each subDAC is running at 10 GSps, such that the overall sample rate of the composite DAC is 40 GSps. This is how extremely high sampling rates can be achieved, beyond the rate possible by a single subDAC.
Another benefit of the time interleaving nature of the DAC is the increase in DAC resolution. By summing the outputs, the maximum level is increased by M, while the least significant bit (LSB) remains the same. This provides a factor of M higher resolution, or 1 extra bit per doubling the number of subDACs. In this design, the 4 subDACs are designed to be 4 bit DACs, and the composite output will provide 6 bits of accuracy.
Challenges
There are several challenges that need to be addressed to successfully design and be able to test the DAC. These challenges include: Each of these challenges will be discussed in the following sections.
Design of the 10 GSps, 4 Bit subDAC
The process that is being used for this design is IBM 45 nm silicon-on-insulator (SOI) CMOS. This is a SOI CMOS process offering a minimum gate length of 45 nm. This process is fast enough that reaching 10 GSps speeds for the subDAC should not be a problem. The 4 bits of accuracy should also not be a problem, as this only requires 6.25% matching between unit cells, including all non-ideal effects.
Since the technical specifications are not pushing the edge of the technology, previous work was used as the basis for the design. The DAC presented in reference 5 is used as the template for this design.
The top level schematic for a subDAC is shown in figure 4 . The DAC is a 4 bit design; as such the data control word is 4 bits long, d<3:0>. There are four current sources, one for each bit of the subDAC. The current sources are composed of unit current cells in a binary fashion. The least significant bit of the data word, d<0>, controls a single unit current source; the second bit, d<1> controls two unit current sources; the third controls four; and the MSB, d<3>, controls 8 unit current sources. It is hard to read in figure 4, but the instance name reflects the array nature of the unit current sources: d<0> connects to instance I5, d<1> connects to I6<1:0>, d<2> to I7<3:0> and d<3> to I8<7:0>. The last unit current source on the right side of figure 4 is I9<23:0>, and it is used to form the dummy cells that are placed around the array in the layout to preserve matching.
The biasing of the current cells is accomplished through the use of current mirrors. On the left side of figure 4 is the reference current source composed of transistors T0 and T1 in a cascode current mirror. The voltages that are generated by these cascode transistors are used to bias each of the unit current cells. Also shown in figure 4 is cell between the input data, d<3:0> and clk, and the current cells. This is the buffer latch. The purpose of this cell is to align the data edges of the digital word d<3:0> so that the subDAC switches all the current sources simultaneously. What can happen is that through random process variations, each of the individual bits, d<3>, d<2>, d<1>, and d<0>, can arrive with a slight delay to one another. The buffer latch will align all 4 bits back to the common signal, clk. It also has a secondary purpose of creating a differential signal from the single ended data bit. The schematic of this cell is shown in figure 5 . 
Unit Current Cell
The schematic of the unit current cell is shown in figure 6 . The structure is that of a current source formed from transistors T2 and T3, with a differential switch composed of transistors T6 and T7. The differential switch control bits, d and db, are specified so that one bit is high, and the other is low. This forces all of the current from the current cell to flow either through transistor T6 to the output port "outp", or through transistor T7 to the output port "outn." The transistors T9 and T8 are used as cascode devices to provide isolation between the output nodes "outp" and "outn" and the internal nodes of the current cell "vtp" and "vtn." Figure 6 . Schematic of the unit current cell.
Several design goals have to be met by the unit current cell. The output impedance of the current mirror needs to be very high. This is controlled through the length of the current mirror transistors T2 and T3 -the longer the length, the higher the output impedance. The IBM 45 nm process does not allow for variation of the length of the transistor; they only provide transistors that have a fixed length. The choices for the length for thin gate devices are 45 nm, 56 nm, 112 nm, and 232 nm. The 45 nm device is a digital-only SOI device. This means that the bulk voltage can vary between devices by several hundred millivolts, causing a large mismatch between devices. To avoid this problem, only the devices that provide a bulk access pin are considered in the design of the current mirror, limiting the sizes to 56 nm, 112 nm, and 232 nm. Since 232 nm is the longest device that provides a bulk access pin, it is the transistor of choice for the current mirror.
In contrast to the current mirror, where the transistors are always on and very large, the switches T6 and T7 need to be as fast as possible. To achieve this fast switching speed, the shortest device available is chosen, the 45 nm digital device. In this situation, mismatch between switches has minimal effect on the DAC performance, and so the small size and lack of bulk access pin are not an impediment. The width of the transistors is kept small, 1.007 µm, to keep the parasitic capacitances small, allowing for fast operation. These are the same considerations given to the cascode transistors T9 and T8 as well.
Generation of the 4 10 GHz Clocks
As was mentioned in the section "2. DAC Architecture" the DAC needs four clocks in quadrature, each running at 10 GHz. The generation of accurate 10 GHz clocks is a challenging design. To accomplish this task, a 20 GHz differential signal is applied to the DAC. This signal is then fed into a divide-by-two circuit, which generates the 4 10 GHz clocks in quadrature, which are in turn buffered by a pair of inverters. This is shown in figure 7 . The first part of the clock generation is to buffer the input signal so that the clock can be fed into the chip. The schematic of the input clock buffer is shown in figure 8 . This block must provide 100 Ω differential impedance to the test equipment to maximize the amount of energy that is fed into the chip. This is accomplished through the use of the terminating resistors R0-R3. Each resistor is 50 Ω. Resistors R0 and R1 are in series, connected between "vinp" and "vinn," while resistors R2 and R3 are shunt from the inputs to ground. On either side of the resistor connections are capacitors to couple the radio frequency (RF) signal while rejecting the DC. Transistors T0-T3 and resistors R4 and R5 are used to amplify the potentially weak differential clock signal up to a large signal that can be used by the following blocks. The amplifier is a common-source configuration with cascode devices. The cascode topology provides reverse isolation to improve stability, while simultaneously lowering the Miller capacitance of the input devices, resulting in faster operation. The resistors are chosen as 250 Ω devices. The resistor size is a tradeoff between gain and voltage drop across the resistor. The larger the resistor size, the more gain achievable from the amplifier, however if the resistor is made too large, then the voltage drop across the resistor will push the transistors into triode region, reducing the gain.
After the clock buffer is the divide-by-two. This is shown in figure 9 . The divide-by-two is two D flip-flops (DFF) connected in negative feedback. This can be seen as the positive output from the right DFF, Q, is fed back to the negative input of the left DFF, Db. The clock for the left DFF is clk, clkb for the right. The design of the DFF is shown in Figure 10 . The details of the design can be found in reference 10.
The simulation results for the blocks from figure 7 are shown in figure 11 . The plot shows the 4 clocks in quadrature, with 100 ns periods, corresponding to 10 GHz operation. Notice that the clocks are not 50% duty cycle, but rather 25% duty cycle. 
Reception of 160 Gbps Data Atream
One of the major design issues associated with this RF DAC is feeding the digital data to the chip. Each of the 4 subDACs requires 4 bits worth data at a rate of 10 GSps. This corresponds to a data rate of 40 Gbps per subDAC, or 160 Gbps overall. The data reception can be broken down into two areas: physical reception of the data, and data sampling and alignment.
LVDS Input
The physical reception of the data is performed by a Low Voltage Differential Signaling (LVDS) input block, shown schematically in figure 12 . The LVDS design is based upon the work in reference 5. The 45 nm transistor in this technology is designed to handle a 1 V supply. However, LVDS is usually a 1.2 V common mode. Even though the differential input is AC coupled by capacitors C0 and C1, the choice was made to use the higher breakdown voltage devices of the thick gate transistors. The tradeoff is that these devices are slower than the thin gate transistors, and so a higher bias current is needed to achieve operation at 10 GHz. But this allows choosing an optimal gate bias without having to worry about gate dielectric breakdown from large signal swings of the LVDS input signals. This can be seen in figure 13 , where the current mirror transistor T0 is using a 1.8 V supply, and the gates of the input pair T11 and T12 are DC biased at 1 V. Even if an 800 mV peak signal is input to this stage, the input transistors will still be below the breakdown voltage of the device. The output from the pmos input pair is fed through a differential regenerative stage consisting of nmos devices T4-T7. The output from this stage is fed through a single-ended to differential stage consisting of the devices T8, T9, T10, and T13. The supply of this stage is the normal vdd of 1 V, providing single-ended CMOS level outputs from the LVDS stage. The output of the LVDS is buffered through several stages to drive the large parasitic wiring loads on the way to the digital core.
Data Sampling
After the digital inputs leave the LVDS and the buffers, they arrive at the digital core. This first part of the digital logic is responsible for choosing the optimal sampling instant, called the clock and data recovery (CDR). Because there is available on chip 4 10 GHz clocks in quadrature, there is an effective 40 GHz sampling clock. This is very important, because the data path of each individual bit will have a slightly different delay than every other bit. In the system design specification, it was assumed that each of the 4 bits for a subDAC would have negligible delay to each other. However, there is no guarantee that the phase delay between the bits of each subDAC will be negligible. Thus, the proper sampling instant is determined on a subDAC level, rather than on a bit level or system level. Having a 40 GHz sampling clock with a 10 GHz data rate allows a four times oversampling, giving good confidence that an optimal clock can be found to sample the data.
Once the proper sampling clock is chosen for the input data, this clock is used to sample the four bits out of the LVDS. The result is that all of the 16 bits of data are now aligned to the same clock edges, within one clock period of each other. However, the data between each 4 bit subDAC can be delayed with respect to the other subDACs by up to 4 bits. This is due to the data source, an FPGA, having a different phase lock loop (PLL) for each quad of IO. The next stage after the CDR is the data alignment stage, which aligns the data words of each subDAC to each other. This is accomplished through the use of a synchronization pattern at the start of all data transmissions. The final step after the alignment is to re-time the data such that they are presented in quadrature on the correct sampling clock to the subDACs. 
Layout
The top level layout of the RF DAC is shown in figures 15 and 16. Figure 15 is the layout including the power supply grid, routed on the 2 nd and 3 rd highest metals. However, this routing hides the blocks underneath, so figure 16 shows the layout with these power grids removed. The interface to the chip is through an 8 by 8 array of pads, using IBM's C4 technology. This is a flip chip style ball bond in a 3-on-6 pattern. This means that the pads, and solder balls applied to the pads, are 3 mils wide (75 µm), with a center-to-center spacing of 6 mils (150 µm). In order to be able to properly align the chip, the pad layout is non-symmetric. The pad at the top right is removed so that chip can be properly oriented.
The pin layout is shown in figure 17 . 
Interface
The pin list is shown in table 1. The first column is the pin name, the second is the (x, y) location of the pin in the pad array, the third is the pin direction, and the fourth is the pin purpose. The signals Dx<3:0> and Dxb<3:0> correspond to the positive input signal and negative input signal of the LVDS channel, respectively. There is a 3 bit enable control word, which corresponds to the enable state shown in table 2. The chip is biased through a current source supplying 50 µA of current at a minimum voltage of 1 V on pin i_50u_in. The output is fed differentially through pins outp and outn into a 100 Ω differential load biased at 2 V, as shown in figure 18 . Power is supplied through vdda at 1 V, vddd at 1 V and vddio at 1.8 V. Figure 18 . Biasing and termination of the output pins.
