Article

Semiconductor Technology

# **Chinese Science Bulletin**

July 2012 Vol.57 No.19: 2480–2487 doi: 10.1007/s11434-012-5157-4

# A 5.3-GHz 32-bit accumulator designed for direct digital frequency synthesizer

CHEN JianWu<sup>1,2</sup>, WU DanYu<sup>1,2</sup>, ZHOU Lei<sup>1,2</sup>, WU Jin<sup>1,2</sup>, JIN Zhi<sup>1,2\*</sup> & LIU XinYu<sup>1,2\*</sup>

<sup>1</sup>Institute of Microelectronics, Chinese Academy of Sciences, Beijing 100029, China;

<sup>2</sup> Key Laboratory of Microelectronics Devices & Integrated Technology, Institute of Microelectronics, Chinese Academy of Sciences, Beijing 100029, China

Received November 1, 2011; accepted March 22, 2012

A 32-bit pipeline accumulator with carry ripple topology is implemented for direct digital frequency synthesizer. To increase the throughout while hold down the area and power consumption, a method to reduce the number of the pre-skewing registers is proposed. The number is reduced to 29% of a conventional pipeline accumulator. The propagation delay versus bias current of the adder circuit with different size transistors is investigated. We analyze the delay by employing the open circuit time constant method. Compared to the simulation results, the maximum error is less than  $\pm 8\%$ . A method to optimum the design of the adder based on the propagation delay is discussed. The clock traces for the 32-bit adder are heavily loaded, as there are 40 registers being connected to them. Moreover, the differential clock traces, which are much longer than the critical length, should be treated as transmission lines. Thus a clock distribution method and a termination scheme are proposed to get high quality and low skew clock signals. A multiple  $\pi$ -type termination scheme is proposed to match the transmission line impedance. The 32-bit accumulator was measured to work functionally at 5.3 GHz.

#### accumulator, carry-ripple adder, RCA, pipeline, DDS, HBT, GaAs

Citation: Chen J W, Wu D Y, Zhou L, et al. A 5.3-GHz 32-bit accumulator designed for direct digital frequency synthesizer. Chin Sci Bull, 2012, 57: 2480–2487, doi: 10.1007/s11434-012-5157-4

Direct digital frequency synthesizer (DDFS) is able to generate sinusoids with sub-hertz resolution, good spectral purity, fast frequency switching and phase continuity on switching. For the next generation radar and communication systems, DDFS operating at GHz-range clock is required. However, the operating speed of DDFS is limited by the phase accumulator (PA), which consumes large area and power. Many architectures and designs have been reported in the literature for the PA in DDFS, such as the carry-ripple adder (RCA) [1], the carry look ahead adder (CLA) [2] and the pipelined adder [3,4]. Though faster than RCA, the significant fan-in and fan-out requirements of CLA architecture lead to lower clock rate as the bit width increases. The pipeline accumulator offers significant speed improvement compared to the CLA, which has large propagation delays before a valid sum is produced. The carry ripple pipeline accumulator is more practically suitable for GHz DDFS with 32-bit resolution in terms of compact layout and hardware implementation.

A 32-bit pipeline accumulator with 8-stage pipeline is illustrated in Figure 1. A number of pre-skewing and deskewing registers are required to keep the input and output of the accumulator coherent. To minimize the area while achieving the desired operating speed, a new hardware efficient PA is proposed in this work. This method reduces the number of registers required in the pipeline without performance degradation. The optimum design in terms of power and propagation delay is investigated. The clock for the accumulator is critical for high speed operation. A clock distribution method and a termination scheme are proposed.

<sup>\*</sup>Corresponding authors (email: jinzhi@ime.ac.cn; xyliu@ime.ac.cn)

<sup>©</sup> The Author(s) 2012. This article is published with open access at Springerlink.com



Figure 1 Conventional 32-bit pipeline accumulator.

### **1** Optimum design of the accumulator

The PA in DDFS receives the frequency control word (FCW) and updates the output every clock cycle. All of the registers in the conventional PA shown in Figure 1 operate at full clock speed, consuming large power and area. The basic blocks of the PA are the full-adders and the registers, which are implemented with D flip-flops (DFFs). So as to reduce the power consumption, two methods are usually adopted. One is to reduce the number of registers, and the other way is to reduce the power consumption of the full-adder and the register could be modeled by the propagation delay. An optimum design based on the insight analysis of the propagation delay is discussed.

#### 1.1 Reducing the number of the pre-skewing registers

The output of the PA in DDFS is often truncated before addressing the phase-to-amplitude block. The number of the de-skewing registers is much smaller than that of the pre-skewing registers. In the *N*-bit accumulator with *M* pipeline stages, the number of the pre-skewing registers becomes  $N \times (M+1)/2$ . The power and area increase considerably as the depth of the pipeline increases. A 32-bit PA with reduced number of registers was introduced in [5]. The power is reduced by gated clocking technique, which requires complicated clocking design. This technique is not suitable for GHz DDFS. A new scheme is proposed, as shown in Figure 2.

The pre-skewing registers are clocked in file by the pipelined pulses with width of one clock cycle, as shown in Figure 3. The new scheme is based on the shifted clock pulses, while the FCW is shifted by the registers in series for conventional scheme. The first clock pulse str<0> in Figure 3 is triggered by the external FCW storing signal. The other clock pulses are generated by registers in cascade as shown in Figure 3. For the 32-bit PA with 8-stage, the pre-skewing registers are reduced from 144 to 41 when the new scheme is applied.

#### 1.2 Optimizing the critical path

The 32-bit pipeline accumulator consists of 8-stage 4-bit full adder, which is shown in Figure 4. The maximum clock rate of a pipelined adder is limited by the propagation delay of the critical path, as illustrated in Figure 4. The optimum of the propagation delay of the carry cell is much more important than the sum cell.

The sum and carry-out of the 1-bit full adder can be expressed as:

$$sum = A \oplus B \oplus C_{in},$$

$$C_{out} = AB + BC + C_{in}A,$$
(1)

where *A* and *B* are the input bits and  $C_{in}$  is the input carry. Both sum and carry could be designed with two 2-level logic gates in cascade or one 3-level logic gate. For the 5.3 GHz accumulator, the adder is designed with 3-level logic gate, which tends to have smaller propagation delay.

Various analysis methods have been introduced to model



Figure 2 The proposed 32-bit 8-stage pipeline accumulator.



Figure 3 The generation of the clock pulses for the proposed pre-skewing scheme.



Figure 4 The 4-bit pipelined full adder with D-flip flops.

the gate delay. The approaches proposed in [6] for optimizing design of CML and ECL show a maximum error close to 20% compared with Spice simulations. We calculate the gate delay basing on the open circuit time constant method. The nonlinear elements are represented by the small or large-signal equivalent circuits as described below [7]. The current-switch transistors and the emitter followers are represented by the large and small-signal equivalent circuits respectively.

The base-emitter capacitance could be modeled with the depletion capacitance and the diffusion capacitance. The base-emitter depletion capacitance  $C_{je}$  is defined as

$$C_{\rm je} = \begin{cases} \frac{1}{\Delta V} \int_{V_{\rm be,on} -\Delta V}^{V_{\rm be,on}} C_{\rm je}(V) dV & \text{large signal,} \\ C_{\rm je}(V) = \frac{C_{\rm je}(0)}{\left(1 + \frac{V}{\phi_{\rm be}}\right)^{m_{\rm be}}} & \text{small signal,} \end{cases}$$
(2)

where V is the base-emitter reverse bias voltage,  $\Delta V$  the logic swing,  $V_{be,on}$  the base-emitter junction voltage in the

on-stage,  $C_{je}(0)$  base-emitter zero-bias junction capacitance,  $\phi_{be}$  base-emitter grading coefficient, and  $m_{be}$  base-emitter junction exponent. The base-collector junction capacitance  $C_{cb}$  could be calculated in the same way.

The large and small-signal transconductances are

$$G_{\rm m}(\text{large-signal}) = \frac{\Delta I}{\Delta V} = \frac{1}{R_{\rm L}},$$

$$g_{\rm m}(\text{small-signal}) = \frac{dI}{dV} = \frac{I_{\rm C}}{\eta V_{\rm T}},$$
(3)

where  $R_{\rm L}$  is the load resistor,  $I_{\rm C}$  the switched current,  $\eta$  the ideality factor, and  $V_{\rm T}$  the thermal voltage.

The small-signal diffusion capacitance  $C_{t,diff}$  equals to  $g_m \times \tau_f$ , while the large-signal diffusion capacitance becomes  $G_m \times \tau_f$ , where  $\tau_f$  is the forward transit time.

Both the sum and carry of the full adder are designed with 3-level gates. The analysis of propagation delay versus current density of the sum cell will be shown in detail. To compute the propagation delay, the resistance across the capacitor in Figure 5 could be calculated when the transistor is represented with the hybrid- $\pi$  equivalent model. The sum cell was designed with emitter follower as the output stage. The next logic stage was connected to the sum cell. Assume the sum cell operating at the differential mode, we can limit the analysis to the half-circuit. Moreover, the transistors at the lower level contribute to the delay more than the upper level. The lowest level transistors are driven by pulses, while the others are set constant. The propagation delay is defined as the time delay between the output node O and the input node *A*. We can model the propagation delay by adding the time constant together along the signal path.

Assume the base of  $Q_3$  and  $Q_6$  are at the logic high while the base of  $Q_7$  and  $Q_{10}$  are logic low.  $Q_1$  is driven by an input from logic low to high, thus the output node at the emitter of  $Q_{11}$  shows a transition from high to low. The way to calculate the delay associated with  $Q_1$  is shown in Figure 6 and the other time constants could be modeled in the same way.

The resistance across each capacitor could be calculated while all other capacitors are open-circuited. The resistances in Figure 6 are shown below.

$$\begin{split} R_{be1} &= \frac{r_{bb1} + R_{s1} + r_{ex1}}{1 + G_{m1} r_{ex1}}, \\ R_{cb1} &= r_{bb1} + R_{s1} + \left(r_{ex4} + 1/G_{m4}\right) \\ &\quad + \frac{G_{m1}(r_{bb1} + R_{s1})(r_{ex4} + 1/G_{m4})}{1 + G_{m1} r_{ex1}}, \end{split}$$
(4)  
$$R_{cbx1} &= R_{s1} + \left(r_{ex4} + 1/G_{m4}\right) + \frac{G_{m1} R_{s1}(r_{ex4} + 1/G_{m4})}{1 + G_{m1} r_{ex1}}. \end{split}$$

The time constant associated with  $Q_1$  is

$$\tau_{Q_1} = R_{be1} C_{be1} + R_{cb1} C_{cb1} + R_{cbx1} C_{cbx1}.$$
 (5)

The total propagation delay is  $\ln(2)\tau$ , where  $\tau$  is the sum of the time constants presented in Figure 5. Low propagation delay is preferred for high speed logic circuits; however the delay varies with the bias current. Large size transistors have higher  $f_t$ , however it is not indicative of high speed



Figure 5 Model the propagation delay of the sum cell with capacitances.



**Figure 6** Equivalent circuit for the calculation of the resistances across the capacitances of  $Q_1$ .

operation. Both the bias current and the scaling of the emitter length must be considered in the optimum design of the logic cell. In order to evaluate the accuracy of the propagation delay model, we compare the model with the simulation results.

The delay associated with the transistors with various emitter lengths was investigated, as shown in Figure 7(a). When the emitter length is scaled from 3 to 5 and 10  $\mu$ m, the delay reduces from 51 to 41 and 37 ps respectively at a current density of 0.35 mA/ $\mu$ m<sup>2</sup>. An increase of 53% in the bias current results in a decrease of 20% in delay when the emitter length is increased from 3 to 5  $\mu$ m. Whereas doubling the bias current only reduces the delay by 10% when the emitter length is increased from 5 to 10  $\mu$ m. The transistors with an emitter length of 5  $\mu$ m are preferred when low propagation delay is required. The optimum design of the sum cell is based on the transistors with an emitter area of 1.4  $\mu$ m × 5  $\mu$ m.

Both the hand analysis and the simulation results show that the propagation delay decreases when the bias current increases. A maximum error between the model and the simulation result is less than  $\pm 8\%$ , as shown in Figure 7(b). The error arises from the equivalent transistor model from two aspects. Firstly, the parameters in the equivalent model vary with the bias current, however the capacitances in our analysis are fixed at zero-bias; Secondly, the equivalent model is a simplified model, especially the high current effects are not considered. The delay error associated with the emitter length of 10 µm is more than others, as the transistors are biased at high current, though the current density is the same.

The decrease rate becomes greater when the current density is less than 0.35 mA/ $\mu$ m<sup>2</sup>. A decrease of 40% in current tends to increase the delay by 32%, while increasing the current by 40% would reduce the delay by 11%. It is not efficient to reduce the delay by increasing the bias current when the current is high. The analysis of the delay of the carry cell is the same as the sum cell, as both of them are designed with 3-level logic. The selection of the bias current for the critical path is determined by the maximum delay allowed, which is related to the timing constraint.

## 2 Clock distribution and termination scheme

The clock distribution with equal phase delay and driving capability is critical for the registers along the clock tree. Controlled and precise clock distribution techniques are



Figure 7 (a) Propagation delay versus bias current for both the hand analysis and the simulation; (b) error between the hand analysis and the simulation.

required to maintain a synchronous system. To get a maximum available clock frequency, the clock jitter and skew must be minimized. With higher frequencies with the associated fast edge rates, long traces behave like transmission lines. Ring back, overshoot, and undershoot occur as a result of poor termination of transmission lines. The critical length for considering the transmission line effect is thought to be [8].

$$l = \frac{0.35 \times v_{\rm p}}{6 \times 7 \times f_{\rm clk}} = \frac{0.35 \times 0.84 \times 10^8 \,\mathrm{m/s}}{6 \times 7 \times 6 \times 10^9 \,\mathrm{Hz}} = 117 \,\,\mathrm{\mu m},\tag{6}$$

where  $v_p$  is the phase velocity of the GaAs interconnect, and  $f_{clk}$  is the clock frequency.

The clock trace of the 32-bit accumulator is about 3.8 mm, which is much longer than the critical length. The clock trace must be treated as a transmission line, which is terminated by its characteristic impedance. However, the effective characteristic impedance of the clock trace decreases as a result of the layout and the capacitive loading along the trace. Moreover, the registers are not equally spacing along the trace. To distribute the clock signals with high quality, multiple reflections along the clock traces must be suppressed. The proposed clock distribution and termination scheme are shown in Figure 8.

The delay of the clock trace 2 without load in Figure 8 is simulated to be 30 ps. When the 40 registers from the 32-bit RCA are connected to the clock trace, it is simulated to have more delay. It is not feasible to drive the clock trace from both ends. A tree type clock distribution is proposed as shown in Figure 8. For compact layout, the clock trace 1 is driven from the left side and a delay cell is added before the clock driver at the left. Both the clock trace 1 and 2 are much longer than the critical length, they must be terminated carefully and separately.

The load of the clock trace 1 is located at both ends; its impedance is not interrupted internally. A  $\pi$ -type termination scheme with DC-blocking capacitor at both ends is proposed for the clock trace 1. Distinct from the clock trace 1, the clock trace 2 is heavily loaded. The clock signal along the trace before termination is seriously disturbed by the

reflect signals. Multiple  $\pi$ -type terminations are implemented to match the impedance and absorb the reflections. The selections of the resistors and capacitors are based on the even and odd impedances of the differential clock traces. Layout parasitic RLC models are employed in all simulations to include the effects of metal interconnect parasitic. Based on the proposed clock distribution and termination schemes, the clock skew is simulated to be less than 13 ps at 6 GHz and the 32-bit accumulator is simulated to operate functionally at 6 GHz.

#### **3** Experimental results and discussion

The accumulator was fabricated in a 1.4  $\mu$ m 60-GHz  $f_t$  GaAs HBT technology. The 32-bit accumulator was truncated to be 11 bits and integrated as part of a DDFS. Using 2400 GaAs HBTs, the total area of the accumulator including the pre-skewing registers is 3.8 mm×0.8 mm. The accumulator including the pre-skewing registers and the de-skewing registers draws a current of 460 mA from a -5.2 V power supply. Note that the pre-skewing registers consume 33% and the de-skewing registers account for 29% of the total power dissipation.

The die photo of the DDFS that integrates the 32-bit accumulator is shown in Figure 9. The MSB carry is buffered to drive the 50- $\Omega$  off-chip measurement instrument. The pre-skewing registers that triggered by the external differential trigger signals update the FCW in series. The frequency of the carry out is equal to that of the DDFS output, which is FCW× $f_{clk}/2^{32}$ , where  $f_{clk}$  is the clock frequency. The pulse width of the carry out is equal to the clock cycle time. The measured waveforms of the DDFS and the carry out are shown in Figures 10 and 11. The waveform of the DDFS is shown in the C1 channel of the oscilloscope, and the carry out in the C2 channel. As shown in Figure 10, when DDFS operates at 5.3 GHz, the frequency of the carry out is 2.6923 MHz, which is close to the theory result 0x00214AC3×5300 MHz/ $2^{32}$ =2.692 MHz. When the FCW is changed to 0x7F-161391 and the clock is 4.7 GHz, the result is 0x7F161391× 4.7 GHz/2<sup>32</sup>=2.333 GHz, as shown in Figure 11. The highest



Figure 8 Clock distribution for the 32-bit accumulator and the proposed termination scheme.



Figure 9 Microphotograph of the DDFS that integrated with the proposed 32-bit accumulator.



**Figure 10** Waveforms of the DDFS and carry out of the 32-bit accumulator under 5.3 GHz and FCW = 0x00214AC3.

stable operating frequency was measured to be 5.3 GHz, which is less than the simulation result. If the bias current of the accumulator is increased, the accumulator tends to operate at higher frequency.

As shown in Table 1, the proposed 32-bit accumulator achieves the best operating speed among the accumulators with resolution more than 24 bits. The area of the proposed accumulator is a bit large; this is due to the FCW input



Figure 11 Waveforms of the DDFS and carry out of the 32-bit accumulator under 4.7 GHz and FCW=0x7F161391.

interface, which is critical for phase continuity. Note that the pre-skewing registers consume 33% and the de-skewing registers account for 29% of the total power dissipation.

# 4 Conclusion

A 5.3-GHz 32-bit accumulator for DDFS is presented. To increase the throughout while hold down the area and power consumption, a carry ripple pipeline topology with reduced number of pre-skewing registers is proposed. The number of the pre-skewing registers is reduced to 29% of a conventional pipelined accumulator. The propagation delay of the adder is modeled with the open circuit time constant method. The maximum error between the model and the simulation result is less than  $\pm 8\%$ . The optimum selection of the bias current and the scaling of the transistors based on the delay model are discussed. A multiple  $\pi$ -type termination scheme is proposed for the 5.3 GHz on-chip clock traces with 3.8 mm length. The experimental results indicate that the proposed termination scheme is feasible, and it could be applied to other high speed clock systems. The accumulator could be extended to be 48-bit when a high resolution DDFS is required.

Table 1 Comparison with various recent high-speed pipeline accumulator designs

| Technology                 | 1.2 μm<br>CMOS [9] | 0.25 μm<br>CMOS [3] | 90 nm<br>CMOS [4] | InP<br>DHBT [10] | SiGe<br>HBT [11] | GaAs HBT<br>(This work) |
|----------------------------|--------------------|---------------------|-------------------|------------------|------------------|-------------------------|
| Transistor $f_t$ (GHz)     | _                  | -                   | -                 | 300              | 200              | 60                      |
| Resolution (bits)          | 32                 | 32                  | 24                | 4                | 10               | 32                      |
| f <sub>clk,max</sub> (GHz) | 0.7                | 0.63                | 1.3               | 41               | 7                | 5.3                     |
| Area (mm <sup>2</sup> )    | 0.81               | -                   | -                 | 1.77             | 1.16             | 3.04                    |
| Power (mW)                 | 850                | 25.84               | 49                | 4100             | 237              | 2392                    |

The authors wish to express their sincere thanks to LI YanKui and OUYAN SiHua for measurement guidance, and SA Bin at RFMC for wire bonding. This work was supported by the National Basic Research Program of China (2010CB327505).

- Thuries S, Tournier E, Cathelin A, et al. A 6-GHz low-power BiCMOS SiGe:C 0.25 μm direct digital synthesizer. IEEE Microwave Wireless Compon Lett, 2008, 18: 46–48
- 2 Bhansali P, Hosseini K, Kennedy M P. Performance analysis of low power high speed pipelined adders for digital ΣΔ modulators. Electron Lett, 2006, 42: 1442–1444
- 3 Strollo A G M, Caro D D, Petra N. A 630 MHz, 76 mW direct digital frequency synthesizer using enhanced ROM compression technique. IEEE J Solid-State Circuits, 2007, 42: 350–360
- 4 Yeoh H C, Jung J H, Jung Y H, et al. A 1.3-GHz 350-mW hybrid direct digital frequency synthesizer in 90-nm CMOS. IEEE J Solid-State Circuits, 2010, 45: 1845–1855
- 5 Kim Y S, Kang S-M. In: Sowers J, Thorburn M, eds. A High Speed Low-power Accumulator for Direct Digital Frequency Synthesizer.

In: IEEE MTT-S Int Microw Symp Dig, 2006 June 11–16, San Francisco, CA. 502–505

- 6 Alioto M, Palumbo G. CML and ECL: Optimized design and comparison. IEEE Trans Circuits Syst Fundam Theory Appl, 1999, 46: 1330–1341
- 7 Rodwell M J W, Urteaga M, Betser Y, et al. Scaling of InGaAs/ InAlAs HBTs for high speed mixed-signal and mm-wave ICs. Int J High Speed Electron Syst, 2001, 11: 159–215
- 8 Bogatin E. Signal Integrity: Simplified. Beijing: Publishing House of Electronics Industry, 2007
- 9 Lu F, Samueli H, Yuan J, et al. A 700-MHz 24-b pipelined accumulator in 1.2-µm CMOS for application as a numerically controlled oscillator. IEEE J Solid-State Circuits, 1993, 28: 878–886
- 10 Turner S E, Elder R B, Jansen D S, et al. 4-bit adder-accumulator at 41-GHz clock frequency in InP DHBT technology. IEEE Microwave Wireless Compon Lett, 2005, 4: 144–146
- 11 Laemmle B, Wagner C, Knapp H, et al. High speed low power phase accumulators for DDS applications in SiGe bipolar technology. In: IEEE Bipolar/BiCMOS Circuits and Technol Meeting, 2009 Oct. 12–14, Capri, Italy. 162–165
- **Open Access** This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.