This paper presents high-speed low-power small-area accumulator designs to be used in DDFS systems. To reduce the Numerically Controlled Oscillator (NCO) design complexity and size, only the most significant bits of the accumulator drive the phase to amplitude mapping block. Those bits need to be updated on every sampling clock, while the least significant bits of the accumulator are not visible to the rest of the DDFS design and can be updated less frequently, which motivated the development of new accumulator designs. Without performance degradation, the proposed designs relieve constraints in implementation, and hence they can be employed for GHz-range DDFS, reduce power consumption up to 82% compared to standard accumulator design, and minimize chip area. For further power reduction, the proposed designs place the phase modulation adder at the front of the accumulator.
Introduction
With today's technology, direct digital frequency synthesizer (DDFS) systems are becoming an alternative to analog-based phase locked loop (PLL) synthesizers. Advantages of DDFS systems over analog frequency synthesizers include sub-Hertz resolution, fast switching between frequencies and phases, and less susceptibility to aging and temperature changes [1] . A simplified block diagram for a conventional sine output DDFS with phase modulation is shown in Fig. 1 . The standard approach to implementing an accumulator for high-speed DDFS systems is to use a pipelined architecture [2] , [3] . Although it can achieve high throughput, it has high power consumption and uses a large area due to the number of registers needed to keep the data coherent.
To overcome the aforementioned problems, the accumulator is divided into two sections in the proposed designs. The upper section containing the most significant bits, which drives the sine wave mapping block, runs at full operating speed, while the lower section runs at lower operating frequencies. Having the lower section of the accumulator running at lower frequencies, the number of pipeline stages in that section can be reduced. Therefore, the proposed accumulator designs relieve constraints in implementation of DDFS, use less power, and minimize chip area while keep- ing the same performance making them suitable for highspeed DDFS. Also the proposed designs eliminate some of the pipeline skewing registers and move the phase modulator adder (adder after the accumulator in Fig. 1 ) to the front of the accumulator. This change implies that the phase change due to modulation is done only once, when a new phase control word (PCW) is programmed, instead of on every clock cycle as in typical designs, helping to further reduce power consumption. This paper is organized as follows. In Sect. 2, we will describe the background of DDFS and the conventional pipelined accumulator. The proposed split and exponential split accumulator designs will be presented in Sect. 3. Next, in Sect. 4, we present simulation results of the proposed accumulators. Finally, Sect. 5 gives a conclusion and future work.
Conventional Accumulator Design

Background
The basic idea to implement a DDFS is to have an accumulator generate a ramp representing phase values in the 0:2π interval. That will address the sine wave mapping where the phase is translated into amplitude and is scaled to the range of the digital-to-analog converter (DAC) input as shown in Fig. 1 . Having M bits, the phase accumulator generates a ramp sweeping from 0 to 2 M − 1. On present designs, M can range from 24 to up to 64 bits. The frequency of the sweep is controlled by the frequency control word (FCW) and the device operating frequency (CLOCK), making the frequency of the output sine wave F OUT = (FCW * CLOCK)/2 M . The sine wave mapping implementation falls into two categories: Computational methods and Lookup lor series [5] and parabolic approximation [6] use arithmetic to approximate the waveform while LUT method uses read only memory (ROM) that contains the amplitude values of specific phases. They are addressed by the output of the phase accumulator. On either method the number of bits coming out of the accumulator has a direct influence on size and complexity of sine wave mapping. In the case of a 32-bit accumulator with a LUT mapping and a 12-bit D/A converter, the size of the ROM needed to implement the LUT will be 2 32 × 12 or more than 51 × 10 9 bits. The solution is not to use the full accumulator's output but just a fraction of it, as shown in Fig. 2 . The accumulator output word is shown in Fig. 3 . The effect of this truncation is a reduced size ROM and a phase precision loss. This loss will appear as spurious frequencies on the output spectrum, and its amplitude will dictate the size of the truncation according to system goals. Although truncation eases sine wave mapping implementation, it does little to help the accumulator design. It still needs to operate at full resolution, adding a full M bits every clock cycle.
Conventional Pipelined Accumulator
Until now, to overcome technology restrictions several architectures were devised such as: pipelined [2] , [3] ; progression of states [7] ; pipelined parallel [8] ; and others. The basic pipelined architecture is exemplified in Fig. 4 . A number of registers are needed to keep the input and output of the accumulator coherent. A on Fig. 4 points to the registers that are programmed with the FCW to be used by the accumulator. To carry the FCW value in a coherent manner to the adders (Fig. 4 D ) , pre-skewing registers (Fig. 4 B ) are used, and in the same fashion the de-skewing registers (Fig. 4 C ) are used to maintain the output coherent [4] . De-skewing registers are needed on the N bits that are part of the clipped output of the accumulator. The size of each stage is defined by the pipeline clock period. Although this architecture allows for a higher throughput, latency is introduced. Now it takes 8 clock cycles to go from stage 1 to 8 in Fig. 4 and compute a full addition in the accumulator. The adder, which is shown after the accumulator in Fig. 1 , does the phase modulation. This extra adder can also be pipelined and inserted between the adders and the de-skewing registers in Fig. 9 .
Proposed Accumulator Designs
As seen above, the pipelined phase accumulator runs at full clock speed and the pre-and de-skewing registers use a large area. The proposed accumulators reduce the number of registers needed in the pipeline and run the portion of the accumulator not visible to the mapping block at lower clock frequencies. Therefore, they are suitable for GHz-range DDFS, reduce power consumption, and at the same time, minimize the chip area without performance degradation.
Split Accumulator
On DDFS applications the desired accumulator has a large width to improve resolution. To reduce the size or the computation effort of the sine wave mapping block, its output is truncated to a smaller width. Lower significant bits (the discarded part) of the accumulator do not directly influence the sine wave mapping. Therefore the only section of the accumulator that needs to be updated constantly is the one that supplies the bits used by the mapping function. The accumulator can then be divided into two sections, the upper section containing the bits that will address the sine wave mapping block and the lower section that is discarded. The only connection between the two is the carry data from the lower to the upper section. This means that as long as the upper section of the accumulator gets updated with the correct carry data, there are no constraints on how to implement the lower section. The idea, as shown in Fig. 5 without the pipeline, is to have the lower section operating at half the speed of the upper section, which relieves constraints on implementation of the lower section, and hence uses less power (operates at lower frequency) and saves area since fewer pipelines stages are needed. It is important to make sure that the carry data between sections gets updated correctly. Since the lower section operates at half the frequency, the FCW for that section needs to be twice the value as the original one. In Fig. 6 this is done by the X2 block or shifting the lower part of the FCW. This will generate a carry that will be used by the upper section every two-clock cycles. The lower section of the accumulator will also generate its own carry every two clock cycles. Those two carry bits are interleaved by the MUX, which is controlled by the clock signal, shown in Fig. 5 . Note that the carry fed to the upper section has a jitter. It can be off by one clock cycle compared to the full implementation of the accumulator. It means that the least significant bit of the upper section can be off the correct value by a value of one. In order to avoid the problem, at least two more bits are assigned in the upper section than the number of bits that address the sine wave mapping. This is illustrated in Fig. 7 by splitting the accumulator into the upper section with M-S bits and the lower section with S bits when N bits are required to be transferred to the sine wave mapping block. A fully pipelined split accumulator is shown in Fig. 8 . Compared to the standard pipelined accumulator from Fig. 4 , the lower section has a two-stage pipeline (stages 1 and 2 in Fig. 8 ) instead of four-stage pipeline (stages 1 through 4 in Fig. 4) . This can accomplished due to the fact the lower section now works at half speed allowing the use of bigger adders. This figure also shows that there are fewer pre-skewing registers. Instead of using the array of registers needed to synchronize the data as in Fig. 4 B , step registers, in gray in Fig. 8 , are used. Writing a new value in the FCW register, Fig. 4 A , triggers a one-shot clock pulse that propagates trough the pipeline, updating the step registers in the accumulator in a predefined sequence. Eliminating most of the pre-skewing registers reduces the total area of the accumulator and the total number of register mitigating the increase in power by increasing the number of pipeline stages [9] . Although the number of pipeline stages is reduced, the latency is still the same since stages 1 and 2 in Fig. 8 operates at half speed and use two clock cycles each.
Phase Modulator
As mentioned earlier, in case of an accumulator with phase modulation, the adder that implements the modulation could be embedded into the accumulator by pipelining it as shown in Fig. 9 . It still operates at full clock speed and introduces one clock into the total latency of the accumulator. In order to improve the design the adder is moved from the back to the front of the accumulator. Instead of having the adder perform additions every clock cycle, it now only operates when the PCW is updated. The lower part of Fig. 6 , shows what is needed to move the phase modulation adder to the front of the accumulator. The objective of the phase modulation adder is to add a constant (phase shift) to the output of the accumulator. Assuming that the update of this constant is less frequent than the operating frequency, moving the adder to the front of the accumulator uses less power since it operates only when the PCW is updated. In this architecture, instead of adding a constant to the accumulator's output on every clock cycle, the value of the accumulator is moved to a new phase by switching the operating FCW for one that is equal to the operating FCW plus the necessary phase shift. In Fig. 6 , register A holds the current phase shift. When updating the PCW, the current phase shift is subtracted from the new PCW, giving the phase difference to the new waveform. It is then added to the current FCW to create the shifted frequency control word (SFCW), which is loaded in register B . To move the waveform to the new phase the SFCW is passed to the accumulator for one clock cycle through the MUX resuming to the original FCW afterward.
Exponential Split Accumulator
On the split accumulator the interface between the lower section and the upper section is done by a carry signal. This carry signal is generated by interleaving the carry out from the X2 block or the most significant bit (MSB) of the S section of the accumulator input with the carry from the lower part of the accumulator that now have (S − 1) bits and runs at half the clock frequency. The same reasoning can be applied to the remaining (S − 1) section of the lower part of the accumulator. The connection between the (S − 1) sections to the rest of the accumulator can be done by a carry signal that is the interleaved result of the MSB of the (S − 1) section and the carry from the (S − 2) remaining bits that will be running at 1 / 4 the speed. By recursively applying this reasoning until the least significant bit of the accumulator, we arrive at the exponential split accumulator design, shown with 8 bits on the split section in Fig. 10 . The main difference between the exponential split accumulator and previous designs is that no adders are used in the lower part of the accumulator. A MUX coupled to every input bit in the lower part selects the appropriate carry to be propagated to the upper part of the accumulator. The selection of which input is propagated through the MUX is done by a control signal generated from the operating clock. Every bit away from the upper part of the accumulator has its MUX controlled by the control signal of the previous bit divided by 2. Another advantage is that only one pipeline stage is needed for the lower part of the accumulator, thus reducing its latency. Figure 11 shows the pipelined exponential split accumulator having in A the exponential split section shown in Fig. 10 but with 16 bits instead of 8 bits. Comparing with the previous pipelined designs, it has 5 stages, while the split accumulator has 6 stages and the traditional pipelined accumulator has 8 stages. It should be noted that the same phasemodulation design described in the previous section can be implemented in the exponential accumulator.
Results
All three architectures are simulated in MATLAB R to record the SFDR degradation due to the truncation of the accumulator's output. The FCW that yields the worst case of phase truncation spurs in this case, a 32-bit accumulator with its output truncated to 14-bit, is a FCW with bit 17 (M − N − 1) set to 1 or 0x00020000 [10] . Figure 12 shows that all three architectures yield the same SFDR value of −80.36 dBc. This is true since the upper section of the accumulator is the same for all architectures. Using a FCW that exercise the lower section, with bit 12 set to 1 or 0x00001000, the results were still the same (−83.37 dBc) for all. To evaluate the advantage of the proposed architecture over the standard and split accumulators, all three designs were created and simulated in HSPICE R using the standard cell library from ATMEL 0.35 µm process. The configuration chosen is of a 32-bit accumulator with its output truncated to 14-bit. On the split and exponential split architectures, the accumulator is divided in half with 16-bit upper and 16-bit lower sections. Figure 13 shows a saving of 47%/20% in area and 61%/21% in power over the standard/split accumulator respectively. Figure 13 (b) also shows the normalized power results for the three architectures on three different frequency control words. FCW1 (7777 7777h) exercises the whole accumulator, while FCW2 (7777 0000h) and FCW3 (0000 7777h) exercise the upper and lower section. There is no difference between the split and exponential split accumulator when FCW2 is used, due the fact that the upper part of the accumulator is the same for both architectures. The difference becomes obvious with FCW3. With only the lower section of the accumulator operating, the exponential split accumulator shows an improvement of 82% over the standard and 46% over the split accumulator in power. Ta- ble 1 compares the performance between the previous and the proposed accumulators. Results for pipeline accumulator, parallel accumulator and pipelined parallel accumulator are from [8] . Results for the proposed split accumulator and exponential split accumulator were obtained using the following parameters: FCW = 0000 7777 h, F CLOCK = 1 GHz, and V DD = 3.3 V. Table 1 also summarizes the number of logic gates used in the accumulators.
Conclusion and Future Work
This paper proposes two accumulator designs (split and exponential split accumulators) for high-speed low-power small-area DDFS systems. The proposed designs have several advantages over the conventional designs. The proposed accumulator designs eases the restrictions on the implementation of high-resolution high-speed DDFS systems, consume less power, and use less area without any performance degradation. Simulation results show that the exponential split accumulator reduces power consumption up to 82% compared to the conventional accumulator. This and other publications show that the regularity of an accumulator allows new and creative architectures. We plan to further explore this regularity to achieve architecture with less pipeline stages allowing high-speed, low-power designs to be implemented.
