In this paper, we propose one low-computation cycle and power-efficient recursive discrete Fourier transform (DFT)/inverse DFT (IDFT) architecture adopting a hybrid of input strength reduction, the Chebyshev polynomial, and register-splitting schemes. Comparing with the existing recursive DFT/IDFT architectures, the proposed recursive architecture achieves a reduction in computation-cycle by half. Appling this novel low-computation cycle architecture, we could double the throughput rate and the channel density without increasing the operating frequency for the dual tone multi-frequency (DTMF) detector in the high channel density voice over packet (VoP) application. From the chip implementation results, the proposed architecture is capable of processing over 128 channels and each channel consumes 9.77 µW under 1.2 V@20 MHz in TSMC 0.13 1P8M CMOS process. The proposed VLSI implementation shows the power-efficient advantage by the low-computation cycle architecture. key words: channel density, high density voice over packet, high throughput, low-computation cycle, power efficiency, recursive DFT/IDFT
Introduction
The discrete Fourier transform (DFT) and its inverse (IDFT) are essential in the field of digital signal processing (DSP) and communication systems [14] . In the realistic world, many applications require spectrum analysis only over a subset of the N center frequencies via the DFT computation instead of the overall results of the fast Fourier transform (FFT). An effective derivative of DFT/IDFT is the Goertzel algorithm [1] , [2] which emerges better performance than the FFT algorithm when only some sparse DFT results need to be obtained by completing a single complex DFT spectral bin value for every N input time instances. The Goertzel algorithm has been widely applied to the dual tone multi-frequency (DTMF) standards [3] - [8] for voice over packet (VoP) network [9] - [11] to compute the interested spectra, the discrete multitone equalizer of multicar- † † The author is with the Dean Office of Academic Affairs, National Chiao-Tung University, Hsinchu, 300, Taiwan, R.O.C.
† † † The author is with the Department of Electrical and Control Engineering, National Chiao-Tung University, Hsinchu, 300, Taiwan, R.O.C.
* A portion of this paper was presented in part at the 2005 IEEE Workshop on Signal Processing Systems (SiPS), Athens, Greece. This work was supported in part by the National Science Council of Taiwan Grant NSC93-2218-E-009-061 and MOEA94-EC-17-A-01-S1-048.
a) E-mail: Vincent yu@emc.com.tw DOI: 10.1093/ietfec/e90-a. 8.1644 rier modulation system [12] , [13] , and speed detection. Considering the state-of-the-art applications, the high channeldensity dual-tone detector [9] - [11] is demanded. Some advanced DTMF detectors for the high density VoP network application have been realized by one embedded DSP processor [4] - [6] , [9] - [11] . Although, the DSP processor based design could keep the maximum flexibility, it may not meet the cost effective considerations. On the other hand, the DSP processor based design may lose the advantages of highthroughput, low power, and small area compared with the application-specific integrated circuits (ASIC) designs [14] .
In [5] , the DSP processor based DTMF detectors needs a large amount of memory to decode only 24 channels, which requires 800 words data memory and 1000 words program memory with 16-bit wordlength for each words. Also, it has to operate on the higher frequency of 24 MHz. For the purpose of optimizing the whole system performance and cost, much research [15] - [22] has concentrated on the dedicated core design. In [15] - [17] , the recursive expressions for the DCT computation that are suitable for VLSI implementation are presented. It is worth noticing that the recursive algorithms are solely used to design recursive DCT architectures rather than the recursive DFT architectures in [15] - [17] . In the past two decades, several recursive DFT algorithms and architectures have been explored [18] - [22] . Compared with the conventional second-order recursive DFT/IDFT architecture, Van et al. [20] utilized resource-sharing and registersplitting schemes to reduce two multipliers and speedup the computation, respectively. Yang et al. [21] proposed two unified IIR filter structures to save the hardware cost for the DFT computation. Nevertheless, neither Van et al. [20] nor Yang et al. [21] improve the computation cycle. In [22] , Fan et al. applied the previous proposed method to reduce the computation cycles but the performance is limited. On the other hand, Fan et al. only proposed the recursive DFT algorithm but the IDFT algorithm is not yet ready in [22] . In essence, a short description of the proposed algorithm has been presented in the associated conference [23] . In this paper, the detailed descriptions of a high-performance and power-efficient VLSI algorithm and architecture by the hybrid of input strength reduction scheme, Chebyshev polynomial, and register-splitting scheme for the DTMF application have been fully provided. The derived algorithm and devised architecture [23] possesses the following features: low-computation cycle (i.e., high throughput) and power efficiency at the expense of slightly increased area overhead compared with the existing recursive DFT/IDFT structures.
Copyright c 2007 The Institute of Electronics, Information and Communication Engineers
Based on the proposed architecture, one highthroughput (i.e. high channel density) and power-efficient DTMF detector has been proposed. For the purpose of achieving the high power efficiency, we perform the bit level SNR simulation to decide the best configuration for the DTMF detector system. The results show that the proposed design only needs 9-bit word-length, which is one-bit less than the second order Goertzel structure, to land the satisfactory resolution under 15 dB SNR environment. In this paper, the resulting DTMF detector uses 12-bit word-length, where the additional 3 bits are used for design margins so as to obtain better performance. On the other hand, the novel design saves 4-bit cost compared with the 16-bit based DSP processor design [4] - [6] . In summary, the proposed DTMF structure not only saves more area cost, but also reduces the power consumption due to the register-splitting scheme [20] and a smaller word-length requirement. Most importantly, the computation cycles can be reduced to 50% and thus a double throughput rate and channel density can be easily obtained without increasing the operation frequency. Our proposed DFT/IDFT chip is able to offer over 128-channel telephone signals for the high channel density DTMF detector [8] without any DSP processor inside. Each channel consumes 9.77 µW under 1.2 V@20 MHz in TSMC 0.13 1P8M CMOS process. This is a significant contribution, as the high channel density and low power characteristics are demanded for the communication systems. The paper is organized as follows. A new recursive DFT/IDFT algorithm and architecture by the hybrid of input strength reduction, Chebyshev polynomial, and register-splitting schemes is revealed in Sect. 2. In Sect. 3, the DTMF application using this new architecture has been demonstrated. After the bit-level SNR simulation, the 212/106-point DFT/IDFT chip has been successfully implemented for the DTMF detector system. In Sect. 4, the comparison results are tabulated in terms of the amount of computation cycles for each output as well as N-point DFT/IDFT, the maximum number of the channel density, the clock period, and the number of real multipliers. At last, the concise statements conclude this paper in Sect. 5.
New Recursive Algorithm and Architecture for DFT/IDFT
The DFT of the N-point input x[n] is defined as
where W N = e − j2π/N . By reducing the input strength of the DFT algorithm, Eq. (1) can be folded as
where 
and
In (3), we can define (3) can be rewritten as
where
N , and g N/2−1 (k) can be generalized as
It is known that Chebyshev polynomials are well defined as
Using the recursive identity stated in (7), Eq. (6) can be deduced as
The z-transform of (9) can be denoted as
For the DST part in (4), by letting
The z-transform of (12) can be denoted as
Equations (10) and (13) can be easily mapped into the recursive DFT structures as shown in Figs. 1(a) and (b), respectively. Compared with the conventional architectures [2] , [20] , [21] , it is clear that by using the proposed DFT algorithm and architecture can reduce computations cycles by 50%. In other words, with respect to the algorithm derivation, the throughput rate can be easily doubled without increasing the operating frequency. For the power-efficiency issue, we adopt the registersplitting scheme [20] (i.e., a type of retiming schemes) to reduce the critical path. There are two main advantages of using retiming scheme [24] : one is high speed and the other is low power. In this paper, we consider this technique for lowering the power consumption where the speed does not need to be increased. The resulting DCT part is depicted in the upper diagram of Fig. 2 , where denotes a hardwired shifter with one-bit left shift. Similarly, the DST part can be modified as the lower diagram of Fig. 2 . In order to maintain the minimum clock period for the recursive DFT computation, the forward pipeline register, , is exploited for the final sum output. Later combining these two new parts into one, a novel recursive DFT architecture that possesses lower computation cycle and more power-efficiency than the conventional DFT structures can be obtained.
The IDFT of the N-point input y[k] is defined as
To develop the low-computation cycle recursive IDFT algorithm, Eq. (14) using the input strength reduction scheme can be modified as
Similarly, Eq. (15) can be treated as the IDCT and IDST parts, x IDCT [n] and x IDS T [n], respectively, as
In (16), we can define (16) can be rewritten as
Let θ n = 2πn N , and g N/2−1 (n) can be generalized as
Using the recursive identity stated in (7), Eq. (19) can be deduced as
The z-transform of (20) can be denoted as
For the IDST part in (17), by letting
can be derived in similar behavior as
where Applying (8) , h N/2−1 (n) can be generalized as
The z-transform of (23) can be denoted as
After using the register-splitting scheme, Eqs. (21) and (24) can be easily mapped into the modified structures as shown in Fig. 3 . Again, from the proposed algorithm and architecture, it is obviously found that the 50% computation cycle reduction can be achieved by contrast with that of [2] , [20] , [21] . That means double the throughput rate can be achieved under the same operating frequency.
Application and Chip Implementation
In this paper, we are encouraged to design a lowcomputation cycle (i.e., high throughput) and powerefficient (i.e., cost-effective) recursive DFT/IDFT architecture for the high channel density DTMF detector in the VoP application. So as to reach this purpose, we follow two down-to-earth steps to optimize our target design. First, according to the dataflow of the DTMF detection as shown in Fig. 4 [5] , we could find that the DTMF detector enables one channel telephone [5] to provide 14 different recursive DFT computations. The total computations for the DTMF detector include 6 106-sample frames and 8 212-sample frames. Thus, we proposed one high channel density DTMF detector to handle both 212 and 106-sample frames based on the proposed recursive core architecture as shown in Fig. 5 . The proposed architecture in the first 106-sample frame needs full 106 clock cycles because it involves extra 53 clock cycles for the input data latency. The other 5 106-sample frames only require 53 × 5 clock cycles, and 8 212-sample frames only require 106×8 clock cycles. Besides, the RDFT unit needs 14 reset clock cycle to initialize each frame computation. In total, one channel DTMF detection process would only require 1,233 clock cycles per window. On the contrary, based on the second-order Goertzel structure, one channel DTMF detection would require 2,346 clock cycles for each window, which is almost twice the latency of the proposed framework.
The high channel density DTMF detector as depicted in Fig. 5 consists of the recursive DFT (RDFT) units, an input unit, and a control unit. The behaviors of the above units are described as follows:
RDFT Unit: The RDFT unit as depicted in Fig. 5 consists of one pre-processing element and one recursive processing element (PE). The pre-processing element is able to provide the intermediary data s k and r k to the following recursive PE. Recalling (5), (11) , (18) , and (22), our proposed VLSI algorithm only needs N/2 clock cycles to accomplish each output data sequence.
Input Unit: The input unit is composed of a dual port SRAM that can store 318 complex data sequences. It could serve two sizes of input data buffer: 106 and 212 samples. According to the proper scheduling, the input unit can provide the dual data x [n] and x [N − 1 − n] for the preprocessing element of the RDFT unit.
Control Unit: The control unit not only plays the role of the data sequence controller but also a parameter con- troller, which feeds the proper coefficients to the RDFT units. In this paper, since the input data and output data of the proposed architecture are all controlled in the serial manner, the desired output data can be obtained for each N/2 clock cycles.
Next, we adopt the bit-level SNR simulation to estimate the appropriate word-length under the ITU specification [3] to further reduce the chip area and power consumption. We know that the DTMF detector must operate properly under 15 dB SNR or higher. Thus, we set the simulation environment as depicted in Fig. 6 under 15 dB with additive white Gaussian noise (AWGN) channel model. Then, we will only consider the DFT part in the receiver side for the DTMF detector. In Fig. 6 , the input signal x[n] passes thought IDFT block and then propagates through the channel, where the above operations run at floating point simulation. In the receiver side, the receiver signal is quantized into the fixed bits and performs the fixed-point DFT calculation. We perform the system simulation of 212/106-sample frames at the 8 DTMF signal frequency bins: 697, 770, 852, 941, 1209, 1336, 1477 and 1633 Hz as shown in Fig. 4 . In  Fig. 7 , the x-axis and y-axis denote the data word-length and the whole system output SNR, respectively. We can observe that the output SNR will saturate as data word-length increases. It is manifest that the proposed recursive architecture only needs 9-bit resolution, which is less than 10-bit of the second-order Goertzel structure. That means we need less hardware resources to achieve the ITU performance requirements under our proposed architecture. In other words, if we select the same word-length for the proposed and Goertzel based designs, the former is able to offer the higher design margin for better system performance. In this case, because 3-bit design margin is sufficient, we choose the data word-length as 12-bit wide.
Concerning the chip implementation, our target is 212/106-point DFT/IDFT for high channel density DTMF detector [9] - [11] . As we know, the ITU timing specification indicates that the durations of DTMF signal detection and non-detection must be at least 40 ms and less than 23 ms, respectively. At a sampling rate of 8 KHz, a 106-sample frame size corresponds to a 13.3 ms window. After each window, the detected signal is compared to the last and second-to-last values. If the result of the new window is the same as the last, but different from the second-to-last, then a new valid DTMF signal has been found [5] . Recall that the proposed architecture requires 1233 clock cycles to finish one channel DTMF detection for each window. In this paper, the operating frequency and guard time are targeted at 20 MHz and 31.6 ms, respectively. That means we only need 61.65 µs (i.e., 1233 × 50 ns) to finish one window computation for one channel DTMF detection. Accompany with the DTMF FSM controller [5] , the proposed design can detect up to 128-channel DTMF signals, which is superior to [4] - [6] . The implementation processes are as follows. First, the Cadence NC-Simulator is used as the Verilog functional verification, so the outputs from the RTL model are validated against a standard LabVIEW model. Then, the 212/106-point recursive DFT/IDFT architecture in which the internal word-length is 12-bit has been synthesized with the Design Compiler in TSMC 0.13 µm CMOS technology. After the post simulation, at the present stage, the critical path is 43.12 ns in TSMC 0.13 µm CMOS process. Consequently, the proposed design is very suitable for DTMF detector system. The floorplan as well as the post-layout have been carried out using Astro. After the back-annotation from Start-RC extractor, the post-simulation has been issued by NC-Simulator to verify the functionality. The static timing check can be signed-off by PrimeTime. Finally, the power analysis and LVS can be done by Astro Rail and Dracula, respectively. For post layout, the core area is 0.18 mm 2 . The chip characteristics listed in Table 1 shows that the average power dissipation of the proposed high channel density DTMF detector is 1.25 mW@20 MHz at 1.2 V supply voltage. It is worth to notice that the proposed design could handle the 128 DTMF channel, that means each channel only consumes 9.77 µW after the division of 128. The micropho- tograph of the 212/106-point recursive DFT/IDFT core design as shown in Fig. 8 has been implemented as one hard IP (Intellectual Property). In this way, the proposed architecture and chip can be reused in the system-on-a-chip (SOC) platform. The proposed 212/106-point recursive DFT/IDFT design not only meets 40 ms timing specification for ITU standard, but also achieves the low power consumption due to the register-splitting scheme and smaller bit-width requirement compared with the design of [4] - [6] .
Comparison Discussion
In this section, we give a comprehensive comparison result as listed in Table 2 in terms of the number of computation cycles for each DFT/IDFT output as well as N-point DFT/IDFT calculation, the maximum number of channel density, the clock period, and the number of real multipliers. Note that the operation time of a complex multiplication requires T m + T a . Our proposed work [23] based on the input strength reduction scheme can save half computation cycles for each DFT/IDFT output compared with the existing works [2] , [20] , [21] at the expense of slightly increased area cost. Note that we make a comparison between our proposed work and the best case design of [21] , FAST fixed-coefficient recursive DFT (FFR-DFT), in terms of specific terminologies in Table 2 . At the same time, the reference structure of [2] is the block diagram as shown in Fig. 9 .2 of [2] . Compared with the results of the recursive algorithm in [22] which, for example, requires 2794 computational cycles to obtain all 64-point DFT outputs, the proposed core-type architecture requires 2048 computational cycles. In other words, our proposed work exploiting the input strength reduction scheme has the lowest computation cycles among existing structures [2] , [20] - [22] . As a consequence, our proposed architecture is capable of providing the highest channel density in the DTMF communication system. From the implementation results, it is obviously seen that the channel amount of the proposed architecture is double compared with other designs [2] , [20] , [21] . Since exploiting the register-splitting scheme, the proposed one inherently has higher speed than the recursive structures of [2] , [21] , [22] and possesses the same operating frequency as that of our previous work [20] . According to the critical path comparison in Table 2 , the proposed DFT/IDFT fabric owns T m + 2T a clock period and the clock periods in [2] , [21] , [22] are of T m + 3T a , T m + 2T a , and 2T m + 5T a , respectively. As mentioned in Sect. 2, the register-splitting scheme either achieves high speed or low power computation. In this article, we consider this technique for lowering the power consumption where the speed does not need to be increased [24] . In Table 2 , if the architecture possesses a shorter clock period, less power consumption can be achieved while keeping the same clock rate. However, considering the hardware complexity, the proposed DFT/IDFT architecture requires two more multipliers than the previously proposed one [20] . Furthermore, based on the proposed work, we can easily construct a parallel-type recursive DFT/IDFT architecture for other applications such as the matching filter and equalizer. The parallel-type architecture can significantly reduce the number of computation cycles for N-point DFT/IDFT from N 2 /2 to N 2 · N P , where P is the number of RDFT and • indicates the minimum integer value greater than or equal to •. Thus, the maximum throughput can be achieved. As a consequence, in Table 2 , it reveals that our proposed architecture has characteristics of the lowest computation cycle (i.e., highest throughput), the maximum number of channel density, and power efficiency.
Conclusion
One new recursive DFT/IDFT algorithm and architecture based on a hybrid of input strength reduction scheme, the Chebyshev polynomial and register-splitting scheme is devised in this framework. The analyzed results show that the proposed VLSI algorithm leads to the fewest computation cycle and the highest throughput rate. Moreover, the proposed 212/106-point recursive DFT/IDFT chip design has been successfully implemented in 0.13 µm CMOS technology and possesses the power-efficiency consumption of 9.77 µW@20 MHz at 1.2 V supply voltage for each channel.
These features guarantee that the proposed high-throughput and power-efficient VLSI architecture is certainly amenable to high channel density DTMF systems. 
