This paper presents the implementation of a 31-tap FIR Hilbert transform digital filter chip used in the digital-IF receivers, to confirm the effectiveness of our new design method. Our design method that we previously reported is based on a computation sharing multiplier using a new horizontal and vertical common subexpression techniques. A 31-tap FIR Hilbert transform digital filter was implemented and fabricated in 0.35 µm CMOS standard cell library. The chip's core contains approximately 33k transistors and occupies 0.86 mm 2 . The chip also has an operating speed of 70 MHz over. The implementation results show that the proposed Hilbert transformer has a smallest cost factor and so that is a high performance filter.
Introduction
The digital Hilbert transformer is a basic signal processing operator used in numerous application domains, e.g., telecommunications, wireless multimedia communication, speech/audio/image processing, and many more. The advantages of digital Hilbert transformer VLSI implementations include the ability to easily program the hardware to accommodate different data rates, modulation formats. For example, by simply scaling the clock frequency, the same chip set can be used across several production lines, e.g., an Inphase/Quadrature (I/Q) demodulator used in receivers of HDTV [1] , and an array antenna radar system [5] . These I/Q demodulator must operate at sampling rates in excess of 70 MHz, to accommodate intermediate-frequency (IF) frequencies in a range between 20-40 MHz.
The digital Hilbert transformer used in the I/Q demodulator chip can be realized of either the finite impulse response (FIR) [2] - [5] , [7] and the infinite impulse response (IIR) [6] filters. Since the Hilbert transformer is required to strictly shift the phase of signal by 90 degrees over wide range frequency, we generally will have the choice of using FIR with linear phase. However, the Hilbert transformer using FIR filter is attributable to its amplitude ripple and large circuit scale.
Manuscript received September 11, 2006 . Manuscript revised February 8, 2007 . Final manuscript received March 16, 2007 . † The authors are with the Department of Electrical and Electronic Engineering, Faculty of Engineering, Gifu University, Gifushi, 501-1193 Japan.
† † The author is with the Department of Bio-system Engineering, Faculty of Engineering, Yamagata University, Yonezawa-shi, 992-8510 Japan.
a) E-mail: yasut@gifu-u.ac.jp DOI: 10.1093/ietfec/e90-a. 7 .1376
The primary source of complexity in Hilbert transformer using FIR digital filters is the filter coefficients and the filter word length. These coefficients have been traditionally represented in binary number format and implemented as constant multipliers [2] , [3] . One way of implementing more efficient multipliers is to represent the filter coefficients in canonic signed digit (CSD) format [4] , [5] . The FIR filter's latency of N clock cycles, where N is the number of coefficients, is the same as that of the Hilbert transformer design. However, comparison of the two (or more) designs assuming a constant operating time is utterly unfair. For Hilbert transformer design, operating time is dependent on the filter coefficients, since as N grows, longer interconnects will be involved and larger loads will be driven. While it is conceivable that despite the broadcast overhead increasing the operating time, the area saving might still lead to a cost-effective design, realistic analyses require an architecture and technology dependent model to justify a tradeoff. Thus we should compare various filter designs using some kind of performance function.
In this paper, we present high performance implementations for digital Hilbert transformers based on our recently proposed constant multiplier [8] , [9] . The constant multiplier is designed by using common subexpression elimination (CSE) techniques and so results in low hardware complexity and a small VLSI chip area. The rest of this paper is organized in four sections. Section 2 describes the design method of digital FIR Hilbert transformer. Section 3 presents the architecture of the digital Hilbert transformer based on a computation sharing multiplier using horizontal and vertical CSE techniques. Section 4 shows the implementation results of the 31-tap digital FIR Hilbert transformer fabricated in a 0.35 µm CMOS process using a standard static logic cell. The target of the Hilbert transformer to be implemented is the I/Q demodulator with operating speed of 70 MHz over which is used in the digital-IF receivers. This section also discusses the cost-performance tradeoffs and the scaling on the proposed design. Finally, conclusions are drawn in Sect. 5.
Design Method of FIR Hilbert Transformer

Discrete Hilbert Transform
The Hilbert transformer converts the phase of the incoming time samples by −90 degrees for positive frequencies and by +90 degrees for negative frequencies. The frequency spec- 
whereX(ω) represents the Hilbert transform of X(ω).
The frequency relationship expressed in Eq. (1) is obtained from system whose impulse response h(t) = −1/πt. The Hilbert transform relationship for time signal is given by the convolution integral
wherex(t) denotes the Hilbert transform of x(t), * represents convolution and τ is the variable of integration. For discrete time sequences, the Hilbert transform frequency relationship is slightly different, because the spectra of discrete signals are periodic between −π and +π. In Ref. [10] , Rabiner and Schafer have given the frequency response of an ideal Hilbert transformer as:
The impulse response sequence becomes:
Note that h(N) = 0 for all even N and h(−N), facts which are exploited in order to reduce the required circuitry. Of course, the ideal transform is not physically realizable since it is non-causal and infinite duration. In order to create a realizable filter, the impulse response is windowed and shifted to make it causal.
Architectural Design of FIR Hilbert Transformer
The FIR Hilbert transformer is usually realized by using parallel connection of case 3 (i.e. "anti-symmetric" impulse response with "odd" length) FIR filter and delay elements. 
can be generated by using a Hilbert transformer. In the basic non-causal system for the generation of the analytic signal,
. The basic causal system differs from the non-causal only by a time shift of (N − 1)/2 sample, i.e. This causal system can be with linear phase characteristics. The actual filter is based on the design of a 31-tap Hilbert transform filter which was optimized for minimum stopband energy. An ideal 31-tap Hilbert transform filter was designed first using program based on the design technique presented in Ref. [10] . The resulting floating-point coefficients were then converted to an optimized CSD representation as shown in Fig. 2 , where n denotes −1. The frequency response of the CSD Hilbert transform filter is shown in Fig. 3 . The passband is between 0.05 f s and 0.45 f s , where f s is a sampling frequency. The passband ripple is less than 1 dB and stopband attenuation is greater than 40 dB.
FIR Hilbert Transformer Chip Architecture
Folded Transpose Filter Structure
Parallel architectures typically require a substantial amount of die-area and consume a large amount of power. For ex-ample, it has been shown in Ref. [4] that high-performance fixed coefficient FIR filters is realized with an architecture based on the transposed filter structure. The advantages of this structure include a small critical path (data broadcast delay + one multiplier + one adder + one register setup time). Furthermore, it is well known that for linear phase filters, the folded transpose filter structure may be used, halving the number of required general multipliers. Also, we can implement more efficient multipliers to represent the filter coefficients in CSD format.
Constant Multiplier Structure
Multiple constant multiplication (MCM) can be implemented efficiently by using dedicated shift and add multipliers. Common subexpression elimination (CSE) as a way to tackle the MCM problems is as a possible method for the optimization of finite-duration FIR filter area through the reduction of the multiplier block logic [8] , [9] , [11] - [16] . In general, the goal of CSE can be defined as follows.
1. Identify multiple patterns in the coefficient set. 2. Remove these patterns and calculate them only once.
Our CSE techniques [8] , [9] have been proposed an efficient way to find the correct bit-patterns for horizontal and vertical CSEs. The proposed CSE is stated as the problem of minimizing the numbers of the delay and adders/subtracter blocks which are needed to perform all of the multiplications. That is, the objective cost function (CF) to be minimized is written as:
where N reg and N as are the number of registers and adders/subtracters respectively, and β is weights. We set the parameter β to 0.15-0.2, if we assume that the FIR filters are fabricated in a 0.35 µm standard CMOS process. Using the proposed method, the MCM area of the FIR filters including the Hilbert transform filter is reduced by an average of 20% [9] . The idea of the proposed CSE can be demonstrated on a 31st Hilbert transform filter design shown in Fig. 4 . The proposed technique is also described by the pseudo C language code shown in Fig. 5 . The proposed CSE is completed by the following two steps.
Step 1: Horizontal CSE Method
In the horizontal CSE method, we must be examined all combinations of non-zero bit patterns in a coefficient. Since a bit pattern can only be eliminated once, we must also detect the occurrence of the same patterns within each other. For example, the valid non-zero bit patterns of coefficient 010n010n are summarized in Table 1 , where n denotes −1.
This table shows that we should select 10n as the most frequency of non-zero bit pattern. Note that 10n and 10001 can be made equivalent in their implementation to n01 and n000n. Table 2 summarizes the frequency of the valid non-zero bit patterns in 31st order Hilbert transform filter coefficients. In this case, pattern 10n is identified as most frequent for the coefficients. If two patterns have the same frequency (> 1), the smallest pattern is chosen. Because, adder/subtracter structures with a bigger wordlength cause a larger implementation area. Most common horizontal subexpressions resulting from the proposed method is extracted from the coefficient table represented in canonic signed digit (CSD) shown in Fig. 4(a) .
Step 2: Vertical CSE Method
The remaining non-zero bits are examined for optimum vertical common subexpression. Pattern identification of vertical CSE is the same as that of horizontal CSE. In this case, vertical pattern n0n as shown in Fig. 4(b) is identified as most frequent for the coefficients. Figure 4 (b) also displays a final coefficient table of 31st order Hilbert transform filter processed by the pseudo code.
Adder Structure
Carry Save Adder
For high-speed architecture, simple carry propagate adder (CPA) is not sufficient to achieve the required throughput rate † . This problem is dealt with in one of two ways, by using a high speed CPA technique such as a conditional sum adder or carry look-ahead adder, or by eliminating the need for carry propagation through the use of a redundant addition scheme such as signed-digit arithmetic or carry save arithmetic. In this paper, carry-save-adder (CSA) is chosen rather than using a fast CPA, because of the large number of such adders which would be required in the filter.
The tradeoff is that two z −1 registers are required: one to delay the sum and one to delay the carry. This is an efficient tradeoff since it doubles the register hardware which leads to an area increase 10%, but results in a speed increase by factor of F:
where b is the wordlength, T c is the carry-ripple delay per bit and T s is the sum delay per bit. Typically, T c /T s ≤ 1.0, this means that the CSA provides a speedup by a factor of up to b [18] .
Vector Merge Adder
A vector merge adder (VMA) has to be used in the final stage to add the final sum and carry at the base of the carry save adder tree. The vector merge adder is a traditional adder responsible for calculating the final filter output, and is the only part of the system that has combinational delay longer than that of a single full adder. A key point is that while the natural choice in a VLSI system would be implement the VMA with a sophisticated addition technique such as carry look ahead addition and square root carry select addition to improve delay. As the filter in this paper is required high-speed architecture † , we choose the square root † For example, Zimmermann [17] has reported that the delay of 16 bit carry propagate adder and carry look-ahead adder are 12-20 ns and 8-11 ns respectively, on the other, those of 16 bit carryselect-adder (CSLA) and carry-save-adder (CSA) are 7-10 ns and 6-8 ns respectively. In this paper, the target of the Hilbert transformer to be implemented is the I/Q demodulator with operating speed of 70 MHz ( 1/14 ns), and then we think that it is usually preferable to choose the CSLA and the CSA. carry select adder.
Final FIR Hilbert Transformer Structure
By using the structures presented in the previous subsections, we obtain the final FIR Hilbert transformer structure shown in Fig. 6 , which requires a smaller chip size, a faster speed, and has less power dissipation after ASIC implementation.
Hilbert Transformer VLSI Implementation
Considered Application Example: Wideband Radar Receiver
In this application example, we assume that the receiver with FIR Hilbert transformer is to be used in an array antenna radar system [5] . This receiver consists of two parts: a RF part with only one mixer stage and an I/Q demodulation part. The signal is then bandpass sampled by the analog-digitalconverter (ADC) at IF. The I/Q demodulation is performed in the digital domain and is based on the FIR Hilbert Transformer (HT) in a receiver structure. Such receiver is shown in Fig. 7 . The input signal to the receiver is in the range of 8-12 GHz. The IF is 360 MHz and the sample frequency of ADC is 80 MHz. After the I/Q demodulation the sample frequency is 40 MHz.
Implementation and Testing Results
We designed an experimental chip for the 31st order FIR Hilbert transformer chip as shown in Fig. 8 . The chip uses 0.35 µm 3.3 V CMOS provided by the VLSI Design and Education Center (VDEC), the University of Tokyo. The target library is the VDEC EXD library for Rohm 0.35 µm CMOS technology. The Resistor Transfer Level (RTL) and gate-level netlists are written in Verilog-HDL. We use Synopsys Design Compiler for logic synthesis, Avant! Apollo for physical-layout and Cadence Verilog-XL for simulation. Table 3 summarizes main specifications of the FIR Hilbert transformer. From the implementation results, the core size is 0.86 mm 2 , and it integrates about 33k transistors. In order to verify the operation of the chip, a pseudo noise signal which is generated from a data generator (Tektronix DG-2040) is inputted into the proposed FIR Hilbert transformer. The resulting waveforms were digitized with a Tektronix TDS-7054 digital oscilloscope. Figure 9 shows a digital oscilloscope trace of an MSB output of a chip operating at 71 MHz. The power dissipation is also 263 mW at Figure 10 shows the technical roadmap has been reported over the past decade or so. The straight-line represents the trend for the semi-custom LSI design. If the Hilbert transformer which uses a new approach breaks the trendline, we can say that Hilbert transformer is characterized by highbit-rate, small circuit scale. Trends regarding the chip area of a Hilbert transformer are summarized in Fig. 10(a) . We found that this is the minimum reported area for a Hilbert transformer which is fabricated by using semi-custom design, and we can estimate a 47% reduction in the area compared with the earlier Hilbert transformer made using the same fabrication technology. Figure 10(b) shows the trends regarding the sampling time of a Hilbert transformer. From this figure, we found that this is the minimum reported area for a Hilbert transformer and we can estimate a 24% speed up compared with the earlier Hilbert transformer made using the same fabrication technology.
As the reported Hilbert transformers have been fabricated by various integrated circuits (e.g. FPGA, ASIC, Gate Array) and have been designed from analog/digital FIR/IIR filters, it is unfair to evaluate these transformers by the coverage of one condition. Therefore in order to compare various filter designs, we define the cost factor C for Hilbert transformer evaluation considering the performance-cost ratio as the following equation:
where A×T is a term of area-time product and N ×b is a term of taps-wordlength product. This equation means that as C decreases, the area-time product of Hilbert transformer per taps-wordlength product gets smaller. Table 4 summarizes the comparison of different Hilbert transformers. From this table the results show that the proposed Hilbert transformer has a smallest cost factor and so that is a best performance filter regardless of process, design, etc.
Conclusions
We have presented the design and implementation of a 31-tap FIR Hilbert transform digital filter chip. The architecture of the filter has been based on a computation sharing multiplier using horizontal and vertical common subexpression elimination techniques. The chip has been implemented by using a 0.35 µm CMOS process technology with the area of 0.86 mm 2 . The chip has a clock frequency of 71 MHz and a power consumption of 263 mW at 70 MHz. His current research interests include electromagnetic compatibility (EMC) analysis, lossy transmission line modeling, and microwave system and high-speed PCB signal integrity analysis. Dr. Sekine is a member of IEEE.
