FPGA implementation of digital timing recovery in software radio receiver by Ng, TS & Wu, YC
Title FPGA implementation of digital timing recovery in software radioreceiver
Author(s) Wu, YC; Ng, TS
Citation The 1st Asia-Pacific Conference on Quality SoftwareProceedings, Tianjin, China, 4-6 December 2000, p. 703-707
Issued Date 2000
URL http://hdl.handle.net/10722/46229
Rights
©2000 IEEE. Personal use of this material is permitted. However,
permission to reprint/republish this material for advertising or
promotional purposes or for creating new collective works for
resale or redistribution to servers or lists, or to reuse any
copyrighted component of this work in other works must be
obtained from the IEEE.
FPGA Implementation of Digital Timing Recovery in Software Radio Receiver 
Yik-Chung Wu and Tung-Sang Ng 
{ ycwu, tsng} @eee.hku.hk 
Department of Electrical and Electronic Engineering 
The University of Hong Kong 
Pokfulam Road, Hong Kong, China 
Tel.: ++ 852 + 2857 8406 
ABSTRACT 
This paper describes an implementation of an all- 
digital timing recovery scheme. Squaring non- 
linearity is employed to generate the timing estimate 
and an IIR is used to extract the spectral component 
at symbol rate. Hardware design is performed using 
VHDL and realized in FPGA. The whole design 
can be fitted into an Altera EPFIOK70 FPGA chip, 
with 95.5% utilization of logic elements and 22% 
utilization of memory bits. The implementation 
exploits features of FPGA, which enable easy 
implementation of look up table and variable data 
precision at different nodes. 
1. INTRODUCTION 
As the complexity of digital devices dramatically 
improved in the last decade, more and more 
functions in a communication receiver can be 
shifted from analog to digital. An obvious example 
is the development of software radio. Ir, a software 
radio receiver, the received signal is sampled by a 
free-running local clock at IF and all the subsequent 
operations are performed in the digital domain. 
Traditional symbol timing recovery algorithm, 
which is performed by altering the timing phase of 
the sampling clock, needs to be replaced by fully 
digital realization. 
When implementing an algorithm in hardware, we 
are facing with the tradeoff between performance 
and flexibility. ASIC, which possesses the 
advantage of low power consumption and high- 
speed, is only designed for a specific task. DSP, 
which can be programmed to implement different 
functions, requires many clock cycles to complete a 
task. FPGA lies in the middle of the two extremes. 
It is configurable and parallelism can be exploited to 
achieve high speed. Numerous applications in 
communication system implemented using FPGA 
are reported in the literature (e.g.. [6], [7], [SI). 
Apart from the flexibility and high performance, 
FPGA has two advantages: easy implementation of 
look-up table (LUT) and different precision can be 
Fax: ++ 852 + 2559 8738 
easily accommodated at various nodes in the system 
such that required processing is exactly realized. 
While VLSI implementation of digital riming 
recovery algorithm is popular, very few are 
implemented using FPGA platform. This paper 
presents a digital timing recovery scheme based on 
the squaring method [I], with an infinite impulse 
response (IIR) filter replacing the DFT operation 
[2]. The system was implemented using VHDL and 
realized in FPGA. Emphasis is being placed on how 
the features of FPGA are exploited in the 
implementation of LUT and variable data precision 
at different nodes. 
This paper is organized as follows. The system will 
be described in section 11. Section I11 presents how 
the property of user-definable precision is exploited, 
and section 1V discusses how the look-up tables are 
used to simplify the design. Simulation results are 
presented in section V. Conclusions will be drawn 
in section VI. 
11. SYSTEM DESCRIPTION 
The overall system is shown in Figure 1.  The 
sampled signal is first filtered by the digital matched 
filter and then spitted into two paths. The lower one 
is for estimating the unknown timing offset and the 
upper one, which is a first-in-first-out (FIFO) buffer, 
is for buffering the samples before the timing 
estimate is determined. In the offset time estimating 
path, an IIR filter is used to extract the harmonic at 
symbol rate [ 2 ] .  Finally, the estimated timing 
information is output to the interpolator. In the 
following, some important components will be 
discussed in detail. 
I t  
-41 a( o( &---J-$*dF\ q"am 
&I bm u k u M  
Figure 1. Block diagram of the overall system 
0-7803-6253-5/00/$10.00 02000 IEEE. 703 
a. Interpolation 
Interpolation in timing recovery is the process of 
calculating the sample values in between the 
existing ones (i.e., compute samples r(k - pmTv)  
from r ( k ) ) ,  such that optimum decision can be 
made from synchronized samples. The key equation 
[3] to the interpolation process is 
An interpolant is computed using I = I, - I, + 1 
adjacent input samples about the base-point 
x(mkT,)  and I samples of the impulse response of 
the interpolation filter identified by the fractional 
interval p k .  
Cubic interpolator, which is a member of 
polynomial-based approximating interpolation filter, 
can work well in typical modem applications [4]. 
Cubic interpolator can be implemented either using 
online calculation or LUT method. For online 
calculation, Farrow [5] -proposed an efficient 
structure for calculating interpolants. For LUT 
method, , u k  has to be quantized into, say, L levels. 
Then the sets of impulse response samples 
correspond to each quantized pi  are pre-computed 
and stored in a LUT. The correct set of impulse 
response samples is addressed by the fractional 
interval p k .  The advantage of this method is that 
the computing burden is independent of the impulse 
response. However, the recovered clock suffers 
fiom a timing jitter of maximum value T, / L . 
b. Squaring Timing Recovery 
The received signal for a linear modulation (PAM, 
QAM, PSK) is given by 
m 
r ( t )  = a,  g,. ( t  - nT - ET)  + n(t )  , 
n=-m 
where a,  is the transmitted data symbol, g, . ( t )  is 
the transmission signal pulse, T is the symbol 
duration, n ( t )  is the Gaussian white noise with 
power spectral density N ,  and E is an unknown 
delay but assumed slow-varying. 
After the received signal passes through the RF 
front end, where out of band noise is rejected, r ( t )  
will be sampled at rate 1 / T, = N / T and filtered by 
a digital matched filter with impulse response 
g , ( k T l N ) .  That is, the output of the digital 
matched filter is given by , 
m - rk = x a , g ( k T / N - n T - & T ) + E ( k T / N ) ,  
n=-m 
where g(m) = g ,  ( m )  * g, ( m )  and E'( kT / N )  is the 
filtered and sampled noise. 
M. Oerder and H. Meyr [ l ]  proposed that the 
unknown timing delay can be generated by 
computing the complex Fourier coefficient at the 
symbol rate for every segment of LN samples ( L  
symbols) of c2.  That is, the estimate is given by 
One obvious implementation of ( 1 )  is the use of 
DFT. However, as only the Fourier coefficient at 
symbol rate is needed, the computation of other 
Fourier coefficients is unnecessary. M. Rahnema 
[ 2 ]  showed that if N = 4 ,  the computation of 
discrete Fourier coefficient at symbol rate is 
equivalent to the output of an IIR filter, with system 
transfer function ~ ( z )  = jz- '  / (1 - j z - '  1, at time 
n = 4 L .  This method reduces the computation of 
discrete Fourier Transform (DFT), which is a very 
hardware demanding process, to a simple filtering 
operation. 
C Determination of interpolation control 
parameter 
The interpolator, however, does not use 2, directly 
because the estimated timing delay im is in term of 
T . An interpolator needs the fractional interval 
/ I , , ,  which is in term of T, , and the identification 
of basepoint set n,, , which is an integer. Therefore, 
zm must be converted to fractional offset time pnJ 
and basepoint set n,,, . The relationship between 
2, , p,, and n,, is given by 
Rearranging terms gives 
( 2 )  
The timing delay estimate 2, can be divided into 
five sub-ranges and each sub-range is handled 
individually. This is shown in Table 1. The cases 
for ;, 2 -0.125 are straightforward. For the case 
-0.375 I in, < -0.125, n,, should be equal to -1 
according to (2). Since the signal is sampled with 
rate 4 / T , basepoint at - 1  is equivalent to basepoint 
at 3. The same principle applies to the case of 
;,,,T = pmT,  + n,,,T,. 
;, x T / T, = pnl + n n l .  
-0.5 I i,,J < -0.375. 
Table 1.. Conversion from 2, to n, and p, 
704 
111. USER-DEFINABLE PRECISION 
One of the features of FPGA is the flexibility to 
enable designers to define different precision at 
various nodes in the system. This allows~ the design 
to match with the required processing precision and 
minimize the use of resource. In the following, 
some subsystems implemented using different 
precision at various nodes are described. 
U .  Matchedfilter 
The matched filter is a root raised cosine filter with 
roll off factor equals 0.3. It is a 17 taps FIR with 
coefficients quantized to 8 bits. The filter 
architecture is shown in Figure 2 with Table 2 
showing the data formats at different nodes. Taking 
advantage of the coefficients symmetry, the 
symmetric taps can be added together before 
multiplied by the coefficients. Assuming the input 
data samples are 8-bit numbers with 7 bits 
representing fraction (i.e., it's magnitude is smaller 
than unity), the result from addition may extend one 
more bit. Therefore, in order to avoid overflow, 9 
bits are needed in node A, with 7 bits representing 
fraction. 
Figure 2. Filter architecture for the matched filter 
b. IIRfilter 
The implementation of the IIR filter is shown in 
Figure 3, along with the internal register names. 
Theoretically, the output of an IIR filter will grow 
unbounded as index of output increases. However, 
in this application, only the output at w = 4L is 
needed and the IIR filter will be cleared for another 
estimate. Since the output from the matched filter 
will always be smaller than 0.709 (5.6719/8), the 
output of the squarer will always be smaller than 
0.503 ( [5.6719/8]* ). Using matlab simulation, for 
the input to the IIR filter with magnitude smaller 
than 0.503, the magnitude of the output will always 
be smaller than 1 for L I 6 4 .  Therefore, if a 
'segment of 64 symbols is used for timing estimate, 
the data formats for the input, outputs and the 
internal registers of the IIR filter are the same. 
Figure 3. IIR filter used in timing delay estimation 
C. Interpolator (online calculation) 
For interpolator implemented using the Farrow 
structure [5], the output is given by 
This equation shows that if the results from each 
multiplication with ,uk are not truncated, a very 
long register is needed to store the intermediate 
result. If truncation is done on each intermediate 
result, analysis of quantization effect has to be 
carried out in ,order to determine how many bits 
should be truncated in each step. Figure 4 shows 
Data format the linear noise model [9] for the-Farrow structure. 
node A Assuming there is no quantization error in the 
Farrow coefficients, the first source of quantization 
error is introduced if we truncate v (n)  to a smaller 
number of bits, which is denoted by e ,  in Figure 4. 
8 bits (xx.xxxxxx) 
Table 2. Data format at various nodes in the matched filter 
Since the filter coefficients are 8 bits with 6 bits 
representing fraction, a 17-bit data with 13 bits 
representing fraction results after multiplication 
(node B). Moreover, as (h(n)l= 5.6719, the 
output from the final summation will have a 
maximum value of 5.6719. This can still be 
represented by the same 17-bit word, so the data 
format at node C is the same as that in node B. At 
the output of the filter, the result is truncated to 8 
bits with fractional point left shifted 3 bits (i.e., 
divide the result by 8) such that the same format 
results at output MF_out[7..0] as that of input 
Din[7.:0]. 
Pb I 
. Figure 4. Linear noise model for the farrow structure 
The second source of quantization error occurs 
when the product after multiplication with ,uk  is 
truncated, which is denoted by e r .  The final source 
of error is the truncation before the interpolator's 
output, which is denoted by e , .  Analysis shows 
705 
that the total quantization error at the output of the 
interpolator is given by 
e = e , (pk3  + pkz + ,uUk + I )  + e , (pk2  + pk + I)  + e , .  
Since -0.5 s , u k  < 0.5, it follows that 
e<2e, +2e, +e, .  
If all the internal registers in the Farrow structure 
are 8 bits long (byte) with 7 bits representing 
fraction, the total error at output error e is bounded 
by 
e < 1 / 2 ~ + 1 / 2 ~ .  
If truncation is performed at the output only, the 
total error is bounded by 
e <  112'. 
Consider a compromise between the above two 
cases, suppose v (n)  is not truncated, the product 
after multiplication with ,uk is truncated to 13 bits 
with 11-bit fi-action and the final product is 
truncated to 8 bits with 7-bit fraction before output. 
Then the total error at the output is bounded by 
e<1 /2"  + 1 / 2 ' .  
The total error at the output in this case is about the 
same as that of truncation at the output only. 
However, the length of the internal registers is much 
shorter. Therefore, this truncation approach was 
implemented for online calculation of the 
interpolator. 
IV. LUT EXPLOITATION 
Another property of FPGA is the ease in which LUT 
can be implemented. This property can be exploited 
to greatly simplify the circuit. Two examples are 
described in the following. 
a. Interpolator (LUT approach) 
If a little bit jitter can be tolerated, the parameter 
pk can be quantized and the Farrow structure can 
be replaced by a simple 4 taps FIR filter with the 
coefficients (impulse response samples 
h, [( i + , u k  )T, ] ) pre-computed and stored in a LUT. 
Figure 5 shows a simplified diagram of the 
interpolator using the LUT approach. In this case, 
pk is quantized into 8 levels. Figure 5 also,  
includes a multiplexier, which is controlled by 
parameter base- ptCL.01, for determining which 
sets of samples to be used in interpolation. Actual 
implementation in FPGA shows that the LUT 
method uses about 30% less logic elements when 
compared with the online calculation method. 
Figure 5 .  Interpolator using LUT method with multiplexier 
b. Calculation of interpolation control 
LUTs are used to calculate the control parameters 
for interpolation. The block diagram for this 
process is shown in Figure 6 ,  with Table 3 showing 
the data format at various nodes. The process of 
determining timing delay estimate il, is 
implemented by four steps. First, the sign bits and 
magnitude of the IIR filter outputs are separated. 
Second, absolute value of the imaginary part is 
divided by absolute value of the real part. In this 
step, only the first quadrant is considered. Third, 
the arctan value of the quotient is obtained by LUT, 
which is shown in Table 4. The use of arctan LUT, 
whose resolution is determined by the number of 
quantization levels of ,uk, avoids the difficult 
implementation of arctan calculation in FPGA. 
Fourth, the quadrant in which the angle is located is 
determined by the sign bit extracted in the earlier 
step, according to Table 5. The output from the 
quadrant LUT is the timing delay estimate defined 
in (1) and is quantized into 32 levels. Finally, the 
timing delay estimate is converted to fractional 
offset time pn, and basepoint set n,, by the offset 
time conversion LUT, which is identical to Table I .  
The use of LUT in this step reduces the calculation 
of (2) into simple additions and shifts operations. 
parameters 
.___-_.____.___.-___.----.-----.-- 
_..___..__-___..-__.______________I 
Figure 6. Block diagram for the calculation of interpolation 
control parameters 
I mu I 3 bits (.xxx) 
Table 3. Data format at various nodes in fractional offset time 
calculation module 
706 
V. SIMULATION RESULTS 
sign-im 
+ 
+ 
After each individual module are compiled and 
tested, they are integrated and compiled together. 
The whole receiver part can be fitted into an Altera 
EPFlOK70 FPGA chip, with resources allocated to 
various components shown in Table 6. The overall 
design is a little bit smaller than the sum of 
individual design blocks since when compiling 
individual blocks, some logic elements may not be 
fully utilized. The overall receiver design occupies 
95.5% logic elements and 22% memory bits of the 
EPF 1 OK70 FPGA. 
sign-re 012n angle-b (-0121~) 
+ angle a -angle_a 
0.5-angle a -0.5+angle a 
VI. CONCLUSIONS 
This paper presented an implementation of the 
digital square timing recovery using FPGA. The 
effects of roundoff noise inside the interpolation 
filter have been analyzed. Look-up tables (LUT) are 
employed to replace some computationally intensive 
tasks. The flexibility of assigning different 
precision at various nodes in the system has been 
demonstrated. 
ACKNOWLEDGEMENT 
This work was supported by the Hong Kong 
Research Grants Council and by the University 
Research Committee of The University of Hong 
Kong, Hong Kong. 
REFERENCES 
M. Oerder, H. Meyr, “Digital filter and 
square timing recovery,” JEEE Trans. 
Commun., vol. 36, pp. 605-61 1 ,  May 1988. 
M. Rahnema, “Symbol timing recovery and 
tracking method for burst-mode digital 
communications,” US Patent, Patent 
Number: 5870443. 
F. M. Gamder, “Interpolation in digital 
modems - part I: fundamentals,” JEEE 
Trans. Commun., vol. 41, pp. 501-507, 
Mar. 1993. 
L. Erup, F. Gardner, R. A. Harris, 
“Interpolation in digital modems - part 11: 
implementation and performance,” IEEE 
Trans. Commun., vol. 41, pp. 908-1008, 
Jun. 1993. 
C. W. Farrow, “A continuously variable 
digital delay element,” Proceedings of 
H. S. Park, K, Y ,  Sohn, D. H. Kim, “The 
implementation of modulator using FPGA 
technology for W-CDMA WLL,” 
Proceedings of IEEE ASIC Conference and 
Exhibit 1997, pp.79-83 
ISCAS 88, pp.2641-2645. 
D. H. Lee, A. Choi, J .  M. Koo, J. 1 Lee, B. 
M. Kim, “A wideband DS-CDMA modem 
for a mobile station,” IEEE Trans. 
Consumer electronics, vol. 45, No. 4, Nov. 
J. M. P. Langlois, D. A. Khalili,-R. J. Inkol, 
“A high performance, wide bandwidth, low 
cost FPGA-based quadrature 
demodulator,” Proceedings of the 1999 
IEEE Canadian conference in electrical 
and computer engineering, pp.497-502 
A. V. Oppenheim, R. W. Schafer, 
“Discrete-time signal processing,” 
Prentice-Hall, 1989 
1999, pp. 1259- 1269 
Table 4. Arctan look-up-table 
I +. I -angle a 1 Angle a 
I -o.S+angle a I 0.5-angie a 
Table 5. Quadrant look-up-table 
Table 6. Resource allocation to various subsystems 
707 
