I. INTRODUCTION
Orthogonal Frequency Division Multiplexing (OFDM) has gained acceptance worldwide. This is mainly due to the simplification it provides in receiver implementation. It is also used as a multiple access technique where the bandwidth is divided into a number of orthogonal sub-carriers. OFDM systems are sensitive to frequency offsets, and several sources of this anomaly are identified. i.e., mismatch between local oscillators at both ends of the transmission, the mismatch between the master clocks in transmitter and receiver, and, effects inherent to the channel properties, such as multi-path and Doppler spread. All the above mentioned effects will contribute to a phase rotation of transmitted samples that need to be corrected.
Frequency offset is typically estimated in two stages, a fast but less accurate acquisition stage, followed by a more precise tracking stage. The work presented in this paper is concerned with the first stage, and relies on the assumption that the initial acquisition just has to be sufficiently accurate to obtain a successful communication link. A refined estimation is performed in the subsequent stage, in order to assure reliable demodulation.
In this paper, an architecture capable of time and frequency estimation, when signal components I and Q are quantized to one bit, is presented. The architecture described in the current article is grounded on the feasibility study presented in [1] , where thorough analysis and simulations prove the algorithmic complexity reduction. A preliminary architecture based on this theoretical background is also introduced in [2] , where the time and frequency are estimated by two independent branches. The current paper is an evolution from [2], where an improvement in both, algorithm and architecture is achieved by giving more importance to the phase rather than the signal components I and Q. Furthermore, more focus is placed on performance of the proposed architecture under various SNRs. Only using the sign-bit was considered in [3] for frame synchronization, but it was assumed that there was no frequency error. In [4] , the idea of using only the sign bit was mentioned for DVB, and a huge reduction in area was reported. However, no analysis concerning the performance was provided. [5] propose using the sign bit only using an algorithm for calculating the angle to estimate the frequency error, which has a resolution of 10 effective bits. The proposed architecture is synthesized towards a 65nm low-leakage high threshold standard cell CMOS library.
This work has been carried out under the MULTI-BASE [6] project supported by the 7th Framework Programme (FP7) of the European Commission.
II. ACQUISITION WITH SIGN-BIT
OFDM standards contain a Cyclic Prefix (CP) to handle Inter Symbol Interference (ISI). In order to successfully decode the received data, the CP should preferably be longer than the channel's impulse response. The samples from the CP are used to estimate the start of the symbol and the fractional frequency offset [7] . The fractional frequency offset is defined as the fractional part of the actual frequency error divided by the sub-carrier spacing of the OFDM system.
Symbol start estimation and fractional frequency offset estimation are efficiently performed in the time domain [8] . The symbol start is estimated as the time instant when there is maximum auto-correlation, while the frequency offset is the argument of this maximum detected peak. Rather than only operating on I and Q, directly operating on the phase is considered as well. Specifically, the IQ branch is used for estimating the symbol start, whereas the phase branch is used to estimate the frequency offset, as depicted in Fig. 1 . Note that (·) * denotes the complex conjugate, whereas (·) q denotes the quantized version of the superscripted signal.
Operating on the phase is typically not desired, since the amplitude information is not considered. In this study, only the sign-bit is used, and therefore there is no amplitude information available. The auto-correlation can be realized as a moving sum, carrying the values of a complex conjugate multiplication as:
where r is the baseband representation of the received signal, k the sample index, N the auto-correlation distance, and S the length of the moving sum. The symbol start is estimated from (1) as the index k where |y(k)| has its maximum value. Suppose that the received signal is only affected by the frequency error, i.e. no time-dispersion, Doppler spread, or noise. Then it is seen that
and ∆ϕ = 2πεN . Where ε is the fractional frequency error between transmitter and receiver. If the received signal is quantized to only the sign-bit, the auto-correlation function in (1) still contains sufficient information to accurate estimate the symbol start. For the quantized signal, the argument resulting from the complex multiplication, γ q from (2), will only take values with π/2 resolution. Assuming that (0 ≤ ∆ϕ < π/2), the phase of γ q will be either 0 or π/2, depending on whether the quantized received signal, r q (k) and r q (k−N ), are located in the same quadrant or not. This is illustrated in Fig. 2 , where arg |γ q (k)| = π/2 when γ q (k) falls in the gray region defined by the arch ∆ϕ. The mean value of ∆ϕ is computed as
with a correspondent variance equal to The variance of the estimator is shown in Fig. 3 . The maximum variance results when the true offset is π/4, and is inversely proportional to S, the number of the elements in the moving sum.
The analysis above assumes perfect knowledge of where the start of the symbol is located. However, if this sample error is taken into consideration, E[ ∆ϕ] (4) becomes
where β is the number of samples the estimator is away from the true symbol start. This is due to when there are β uncorrelated samples in the auto-correlation window. These samples contribute to a random phase in γ(k|β) ∈ {0, π/2, π, −π/2}, all with probability 1/4. Equation (6) shows that an error in the symbol start has a negative impact on the phase estimation. However, as long as the number of samples is relatively small compared to S, the frequency estimate is sufficiently good.
III. APPLICATION TO WLAN, LTE AND, DVB-H
In this section the suggested sign-bit based estimation is applied to the three standards IEEE 802.11n (WLAN), LTE and, DVB-H. The three standards are OFDM based and in terms of acquisition are divided into two categories, continuous and packet-based. LTE and DVB-H are placed into the continuous category, while IEEE 802.11n is considered packet based. For the latter category, this implies that once a packet is sent, there is a limited period of time for the terminal to acquire the packet and decode its contained data.
In order to achieve a fast acquisition in WLAN, a preamble containing known information is part of the standard. The preamble consists of a short training field for signal detection, automatic gain control and coarse synchronization, and a long training field, for finer synchronization and channel estimation. 10 short training symbols contained in the short training field, with 16 samples each , are used for the coarse estimation in this study.
In case of LTE and DVB-H the CP is used. LTE symbols contain 2048 samples plus 144 samples corresponding to the CP. For DVB-H there are three different transmission modes (8K, 4K and 2K), with four different cyclic prefix lengths, which results in various combinations of N and S. In this study the transmission mode 4K and 1/4 CP was somewhat arbitrarily chosen.
S and N in (1) and (2) are given in Table I where S is the length of the moving sum, and N is the distance between repetitive samples. The numbers in the table are extracted from the respective standard specification.
IV. ARCHITECTURE
An architecture capable of auto-correlation processing and phase estimation, using only the sign-bit is depicted in Fig. 4 . The structure is divided into three sections. First, a complex conjugate multiplication to calculate γ q from (2). Second, a complex valued accumulation that carriers the complex components I and Q, the absolute value of the autocorrelation function y q from (1), and consequently the symbol start. Third, a phase accumulation to compute and hold arg |γ q |. Note that the dashed lines are carrying phase information, while the solid lines carry I and Q components.
A. Complex Conjugate Multiplication
In communication applications, a complex multiplication is an expensive operation in terms silicon area. However, using only the sign-bit, the result of the complex conjugate multiplication γ q (k) from (2) takes exclusively the values {1, i, −1, −i}. Consequently, the multiplications are simplified to simple 1-bit XOR operations, The FIFO used to delay the received signal by N samples is dramatically reduced. The depth of the FIFO is dynamically adjusted to meet any of the considered standards. Further details about the concerning reductions are presented in Section VI.
B. Complex Value Accumulation
The moving sum used to calculate the auto-correlation output |y q |, is realized as the accumulation of the last S complex values of γ q . One input to the moving sum comes directly from the complex conjugate multiplier. The sample corresponding to γ q (k − S − 1), which is subtracted from the moving sum, comes from the phase accumulation, and therefore, requires the extraction of its real and imaginary components. The absolute value |y q | is approximated by adding I and Q components. One more advantage of using only the signbit, is that the number of bits in the auto-correlation function, depends solely on the length of the moving sum S, which in return depends solely on the standard in operation.
C. Phase Accumulation
Typically the frequency offset is calculated as the phase corresponding to the maximum auto-correlation peak y(k) [9] . This is frequently implemented using a CORDIC module to perform the angle calculation. A CORDIC module makes angle calculations using low complexity modules like adders and shifters, and computes one basic rotation per clock cycle. Throughput is increased at a cost of hardware complexity by concatenating stages and pipelining.
In the proposed architecture phase estimation is simplified avoiding the use of CORDIC modules. The argument of the output of the complex conjugate multiplication in (1) only takes the values {0, π/2, π, −π/2}. Thus, the phase calculation is simplified to an angle computation with π/2 resolution, plus a moving sum.
The FIFO in the moving sum from [2] is reduced in this architecture by storing the phase of the complex components instead of the components. Thus the FIFO in the phase accumulation has only two bits since there are only four possible angles. The FIFO has adjustable depth according to the values of S shown in Table I . The output from the phase accumulation becomes the estimated offset times the number S with resolution of π/2. Depicted as 2 π S∆ϕ in Fig. 4 . Where S is a constant dependent on the selected standard.
V. SIMULATIONS AND RESULTS
Simulations for the three standards were carried out to compare the theoretical analyses from Section II. The simulated expected value of ∆ϕ is shown in Fig. 5 , where the channel is unitary. It is seen that in DVB-H and LTE the expected value has no bias and perfectly matches the true offset. For IEEE 802.11n the expectancy follows the true offset, yet errors at certain frequencies are present. IEEE 802.11n has a known predefined preamble, see Section II, while the only considered effect in the channel is a frequency offset due to the oscillator mismatch between transmitter and receiver. Since the preamble remains unchanged, and the frequency offset is also fixed, the result is a deterministic perturbation of 4 samples over the estimation of the symbol's start. This prevents the estimated frequency offset to converge closer to the true value. Note that in true applications this situation does not exist, due to the fact that there is no such thing as a unitary channel.
The mean square error (MSE) for the proposed algorithm is shown in Fig. 6 . The algorithm is simulated for an additive white Gaussian noise (AWGN) channel with frequency offset corresponding to π/8 for the three standards. The solid markers show the MSE if full precision with no quantization error. It is clear from the figure that there is a significant performance loss of the algorithm compared to an implementation with full precision. For instance consider LTE at SNR of 8dB, results in a performance loss of around 10 · 10 −1 , corresponding to a frequency error mismatch of ±0.316 radians. This is tolerated by the fine synchronization stage. Moreover, in low SNRs the performance loss is smaller. This is because the floating point accuracy increases considerably on high SNRs, whereas the quantization effect for the proposed algorithm is still high on high SNRs. It is seen that DVB-H presents the best performance, due to that its CP is considerable longer. For LTE and 802.11n, the same S = 144 is used. However, a better performance is achieved in 802.11n, since the use of the preamble, reduces the variance of the estimator.
VI. IMPLEMENTATION RESULTS
The proposed architecture is synthesized in 65 nm lowleakage, high threshold standard cell CMOS library. Power simulation are performed using Synopsys PrimeTime with back-annotated toggling information under a transmission of DVB-H on an AWGN channel of 10 dB. The proposed architecture dissipates in average 4.8µ Watts per symbol.
With a memory as large as 2 bits and 4K samples for DVB-H, it is clear that the design is memory dominant. Around 70% of the area is dedicated to memories. When the proposed design is compared with a typical 8 bits implementation, the reduction in terms of number of bits is of 87% for the memory placed in the complex conjugate multiplication, and around 97% for the memory placed in the phase accumulation. The reduction in terms of logic is of 70%. With a total area of 0.047 mm 2 , Table II summarizes the above mentioned. 
VII. CONCLUSIONS
In this paper an algorithm and an architecture to estimate the symbol start and the frequency offset of three OFDM standards is presented. The effect of the quantization to one bit is analyzed analytically, by means of expected value and variance. There is an evident loss of performance yet, the objective of this study is to simplify acquisition stage with the assumption that the posterior stage is capable to handle this performance loss. With all this into consideration, a reduction in silicon area of 93% is achieved, when compared to an equivalent 8-bit architecture. Moreover, complexity in general is dramatically reduced, which in return makes the proposed architecture suitable for use cases when several standards are in operation in a single terminal.
