Abstract-This letter proposes a suboptimal implementation of a binary correlator suitable for detecting a known fixed pattern in a binary stream. The theoretical performances in terms of the probability of nondetection and the probability of false alarm are evaluated. These performances show that the degradations are negligible. Compared to a proprietary core provided by FPGA vendor, this implementation allows a 15 % look-up table reduction, a 30 % register reduction and up to a 30 % higher clock frequency in a FPGA.
I. INTRODUCTION

B
INARY correlator is a common tool in digital signal processing systems used for synchronisation in a binary stream. The principle is based on the detection of a known fixed pattern in a binary stream which can be corrupted by a noise with a probability of error p. If the degree of match between the N last received bits and the N bits of the training sequence seq(k) k∈{0,...,N −1} is above a given threshold, a new frame is detected. This principle is widely used in the context of symbol timing synchronisation for OFDM transmission [1] , [2] , [3] .
In this letter, we propose a suboptimal algorithm which simplifies the bit-correlator hardware implementation in a FPGA. The probability of false alarm (P fa ) and the probability of nondetection (P nd ) for both optimal and the proposed bit correlator are given.
II. BIT CORRELATION
The exact match score c e (n) for the n th bit between the training sequence seq(k) k∈{0,...,N −1} and the last N received bits of the sequence x(k) k∈N is defined in eq. (1) as the number of common bits.
If the bit rate is equal to the clock rate (i.e., a received bit at each clock cycle), the computation of c e (n) requires typically a shift register of size N to store the sequence x(k) k∈{n−N +1,...,n} [2] . Then, the b k (n) values can be directly computed from eq. (2).
Manuscript received January 24, 2007. The associate editor coordinating the review of this letter and approving it for publication was Jérôme Louveaux.
C Finally, a pipelined addition tree performs the summation of the N partial results. The summation result is then c e (n). For the sake of simplicity, the sequence length N is assumed to be a multiple of 4. In order to implement efficiently the bit correlation function described above onto a FPGA, the internal logic structure of the FPGA is taken into account. This structure consists of a set of 4-input/1-output Look-Up Tables (LUT) [4] , [5] . This structure is found within FPGAs provided by the two companies representing about 90 % of the FPGA market [6] . To fully exploit the 4 inputs of a LUT, eq. (1) can be rewritten as in eq. (3) c e (n) =
It can be noticed that each f k is the sum of 4 binary values. Thus, they take their value between 0 and 4; 3 bits (f
Let p e be the channel bit error probability and p b the probability that bit b k equals 0. When the received sequence x(k) is synchronised, the probability that bit b k equals 0 is p b = p e . Otherwise, p b equals 0.5. In fact, when not synchronised, at least P (P ∈ [1. .N ]) bits of the last N received bits are random data not correlated with the training sequence. If P < N, N − P bits are non synchronized bits from the training sequence and the partial match score of these N − P bits is in average around
(training sequences are generally chosen to have good auto-correlation properties). This partial score is also obtained for p e = 0.5. As p b and p e can be seen as equivalent, the probability of error will be denoted as p in the following.
The probability density function (p.d.f.) for the sum of 4 binary values is
In the system functional domain, when synchronised, p is significantly lower than 0.5, otherwise p is around 0.5. Thus, when synchronised, P (f k = 1) = 4p 3 (1 − p) can be considered as negligible compared to P (f k > 1) and it is not useful to discriminate the cases where f k is equal to 0 or 1. The function f k expressed in eq. (5) is chosen for implementation.
1089-7798/07$25.00 c 2007 IEEE
This function takes its value between 0 and 3: 2 bits f 1 k and f 0 k are needed to code the result. This function can be implemented using only 2 LUTs. This represents a significant lower complexity than the optimal solution. Note that the proposed approximated match score c a (n) takes its value between 0 and 3 4 N .
III. THEORETICAL PERFORMANCE FOR THE PROPOSED IMPLEMENTATION
The performance can be evaluated theoretically according to the bit error probability p. In the case of the exact computation, the p.d.f. of c e is given by P (c A detection threshold s t (p), an integer between 0 and N or 3 4 N , defines whether or not a detection occurs. From such a threshold, probability of false alarm P fa and probability of nondetection P nd are defined in eq. (6) and eq. (7), respectively. The receiver operating characteristic (R.O.C.) can be derived from these expressions: Fig. 2 presents the R.O.C. drawn (with p = 0.35) for the two schemes. The two R.O.C. do not show any significant difference. The two curves are modified in the same manner for any variation of p.
An optimal threshold is defined as expressed in (8) . It varies with p and is different for each computation scheme. Fig. 3 does not show significant differences between the two computational schemes.
IV. IMPLEMENTATION RESULTS
Both the exact and approximated correlations have been implemented with a FPGA Xilinx Virtex2-2000 [4] . For realistic implementation, the pseudo-random sequence is a CHU sequence [7] using a single precision bit for quantification. This well-known sequence is a CAZAC (Constant Amplitude Zero AutoCorrelation) sequence like the ones used in [8] . Typically, over a multiple path channel [9] , with a detection threshold of SNR = 10 dB, p b is around 0.3 when synchronised and 0.5 when not synchronised. Over an additive white gaussian noise channel with the same detection threshold, p b is around 0.05 when synchronisation is achieved and 0.5 when not.
The exact correlation core of Xilinx (Bit Correlator from the CORE Generator tool [10] ) is used as a state of the art design (denoted as Ref.). A generic VHDL component implementing both exact and approximated correlator has been written: the final addition can be chosen as an optimized Carry-Save Adder (CSA) or a classical tree adder [11] . Table I presents the place&route results (in term of logic (LUT), registers (DFF) and clock frequency) for the exact (reference, component with and without CSA) and approximated correlators (component with and without CSA). The targeted clock frequency is 280 MHz with no other specific timing contraint given. The implementation of the approximated scheme shows about 15 % (resp. 30 %) of lower complexity in term of LUT (resp. DFF) compared to the reference design, associated with a larger clock frequency of about 30 %. 
