INTRODUCTION
The modern global navigation satellite systems (GNSS) signals, such as the Global Positioning System (GPS) L5 and L1C, and Galileo E5 and E1, have brought several innovations: the introduction of a pilot channel that does not contain any data to allow very long coherent integrations; the introduction of a secondary code to offer better cross-correlations, to facilitate the synchronization with the data, and to help interference mitigation; the introduction of new modulations to reduce the impact of multipath; and the use of higher chipping rates to have better accuracy and interference mitigation.
Although having a secondary code brings some advantages, it also presents some drawbacks. Indeed, with the modern GNSS signals, there is now a potential sign transition (i.e., a carrier phase shift of 180°) between each period of the primary code, unlike the GPS L1 C/A signal that has a potential sign transition each 20 code periods only. These sign transitions are one of the limitations of the coherent integration time, and thus of the receiver sensitivity [1] - [4] . Therefore, to use a long coherent integration time and get high sensitivity, the delay of the secondary code must be estimated.
There have been several proposals to address this problem. In [5] , [6] , it was proposed to synchronize with the primary code first, and then synchronize with the secondary code. However, this implies the ability to detect the signals using only one period of the primary code, which is not the case in the high sensitivity context. In [7] , [8] , it was proposed to extend the coherent integration time by estimating the possible combinations of several secondary code chips, and using this to determine the secondary code delay [9] , but these methods are still not adapted to the high sensitivity context.
To get high sensitivity, the coherent integration time should be at least one period of the secondary code, or a multiple of it. In [10] , it was proposed to determine the primary code delay with a serial search and the secondary code delay with a fast Fourier transform (FFT) based correlation; however, the serial search is too timeconsuming for a realistic implementation. In [11] , the authors proposed to perform an FFT-based correlation over one period of the secondary code with the L5 signal; nevertheless, this requires very large FFTs (length greater than 2 18 ), which are not compatible with a hardware implementation. Finally, [12] proposed to perform FFT-based correlations over one period of the primary code (doubling the length to manage the sign transition), and to combine the results according to the secondary code chips.
In this article, we will focus on this last method. More specifically, we will compare different hardware implementations of this method. Indeed, the combinations can be performed before or after the correlations with the local primary code; they can be computed sequentially or in parallel; and the output can be computed in different orders (checking all the primary code delays for one secondary code delay, or checking all the secondary code delays for one primary code delay). The objective is therefore to identify the most efficient implementations. Note that these different implementations are not approximations; they all provide the same output and thus the same performance in terms of sensitivity. We will also present a method that approximately halves the number of operations related to the secondary code correlation, still without impacting the sensitivity, and see how it can reduce the processing time with the hardware implementations.
ACQUISITION OF GNSS SIGNALS

SIGNAL DEFINITION
The signal received by a GNSS receiver is the combination of several GNSS signals coming from U different satellites, plus a noise term. Thus, after the front-end, the discrete baseband signal can be written as Photo credit: NASA f S being the sampling frequency, and η b (nT S ) is the noise component ( [13] , chapter 1). Considering a real sampling front-end, the discrete baseband signal from satellite u having a data and a pilot channel can be expressed as [13] , [14] . Note that this model is simplified, since it does not take into account the Doppler effect on the code, or the local oscillator effect on the sampling frequency for example (see [13] , chapter 1), but it is enough for our problem.
The pseudorandom codes code and of a secondary code (and potentially of a subcarrier, not considered here but without impact on our discussion), and are also called tiered codes. In these tiered codes, the primary code is repeated several times and each period is multiplied by a chip of the secondary code. Since the primary and secondary codes are binary codes taking +1 or -1 as value, the tiered code is also binary code taking +1 or -1 as value. Using vector notation, denoting p the primary code of length N P , and s the secondary code of length N S , they can be defined as , ,
where the subscript represents the sample for p and the chip for s. The tiered code, denoted c, has thus a length N = N S N P and is defined as
where ⊗ denotes the Kronecker product. The length N P of the primary code depends on the signal and on the sampling frequency. For example, the L5, E5a, and E5b pilot signals are binary phase shift keying signals with a chipping rate of 10.23 MHz, therefore the usual minimum sampling frequency considered for these signals is 20.46 MHz (twice the chipping rate, but this exact frequency is never used because of position accuracy problems [15] ). Since the length of the primary code is 1 ms, the minimum value of N P is 20,460. The length N S of the secondary code is not related to the receiver and depends only on the signal. For example, the length of the secondary codes on the data and pilot channels is respectively 10 and 20 chips for the L5 signal, 20 and 100 chips for the E5a signal, and 4 and 100 chips for the E5b signal. Therefore, the minimum value of N is 409,200 for the L5 pilot signal, and 2,046,000 for the E5a and E5b pilot signals.
PARALLEL CODE SEARCH ACQUISITION
The aim of the acquisition is to detect the visible GNSS satellites, and to estimate their baseband frequency u b f and code delay τ u , by synchronizing local replicas with the incoming signal. The acquisition is thus a two-dimensional problem, for each satellite. There are different methods to perform the acquisition, such as the serial search, which tests the different combinations for the carrier frequency and code delay one by one [16] ; the parallel frequency search, which tests one code delay and several or all the carrier frequencies in parallel using an FFT [17] , [18] , [19] ; the parallel code search, which tests one carrier frequency and all the code delays in parallel using an FFT-based correlation [20] , [16] , [13] ; or there are also methods that parallelize the search in the two dimensions [21] , [22] , [23] . For a high sensitivity hardware receiver, the parallel code search seems the most suitable method because of its high level of parallelization, its moderate memory requirements, and because it can compensate the code Doppler whereas the parallel frequency search and its derivatives cannot [13] , [24] .
AUGUST 2017 IEEE A&E SYSTEMS MAGAZINE 47
The basic diagram of the parallel code search implemented on a field-programmable gate array (FPGA) is shown in Figure 1 . In this figure, the incoming signal is stored in a memory at the sampling frequency for a faster processing during the acquisition. For different frequencies of the local carrier replica, the circular correlation between the incoming code and the local code is computed using FFTs. Then additional coherent or noncoherent integration can be performed. This process is performed at the FPGA frequency, which is usually much higher than the sampling frequency, allowing a speeding up of the acquisition [24] .
For the following, we will concentrate only on the processing between the carrier removal and the extra coherent integration, i.e. the circular correlation computed using FFTs.
DIRECT CORRELATION OVER THE SECONDARY CODE PERIOD
The circular correlation can be performed over the entire tiered code to synchronize with both primary and secondary codes simultaneously, as proposed in [11] . Using matrix notation, the circular correlation can be written as
where C is an N × N right circulant matrix with c T as first row, x is the signal after the carrier removal, and X is an N × N left circulant matrix with x as first column [25] . Since a circulant matrix can be diagonalized by the discrete Fourier transform matrix F, we can write
where * denotes the conjugate operator and ○ denotes the Hadamard product (element by element product) [25] . Therefore, this circular correlation can be implemented using FFTs as shown in Figure 2 , where the length of the FFTs is N. The corresponding timing diagram is shown in Figure 14 (all the timing diagrams are provided in appendix to not overload the core of the article), assuming that several FFTs can be computed consecutively without pause (this corresponds to the streaming implementation of some FFTs [26] ), and that the FFT has a latency of L N clock cycles (i.e., there are L N clock cycles between the last sample of the input sequence and the first sample of the first output sequence).
For an FPGA implementation, the FFT cores available require an FFT length that is a power of two [26] - [29] . As mentioned previously, the minimum value of N is 409,200 for the L5 signal, thus the smallest power of two possible is 2 19 = 524,288. To have this FFT length, the sampling frequency must be 26.2144 MHz (524,288/(20 ms)). Otherwise, if another sampling frequency is considered, zero-padding must be used, and the equivalent of two code periods are needed (to keep the periodicity of the code and avoid losses [30] , [31] ), and in this case the FFT length would be 1,048,576.
In any case, it is not possible to implement such FFT directly since the required length is too large. Indeed, the maximum length currently available with the Altera FFT core is 262,144 with the variable streaming data flow (which consumes a tremendous amount of resources) and 65,536 with the streaming and burst data flows [26] ; the maximum length is 65,536 with the Xilinx FFT core [27] ; 16,384 with the Lattice FFT core [28] ; and 8,192 with the Microsemi FFT core [29] . Nevertheless, the processing time of the theoretical implementation of the direct correlation is given in Table 1 , without and with zero-padding. In the next section, we will consider the computation of the circular correlation by combining the results of smaller circular correlations, which is more Implementation of the direct correlation over the secondary code period (see (5) and (6)). See the timing diagram in Figure 14 .
practical for hardware implementations. The processing time and memory usage of all the implementations are given in Table 1 .
CORRELATION OVER THE PRIMARY CODE PERIOD AND
COMBINATIONS
Instead of computing the circular correlation over one entire period of the secondary code, it is possible to perform a circular correlation over one period of the primary code, repeat this for multiple consecutive periods, and then combine the results according to the chips of the secondary code, as proposed in [12] . Therefore, the output y can be computed by portions equivalent to the period of the primary code. For example, considering that the secondary code has four chips (example that we will use along this article for the illustrations), the four portions of the output can be obtained as , y P x P x P x P x y P x P x P x P x y y P x P x P x P x y P x P x P x P x
where the y i are the different portions of the output containing N P samples, P T is a Toeplitz matrix of size N P × 2N P where the first row is the primary code p T padded with N P zeros, and the x i are built from two consecutive periods of the incoming code, i.e.,
, they contain thus 2N P samples (see [13] , chapter 6 for more details about how to obtain this equation). Note that (7) is not an approximation of (5), the output y is exactly the same in both cases. Only the way to compute y is different.
The Toeplitz matrix P T can be embedded into a circulant matrix of size 2N P × 2N P by adding N P rows [25] , therefore the product between P T and a vector of 2N P points can be computed as a circular correlation using three FFTs of length 2N P , where the second half of the output is discarded. If the length of the FFTs has a constraint (such as to be a power of two), zeros can be added to the local and incoming sequences to achieve the desired length. Since we focus on FPGA implementations, we consider this constraint, and thus for the acquisition of the L5, E5a, and E5b signals, the length of the FFT will be N FFT = 2N P + N Z = 65,536 (since the minimum for 2N P is 40,920). This value for the FFT length can be used for sampling frequencies up to 32.768 MHz. Note that there are methods to optimize this double length circular correlation for FPGA implementations [31] , [32] but we do not consider them in the following discussions since this circular correlation is present in all the implementations discussed.
Coming back to (7), the multiplication by the secondary code chips can be done at different stages. If the different combinations according to the secondary code delay are performed before the FFT-based correlation, (7) becomes 
* SCCC stands for secondary code circular correlation.
AUGUST 2017 IEEE A&E SYSTEMS MAGAZINE 49 ,
s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s s
y P x y P x y P x y P x r r r r (9) where r i = P T x i . Since the secondary code is removed after the FFTs, we will talk about post-FFT secondary code removal. The notation a i and r i are used to facilitate the link between the equations and the figures. Note that r i has a clear meaning, since we can write (assuming the Doppler is correctly removed)
where Δ is the unknown delay of the incoming secondary code, r p is the autocorrelation of the primary code, and η i is the noise. In both (8) and (9), there are N S FFT-based correlations over at least 2N P points. For the combinations, (8) uses vectors of 2N P points, whereas (9) uses vectors of N P points. Thus, (9) requires slightly less operations than (8) . One can check that these two equations require more operations than the direct correlation of (5).
In the next subsections, we will study the FPGA implementation of both equations, testing the secondary code delays sequentially or in parallel, and using or not memory to save temporary results. For the evaluation of the processing time, we will consider a streaming data flow, i.e., an FFT that can process the data in a continuous way.
IMPLEMENTATION OF THE PRE-FFT SECONDARY CODE REMOVAL IN A SEQUENTIAL WAY
In (8), there are N S correlations between the local primary code p and the combinations of the different periods of the incoming signal (a 0 , a 1 , ...). The corresponding implementation computing each combination sequentially is shown in Figure 3 . The accumulator used before the FFT is implemented with an adder and a memory having 2N P addresses, to accumulate over N S samples (one sample of each period), as shown in Figures 12 and 13 . The processing starts by accessing all the portions of the input (x 0 , x 1 , ...), and when the last one is accessed, the first combination a 0 is available and its correlation with the local code p is computed to obtain y 0 . Then, x 0 , x 1 , ... can be accessed again immediately to compute the second combination, and so on and so forth, until the N S combinations have been tested. The processing time is approximately 2N S times higher than the one of the direct correlation implementation.
With this implementation, the memory needed is twice 2N P (B 0 + ⌈log 2 N S ⌉) bits for the accumulation because the signal is complex, where B 0 denotes the number of bits used to quantize x i .
IMPLEMENTATION OF THE PRE-FFT SECONDARY CODE REMOVAL IN A PARALLEL WAY
It is also possible to compute the different combinations (a 0 , a 1 , ...) in parallel using N S accumulators, as shown in Figure 4 . In this case, the processing also starts by accessing x 0 , x 1 , ..., and when the last one is accessed, a 0 , a 1 , ... are available in the accumulators' memory. Then, each a i is read successively and the correlation with the local code p is computed to obtain y i . Then, for the next data stream, the portions of the input can be accessed again only when the last combination is read, which implies that the processing time is divided by a factor lower than N S compared to the previous sequential implementation.
With this implementation, the memory requirements are higher since 2N P N S (B 0 + ⌈log 2 N S ⌉) bits need to be stored for the accumulation.
These two pre-FFT implementations are the extreme cases, where either only one combination is computed at a time, or all the combinations are computed simultaneously. However, it is also possible to test only several combinations, using one accumulator per combination. For example, the timing diagram considering two accumulators is given in Figure 17 . Implementation of the pre-FFT secondary code removal (see (8) ) computing each combination of the input sequentially. See the timing diagram in Figure 15 .
IMPLEMENTATION OF THE POST-FFT SECONDARY CODE REMOVAL WITH A MEMORY
Looking at (9) , it can be seen that the correlation between the local primary code p and each portion of the incoming code (x 0 , x 1 , ...) needs to be performed only once. Only the combinations of the different portions according to the secondary code delays differs. However, this requires the storage of the correlation portions (r 0 , r 1 , ...). The corresponding implementation using a memory to store r 0 , r 1 , ..., and computing y 0 , y 1 , ... sequentially is shown in Figure 5 .
The processing starts by accessing x 0 , x 1 , ..., computing their correlation with the local code p, and storing the results into the memory. Then, the memory is read and a combination is tested, then the memory is read again and another combination is tested, and so on and so forth. The process is then repeated for the next data stream, as soon as it is possible to write again into the memory without overwriting data not yet read. With this implementation, the combinations are performed over vectors of N P instead of 2N P for the pre-FFT implementations, which implies that the processing time is approximately halved compared to the pre-FFT sequential implementation.
With this implementation, the memory needed is twice N P N S × B 1 bits to store the FFT outputs and twice N P (B 1 + ⌈log 2 N S ⌉) bits for the accumulation, where B 1 denotes the number of bits used to quantize the outputs of the inverse FFT (r i ).
IMPLEMENTATION OF THE POST-FFT SECONDARY CODE REMOVAL IN A SEQUENTIAL WAY
It is also possible to implement (9) without storing r 0 , r 1 , ..., but in this case, they must be recomputed several times. The corresponding implementation computing y 0 , y 1 , ... sequentially is shown in Figure 6 .
The processing starts by accessing x 0 , x 1 , ..., computing their correlation with the local code p, and combining the results according to the secondary code chips. The process is then repeated to test the next combinations. Then, the process is repeated for the next data streams. With this implementation, since the zero-padding is present at the input of the FFTs and for the combinations, the processing time is higher than with the pre-FFT sequential implementation and with the post-FFT implementation with a memory.
IMPLEMENTATION OF THE POST-FFT SECONDARY CODE REMOVAL IN A PARALLEL WAY
As previously, it is also possible to compute each portion of the output in parallel using N S accumulators, as shown in Figure 7 . The processing is similar to the previous post-FFT implementation, except that the FFTs are computed only once, since each accumulator accumulates when a new correlation portion is available, and that there are N S output available simultaneously (which will require a slightly different detection process after that). Contrary to the pre-FFT parallel implementation, there is no need to stop the stream between different data streams, therefore the Implementation of the post-FFT secondary code removal (see (9) ) using a memory to store correlation portions and computing each combination of the output sequentially. See the timing diagram in Figure 18 .
Figure 7.
Implementation of the post-FFT secondary code removal (see (9) ) computing each combination of the output in parallel. See the timing diagram in Figure 20 .
AUGUST 2017 IEEE A&E SYSTEMS MAGAZINE 51
processing time is lower. Note also that the processing time is divided by approximately N S compared to the post-FFT sequential implementation.
IMPLEMENTATION OF THE POST-FFT SECONDARY CODE REMOVAL USING CIRCULAR CORRELATION
In the previous implementations, the output is computed by consecutive portions corresponding to one primary code period. However, it is also possible to compute the output in a different order. Indeed, the lth samples of the outputs y 0 , y 1 , ..., can be obtained from the circular correlation between the secondary code and the lth samples of the correlation portions (r 0 , r 1 , ...). Starting from (9), we can write
.
This circular correlation can be computed traditionally in the time domain, or using FFTs. However, this means that we need to have access to the different correlations portions at the same time, therefore, they should be stored into memory as in the implementation of the post-FFT secondary code removal with a memory.
Implementation of the Secondary Code Circular Correlation in a Sequential Way
If the different combinations in (11) are computed in a sequential way, the accumulation can be done with a simple adder, without using a memory. The corresponding implementation is shown in Figure 8 .
The processing until the storage of the correlation portions is similar to the post-FFT implementation with a memory. After, what is different is the reading order of the memory, because now we read the first sample of each correlation portion (r 0,0 , r 1,0 , ...), multiply them with the secondary code and accumulate the result. These samples are then accessed again to test another delay of the secondary code, and so on and so forth. Thus, they will be accessed N S times. Then, we read the second sample of each portion (r 0,1 , r 1,1 , ...) and the same process is performed, and this is repeated N P times for the N P delays of the primary code.
Because of the different writing and reading order of the memory, there is an additional latency introduced compared to the post-FFT implementation with a memory (this can be clearly seen comparing Figures 18 and 21) , and therefore the processing time is slightly longer.
With this implementation, the memory needed is twice N P N S × B 1 bits to store the FFT outputs.
Implementation of the Secondary Code Circular Correlation in a Parallel Way
It is also possible to compute the N S samples of the output in (11) in parallel using N S accumulators, as shown in Figure 9 .
The processing until the storage of the correlation portions is similar to the previous implementation. The only difference is that we need to read only once the N S samples r 0,l , r 1,l , ..., to test the N S combinations. Therefore, compared to the previous implementation, the processing time is reduced a lot (up to N S /3) in exchange of only N S logic accumulators. However, compared to the post-FFT parallel implementation, the processing time is slightly higher because of the different order of writing and reading in the memory that introduces a latency (this can be seen comparing Figures 20  and 22) .
Implementation of the Secondary Code Circular Correlation Using FFTs
As indicated previously, since (11) corresponds to a circular correlation, the operation can be performed using FFTs. The corresponding implementation is shown in Figure 10 , where N FFT,S denotes the length of these small FFTs. Following our constraints,
Figure 10.
Implementation of the post-FFT secondary code removal using circular correlation (see (11) ) computing each sample of the output using an FFT (the writing and reading orders of the memory are different). See the timing diagram in Figure 23 .
Figure 9.
Implementation of the post-FFT secondary code removal using circular correlation (see (11) ) computing each sample of the output in parallel (the writing and reading orders of the memory are different). See the timing diagram in Figure 22 .
Figure 8.
Implementation of the post-FFT secondary code removal using circular correlation (see (11) ) computing each sample of the output sequentially (the writing and reading orders of the memory are different). See the timing diagram in Figure 21 .
these FFTs need sequences that have a length that is a power of two. None of the secondary code currently available has such a length (except on the data channel of the E5b signal). Therefore, zero-padding must be used, and the length of the sequences must at least double (to keep the periodicity and avoid losses). For example, with the GPS L5 pilot secondary code that has 20 bits, the FFTs length will be 64 bits.
The process is similar to the previous implementation, except that more samples are needed to compute the circular correlation, and therefore the processing time is longer. Moreover, the resources required by an FFT of 64 points in terms of logic, memory, and multipliers are not negligible, therefore such FFT will likely require more resources than the implementation of N S accumulators (except maybe if N S = 100, as with the E5a and E5b signals). Consequently, the use of the FFT for the circular correlation over the secondary code is not recommended. Table 1 provides a summary of the memory needed and of the processing time for each considered implementation. Let's first have a look on the sequential implementations. Comparing the pre-FFT and post-FFT sequential implementations (Figures 3 and 6) , the second one requires a higher processing time due to the zero-padding (this extra time can be significant if N Z is large), and its required memory is multiplied by (B 1 + ⌈log 2 N S ⌉)/2(B 0 + ⌈log 2 N S ⌉). Usually, B 0 is rather small (since the incoming signal is typically quantized with 2 bits and the local carrier replica as well [33] ), and B 1 is not small because the FFT requires a certain number of bits to provide accurate results (typically 16 bits, from experience). Thus, the memory requirements for both implementations can be relatively close. Therefore, the pre-FFT sequential implementation seems more interesting than the post-FFT sequential implementation.
SUMMARY
For the post-FFT sequential implementation using a memory ( Figure 5 ), its processing time is roughly half the one of the post-FFT sequential implementation (Figure 6 ), whereas the memory is multiplied by a factor close to N S . Note however that the FFTs require a significant amount of memory, and that the incoming signal is also stored (see Figure 1) , therefore the total amount of memory needed for the acquisition is multiplied by a factor less than N S . For the post-FFT implementation using a memory with a sequential secondary code circular correlation (Figure 8 ), there is a slight increase in the processing time and a slight decrease in the memory requirements. Thus, the most suitable of these three post-FFT sequential implementations will depend on the context and design constraints.
Let's now compare the parallel implementations. Comparing the pre-FFT and post-FFT parallel implementations (Figures 4 and  7) , the second one has a lower processing time (by a factor at most two), whereas the memory is multiplied again by a factor (B 1 + ⌈log 2 N S ⌉)/2(B 0 + ⌈log 2 N S ⌉). Therefore, there is probably an advantage for the post-FFT implementation, but the context and the design should be taken into account to make a precise evaluation.
For the post-FFT implementation using a memory with a parallel secondary code circular correlation (Figure 9 ), its processing time is higher than the one of the post-FFT parallel implementation ( Figure 7 ) by a factor less than 3/2, whereas its memory is multiplied by a factor B 1 /(B 1 + ⌈log 2 N S ⌉), which is smaller than one. Therefore, it is again difficult to decide between these two implementations without more information about the context and the design.
For the post-FFT implementation using a memory with an FFTbased secondary code circular correlation (Figure 10 ), the processing time is longer than the one of the post-FFT implementation using a memory with a parallel secondary code circular correlation ( Figure 9 ) by a factor of at least 4/3, and the memory requirement is slightly higher due to the small FFTs. Therefore, this implementation is less efficient and not interesting.
To have a more concrete evaluation, let's consider two examples, one corresponding to a "low-cost" receiver where the incoming signal is quantized with few bits and sampled with a low frequency, and one corresponding to a "high-end" receiver using more bits for the quantization and a higher sampling frequency. The parameters selected considering the GPS L5 pilot signal are shown in Table 2 , and the results are shown in Tables 3 and 4 .
For the evaluation of the memory required by the FFTs, we have considered the FFT core provided by Altera, and such FFT of 65,536 points using a streaming data flow and 16 bits of resolution implemented on an Altera Stratix V FPGA requires about 12.5 Mbits of memory [32] . The memory required by an Altera FFT roughly doubles when the length is doubled [34] ; therefore, we can assume that if it would exist, an FFT of 1,048,576 points would require approximately 200 Mbits. Note that nonetheless, these amounts of memory could be significantly reduced (by about 75%) by using an alternative implementation of the circular correlation [32] , although not reported in Tables 3 and 4 . Note also that to store the incoming signal (see Figure 1) , an additional memory is needed, for example of 2N P N S × B = 1,636,800 bits if B = 2 bits are used for the quantization.
In Tables 3 and 4 , we clearly see that the pre-FFT and post-FFT sequential implementations require much less memory than the other implementations, but they have a much longer processing time. It can also be seen that when there is no zero-padding (high-end receiver case), the post-FFT implementation has the same processing time as the pre-FFT one but uses less memory. Still consider- * These values are for unoptimized FFT implementations and could be reduced by about 75% [32] .
ing a sequential implementation, the use of a memory to store the correlation results increases a lot the memory for a small decrease of the processing time. Finally, the parallel implementations use a lot of memory but decrease a lot the processing time, and the post-FFT parallel implementations are better than the pre-FFT parallel implementation since the processing time and the memory can both be lower. The parallel FFT implementations have a processing time close to the one of the theoretical direct correlation, or even better, for the high-end receiver (because the FFT for the direct correlation uses 2 21 points due to the chosen sampling frequency), whereas the direct correlation requires a much higher amount of memory for the very large FFTs (and a higher amount of logic, not mentioned in the tables). In conclusion, we can say that with both receivers, the most suitable implementations are post-FFT parallel implementations. And comparing both receivers, the high-end one uses more memory and the processing time is longer due to the higher quantization and sampling frequency. Of course, the sequential and parallel implementations considered here are the two extremes; it is also possible to test only few delays for the secondary code in parallel, which would balance the memory requirements and the processing time.
USE OF DUAL READ ACCESS MEMORY
In the previous discussions, it was assumed that only one sample could be read from a memory at each clock cycle. However, the memories inside FPGAs usually propose a dual read access, and thus it is possible to read simultaneously two samples stored at different addresses. This can be used to improve the processing time of the implementations discussed previously, but not all of them can benefit from it, as discussed next.
For Figure 3 , if we can access two samples of x i,n at the same time, the processing time can be halved since the bottleneck is in the access of the input signal. However, since x i,n is after the mixer with the local carrier, it would require two local carrier generators, therefore it is not so straightforward to implement. For Figure 4 , the processing time can be reduced only a little bit, at most by a factor 4/3 because the bottleneck is on the correlation computation, with the same complexity as before. For Figures 5 and 8 , the processing time can be almost halved since the bottleneck is mainly related to the memory reading, and it is simple to implement since it is related to the memory storing the correlation results and does not complicate the access to x i,n . For Figure 9 , the processing time can be reduced only a little bit, at most by a factor 6/5 because the bottleneck is mostly on the correlation computation, with the same simplicity as previously. For the other implementations ( Figures  6, 7 , and 10), having a double read access cannot be exploited and thus the processing time will stay the same.
NEW METHOD TO REDUCE THE PROCESSING TIME
In this section, we describe a method that reduces the theoretical number of operations related to the secondary code correlation by about 50%, and discuss its application for a hardware implementation. Note that this method is not an approximation, i.e. the output will be exactly the same as previously, and thus the performance in terms of sensitivity is exactly the same.
The main idea is to rewrite the local secondary code as (12) where 1 is a vector composed of ones only. In this case, the elements of s′ can have as value 0 or -2. Note that the local secondary code is not modified, it is simply expressed as the sum of two codes, and this concerns only the local code, not the incoming one. Thus, (8) and (9) can respectively be rewritten as (13) and (14) are not approximations of (8) and (9), the output y is exactly the same in all the cases. Only the way to compute y is different. Since x Σ and r Σ are the sum of signals still containing a secondary code, one may think that they contain mostly noise and thus that they are not useful and could be removed from the computation, but this would be a wrong idea. Even if they indeed contain mostly noise, these are simply intermediate results, and the noises present will be subtracted to the same noises when adding x Σ and ′ i a or r Σ and the combinations of r i , and at the end the output y will have the same noise component as with the traditional method. Removing x Σ or r Σ from (13) and (14) would change the operation done, add more noise, and therefore impact the sensitivity. Therefore, (13) and (14) should be applied as it is. In (8), the computation of one combination requires (N S -1)2N P additions, thus the computation of the N S combinations requires N S (N S -1)2N P = ( Table   5 shows the number of additions of both equations considering 50% of zeros in s′ and for the actual number of zeros with the GNSS secondary codes. It can be seen that when N S increases, the reduction of the number of operations approaches 50% in the worst case, and it is slightly above 50% for the GNSS signals. The same reduction is obtained for the post-FFT equation. Therefore, since this method reduces the number of operations, it can be useful for digital signal processor based receivers for example. Now let's see the applicability for FPGA based receivers. For this, we will focus on the pre-FFT sequential implementation and (13). Previously, with (8) , for each portion of the output (y 0 , y 1 , ...), it was necessary to combine N S portions of the incoming code (x 0 , x 1 , ...) before performing one FFT-based correlation, as already shown in Figure 15 . Now, with (13) , for each portion of the output (y 0 , y 1 , ...), it is necessary to combine only about half of the portions of the incoming code (x 0 , x 1 , ...) since in average half of the samples of ′ n s are zero. Therefore, if a portion of the incoming code is multiplied by 0, we simply do not read it from the memory, and therefore the reading of the memory is about twice faster. However, we also need to add a special combination of the incoming code (x Σ , the sum of all the portions). But since this special combination is identical for all the portions of the output, we can compute it only once and store it into another memory. This memory will then be read when we will want to add x Σ and ′ i a . Therefore, accessing this second memory does not impact the processing time, because it is read simultaneously to the last x i used to compute ′ i a , as shown in Figure 24 . The corresponding implementation is shown in Figure 11 . 
Therefore, to compute each portion of the output (y 0 , y 1 , ...), it is necessary to read only two portions of the incoming code (x 0 , x 1 , ...) instead of four, as illustrated in Figure 24 . The processing starts by accessing all the portions of the input (x 0 , x 1 , ...) and summing them to compute and store x Σ . Then, it works as the pre-FFT sequential implementation except that only the portions of the input that are not multiplied by zero are accessed, and that x Σ will be added when each ′ i a will be available.
With this implementation, the memory needed is twice 4N P (B 0 + ⌈log 2 N S ⌉) bits for the accumulation, as for the pre-FFT parallel implementation using two accumulators. However, looking at the processing time of both implementations (Figures 17 and 24) , the one using the new method can have a lower processing time because it is possible than more than half of the sample of s′ are zeros, and because the zero-padding has less impact.
For example, the L5 pilot secondary code contains 12 ones and 8 minus ones. Therefore, the code s′ will contain 12 zeros, i.e. 60% of the total length. Making the same numerical application as previously used with the "low-cost" receiver, the memory needed for both implementations is 72N P = 1,474,560 bits, and the processing time is 36,003N P + 2N Z = 737,390,592 clock cycles for the pre-FFT sequential implementation using the new method, and 40,005N P + 1,002N Z = 843,927,552 clock cycles for the pre-FFT parallel implementation using two accumulators, which means a reduction of about 12.6%. Therefore, the use of the proposed technique may be interesting for a hardware implementation. Note that the use of double read access can be exploited to approximately halve the processing time.
Of course, the choice of subtracting or adding one to the secondary code in (12) depends on the code that we have. The goal Figure 11 .
Implementation of the pre-FFT secondary code removal using the proposed technique (see (13) ) computing each combination of the input sequentially. See the timing diagram in Figure 24 . is to have as many zeros as possible in s′. Note that there are also some variants of this method providing better performance but not applicable to every code [35] , [36] .
CONCLUSION
In this article, we have performed a comparison of the possible hardware implementations of the parallel code search acquisition for GNSS signals having a secondary code. Since applying directly the FFT over the entire tiered code is not possible or at least extremely consuming in hardware, a better solution already suggested in the literature is to perform FFT-based circular correlations over the primary code and to combine the results. Focusing on this solution, we have compared different hardware implementations, including the cases when the combinations are performed before or after the FFT-based correlations, when they are performed sequentially or in parallel, and when the output is provided in different orders. Moreover, we also analyzed the memory requirements and the processing time of each implementation. From these comparisons, it has been shown that some implementations are not interesting (such as the one using a second FFTbased circular correlation for the secondary code), since they consume more memory and provide a longer processing time than other implementations. It has also been shown that the direct correlation that applies the FFT over the entire tiered code would not be interesting, because some proposed parallel implementations provides slightly longer processing times, but require much less memory.
Generally, the choice of the most suitable implementation is a compromise between the memory used and the processing time. However, if the various parameters are specified (such as the quantization of the signals, the sampling frequency, the number of coherent or noncoherent accumulation, the number of frequency bins to test), it will be easy to evaluate both the memory and the processing time using our results since all the formulas are provided.
In addition, we also have proposed a new method that approximately halves the number of operations related to the secondary code correlation, and slightly reduces the total processing time (12.6%) for a hardware implementation. The idea of this method (which is not an approximation) is to add or subtract 1 to the binary secondary code to obtain a code with at least half of zeros to perform the correlation.
APPENDIX B
TIMING DIAGRAM OF THE IMPLEMENTATIONS 
