Abstract
I. INTRODUCTION
A GPS satellite transmits navigation data that is used by the receiver to calculate position. The navigation data is modulated or "spread" by the coarse acquisition (C/A) code, which is a pseudorandom noise (PN) binary sequence, as shown in Figure 1 . The navigation data bandwidth is 50 Hz, while the C/A code bandwidth is 2.046 MHz. Spreading of a narrow band signal by a modulating code is known as spread spectrum technology. At the cost of increased bandwidth, spread spectrum transmission provides benefits such as more secure communication and multiple users at the same signal frequency.
In many cases the GPS signal is 20 dB below the noise floor, so instead of the 1 ms of data usually required, a receiver may have to process 10 ms of data to acquire the signal [1] . This provides special challenges for extracting the phase and carrier information from the received signal. The computational time requirements are quite large, and if the acquisition is attempted in the time domain, then approximately the order of 4*N 2 multiplications and additions are required, where N represents the number of data samples [2] . These computational demands discourage the performing acquisition, especially in a weak signal environment where repeated locks may be required.
A number of approaches, based in the frequency domain, have been developed to deal with this challenge. If a FFT is used, the number of calculations is reduced to (2*log 2 2N+l) * 2N additions and half of the number of multiplications as before. Compared with the time domain approach, the overall efficiency of the system improves exponentially as the number of data points increases.
Even with the use of the FFT approach, time is still a limiting factor using a purely software implementation. As to hardware implementation the FPGA allows a relatively inexpensive way to map the algorithm onto dedicated hardware and thus achieve the speeds necessary to complete the algorithm.
This paper describes the architecture and implementation of circular correlation in the frequency domain [3] . A Xilinx Virtex -II Pro FPGA implementation of a GPS C/A code acquisition system is presented. This system is designed to operate on signal-to-noise (S/N) ratio around -20 dB, using 10 ms of received data to achieve lock on the C/A code. 
II. THE FFT APPROACH ON AN FPGA

A. Circular Correlation using FFT
The acquisition of a spread spectrum system involves correlation of the incoming signal with a locally generated synchronized version of the transmitted spreading code. Because of Doppler shift, there is some uncertainty in the carrier frequency. When the incoming signal is downconverted to a base band signal, the Doppler shift may affect the center frequency, causing the base band signal to be "off" by a few KHz. The magnitude and position of the correlation peak provides information about the relative phase of the code and can also be used to determine the carrier frequency. If a number of correlations are performed on the same input data down converted by slightly different amounts, then the correlation that gives the highest peak can be used to infer the Doppler shifted carrier frequency. The use of the discrete Fourier transform (DFT) to perform circular correlation allows a more rapid operation and with respect to signal processing gives the same result as a time domain operation [4] . Many simulations using data with a known frequency were run to determine the optimal data length for the GPS receiver. The results of signal detection probability are Table I [1] . The criterion for a successful run is the measured frequency must be within ±2 Hz of the known frequency. 
B. Algorithm and Implementation
The data streaming into this design represent a signal that has already been down converted to around 1.25 MHz (Doppler effect brings the uncertainty) and sampled at 5 MHz. It has been quantized to 8 bits in the 2's complement format. The design takes the data in blocks of 50,000 and down coverts these to base band by multiplying with a complex exponential of frequency 1.25 MHz. The output is now complex. This output is "subsampled" to generate 4,096 samples from each frame of 5,000. The system then performs an equivalence of a sliding correlation with the 4,096 samples (generated by "subsampling") of 1,023 chips of the reference C/A code on the entire block. The correlation operation is actually realized by performing a series of ten 4,096 point FFT's on the input block and then multiplying the first 2,048 of each of these FFT outputs with the first 2,048 outputs of the 4,096 point FFT of the C/A code. These multiplications are then followed by a 2,048 point IFFT to yield 10 time-domain circular correlations. Each of the 10 correlation results is stored and read out in parallel one at a time and fed into a 10 point FFT. This results in 2,048 * 10 point FFTs. These results are then stored in an array. The final step is to search the results and find the peak with the highest magnitude; the magnitude and position of the peak is recorded. This process is repeated with 10 different complex exponentials ranging from (1.25 MHz -4 KHz) to (1.25 MHz + 5 KHz). The algorithm then extracts the global peak and position from the various down conversions and uses this information to infer the code phase and Doppler shift of the incoming signal.
III. ARCHITECTURE AND DESIGN
A. Design Choices
An effective architecture will require that while the first 10 ms of data are being processed; the second 10 ms worth of data is being buffered. The 10 ms worth of data is equal to 50,000 samples. This leads to a decision of designing a 2-port RAM of depth 100,000 to read from and write to the buffer at the same time. The designed 2-port RAM is more effective than the FIFO generated by the Xilinx core generator for three reasons:
1. Xilinx IP cores only generate FIFOs with memory size of radix-2, so the closest size to 100,000 is around 131,072, leading to a waste of resources. 2. The FIFO has a "quirk" called first-word-fall-through after reset, which would add extra hardware for the control logic. 3. The need to reuse the input data 10 times.
Another design challenge was the generation of the C/A code samples. The code samples correspond to the code that is generated at 1.023 MHz and sampled at 5 MHz. One could generate a 1.023 MHz clock for a LFSR that generates the code, and then use a separate 5 MHz clock for the sampling circuit. This however would lead to a synchronization issue between the two clock domains. Both clocks would have to be rising at exactly the same time to ensure alignment of the code samples with the originally generated code. Even if this issue could be overcome, there is yet another challenge to design FPGA circuits that can produce a 1.023 MHz clock from the existing crystal-generated on-chip clock. The ratio of 5 to 1.023 is not a rational number and hence cannot be used as an input to the delay locked loop (DLL) that is used to generate the clocks of desired frequency.
The solution that was arrived at is as follows. A counter is set to increment by multiples of 1,023, and 5,000 is subtracted from the count every time the count exceeds 5,000. So basically we are dividing 5,000 by 1,023 and only keeping the remainder. This cycle repeats indefinitely. An output is enabled for one clock every time the count exceeds 5,000. This signal enables the shift logic of the code generator, advancing it by one state. To the external environment it then appears that the 1.023 MHz code is being sampled at 5 MHz. A check with the Matlab of generated and sampled C/A code reveals identical sequences.
The other challenge with this design was the mapping of a 5,000 point DFT from the Matlab onto the hardware. In the Matlab model, the input data is processed in frames of 5,000. A 5,000 point DFT is straightforward in Matlab because of the use of floating point calculations. In a fixed-point hardware, this is a challenge. Most hardware implemented DFTs are mainly either radix-2 or radix-4 FFTs. The closest hardware implementation of a 5,000-point DFT is a 4,096 point FFT. Therefore, there needs to be a way to convert the 5,000 samples to 4096 samples. One choice explored was the averaging correlation technique [5] . In this technique, the 5,000 samples would be averaged to 4,096 samples prior to application of the FFT. Although there has been success using the averaging correlation method [5, 6] , it is relatively hardware and time-intensive. It worked well for a 1 ms acquisition, but for a 10 ms acquisition, it would be too cumbersome. In the interest of hardware simplicity a pseudorandom sampling technique is proposed instead. If a pseudorandom scheme is used, then the data set retains most of its auto-correlation properties. Auto-correlation properties are important for direct sequence spread spectrum acquisition [7] .
After experimenting with a few "sub-sampling" algorithms, we propose an effective algorithm which proves to give the desired results. The algorithm is described below: 1) Out of every 11 samples, the 5 th and 11 th samples are dropped. This procedure is repeated 452 times to give 4,068 samples, and 2) The remaining 28 samples are appended to the 4,068 samples to give a total of 4,096 samples for further possessing. This "sub-sampling" technique is applied to the base band data in order to preserve the relative sample position of the code to the incoming data.
One limitation caused by the conversion of 5,000 samples to 4,096 samples is that when the final C/A code phase is extracted, there needs to be a multiplication by a 5,000/4,096 correction factor. This ratio is 1.2207 which is not a whole number and is difficult to represent exactly in a fixed-point notation. Figure 2 summarizes the dataflow through the system for one frame of 50,000 samples of data. These samples constitute a possessing "unit". Each unit is operated on at 10 slightly different down conversion frequencies, giving rise to a total of 10 "stacks" for sorting by taking into account the uncertainty caused by the Doppler shift. The stack with the greatest global maximum is considered to correspond to the optimal down conversion frequency.
B. Global Data Flow
There is a waiting period between the time the first 50,000 unit is completed with processing and the time the next unit is completely read into the two-port RAM. The dataflow is partitioned into four functional domains: 1) Data Capture Domain, 2) 4,096 FFT domain, 3) 2,048 IFFT domain, and 4) 10 point DFT/Sorting Domain. Considering these breakdowns, each functional domain will be discussed as follows.
B.1 Data Capture Domain
This domain's function is to capture the data that is streaming in at 5 MHz, down convert it to baseband, "subsample" it, and stream it to the 4,096 FFT at 100 MHz. This domain also regulates the generation and storage of the C/A code. It consists of two dual-port RAMs, a direct digital synthesizer (DDS), a C/A code generator, a complex multiplier, counters, and some multiplexing ability to allow re-use of hardware. The block diagram is shown in Figure 3 .
Each dual-port RAM has two address ports, one read port and one write port, and allows data to be written to it from one port, while data is being read from it via a different port. The ports can operate on different clock domains. The control in this domain serves to: 1) prevent "collisions" i.e. trying to read from and write to the same memory location and 2) synchronize the timing of both write and read operations to minimize time wastage. As shown in Figure 3 , the Dual Port RAM 0 is of width 8 bits and depth of 100,000, and is configured as a circular buffer that continuously reads in the data at 5 MHz. The Dual Port RAM 1 is actually two dual port RAMs, each of depth 40,960 and width 16 bits. Both the real and imaginary "subsampled" downconverted data are stored for streaming to the FFT module. When the system is initialized for a given satellite, the Dual Port RAM 1 stores the C/A code (which is all real) hence the need for a multiplexer. This C/A code is streamed to the 4,096 FFT for further processing before the primary data is written to the Dual Port RAM 1. This initialization process occurs very early on in the cycle and is completed within about 95 µs of asserting global start. The counters are used for write and read address generation for the RAMs as well as coordination of the write signal to the Dual Port RAM 1 during the "subsampling" stage. The write signal is disabled at certain counts to "drop" specific data points for the "subsampling".
The direct digital synthesizer (DDS) generates the equivalence of complex sinusoids ranging in frequency from (1.25MHz -4 KHz) to (1.25 MHz + 5 KHz), and sampled at 5 MHz. Because of the need to operate the system at 100 MHz, the DDS is configured to produce frequencies 20 times the actual value stated.
The Complex Multiplier (cplx mult) is a two-stage pipelined complex multiplier that performs a point multiplication of y(n) and the complex output c_e(n) of the DDS. The output of the multiplier is therefore complex. This multiplication is used to down convert the input sequence to base band. The complex data from the multiplier are "subsampled" and stored in the Dual Port RAM 1.
The C/A code generator is a gold code generation module, which consists of a pair of LFSRs, a controller that schedules loading and shifting the code generator, a ROM for the initial fills, and a clock division circuit for generating a sampled version of the code. This module, although clocked at 100 MHz, generates the equivalent of 5 MHz C/A code samples.
The control in this domain also regulates the streaming of the data from the Dual Port RAM 1 to the FFT module and the starting of the 4,096 FFT core. These data are streamed out from the RAM to the FFT in bursts of 40,960 except for the first burst of 4,096, which is the C/A code. This controller uses counters to synchronize the read address from the Dual RAM 1 and also to ensure that an "fft_start" is pulsed every 4,096 clocks. This allows 40,960 samples to be processed as 10 blocks of 4,096 by the FFT.
B.2 FFT 4,096 Domain
As shown in Figure 4 , this FFT 40,960 Domain serves to: 1) regulate the storage of the FFT of the C/A code (C_A(F)) and 2) synchronize the C_A(F) signal with the FFT of the down converted data (D_C(F)).
The first 2,048 samples of 4,096 C_A(F) samples are stored in a RAM via the demux, and within 155 µs of asserting global start, the 2,048 FFT output samples from these data are stored and ready for use. The demux then switches to the alternate (lower) output and stays in this mode for the remainder of the acquisition process until a new satellite is selected.
As shown in Figure 4 , the Complex Multiplier (cplx mult) multiplies D_C(F) and C_A(F) and generates the output D_S(F). This multiplication is a point multiplication equivalent to circular correlation of the base band y (n) and c_a (n) in the time domain. The actual conversion of D_C(F) to complex conjugate takes place in the complex multiplier. The multiplier is modified so the subtraction and addition operations are reversed after operands are collected. If the two operands are (a+jb) and (c+jd), then a straightforward complex multiplier would yield: (ac-bd) for the real and (bc+ad) for the imaginary. If the second operand is the complex conjugate of the first, i.e
. (c-jd), then the result is (ac+bd) and j(bc-ad).
The 4,096 FFT module is a pipelined streaming I/O 4,096 point FFT engine generated by the Xilinx Logicore Software.
The core can, "simultaneously perform transform calculations on the current frame of data, load input data for the next frame of data, and unload the results of the previous frame of data. The user can stream in input data and, after the calculation latency, can continuously unload the results" [8] . The core is designed to stream in both real and imaginary inputs and stream out complex outputs after a 4,096 clock latency. The input and output are serial, and once the module is started, it will continuously compute FFTs of data in 4,096 frames. The only control required is to activate the core for each new 4,096 frame of d_c (n). This control comes from the previous stage. The algorithm requires only the first half of the output D_C(F) for use in the next stage, but the core needs to output the results of the entire transform before the next frame output can be accessed. This causes an unavoidable delay. Because one processing unit is effectively 4,096 * 10 samples long, the 4,096 FFT must process the unit in 10 frames.
B.3 2,048 IFFT Domain
This domain is primarily concerned with storage and parallel processing of the frames of data from the 2,048 IFFT. It consists of a 2,048 IFFT, a bank of 10 FIFOs (each has a depth 2,048), and demultiplexing capability to distribute data and control to each of the FIFOs, as shown in Fig. 5 . The 2,048 IFFT is a pipelined streaming I/O architecture. This module calculates the IFFT of D_S (F) to translate it back to the time domain of d_s(n). This represents the circular correlation of the base band y(n) and c_a(n). It is equivalent to re-sampling the time domain signal at half of the original sampling rate [9] . The 2,048 FIFO stores the complex d_s(n) outputs of the IFFT module. There is in fact a parallel bank of ten of these FIFOs; one for each 2,048 frame of d_s(n) samples from the IFFT. Between each frame is a pause of 2,048 clocks because only the first 2,048 of each set of 4,096 D_C(F) samples are used. The 1-10 demultiplexer functions to route data and write enable signals to the appropriate FIFO for each frame. The control signal from the 2,048 IFFT domain signals to the subsequent domain when all FIFOs are full and ready to present their data samples in parallel with each other.
B.4 10 Point DFT/Sorting Domain
The final domain is depicted in Figure 6 , and is for the final mathematic operation on the data and sorting of results. The 10 point DFT module carries out a discrete Fourier transform in 3 clocks. It accepts 10 real and 10 imaginary inputs in parallel and after 3 clocks, presents the valid 10 point transform (10 real and 10 imaginary numbers) in parallel at its outputs. The module performs a total of 2,048 10-point DFT for each 50,000 input unit. The Max_find_0 searches the outputs of the 10 point DFT to find the maximum absolute value of each set of ten and presents this value along with an associated row index, ranging from 1 to 10 to the max_find_1 module. The max_find_0 carries out a total of 2,048 bubble sorts to give a total of 2,048 maxima. Within max_find_0 are a parallel to serial converter, a serial bubble sorter and a control module. It requires 10 clocks to perform its function. At this point the data have been converted to the unsigned 23-bit numbers.
Max_find_1 serially sorts through the 2,048 outputs of max_find_0 to obtain a local maximum for a given down conversion frequency. Max_find_1 generates an additional column index from 1 to 2,048 to accompany the row index from max_find_0. There is now a single unsigned 23-bit number, a row index column index stored as a result of processing 50,000 points of input data. Max_find_2 serially sorts through the 10 numbers from Max_find_1 to find a global maximum, associating a third index, a stack index, with the final maximum.
The Final Calculation Circuitry takes the three indices as inputs and via some simple calculations decodes them to obtain the carrier frequency and the relative phase of the C/A code of the incoming code. These data are continually fed into an Early Prompt Late module that keeps the relative phase of code adjusted and keeps the down conversion frequency as close to the Doppler shifted frequency as possible within the frequency resolution limits of the system.
The control circuitry of this domain ensures that: 1) the 10 point DFT starts after the FIFOs from the previous stage are filled and 2) the bubble sorters are flushed after each batch of data. The entire acquisition cycle requires about 6 ms.
C. FPGA and Memory Considerations
The chosen Virtex II Pro FPGA has 99,000 equivalent gate count, 400+MHz clock rate, 196 18x18 embedded multipliers, and up to 10 Mbytes of on-chip RAM [10] to provide enough resources and computing power for the proposed 10 ms design. Depending on the relative strength of the signal, both the amount of data needed to achieve lock and the time budget increase. As an example, for 20 ms acquisition, the buffer and 10 point DFT sizes double and acquisition time is estimated to increase to 10.4 ms. For acquisition times longer than 100 ms off-chip memory is needed. Table II shows the memory and estimated power requirements for various acquisition times. 
D. Simulation and Synthesis
The design was simulated to verify functionality using Modelsim. The simulation setup used real data collected form indoor antennas and digitized at 5 MHz. The signal data already had noise as part of them. The data were saved in a file and fed to the VHDL test bench during simulation as 8 bit numbers. Bit truncations were simulated in Matlab using divide by powers of 2 followed by a "floor" operation. The simulated circuit used slightly less bit truncation than Matlab at each stage in order to limit the width of the data path and still keep enough information to extract useful results. A sample simulation highlighted in Figure 7 matched the Matlab results. The acquisition time was around 5.4 ms. One point to note is that the final conversion for the C/A code phase calculation requires either a very large number of bits to represent it as an integer, or it requires a floating point calculation.
The larger modules of the design were synthesized individually using the Xilinx XST synthesis tool in order to gain a better understanding of which components contributed the most to the overall area of the system. As shown in Table III , the 10 point DFT consumes the most resources, followed by the 4,096 FFT and the 2,048 IFFT. This synthesis breakdown gives insight into where further optimizations for this design could focus. The 10 point DFT is a logical place to start for further optimization of the design. The Xilinx IP cores are "black boxes", so basically the only changes that could be made to minimize hardware would be options like switching from streaming I/O to burst mode. 
IV. CONCLUSION
Acquisition is the most important step to a GPS receiver because one must detect the signal. The acquisition generates two important parameters: the carrier frequency and the initial phase of the C/A code. In general, acquisition performed on long data will increase the receiver sensitivity. This paper describes the implementation of a weak signal C/A code acquisition circuit that uses 10 ms of data to perform a frequency domain based operation to realize its function. The novel "subsampling" technique demonstrated in this paper is an alternative to averaging correlation as a way to map an ideal software approach acquisition to hardware.
