INTRODUCTION
The M-code signal, the modernized GPS military signal designed in the late 1990s, is scheduled to be first transmitted by a Block IIR-M satellite in 2005. As described in [1] , the M-code signal's revolutionary design includes a novel modulation [2] , new data message, and new security architecture. M-code signal acquisition relies primarily upon direct acquisition, where in effect the receiver correlates (over time and frequency shifts) a locally generated replica of an M-code signal with the received waveform. When there is a match between the replica and a received signal, coarse synchronization is achieved, and the receiver commences signal tracking, data message demodulation, and position calculation. Since the M-code signal, like the current GPS military signal called Y-code signal, uses very long spreading codes, signal acquisition cannot take advantage of the short spreading codes that simplify acquisition processing in civilian signals such as the GPS C/A-code signal.
While direct acquisition circuits were developed for Y code receivers in the 1990s [3] , these circuits provided much less capability than would be needed for direct acquisition to be the primary mechanism for acquiring the M-code signal. During the design of the M-code signal, studies demonstrated that a combination of factors would allow direct acquisition to surpass the design requirements for the M-code signal. But the results of these studies did not lead to consensus that ICs ready for receiver production in the latter half of this decade could meet the performance requirements while providing adequately low complexity, low parts cost, low peak and average power consumption, and low thermal dissipation. Functioning silicon was needed to remove remaining doubts.
A team with expertise in systems engineering, digital signal processing, and IC design took on the challenge of developing a prototype IC for direct acquisition of the Mcode signal. The team identified ways to exploit the unique characteristics of the M-code signal, evaluated processing architectures that balanced risk and capability, developed predictions of performance and of IC complexity, designed and applied algorithms, developed detailed simulations, and traded off processing implementations, yielding design files that were sent to the foundry only 12 calendar months after the design effort began. The resulting "DirAc" ICs have been packaged and tested, confirming that they provide full functionality and meet or exceed performance predictions. Software and hardware development is underway to integrate the IC into a test receiver for further testing.
The next section of this paper discusses direct acquisition of the M-code signal, outlining issues and opportunities to be considered. The following section describes the DirAc architecture. The subsequent section describes the first DirAc IC's design, including digital signal processing, architecture, and layout, built using 180 nm lithography readily accessible in 2001. The succeeding section outlines a second version DirAc that could be developed using 130 nm technology available in 2003. Fundamental performance characteristics are provided in the subsequent section, while the final section summarizes the findings of this paper.
DIRECT ACQUISITION OF THE M-CODE SIGNAL
Signal acquisition involves the steps that take a receiver from a state of being powered on and having passed selftest, to providing an initial estimate of position, time, or velocity (PVT) at specified accuracy. Time to first fix (TTFF) denotes the delay between starting the acquisition process and providing PVT with specified accuracy. In conventional receivers, TTFF then involves the time for coarse initial synchronization (obtaining initial alignment between the receiver's timing and frequency and those of the received signal), signal tracking or other processing that produces refined and repeated estimates of signal timing and frequency, reading the data message to obtain position and time at the satellite transmitting the signal (if needed), obtaining signal tracking and satellite position and time for three or more additional signals, and then calculating PVT using the estimates of signal timing and frequency along with positions and times at the satellites.
Even though coarse initial synchronization is only one part of acquisition processing, this paper complies with common terminology, calling the circuit that performs coarse initial synchronization an "acquisition circuit."
Typically, an acquisition circuit crosscorrelates a locally generated signal replica against the received waveform containing multiple signals, interference and possibly jamming, and noise. In concept, the locally generated reference is shifted in time and frequency, forming a segment of a cross ambiguity function (CAF) [4] between the replica and the desired received signal. The time duration of the signal segments used in the crosscorrelation is called the coherent integration time. Noncoherent integration can be accomplished by adding the magnitudes of multiple CAFs, computed over the same ITU and IFU. This noncoherent integration enhances performance in noise and jamming, but consumes additional time to collect and process the longer segment of received waveform.
Digital processing actually searches discrete values in time and frequency space, called time-frequency cells. The time span and frequency span searched in parallel by an acquisition circuit may be called a time-frequency tile-composed of multiple cells in a rectangular array. If the ITU or IFU is larger than the span of the tile, sequential tiles are computed serially to compute the CAF over the entire ITU and IFU. Figure 1 shows how cells and tiles fit into the ITU and IFU. Although for signals with short periodic spreading sequences, the largest ITU that must be searched corresponds to the period of the spreading sequence, signals whose spreading sequences have much longer periods require search of the entire ITU. The magnitude CAF, computed with or without noncoherent integration, is used to form a test statistic. A threshold setting algorithm establishes a criterion, and the time and frequency of any magnitude-squared CAF value that exceeds this threshold criterion correspond to a possible coarse initial synchronization. Some of these detection reports may be false, since establishing the threshold value too high excessively reduces the probability of a valid detection. Thus, some of the reported detections correspond to false initial synchronization points, and various techniques are used to distinguish between valid and false synchronization points.
If spacing between time-frequency cells is too wide, a peak in the CAF signifying coarse initial synchronization may occur in between sample points, degrading the opportunity to detect this peak. In general, modulations having narrower peaks must be sampled faster to avoid this problem in the time domain. Sampling in the frequency domain is independent of modulation design, but instead is proportional to the reciprocal of the coherent integration time used in computing CAFslonger coherent integration times require finer frequency spacing.
When custom hardware is used for size and power efficiency, crosscorrelations are often implemented in the time domain, since two or fewer bits of quantization are needed, and less silicon area is needed. When the crosscorrelations used to compute the CAF are implemented using time-domain computations, the number of arithmetic operations required to compute a CAF that covers a given ITU increases with the square of the sampling rate. Thus, for signals whose modulation is binary phase shift keying with rectangular spreading symbols, the computational burden (quantified as the rate of arithmetic operations) increases with the square of the spreading code rate. Since the correlation peak is very narrow for signals with binary offset carrier (BOC) modulation, the same logic would motivate even higher sampling rates, and thus higher computational burdens, particularly when the subcarrier frequency is greater than the spreading symbol rate.
Fortunately, sideband acquisition processing of BOC modulations [5] significantly reduces the computational burden for signals having BOC modulations with subcarrier frequency greater than the spreading code rate. Since the signal spectrum has distinct upper and lower sidebands, they can be separately filtered, downconverted to DC, and decimated at a sample rate commensurate with the spreading code rate, independent of the subcarrier frequency. Separate CAFs can be formed from the resulting waveforms from the upper and lower sidebands, using as a replica signal the spreading sequence without subcarrier modulation. The CAFs from upper and lower sidebands are noncoherently integrated. 
DIRAC ARCHITECTURE
This section describes the DirAc architecture. An overview is provided first, followed by more detailed discussion of separate portions.
DirAc Architecture Overview
The DirAc signal processing architecture is shown in Figure 1 . The input is four separate sampled input streams: the sampled inphase upper sideband (I USB ), the quadraphase upper sideband (Q USB ), the inphase (I) lower sideband (I LSB ), and the quadraphase (Q) lower sideband (Q LSB ), each clocked at 5. The fast Fourier transform (FFT) performs a 32-point zero-padded complex FFT of the I and Q data. Zeropadding interpolates between frequency bins to reduce scalloping loss caused when the true frequency value falls between the discrete frequencies computed in the CAF. The FFT provides a coherent integration time of 16×0.625=10 msec while also computing the CAF over different frequency values. Each frequency bin in the FFT produces data in 100 Hz frequency bins, with adjacent bins overlapped by 50 Hz. To further reduce losses [6] , only the 16 center bins in the FFT are retained, covering ±400 Hz.
Magnitudes of the separate CAFs computed from upper and lower sidebands are then summed, forming a single magnitude CAF over 10 msec and ±400 Hz.
When the noncoherent integration time is long, code Doppler can degrade performance unless compensation is used. Code Doppler compensation (CDC) prevents time smearing of a CAF peak due to differences between the receiver and satellite spreading code rates. 
CMF Bank Description
Simpler versions of the CMF bank design can be found in [6, 7] . An advantage of the CMF bank design implemented in DirAc, as compared to other parallel cross-correlation methods (e.g. [8, 9, 10] The tap structure in the short-time CMFs, shown in Figure  4 , exploits the fact that multiple input data streams (I USB , Q USB , I LSB , Q LSB ) are being crosscorrelated with the same reference spreading sequence. Interleaving the data and running at 20.46 MHz allows the same multiplier, code registers, and adder tree to be reused for each input data stream. Each short-time CMF has 3197 taps, and each tap structure includes a shift register of length 4, a multiplier, and two code registers. The two code registers allow for a seamless transition from one code to another. This is especially advantageous when noncoherent integration over multiple time segments is required.
Each CMF consists of multiple tap structures connected to a dual adder tree structure. One adder tree sums up the products corresponding to the odd numbered taps and the other to the products corresponding to the even numbered taps. The resulting sums are kept separate and can be noncoherently added together after the FFT processing to support M-code signal characteristics [11] . In the current design, every 10 milliseconds the CMF bank and FFT processing generates a 51150 x 16 sample corresponding to 10 msec by 800 Hz time-frequency space. If there is an ITU greater than 10 msec and noncoherent integration is not needed, then the same code reference PN sequence remains in the CMF taps for the entire search time. However, if noncoherent integration over multiple 10 msec intervals is required, the reference PN code is updated at every 10 msec interval, so the code offset for the current tile is the same as the previous tile.
This noncoherent integration occurs after code Doppler compensation. The multiple magnitude CAFs corresponding to the same code offsets but computed at different times are summed together before detection. The first magnitude CAF is stored on the off-chip memory and then, for subsequent integrations, the stored magnitude CAF is added to the current magnitude CAF and then the sum is written back into the off-chip memory.
Code Doppler Compensation Function
Prior to summing the current and stored magnitude CAFs, some frequency-dependent delay compensation is required. Any frequency offset between transmitter and receiver results in a time compression or expansion (known as companding). This companding affects both the carrier frequency and the spreading code rate of the received signal. While offsets in carrier frequency are addressed by the FFT, the change in spreading code rate produces a lack of correlation between the local PN code and the received signal over long integration times.
The loss of coherence due to time companding of the baseband signal can be compensated by the use of shorttime correlations followed by post-processing. (e.g. [12, 13] ). The companding effect causes a correlation peak to "drift" along the time axis of the short-time correlation tile at different times. If no compensation is made for the drift of the correlation peak as a function of code Doppler offset and as the number of noncoherent integrations, increasing the number of integrations beyond a certain point provides no additional benefit. Code Doppler compensation predicts the relative location of the correlation peak from correlation block-to-block and to apply the necessary delays to make sure the correlation peaks remained aligned from tile to tile.
Code Doppler compensation consists of a bank of integer and fractional delay lines that are used to maintain the correct peak alignment from tile to tile. The fractional delay lines employ a 4-tap Lagrangian interpolator that uses a table to assist in proper delay coefficient selection. As each CAF is processed, the integer delay lines and fractional delay lines are initialized and/or updated to counteract the correlation peak drift. The delay of the integer variable delay line is a function of both the code Doppler offset and the number of integrations performed.
DIRAC IC DESIGN DETAILS
VLSI implementation decisions for the DirAc IC were driven by the CMF bank since it represents 95% of the hardware resources and processing. This section provides an overview of the IC architecture and discusses the important issues of clock and power distribution.
DirAc IC Architecture Overview
Two implementation strategies were evaluated, hardware reuse and massive parallel processing. An architecture implementing a hardware reuse strategy stores and processes in an iterative fashion using a subset of processing elements in this case correlator taps at an increased rate. This approach limits the required hardware for processing but requires overhead for coordinating the reuse. The major VLSI implementation drawbacks with this approach, especially when implementing a CMF bank, is the requirement to store input samples and intermediate partial correlations while reusing the hardware. In addition, high-speed clock management is required to obtain the reuse factor. Memory bandwidth requirements also force the architecture to support multiple memories on chip. These high-speed memories are challenging to place and route due to route congestion, global routes that introduce signal integrity issues, and routing blockages.
The alternative architecture in contrast facilitates VLSI implementation. We selected a parallel datapath implementation of the CMF. Figure 5 depicts the DirAc IC layout with an overlay of the 16 CMFs and their signal data flow. In this architecture the CMF is processed in parallel. The systolic processing associated with this architecture minimizes global routing and in fact is by definition limited to local routing. This has two benefits: it eliminates global routes minimizing signal integrity issues, and it minimizes power dissipation related to interconnect. In addition, the inherent symmetry of the systolic array facilitates clock and power distribution. Another major benefit is that since the 16 CMF elements are identical the CMF designs can be reused, simplifying the design specification, layout, and verification process allowing for a timely implementation. 
Clock Distribution
The DirAc IC requires 519,500 register elements, representing 43% of the die's core logic area as shown in Figure 6 and over 60% of the total number of transistors.
Figure 6. DirAc Layout Highlighting Die Area Associated with Registers
In order to minimize clock skew without impacting overall design performance, clock distribution was addressed at two levels: macro-level clock distribution and top-level. Clocks were routed on the top two metal layers for each macro (i.e., a CMF). Restricting clock routing to metal 5 and 6 opened routing resources at the lower metal layers for signal routing. All clock routes were also double spaced to minimize interference to signal routes. State-of-the-art place and route tools automatically introduced a clock tree for each macro.
At the top level, clock trees were hand placed. Then an automated router was use to distribute clock interconnects. These routes were double spaced and interleaved with ground.
The final clock network has a clock skew of only 83 psec. The clock tree contains 10 levels and over 15,000 buffers. Power dissipation related to the clock tree was on the order of 10% of total dynamic power dissipation.
Power Distribution
A major design consideration for the DirAc IC was distributing power across the die. The DirAc IC's IR drop requirement had to be limited to 25 mV. Failure to meet this requirement would have potentially resulted in timing errors or logic failures.
The power grid was constructed based on three design criteria: achieving an IR drop below 25 mV, balancing power grid routing tracks with signal routing requirements, and not interfering with the placement and routing of the clock tree.
In order to gain insight early on in the power grid design process a power grid analysis tool was developed using SPICE. Estimates of macro power dissipation based on RTL power analysis were incorporated with a resistor and current source network that modeled the power routes and power dissipation. Insight from this tool drove the final macro placement, course and fine power grid size, and their connections. Multiple iterations and modifications to the power grid resulted along with architectural modifications to the DirAc IC.
In the end, the SPICE-based analysis tool predicted 18 mV final IR drop. We confirmed this IR drop using a commercial gate-level power analysis tool after the DirAc IC was completely placed and routed. Figure 7 shows an IR drop map for the DirAc IC. The largest IR drop occurs at 22 mV. Note the symmetric shape of the IR drop and the hot spot is located towards the center of the die. 
Operational Features
The DirAc IC supports either 10 or 5 msec coherent integration in support of different features and modes of M code [11] . DirAc provides a maximum of 128 noncoherent integrations and supports programmable detection processing. Detection reports are available through a simple memory map interface. On-chip 4 Kbits of dual-port RAM provide for storage of two detection reports which include raw peak data, detection threshold levels, calculated noise floor, and time and frequency location.
DirAc supports a power management mode that allows the user to selectively power down individual short-time CMFs as well as the overall device.
The DirAc IC also supports an extensive test/data collection mode. The DirAc datapath has embedded hardware control, snap-shot memory, and dedicated inputs and outputs to support data collection for postanalysis and pre-fabrication verification. Data can be collected at each major processing element in DirAc's datapath through the external noncoherent integration memory by leveraging the hardware test fabric. For example, data can be collected from the short-time CMFs, FFT, or detection processing along with any specified time-frequency tile. Similarly, the test fabric can reduce simulation time. The DirAc requires approximately 1 million clock cycles to process one time-frequency tile at 40.92 MHz. The test structures allow for preloading the CMF bank in a parallel fashion reducing the required simulation cycles.
External Noncoherent Memory Interface
An external memory provides intermediate storage of the CAF during noncoherent integration. At each clock cycle, a column of the time-frequency time is read from memory, modified by DirAc, and written back to external memory. The number of memory bits required is as follows . 
Power and Energy
The DirAc IC dissipates 1.9 Watts when performing 10 msec coherent integrations and 1.1 Watts for 5 msec coherent integrations. DirAc also provides a power management mode for selectively clock-gating the code matched filter bank. In standby mode the DirAc IC dissipates 10 mW.
To assess average power consumption in an operational context, suppose a handheld receiver must support a 72 hour mission. Most of the time, the receiver is in timekeeping mode, but the user needs to obtain a fix periodically (once every 15 minutes, or once every 4 hours, or once every 24 hours). In time keeping mode, the receiver uses a 1 part per million timekeeping circuit, and must search ±400 Hz of frequency uncertainty. The acquisition circuit uses 5 msec coherent integration time and performs 60 noncoherent integrations to obtain adequate detection performance.
The receiver is powered by two AA alkaline batteries that provide 1400 mA-hr at 3.0 volts. If the DC-DC converter has 95% efficiency, the batteries provide 14,364 W-sec of energy.
Under these conditions, the greatest power consumption occurs when the fix is updated every 15 minutes, rather than with longer update intervals. DirAc IC has to operate for 0.3 seconds each time the receiver is activated. Over the 72 hours, approximately 100 W-sec of energy is used, or an average of less than 0.4 mW over the mission. This represents only 0.7% of the total available battery energy provided by the AA-alkaline batteries.
Acquisition Performance
Acquisition performance of the DirAc IC is can be predicted using standard theory. The output signal-tonoise-plus interference ratio (SNIR) after a crosscorrelation is given by
where is the coherent integration time used in the correlations, T L is the implementation loss expressed as a number greater than or equal to unity, is the received signal power, the factor of 0.25 accounts for splitting the received signal power into four distinct segments (upper and lower sidebands, even and odd spreading symbols) in each coherent integration time, is the power spectral density of the thermal noise at the receiver front end, and
J is the effective power spectral density of the received jamming signal. Details on typical implementation losses can be found in [3, 14] .
Thus, for a given received signal power, coherent integration time, implementation loss, receiver noise level, and effective jamming level, (1) yields the output SNIR.
The detection probability is found using the generalized Marcum Q function. Using the notation ( ) [15] as the probability that the random variable with 2 degrees of freedom and SNIR of N X exceeds threshold value of , allows the detection probability to be expressed
where is the number of coherent integrations times used, and n N Τ is the detection threshold calculated to provide the needed false alarm probability for the given number of noncoherent integrations. The factor of four in the subscript in (2) accounts for the fact that the number of complex quantities being noncoherently combined is four times the number of coherent integration times used, reflecting the combination of upper and lower sidebands, and even and odd spreading symbols).
The expressions (1) and (2) can be used to determine the number of coherent integration times needed to achieve a specific detection probability at a given false alarm probability.
The time (in seconds) to search the initial time uncertainty of ±∆ sec and an initial frequency uncertainty of ±Φ Hz is then
where is the coherent integration time, is the number of short-time correlations within the coherent integration time, and 
FUTURE DIRECTIONS
The DirAc IC prototype provides a baseline for validating the practicality of using direct acquisition to acquire the M-code signal. Next generation acquisition circuits can use more aggressive architectures and can leverage more advanced silicon process technologies to obtain even lower-power, lower-cost, and smaller-size acquisition circuits with even higher performance. Currently, 130 nm CMOS technology is widely available and 90 nm CMOS technology is emerging. Migration to the next generation 130 nm process alone will reduce power dissipation by a factor of three while reducing die area and cost of nonrecurring engineering by a factor of two, compared to the prototype DirAc IC. Migration to 90 nm will produce similar improvements yet again over the 130 nm implementation.
Since the DirAc IC design began in 2001, we have considered various enhancements motivated by a combination of lessons learned, refinements of the Mcode signal design, evolution of operational concepts for the M-code signal, and developments in semiconductor technology. The enhancements described in this section are divided into two groups: those with minor effect on transistor count and clock speed, and those with more significant effect on transistor count and clock speed.
The following enhancements would affect transistor count and clock speed of prototype DirAc by minimal amounts.
• Minor changes to design logic that would support 256 noncoherent integrations when 5 msec coherent integration time is used.
• Adding another 32-point FFT to allow parallel computation of two CAFs when 5 msec coherent integration time is used.
• Increasing the FFT size to 64 points, providing either lower implementation loss with the same frequency coverage, or twice the frequency coverage with the same implementation loss. If the larger FFT is used to extend the frequency coverage, the size of the external memory used for noncoherent integration would be doubled.
• Postprocessing of magnitude CAFs to reduce worst-case implementation losses.
These enhancements could further improve performance by more than a factor of two. Since they have little effect on complexity of the IC, implementation in 130 nm or 90 nm technology provides a small IC with very low power consumption and excellent capability.
More significant enhancements to the prototype DirAc are also under consideration, including the following:
• The coherent integration time could be doubled to 20 msec, providing significant performance benefits at low data rate.
• The sampling rate could also be doubled with the coherent integration time kept the same as in the prototype, providing lower implementation loss and thus better performance.
While these variants would approximately double the number of transistors relative to the Version 1 prototype, the use of 130 nm or 90 nm technology would make the resulting IC smaller and lower power than the DirAc prototype is today. SUMMARY This paper has described the DirAc prototype integrated circuit for direct acquisition of the M code signal. DirAc represents the beginning of a new generation of direct acquisition circuitry, where sophisticated processing algorithms and advanced architectures combine with tens of thousands of physical correlators to execute millions of virtual correlations in parallel. DirAc demonstrates that this level of capability can be obtained using mature semiconductor technology, yielding circuits that are highyield, small, and low-power.
DirAc's code-matched filter architecture allows for low clock rates and minimizes on-chip storage. The resulting architecture is dominated by systolic arrays that can be designed and laid out for maximum efficiency, then replicated. The systolic array architecture also facilitates clock and power distribution.
The IC has been extensively tested in an IC tester, demonstrating that its operation matches design specifications and computer simulations. The IC is currently being integrated into a test receiver to allow more extensive testing.
As semiconductor technology advances, enabling the use of even more transistors and of higher clock rates, there are many opportunities to develop even more capable circuits for direct acquisition of the M-code signal, either using extensions of the DirAc architecture or entirely different approaches.
