Abstract-We present a cross-correlator ASIC for synthetic aperture imaging of Earth's atmosphere. Reconfigurability as a 2-level 96-channel or 3-level 48-channel cross-correlator provides adaptability to a wider array of applications. Implemented in a 65-nm CMOS process, the cross-correlator is capable of running at clock speeds of up to 3 GHz. In 2-level 96-channel mode, the cross-correlator consumes only 1.1 W at 2.5 GHz and 1.2 V, yielding a power efficiency of 96 µW/prod/GHz. The 450-Mb/s readout speed and double-buffering reduce blanking time of the interferometer system to a minimum.
I. INTRODUCTION
This work tackles the challenges of joining the strict power constraints of on-satellite implementation with the massively parallel computing task of cross-correlating all base lines of an aperture synthesis instrument. While the usage of aperture synthesis for Earth observation from orbit is not new [1] , the stringent requirements of such instruments in terms of calculations have thus far kept them from deployment in geostationary orbit (GEO); however, ongoing initiatives aim to change this [2] [3] . In addition to Earth observation, shortrange interferometric imaging applications such as personnel security screening [4] have emerged.
In an aperture synthesis instrument, the zero-lag crosscorrelation, f ⋆g = , samples visibilities, V (u, v), in the uv-plane, which by inverse transform reveals the brightness temperature, T (x, y), in the image plane. A number of key factors determine the performance of the instrument. The parameters that most affect the design of the digital crosscorrelator back-end are number of antenna elements, required bandwidth, required image update time, and required SNR. Many, if not all, of these parameters are interdependent. The RMS noise, σ T , of the system is given by the radiometer equation: σ T = T sys / √ τ B [5] , such that longer integration time, τ , and higher bandwidth, B, mitigate the effect of the system noise temperature, T sys . Maximum bandwidth can be extended by sub-band division, processed in parallell.
The cross-correlator presented here reuses the backbone architecture from an earlier 64-channel 2-level crosscorrelator [6] , but supports new features, such as reconfigurability, channel monitors, and double buffering, and significantly improves on channel count, power efficiency, and readout speed. The cross-correlator can be configured in either a 2-level 96-channel or a 3-level 48-channel mode. The quantization degradation factor, i.e., the reduction in SNR compared to fully analog sampling, is of great importance for the choice of both quantization levels and sampling rate. The degradation factor for 2-level (1-bit) sampling at the Nyquist rate is 1.57, while for 3-level quantization it is reduced to 1.24 [7] . For oversampling at two times the Nyquist rate, the degradation factor for 2-level quantization is reduced to 1.35, while for 3-level quantization it is reduced to 1.13, which is comparable to 4 levels at the Nyquist rate. Oversampling capability and reduced number of sub-bands together make a case for crosscorrelators which can handle high sample rates.
While other ongoing initiatives place the analog-to-digital converters (ADCs) on the same die as the cross-correlator [8] , we have chosen to separate these. Due to the extra signal interfaces, this approach incurs extra system complexity and power dissipation, but in return offers benefits such as higher channel isolation and ease of system upscaling. While the presented cross-correlator ASIC supports 96 2-level channels, current scientific proposals for full hemisphere covering GEO sounders require significantly more channels. Using separate ADC ICs, system upscaling can be supported by splitting signals to several cross-correlators once digitization has been performed, leading to reduced signal power requirements on the front-end and reduced ADC power dissipation.
II. ARCHITECTURE
The cross-correlator is built as an array of 2-input correlator blocks, each consisting of a synchronizer, a 1-bit multiplier, a 4-bit prescaler, and a 24-bit counter with storage buffers and readout control. This section will outline features which go beyond the previous design [6] in order to address many of the current needs for synthetic aperture instruments.
A. Data path and system integration
The signal inputs are divided into 12 banks, each with 8 data and one clock input, matching the output interface of a custombuilt ADC [9] . Clock inputs are differential and use a current mode logic (CML) based termination with 100-Ω resistors to a positive voltage level, V term . Data inputs are single ended to save on I/O-pin count but terminated to V term in the same fashion. To save power, data inputs are CML compatible and do not require full swing. Each input bank has a separate termination pin which is externally tuned to center the input swing around the input buffer switching threshold, V th , such that V term =V th +V swing /2. Having single-ended inputs means the differential pair of the ADC can be split and connected to two cross-correlator ASICs with no need for additional splitter circuitry, except for the differential clock inputs. We use a data routing and clocking scheme that is similar to that used in the previous cross-correlator ASIC [6] . As shown in Fig. 1 , each data channel flows together with a clock. All channels enter the correlator core from one side where each channel is split up into two paths, one going straight through the core and the other going diagonally. This way, each channel will intersect all other channels at some point in the routing, and a largely rectangular chip layout is maintained.
Having 96 input channels corresponds to 4,560 correlation products (more than a doubling from 2,016 [6] ). Each channel has been supplied with an input monitor (M) that counts the number of logic 1s, which in combination with an internal clock-pulse counter (CM) can be used to determine offsets for use in ADC calibration and post-processing correction.
To aid in correlation timing, a reference clock output has been implemented, dividing the correlation clock rate by a factor of 256, making cycle-accurate integration timing possible. The divider can also be reset, making correlation start time completely independent of integration length if required.
B. Reconfigurability
The 2-or 3-level reconfigurability is achieved by using a control signal to switch mode of the multipliers from XOR to AND operation, as shown in Fig. 1 . This pseudocross-correlation product is such that for two 1-bit signals, f and g, the product is f ⋆ ⊙ g def = f ∧ g. The overhead for this reconfigurability is negligible; one extra transistor per correlator block is required along with a multiplexer at the input of each channel.
In 3-level mode, two signals, f and g, with high and low threshold sampling, will produce four inputs f H , f L , g H , and g L . The cross-correlation, f ⋆ g, is calculated from four pseudo-cross-correlation products as
Making up only 1% of the total number of products, f H ⋆ ⊙ f L and g H ⋆ ⊙ g L will not be used for calculating correlations, these products can instead be used for detecting incorrect AD-conversion thresholds.
The reconfigurability can easily be extended to PCB-level by simply setting high and low threshold levels for half of the ADCs respectively, as shown in Fig. 2 . The custom-built ADCs [9] allow for per-channel offset tuning, making this task simple. At the system level, the input signals would then have to be split into two ADCs. 
C. Data readout
The storage registers make it possible to read earlier data while simultaneously performing the next correlation. The buffered readout strategy makes very short integration times possible without severely affecting the efficiency in terms of time spent correlating, compared to time spent between correlations. This is important for interferometric applications, where short blanking times will improve instrument sensitivity.
The storage registers are connected in a byte-wide daisy chain for each row. To save on transistor count, each register consists of conventional latches. Therefore a special serial clocking algorithm with a propagating enable bit had to be devised. A small readout controller was added to each storage byte to handle this clocking. We estimate this to save 30% on transistor count for the storage registers compared to a DFF-chain with conventional serial clock. Power dissipation is minimized by reading one row at a time. Readout is performed through a differential LVDS serial interface with return clock.
While experience of hard errors due to radiation in short-tomedium term GEO missions is not expected [10] , testing of the earlier correlator, implemented in the same CMOS technology, predicts about one soft error per day in GEO [11] . Any soft error occurring in the readout logic is likely to invalidate an entire data set as opposed to single value changes for errors in any other part of the chip. To reduce the risk of data loss at an acceptable level of overhead, only the readout control logic is designed with radiation tolerance in mind. DICE-latches [12] , most of which use node spacing to reduce charge sharing [13] , are used for row selection control, parallel-to-serial counters, and readout enable bit propagation. Fig. 3 is implemented in a 65-nm CMOS process rated at nominal voltages of 1.0-1.2 V. Two rows of staggered pads with a pitch of 73 µm surround the active logic. All banks are routed with matched path lengths to keep skew between inputs to a minimum. The chip is packaged in a 169-ball 10x10 mm 2 BGA on a custom-designed impedancecontrolled substrate. Since most of the chip consists of integrators and storage registers that are clocked at relatively low frequencies, low power (LP) high threshold voltage (HVT) transistors are used extensively to reduce leakage. The total layout area is 3.04 mm 2 , half of which is active logic. This constitutes a 2X improvement in efficiency in terms of logic area per correlation product compared to the earlier design [6] . With this more compact layout, data paths between correlation products are shorter, making it possible to downsize buffers to save power. Since current cross-correlator applications can use clock rates as low as tens of MHz, all logic is implemented with static techniques, meaning the lower limit of the previous dynamiclogic design [6] is entirely eliminated.
III. IMPLEMENTATION The chip in
The correlator core layout and schematic are mostly generated via scripts, with channel count as a parameter and custom cells as building blocks. Accurate circuit simulation of a smaller array configuration, thus, enables full chip verification.
IV. TEST A. Methodology
Our test setup consisted of a custom PCB connected to a PC via a commercial FPGA-board with a soft processor. This setup enabled software-controlled correlator supply voltage, input common mode level, input and output differential buffer bias, clock rate, clock duty-cycle, clock-to-clock skew, datato-clock skew, and readout speed. We also had the ability to set 2-or 3-level mode, set 256 different static input patterns and connect 16 of the channels through on-board ADCs to external analog sources. A LabVIEW interface controlled automated sweep sequences and data handling.
Tests were performed in three different performance corners: C1 uses 1.0-V supply and 1.5-GHz clock, C2 uses 1.2-V supply and 2.5-GHz clock, while C3 uses higher-than-nominal 1.4-V supply and 3-GHz clock. Functionality testing for all corners was performed by sweeping through all 256 available input patterns in both 2-and 3-level modes (512 integrations in total). Integration time was set to 15 minutes for each pattern making the integrators loop to zero several thousand times. This way, enough low-probability errors accumulate to not be masked by the prescalers. An additional, reduced, test of functionality using a single input pattern, activating most products, was used for sweeps of clock rate versus supply voltage; the integration time for this test was 10 s.
Since the cross-correlator operates on the principle that incoming clocks from the different banks are merged throughout the chip, clock skew must be carefully considered. At PCBlevel, this means clock routing has to be handled carefully to meet maximum skew requirements. Note, however, that inaccuracies of test bench timing will still be present and will naturally degrade all skew and duty cycle measurements. The 12 correlator clock inputs were divided in two groups which can be individually delayed. Two more clock paths, also individually delayed, were each connected to half of the ADCs. While not an exhaustive test of skew margin for all synchronizers in the chip, it does exercise 188 of the synchronizers and 16 of the data inputs.
To find the total chip power dissipation in a typical usage scenario (mostly uncorrelated noise on all input channels) within the limitations of the test bench, from which we are not able to supply switching signals to all 96 input channels, comparisons between power dissipation with stable inputs of different patterns and noise inputs on the 16 available analog channels were performed. From these data, per channel clocking power, per channel data path switching power, per product multiplier+counter power, and idle power can be resolved. Total chip power and efficiency was then estimated from these measurements.
The readout speed test was performed using the FPGAboard on which a PLL was swept through all available combinations of multiplier and divider of system clock (100 MHz). This made the test points very dense at low speed but sparse (50-MHz steps) at higher speeds. Due to limitations in the test setup, readout speed could not be verified beyond 450 Mbit/s.
B. Results
All corners passed the full functionality test with no detected errors. The results of the reduced test for voltage and clock rate dependence are presented in Fig. 4 . The lower supply voltage limit has been found to be caused by the DICE-latches in the readout logic. Since the readout logic works at up to 450 Mbit/s, a complete data set readout can be performed in less than 0.3 ms. Fig. 5 shows the power dissipation of the cross-correlator as a function of operating frequency for six different modes, 2-and 3-level operation at 1.0-V, 1.2-V, and 1.4-V supply. Note again that the power dissipation is derived from measurements at each operating point and not directly observed. The power dissipation is lower for 3-level operation, which is due to the lower switching activity per input expected when sampling thresholds are set at ±0.612 of RMS (which is optimal for a 3-level cross-correlator [7] ). Table I summarizes key results for each of the three test corners while Table II lists some properties and contrasts them to the previous design. We have chosen to also list the power efficiency of the cross-correlator as µW per correlation product per GHz for comparison between different corners. For the 3-level case, the total power is lower due to less activity, however, the efficiency is also lower due to the reduced number of complete correlation products. We have designed, implemented and tested a 65-nm lowpower cross-correlator ASIC. This custom ASIC features reconfigurable precision and double-buffered readout to adhere to requirements associated with aperture synthesis radiometers. In addition, this ASIC delivers the performance and power efficiency needed for Earth remote sensing from geostationary orbit (GEO). Thus, a solid case can now be made that the amount of signal processing required can comfortably be accommodated, in a cost-effective 65-nm CMOS process, well within the power budget of a GEO satellite.
