ABSTRACT Interferometric aperture synthesis is a proven technique in radio astronomy and earth remote sensing, which also shows great potentials in security screening. An aperture synthesis passive millimeterwave (PMMW) imager is under development at Beihang University, which is designed for concealed contraband detection on the human body in an indoor environment. This imager uses 256 antenna-receiver channels with 1 GHz bandwidth and can obtain a radiometric sensitivity less than 1 K at a video imaging rate (∼25 frame/s). One of the greatest challenges in this system is the development of a digital correlation subsystem capable of analog-to-digital (A/D) conversion and subsequent signal processing among the system's 256 channels. In this paper, a comparator-based 1-bit/2-level (1B/2L) A/D conversion architecture is presented. The main error sources during sampling are identified as the timing error of sampling clocks and threshold offset of comparators and analyzed in detail. The sampled data are captured by field programmable gate arrays (FPGAs) to perform further signal processing, and a data capture module performing the serialto-parallel conversion and per-bit deskew is designed in the FPGA to transfer sampled data from the sampling clock domain to the internal processing clock domain. A 64-channel test system is built to verify the design, and a correlation efficiency of 92.5% to 99.6% is observed at 1 GHz sampling frequency. It is found that the correlation efficiency degradation to less than 98% is caused by the threshold offsets of comparators which can be compensated using a digital-to-analog converter (DAC) or programmable potentiometer.
I. INTRODUCTION
Interferometric aperture synthesis has been used in radio astronomy [1] and earth remote sensing [2] , [3] for decades, which can synthesize a large aperture by sparsely arranging a number of small aperture antennas. The signals received by each pair of antennas are cross-correlated to get the so-called visibility function samples. Based on Van-Cittert-Zernike Theorem [4] , the brightness temperature within the field of view (FOV) can be approximated by the inverse Fourier transform of the visibility function samples.
In recent past, it has been found useful for a few other applications, being one of them is passive millimeter wave (PMMW) imaging for security screening [5] .
The associate editor coordinating the review of this manuscript and approving it for publication was Bora Onat.
The first laboratory demonstrator is built by Salmon [6] , which has 32 antenna-receiver channels with 330 MHz bandwidth. Consecutive generations of prototypes have also been developed at Beihang University, termed as BHU-2D [7] and BHU-2D-U [8] . BHU-2D have 24 channels with 160 MHz bandwidth and BHU-2D-U have 48 channels with 200 MHz bandwidth. These prototypes have verified the capability of detecting threats and validated the advantages of high imaging rate and large FOV. However, to get satisfactory radiometric sensitivity (<1K) and spatial resolution at video rate in indoor environment, a two-dimension (2-D) aperture synthesis PMMW imager need several hundred channels with a bandwidth of 1GHz or larger [5] .
An improved imager using 256 antenna-receiver channels with 1GHz bandwidth is under consideration in Beihang University [9] . This imager is designed for indoor human body security screening, aiming to enable the detection of metallic and non-metallic threats without having to resort to complex and bulky illumination architectures. For the correlation processing must be performed between each pair of channels, 32,640 complex correlators with 1GHz bandwidth are need in this system. Although analog correlator can obtain 1GHz bandwidth easily, the volume is too large to build an array of 32,640 analog complex correlator and digital correlator is adopted. The block diagram of this imager is shown in FIGURE 1, where 512 baseband in-phase and quadrature (IQ) signals are digitized and transmitted to a digital correlator array that is controlled by a host computer. The host computer controls the whole system and runs algorithms to create images.
The radiometric sensitivity in aperture synthesis imagers is directly proportional to the quantization efficiency of the analog-to-digital convertor (ADC) and this is related to the number of quantization levels. For 1-bit 2-level (1B/2L) quantization the efficiency is 64% and for the 2-bit 3-level (2B/3L) quantization the efficiency increases to 81%. Going to 4-level digitization and beyond offers only relatively minor further improvements in the efficiency. Besides, both the price and the power consumption are too high using commercial multibit ADCs for such a system with hundreds of channels. 2B/3L digitization offers the best balance between performance and complexity, while 1B/2L digitization is the simplest solution which can be implemented using comparators. Actually, if 2B/3L quantization can offer a radiometric sensitivity about 0.5K, 1B/2L quantization can offer a radiometric sensitivity about 0.63 K which is still acceptable for indoor human body security screening. Thus, 1B/2L digitization is used in this imager. When 1B/2L correlation is adopted, the timing skew of sampling clocks and threshold offset of comparators must be handled, which will lead to degradation of radiometric sensitivity.
For the digital correlator array, it can be implemented by Field Programmable Gate Arrays (FPGAs), Graphical Processer Units (GPU) or Application Specific Integrated Circuits (ASIC). However, after 1bit data quantization the signal transmission rate is up to 512 Gbps, which is too high to transmit for PCI Express transceivers in GPUs. ASICs can offer excellent performance with low power consumption, while the cost is too high and the design cycle is too long. State of the art FPGAs have flexible interfaces and abundant logic resources with relative low cost, which are suitable for indoor security applications. Because it is almost impossible to finish all the correlation processing in a signal FPGA chip at the current level of technology, the digital correlation system must be assembled from smaller units. On the other hand, an FPGA chip is expected to integrate more than one hundred channels to control the complexity of data distribution. As the premise of achieving correlation processing, some extra issues have to be considered for the FPGA to capture sampled data correctly. First, it is impossible for the FPGA to do cross-correlation processing synchronous to a clock up to 1GHz, and a mechanism like serial-toparallel conversion is needed to reduce working frequency without data loss. Second, common comparator chips have no source-synchronous clock output for data reception as the ASIC presented in [15] , and a high-speed data reception interface must be designed. Third, Physical pin variations will introduce non-negligible bit skew when synchronizing more than one hundred data lines across several banks to the same clock domain at 1 GHz, even though the arriving data are totally synchronous at the input/output (IO) pins of the FPGA. The bit skew may cause setup time and hold time violation for the FPGA to capture the incoming data if not calibrated, which will lead to a large amount of error bits and reduce the measurement signal-to-noise ratio In Section II, the method to achieve 1GHz 1B/2L A/D conversion based on comparator and FPGA together with the design considerations are presented. The test results obtained from a 64-channel test system are presented in Section IV. Finally, conclusions are presented in Section IV.
II. DIGITIZATION AND DATA RECEPTION A. DIGITIZATION
As shown in FIGURE 2, the comparators used for 1-bit quantization can be divided into two categories: the level-latched comparators and the clocked or edge-latched comparators. The reason for using edge-latched comparators instead of level-latched comparators is that edge-latched comparators hold data outputs stable for an entire clock period, making it easier for the FPGA to achieve high-speed data capture.
Timing skew between sampling clocks has an effect of reducing the correlator output, which can be expressed by [10] 
where γ is the percentages of reduction in correlator output caused by timing skew, B is the bandwidth of input white noise signal, and t is the skew between sampling clocks. To guarantee the radiometric sensitivity, the reduction should be less than 5%, which can be caused by t = 0.087B −1 . For a 500MHz bandwidth, this corresponds to t = 174ps. If signal-channel comparators are used, a complex clock network generating 512 synchronous sampling clocks of 1GHz is needed. It is difficult to obtain such a low skew for so many sampling clocks. It's better to use multi-channel comparator chips to reduce the complexity of the clock distribution network. A 16-channel comparator ASIC is under consideration, for which the clock network only needs to generate 32 sampling clocks. There are two methods to deal with the threshold offset of comparators. One is to use programmable potentiometers or digital-to-analog converters (DACs) to compensate the input offset voltage of comparators, which makes it possible to reduce system power dissipation by lowering input swings to levels otherwise not detectable [11] .The other is the statistic method presented by Zheng et al. in [12] , which needs a probability estimation module in FPGA together with a threshold offset calibration module in host computer. The statistic method can simplify hardware design by connecting the threshold of comparator to the ground of printed circuit board (PCB), however a higher input level is needed to achieve good performance. In ground-based applications, there are no such strict constraints on component power dissipation, the statistic method is preferred to simplify hardware design.
B. DATA RECEPTION
A data reception module performing serial-to-parallel conversion and per-bit deskew is designed based on the delay component IDELAYE3 and the deserialization component ISERDESE3 available in Xilinx Kintex UltraScale FPGA [13] . The IDELAYE3 can delay any input signal except global clocks, and the ISERDESE3 can avoid the additional timing complexities encountered when designing deserializers in the device logic [14] . The schematic of a data reception channel is shown in FIGURE 3. Because the maximum input clock frequency of the general IO interface is less than 1GHz, a double-data-rate (DDR) reception method is adopted. The frequency of the receiver clock is equal to half that of the sampling clock, which can be generated using a clock divider from the sampling clock.
The DDR clock is routed from a global clock input pin-pair to both the global clock buffers i.e. BUFG and BUFG_DIV, via IBUFDS input buffer. The BUFG_DIV divides the input clock by n, where n is half of the required serial-to-parallel rate that is 1:4 or 1:8 [14] . The BUFG clock is used to sample the serial data at the input of ISERDESE3, while the BUFG_DIV is used to clock parallel data out of the ISERDESE3 and clock the per-bit deskew state machine. The output of the BUFG is also used to clock the user logic. The incoming differential data lines are routed to a master IDELAYE3 and a slave IDELAYE3 via the IBUFDS_DIFF_OUT input buffer. After delay adjustment, these signals are connected to the master and slave ISERDESE3s. Parallel data from the master ISERDESE3 is forwarded into the per-bit deskew state machine and into the internal logic via first-in/first-out (FIFO) memory. Parallel data from the slave ISERDESE3 is only used by the per-bit deskew state machine.
The initial value of the master data delay is set to only compensate for the data-to-clock skew resulting from PCB routing and chip propagation delays, which are easy to obtain from a fixed PCB design. This ensures that the initial sample point is almost positioned in the correct place, and the per-bit deskew state machine is used to fine-tune each data line from that point onwards to improve data reception performance.
Following a power-up or reset, the per-bit deskew state machine starts running. The algorithm used to perform perbit deskew originates from [15] and has been used in sourcesynchronous interfaces [16] , which works as follows. If the two samples taken are half a bit period apart (following a transition) and are the same, then the master sampling point is too late as shown in FIGURE 4.(a) and the input data delay need to be increased by one step. If the two samples taken (following a transition) are different, then the master sampling point is too early as shown in FIGURE 4.(b) and the input data delay need to be decreased by one step.
The master sampling point moves toward to the middle point of a bit period after every adjustment of the data delay. Because it is almost impossible to set the master sampling point exactly at the middle point of a period, the master sampling point will keep switching around the middle point when it is less than one delay step from the ideal point, as shown in FIGURE 4.(c) . This mechanism requires changes in the incoming data. If the data line is a static-zero or staticone, the delay remains at their initial value. Because it is impossible to do this comparison in real time synchronous to the sampling clock, the parallel received data is used. The deskew algorithm offers the benefit of removing other sources of skew such as pin delays and package skew from the timing analysis and makes it easier for PCB routing by relaxing the requirement for clock-to-data alignment at pins of the FPGA.
The device utilization of a 64-channel data reception module performing 1:4 serial-to-parallel conversion and per-bit deskew at 1GSps using KCU040-2FFVA1156E is presented in Table 1 . 
III. VERIFICATION AND TEST A. TEST SETUP
A 64-channel test system is built to verify the design as shown in FIGURE 5. A noise source is used to generate white noise with 500MHz bandwidth, which is then connected to a 64-channel power splitter network to produce full-correlated signals with the root-mean-square amplitude of 39 mV. The divided signals are AC coupled to a 64-channel sampling board and sampled by clocked comparators HMC874 [17] at 1GHz. Because this is only a test board and the typical threshold offset voltage of HMC874 is ±5mv, a small normalized threshold offset can be obtained by increasing the power level of input signal. The threshold input ports of comparators are connected to the ground of the PCB directly to simplify hardware design. A Xilinx KCU105 evaluation board containing a KCU040-2FFVA1156E FPGA is used to perform data reception and real-time processing, where the data pins are distributed in 4 I/O banks. The processing results are sent to a computer via UART interface. FIGURE 6 shows the block diagram of the FPGA design. The incoming data are first connected to a data capture module to perform 1:4 serial-to-parallel conversion and per-bit deskew. To make sure the per-bit deskew algorithm works, the delay value of the master IDELAYE3 in every data reception channel is sampled by the integrated logic analyzer (ILA) in Vivado design suits with a sampling depth of 1024 and a sampling frequency of 250MHz. A cross-correlation module is used to process the parallel data at 250MHz. A counter module is used to count the total number of sampling bits and the number of ''ones'' in each data channel, which are to be used to estimate the threshold offset of the comparator.
B. EYE DIAGRAM
Eye diagram for sampling noise signal at 1GHz is shown in FIGURE 7. For the edge-latched comparators used for digitization, the stable bit period is about 1 ns.
C. PER-BIT DESKEW FIGURE 8 shows the delay values of 8 data reception channels across 4 FPGA banks including the channels having the maximum and the minimum delay values. The data delay is dynamically stable, which means the sampling point is moving around the middle point of every bit period and proves that the per-bit deskew algorithm works. 
D. THRESHOLD OFFSET
As discussed in [12] , the threshold voltage can be estimated by
where V T is the threshold offset voltage, σ is the standard deviation of input noise signal, N total is the total number of samples, N 1 is the number of ''1'' in all samples, erf (x) is the error function, erf −1 (x) is the inverse error function. As shown in FIGURE 9, the threshold offset voltage ranges about from -3 mV to 5mV, which fits the typical offset voltage of the comparator ±5 mV according to chip specifications [17] .
E. CORRELATION EFFICIENCY
Correlation efficiency is the ratio between the correlation coefficient obtained and the correlation coefficient of the input signal (here, unity). The reduction of correlation efficiency is mainly caused by the timing errors between sampling clocks and threshold offset of comparators. In the test board, the skew between 64 sampling clocks is less than 46 ps, which leads to a reduction less than 0.5% for 500MHz bandwidth according to Eq. (1). Only considering the threshold offsets of comparators, the measured correlation coefficient should be
where x n and y n are the 1B/2L quantization results of two input noise signals for a correlator, P{x n = y n } is the probability thatx n = y n . Assuming T x and T y to be the threshold of the two comparators and T x < T y , P{x n = y n }is the probability that the input noise signal falls in the interval (T x ,T y ). If the input noise signals follow standard normal distribution having zero mean and unit standard deviation, then P{x n = y n } can be expressed as
and the correlation efficiency is
where E offset is the correlation efficiency only considering threshold offsets of comparators. FIGURE 10 shows the measured correlation efficiency and the calculated correlation efficiency by Eq. (3) using the measured threshold offset voltage shown in FIGURE 9. These two results fit quite well when the correlation efficiency ranges from 0.92 to 0.98 except for the data inside the red rectangle and some isolated points. This close match proves that the degradation is mainly caused by the threshold offset of the comparator. It has been found that these mismatches happen between the data channel numbered 8 and the other data channels, for the input signal connected to channel 8 has a larger propagation delay than other channels from the power splitter network.
IV. CONCLUSION
A 1-bit/2-level high-speed A/D conversion architecture based on comparators and FPGA is presented in this paper, aiming to digitize 512 analog signals synchronously in an aperture synthesis imager. Multi-channel comparator chips are used to reduce complexity of the clock network. A statistic method is used to calibrate the threshold offset of comparators, which can simplify the PCB design. A data reception module performing serial-to-parallel conversion and per-bit deskew is designed, which enables receiving data from more than one hundred data lines in an FPGA chip and transferring them from sampling clock domain to internal processing clock domain. A 64-channel test system is built to verify the design. By analyzing test results, the skew between sampling clocks and threshold voltage of comparators are all under control, and the data reception module works right at 1GHz sampling frequency. Although this architecture is designed for an aperture synthesis imager, the design methodology can also be useful for other systems that need to sample a large number of analog signals synchronously.
