Abstract-This paper presents a front-end applicationspecified integrated circuit (ASIC) integrated with a 2-D PZT matrix transducer that enables in-probe digitization with acceptable power dissipation for the next-generation endoscopic and catheter-based 3-D ultrasound imaging systems. To achieve power-efficient massively parallel analog-to-digital conversion (ADC) in a 2-D array, a 10-bit 30 MS/s beamforming ADC that merges the subarray beamforming and digitization functions in the charge domain is proposed. It eliminates the need for costly intermediate buffers, thus significantly reducing both power consumption and silicon area. Self-calibrated charge references are implemented in each subarray to further optimize the system-level power efficiency. High-speed datalinks are employed in combination with the subarray beamforming scheme to realize a 36-fold channel-count reduction and an aggregate output data rate of 6 Gb/s for a prototype receive array of 24 × 6 elements. The ASIC achieves a record power efficiency of 0.91 mW/element during receive. Its functionality has been demonstrated in both electrical and acoustic imaging experiments.
I. INTRODUCTION

D
ATA acquisition from 2-D transducer arrays has become one of the main challenges for the development of endoscopic and catheter-based 3-D ultrasound imaging devices, such as transesophageal echocardiography (TEE) [1] , intracardiac echocardiography (ICE) [2] , and intravascular ultrasound (IVUS) [3] , [4] probes. The main obstacle lies in the mismatch between the large number of transducer elements needed for 3-D imaging and the limited number of cables that can be accommodated in these systems. Recent advances in transducer-on-CMOS integration methods [5] , [6] have enabled the use of front-end application-specified integrated circuits (ASICs) performing signal conditioning and data reduction close to the transducer. The concept of subarray beamforming [7] , which divides the transducer array in subarrays and combines the signals received by the elements in each subarray by a local delay-and-sum operation, is capable of reducing the channel count by an order of magnitude. This has been successfully demonstrated in several ASIC prototypes [1] , [8] and has made it possible to develop probes with 3000+ transducer elements [9] . However, it is still an arduous engineering problem to assemble hundreds of cables within endoscopes or catheters of ≤5 mm diameter [2] . Such constraints have forced system designers to tradeoff imaging quality against physical dimension and fabrication cost [2] .
A variety of efforts have been made in recent years to further reduce the cable count by making better use of the cable capacity. Time-multiplexing in the analog domain [10] allows the signals received by multiple elements to share a single cable. However, the limited bandwidth and transmission-line effects of the micro-coaxial cables result in significant channelto-channel crosstalk [11] , [12] . Other analog modulation methods, such as frequency-division multiplexing (FDM) [13] , also suffer from the cable non-idealities.
A more radical solution is to move analog-to-digital conversions (ADCs) into the probe and perform the channel reduction in the digital domain, where complex modulated signals can be transmitted with much better robustness against noise, interference, and crosstalk. Moreover, in-probe digitization would open up the possibility to migrate more signal processing functionalities into the probe, such as post-beamforming [14] and compressive beamforming [15] , which are expected to further reduce the channel count. A prerequisite to achieve this goal is an efficient way to implement a massively parallel ADC array within the stringent power and area constraint of miniature ultrasound probes.
Based on the framework of subarray beamforming, in-probe digitization can be realized by digitizing the output of analog subarray beamformers with per-subarray ADCs [16] . Alternatively, digital beamforming with per-element ADCs can be considered [17] - [19] . The latter approach requires D 2 ADCs and input buffers for a D × D subarray, plus the associated digital FIFOs for the realization of delays. This is difficult to be directly integrated underneath a pitch-constrained 2-D transducer with an affordable power consumption. In [17] , the area problem was addressed by using nanoscale CMOS technologies. Nevertheless, the reported element-level ADC is larger than the ideal half-wavelength pitch of 150 μm at 5 MHz, and the power dissipation is more than two orders of magnitude higher than its analog counterpart [1] . Moving the ADC to the output of an analog beamformer helps in reducing the area and power cost for ADC. The work of [16] employs subarray beamformers based on S/H circuits followed by a stand-alone Nyquist-rate ADC. The reported silicon area and power are both dominated by the analog beamformer, resulting in a per-channel footprint that is 12× larger than the dimension of a 5-MHz transducer element and a power consumption far exceeding the heat dissipation constraint for miniature probes [20] . More recently, [21] proposes to embed charge-redistribution successive approximation register (SAR) ADCs with a subarray beamformer, so as to save power. This approach, however, requires N SAR ADCs to implement an N-channel beamformer, resulting in a poor area efficiency.
In this paper, we propose an element-pitch-matched ASIC architecture to demonstrate the feasibility of efficient in-probe digitization in miniature ultrasound probes [22] . It provides subarray beamforming for a directly integrated 5-MHz PZT matrix with a half-wavelength element pitch of 150 μm. In each 3 × 3 subarray, a compact Nyquist-rate beamforming ADC is implemented following the analog front-end (AFE) circuits. By merging the beamforming and digitization functions coherently in the charge domain, there is no need for intermediate ADC buffers, resulting in significant power and area reduction. The outputs of four beamforming ADCs are fed into a high-speed digital serializer on the periphery of the ASIC to reduce the total output channel count by a factor of 4. Thus, the ASIC achieves a 36-fold channelcount reduction, while consuming less than 1 mW/element in receive mode. The effectiveness of the proposed architecture has been successfully demonstrated in both electrical tests and a 3-D imaging experiment. This paper is organized as follows. The architecture of the proposed ASIC is discussed in Section II. Section III describes the circuit implementation details of the subarray receiver and the datalink. Section IV presents the experimental setup and results. The conclusion is given in Section V. Fig. 1 shows an overview of the proposed system. It consists of a front-end ASIC, a 5-MHz 150-μm-pitch PZT matrix, and the associated external electronics. The PZT matrix is constructed from a bulk piezoelectric material (CTS 3202 HD) that is stacked on the ASIC using the PZT-on-CMOS heterogeneous integration process described in [6] . A metallic interconnection layer and a conductive glue layer connect the bottom electrodes of the PZT elements to a bondpad array on the ASIC.
II. SYSTEM ARCHITECTURE
A. Overview
As a proof-of-concept, we use a PZT matrix with a relatively small aperture in this work, while the proposed circuit architecture and layout are both designed to facilitate the future extension to a larger array size, such as the 32 × 32 array presented in [1] . The matrix is divided into 3 × 24 trans- mit (TX) elements and 6 × 24 receive (RX) elements. Similar to [1] and [6] , the TX elements are directly wired-out to external high-voltage pulsers using metal traces in the ASIC, thus enabling the use of a low-voltage CMOS process [3] . Nevertheless, the concepts presented in this paper are equally applicable to an ASIC in which local TX pulsers are realized using a high-voltage CMOS technology. In this prototype, every three TX elements in the same column are connected together in the ASIC to reduce the overall I/O count, turning them into a 1-D phase array with a broad beam profile in the elevation direction. In a future extension to a larger array, an alternative more symmetrical transmit layout would be preferred, such as the central transmit subarray described in [1] . In RX, subarray beamforming is applied on 3 × 3 elements to realize a 9-fold channel reduction, yielding in total 16 subarrays. Fig. 2(a) shows the architecture of RX electronics in the ASIC. Each subarray receiver consists of a nine-channel AFE that acquires echo signals from the transducer elements and conveys them to programmable delay lines. The output of the nine delay lines are added and digitized in the charge domain by an SAR ADC. The digitized data are transferred to a datalink at the periphery of the ASIC, where the data from four subarrays are combined on a shared low-voltage differential signaling (LVDS) output channel, thus realizing an extra 4-fold channel-count reduction. At the system side, a field-programmable gate array (FPGA) [23] receives and stores the data, which is then transmitted to a PC for image reconstruction.
B. AFE
The AFE in each channel consists of a low-noise amplifier (LNA) and a programmable-gain amplifier (PGA). Fig. 2(b) illustrates the expected peak-to-peak voltage signal received by a transducer element as a function of time. Note that the time axis, assuming a constant speed of sound, is equivalent to the axial depth. Echoes resulting from deeper scatterers will arrive later, and will be more attenuated, leading to an overall peak-to-peak amplitude range from 30 μV to 500 mV, while the instantaneous dynamic range (DR), i.e., the ratio between the largest and smallest echoes at a given imaging depth, is about 40 dB. The depth-dependent (and hence time-dependent) attenuation can be compensated by applying time-varying gains in the AFE. Such time-gain compensation (TGC) function is implemented by distributing discrete gain steps between the two stages in the AFE. The LNA provides a high gain for small echo signals from deep scatterers, where the acoustic and electrical noise determine the detection limit, and a lower gain for large echoes from nearby scatterers, where linearity matters. With a gain step of 18 dB, the DR is thus compressed to approximately 58 dB [ Fig. 2(c) ]. This is further reduced by the PGA, which provides gain steps of 6 dB, resulting in an output DR of 46 dB, with peak-to-peak amplitudes ranging from 4 to 800 mV [ Fig. 2(d) ]. This arrangement effectively reduces the noise and DR requirements of the succeeding circuits, while keeping the complexity of the TGC implementation modest, thus bringing significant power and area advantages for the whole system. It can be combined with a fine-gain compensation in the digital back-end to avoid imaging artifacts associated with the gain steps.
C. Beamforming ADC
The beamformer in each subarray is similar to [1] . Analog delay lines based on pipeline-operated switched-capacitor memory cells are employed because of their simplicity and flexibility in delay control, as well as their good immunity to process/voltage/temperature (PVT) variations [16] , [24] .
Each delay line consists of eight memory cells operating in a time-interleaved fashion at a sampling rate of 30 MHz, corresponding to a delay resolution of 33 ns and a maximum range of ∼233 ns. This delay range allows for pre-steering up to ±37°in both the azimuthal and elevation directions [6] . The chosen delay quantization causes negligible degradation to the image quality with the aid of the imaging scheme proposed in [25] , while requiring only a modest number of memory elements that fit within the available die area. The delay control logic is implemented based on a delay stage index rotator as described in [1] , which can be programmed via a built-in serial-peripheral-interface (SPI) embedded in each subarray.
Dynamic focusing in the subarray beamformer is not implemented in this design as it is not required given the relatively small subarray size (3 × 3) [7] , [26] . However, if desired, it can be readily added to the proposed beamforming ADC architecture by adopting the delay extension/skipping approach as described in [9] .
As discussed in Section II-A, the signal DR at the output of each AFE channel is 46 dB. Considering the extra √ 9 (9.5 dB) SNR gain provided by a nine-channel beamformer, a 10-bit ADC is required to achieve an adequate quantization resolution. For an ultrasound transducer with a ≥50% fractional bandwidth, the ADC sampling rate must be 4-10 times the transducer central frequency to maintain a satisfactory sidelobe level [27] , which corresponds to 20 to 50 MS/s for our 5-MHz array. Given these specifications, an SAR ADC stands out as the architecture choice for its superior power efficiency [28] .
Most SAR ADCs perform the quantization in the voltage domain [29] , [30] , while switched-capacitor-based (S/H) delay lines essentially operate in the charge domain. Therefore, a charge-to-voltage conversion is required for driving the ADC. An active summing amplifier was used in [16] to sum the charges stored on the delay-line capacitors C M , while an extra voltage buffer drives the ADC [ Fig. 3(a) ]. To implement a unity voltage gain, the feedback capacitance must be equal to the total memory capacitance involved in each cycle, i.e., N × C M , leading to a considerable power consumption in the amplifier, more than 10× power of the ADC in [16] . The summing amplifier can be eliminated by adopting passive charge summation [1] [ Fig. 3(b) ]. However, the power consumed by the ADC driver is still problematic; it is usually comparable to that of the ADC itself, as a result of the relatively large capacitance it needs to drive during a constrained input sampling time.
To eliminate the ADC driver, we propose to perform the digitization in the charge domain, rather than in the voltage domain [ Fig. 3(c) ]. This is achieved by sequentially neutralizing the passively summed signal charge with binary-scaled charge references through a successive approximation process. In practice, the charge references can be realized as a precharged capacitor DAC (CDAC) array. By doing so, the beamformer and the digitizer are essentially merged together: the delay lines perform as a multichannel time-interleaved input sampler in a charge-sharing SAR ADC [32] . We will refer to this circuit as a beamforming ADC. Both the beamforming and the digitization are performed differentially to mitigate the impact of common-mode noise and interference. In each channel, the PGA converts the single-ended LNA output to a differential voltage, which is cyclically sampled and held on memory cells under the control of non-overlapping sampling clocks, S1 : 8. The charge sampled on the memory cells is then released to the summing nodes, V X P and V X N , at the rising edges of channel-specific readout clocks, R k 1: 8, where k ranges from 1 to 9. The delay of a channel is thus defined by the time interval between the falling edges of its sampling and readout clocks. Before the start of each readout phase, a CDAC array is precharged to a reference voltage (V R E F ). In each readout phase, the successive approximation charge-balancing process starts after a short-time interval reserved for the passive charge redistribution on the joint memory cells. In every bit cycle, a dynamic comparator detects the sign of the differential voltage on the summing nodes (V X P -V X N ) and dictates the polarity of the charge reference for use in the next cycle. To obviate the need for distributing an oversampled clock, self-timed SAR logic [33] is employed. By the end of the readout phase, a digital representation of the delayed-and-summed charge is available. To simplify the output routing, the differential outputs of the dynamic comparator (CPout+/−) are buffered and transmitted to the periphery of the ASIC, where the 10-bit parallel data are recovered and synchronized to a high-speed system clock for further processing.
Upon completion of a conversion, the summing nodes (V X P and V X N ) are reset to prevent undesired signal attenuation associated with residue charge on the parasitic capacitors [1] , as shown in Fig. 4(a) (CPRST) . This operation also enables the calibration of the comparator offset in the background, as will be discussed in Section III. It converts the differential return-to-zero (RZ) outputs of the SAR comparator to a single-bit non-return-to-zero (NRZ) data stream, and extracts an asynchronous clock that is aligned with the recovered data. The data stream is then synchronized to a 300-MHz global clock in an FIFO for further processing.
Before the serialization, every two consecutive 4-bit-wide FIFO outputs are concatenated to an 8-bit byte by a 4b-to-8b converter, which is then mapped to a 10-bit code in an 8b/10b encoder [34] . This coding scheme facilitates clock recovery at the system side without relying on a per-channel clock line. Moreover, it ensures a DC-balance in the data stream, which helps both data recovery and error detection at the system side [34] . The 10-bit data are then serialized to a single-bit data stream at 1.5 Gb/s, which is buffered by an LVDS driver and transmitted over a twin-axial cable to the imaging system.
The ASIC receives a 300-MHz global LVDS clock from the system, which is converted to CMOS logic levels at the periphery of the ASIC. Here, it serves as the main clock for the core of the datalink, and is multiplied by 5 in a delaylocked-loop (DLL) to generate the clock phases needed for serialization. It is divided down by 10 to produce the 30-MHz beamforming clock (CLK BF ), which is distributed across the subarray receivers via a balanced clock tree.
III. CIRCUIT IMPLEMENTATION
A. AFE
The LNA in each AFE channel is an improved version of the design described in [1] and [35] . It is implemented as a single-ended capacitive feedback current-reuse amplifier with a split capacitor network to achieve a compact layout, and consumes 75 μA. The PGA implements three functions: 1) providing four fine gain steps to define the TGC gain resolution; 2) converting the LNA output to differential signals to drive the delay lines; 3) low-pass filtering prior to sampling to minimize aliasing. As shown in Fig. 6 , a programmable capacitor network provides the desired gain levels ranging from 6 to 24 dB according to the control code map shown below. To save area, a T-type capacitor network [36] , employing unit capacitors of 33 fF, is used as the feedback element across a compact differential telescopic amplifier. Each PGA consumes 100 μA.
The PGA drives a delay line that consists of eight stages of time-interleaved differential S/H memory cells. Each cell comprises a pair of grounded metal-insulator-metal (MIM) capacitors and a set of nMOS switches for sampling and readout. The capacitors are sized as large as possible (133 fF) within the available area to minimize the kT/C noise contribution to the front end. The worst-case settling time constant of approximately 1.5 ns is less than 1/20 of the sampling period, adequate for the required linearity in spite of the signal-dependent ON-resistance of the switches.
As discussed in [1] , the mismatch of the S/H memory cells would introduce a ripple pattern with a period of M/ f S , where M = 8 is the number of delay steps and f S is the sampling frequency. To mitigate this interference, the mismatchscrambling technique proposed in [1] is equally applicable to this work. As an alternative, however, since the beamformer outputs are digitized synchronously to the same system clock that controls the beamformer, the ripple patterns at different delay settings are pre-recorded and stored in memory, and then subtracted from the outputs during the normal receive phase. This calibration process is realized off-chip in the back-end digital processing unit (FPGA). This approach takes advantage of the integrated beamforming ADC, and thus not only saves area but also prevents adding the excess noise associated with the mismatch-scrambling technique [1] , at the cost of an increased complexity in the back-end signal processing. Such extra complexity is modest, as only a relatively small set of ripple patterns need to be recorded, given that only a modest number of pre-steering directions are required, e.g., 25 in [25] .
B. Charge-Reference Generation
The generation and distribution of references for a massively parallel ADC array is challenging. In prior implementations of charge-sharing SAR ADCs [32] , [37] , the CDAC is precharged by an external voltage source before the start of each conversion. Due to the significant load that this source needs to drive in a nanosecond time slot, this approach is prone to suffer from errors due to di/dt transients caused by the bondwire inductance, unless large on-chip decoupling capacitors are used. The alternative of employing on-chip reference buffers would introduce a significant power overhead [38] .
In this work, we propose to precharge the CDAC with current sources to relax the power and area requirements for charge reference generation. Unlike the approach described in [38] , the precharging current is locally generated in each subarray. To mitigate mismatch and PVT sensitivity, it is self-calibrated during the TX phase in reference to an external voltage (V R E F ). This simplifies the system-level layout, as no global current reference distribution or high-current voltage reference routing is required. Fig. 7 shows the schematic of the charge reference generator and its timing details. It consists of a gated current source (M P ), a charge pump, and a calibration comparator. Intuitively, the charge reference generated by a gated current source can be written as
where I P is the precharging current and T int is the precharging duration defined on-chip. It is, however, difficult to maintain uniformity of Q R E F across the whole array, since both I P and T int are subject to process variations and mismatch. Therefore, we define the charge reference as
where C D AC is the total capacitance of the DAC array and V R E F is the desired voltage reference with respect to the AFE output.
As the absolute value of capacitors in modern CMOS processes is typically more strictly controlled [39] , calibration is applied to I P so that the voltage on C D AC (V D AC ) after precharging approaches V R E F . This is accomplished by introducing a calibration phase in synchronization with the TX phase, when digitization is not required. During this phase, the charge pump and the calibration comparator are periodically activated in a short-time interval (CAL) following each precharging cycle. The calibration comparator compares V D AC and V R E F , and according to the result, a unit charge packet is pumped in or pulled out from a MOS memory capacitor (C M OS ) at the gate of M P to adjust the value of I P for the next cycle. The size of the charge packet, which dictates the reference calibration resolution (LSB CAL ), is set by both the pulse duration time and the magnitude of the sourcing/sinking current in the charge pump. This process repeats for a defined number of cycles, at the end of which V D AC has converged to V R E F to within ±1 LSB CAL .
During the RX phase, both the charge pump and the calibration comparator are disabled, and the gated current source precharges C D AC based on the bias voltage stored on C M OS . The leakage on C M OS (∼4 pF) would lead to a worst-case gain drift of 0.3 dB within one RX cycle (100 μs), which is negligible given the significant time-dependent attenuation of ultrasound. The broadband white noise of the charge pump is sampled on C M OS and held constant throughout the RX phase, and therefore does not contribute any in-band noise.
Both the precharging current noise and the jitter of T int lead to noise charge sampled on C D AC . The former is minimized by appropriate sizing of transistor M P (W P /L P 1), leaving jitter of the precharging clock as the dominant noise source. This clock is derived from the ASIC input clock, whose jitter performance is, therefore, crucial. Maximizing T int helps in relaxing the jitter requirements. To do so, we use a ping-pong charge reference that consists of two identical CDAC arrays, as shown in Fig. 7 . A duration time of 25 ns (3/4 of the sampling clock period) is allocated for T int , permitting the use of a system clock with moderate jitter (∼20 ps).
In the calibration phase, only one DAC array is connected to the gated current source, while in the RX phase they are alternately used for precharging and conversions. By sharing the current source for precharging and the timing logic for generating T int , the ping-pong charge reference is free from interleaving spurs caused by the DAC capacitance mismatch.
The topology of each DAC array is similar to [37] . The charge references corresponding to the first 7 MSBs are produced by precharging a bank of binary-scaled capacitors, while those corresponding to the last 3 LSBs are generated using charge redistribution. Metal-oxide-metal (MOM) capacitors of 23 fF with symmetric plate parasitic capacitance are utilized as unit capacitors to ensure adequate matching for a 10-bit linearity. In total, 67 unit capacitors are used in each DAC, leading to a total capacitance of about 1.5 pF.
The required reference voltage is determined by the following charge-balancing equation:
where N is the number of subarray elements, V in,max is the maximum peak-to-peak differential PGA output swing, and C M is the capacitance of a unit memory cell in a delay line of the beamformer. The factor of 4 comes from the single-ended-to-differential conversion of C D AC . For N = 9, V in,max = 800 mV, C M = 133 fF, and C D AC ≈ 1.5 pF, V R E F is approximately 160 mV. The calibration comparator is implemented as a StrongArm latch following a class-A preamplifier that is only powered on when CALEN is high (Fig. 8) . The preamplifier consists of two cascaded stages of resistively loaded differential pairs to warrant a sufficient gain. The first stage is dimensioned to minimize both noise and offset. Given the relatively low reference voltage, a pMOS input pair is used. Since the comparator is powered down during the RX phase, its contribution to the overall power consumption is negligible.
C. SAR Logic
For an implementation in 0.18-μm CMOS, the high-speed SAR logic readily dominates the power consumption of the ADC, and therefore needs to be carefully minimized. conventional implementations [29] , [30] , the proposed scheme minimizes the time delay between the comparator decision and the DAC switching, thus relaxing the timing for charge sharing. During the conversion, each DFF pair reads the comparator decision by sensing the rising edges of the comparator outputs. Once a rising edge on either side is detected, the DFF pair is immediately disabled and no longer responsive to succeeding comparison events until the end of the cycle. The data are then captured and frozen to control the switching of corresponding DAC elements. To identify the completion of a conversion, the enable signal of the LSB DFF is used as the DataReady signal. An additional DFF pair is used for comparator offset calibration, as will be discussed in Section III-D.
To further reduce the dynamic power consumption, each DFF pair is kept deactivated until the preceding stage has come to a decision. Thus, in every bit cycle only one DFF pair is clocked for data reading. This is achieved by embedding a local clock-gating buffer within each DFF cell, which defines a bit-wise timing window based on the outputs of the previous and the present stages [ Fig. 9(c) ]. The clock-gating buffer is implemented as a dynamic NAND gate followed by a simple latch with a weak feedback inverter [40] [ Fig. 9(b) ]. To prevent undesired switching events, the output of the previous stage is delayed to guarantee that its rising edge always arrives during the reset phase of the dynamic comparator. Simulation results indicate a 33% dynamic power reduction in the SAR logic thanks to the adoption of the clock-gating scheme. 
D. Dynamic Comparator
An inherent limitation of the charge-sharing SAR conversion is the discrepancy between the charge-domain signal approximation and the voltage-domain quantization, which leads to more stringent requirements on the input-referred noise and offset of the dynamic comparator [37] . Fig. 10 shows the schematic of the dynamic comparator, the core of which is a double-rail latch-type voltage sense amplifier [41] . Its first stage is sized to ensure a sufficiently low input-referred noise. A self-timer circuit takes the comparator outputs and the DataReady signal to generate an oversampling clock that schedules the evaluation and reset of the dynamic comparator.
In contrast to a voltage-domain SAR ADC, the input offset of the comparator in a charge-sharing ADC would result in a dynamic charge offset that degrades the conversion linearity [37] , and therefore should be minimized. Such offset is dependent on the input common-mode voltage [30] during charge sharing, which, in turn, depends on the parasitic capacitance at the summing nodes and therefore varies between subarrays. To avoid the need for individual offset trimming for each subarray, the offset is self-calibrated in a way similar to [42] , which involves a charge pump and an auxiliary input pair with one gate connecting to an external calibration voltage. In contrast with [42] , the offset calibration is performed in the background while the SAR conversion is on-going. As described in Section II-C, the comparator input nodes are shorted to clear the residual charge at the end of each conversion, resulting in an input voltage that is close to the common-mode voltage during the LSB cycles. Therefore, by triggering one more comparison, the polarity of the offset can be detected, allowing the charge pump to adjust the bias voltage of the auxiliary input pair. This additional comparison is realized by adding an extra stage in the asynchronous SAR logic, as shown in Fig. 9 . By repeating this process for successive SAR conversions, the offset voltage is progressively minimized within a finite number of ADC cycles. The background calibration can be disabled by nulling the input of the extra logic stage.
Because of the common-mode charge stored on the parasitic capacitances of the CDAC and switches during precharging, as well as the relatively low reference voltage, the input common mode of the comparator slightly decreases as the SAR conversion proceeds, which leads to a bit-dependent dynamic offset. However, since the variation of the input common mode is only a small portion of its absolute value, the resulting offset variation and dynamic error charge have a negligible impact on the linearity.
E. CDR and FIFO
The differential comparator outputs from each subarray are received by CDR circuits at the ASIC periphery, which reconstruct the serial ADC output and a corresponding asynchronous clock (Fig. 11) . Since the comparator outputs are in RZ format, the clock can be reconstructed from an "OR" operation of the two outputs. A DFF driven by this clock reconstructs the serial ADC data. The DFF has a constant logic high input and is reset by the negative comparator The recovered clock and data are fed to a dual-clock FIFO for further synchronization. The "read" operation of the FIFO is driven by the 300-MHz global clock. In order to simplify the data reconstruction at the system side, once a valid data stream is received, the FIFO is expected to operate in neither "empty" state nor "full" state. The "full" state is avoided by selecting an FIFO queue-depth of 16, more than the 10 bits of a single conversion result. To avoid the "empty" state, a five-cycle delay is applied between the start of the "write" operation and the start of the "read" operation to make sure that enough data are written to the FIFO before reading. The FIFO and the following encoders were implemented and optimized using logic synthesis tools. To enable a bit-error-rate (BER) test for the high-speed data exportation, the FIFO inputs can be switched to the output of an on-chip pseudo-random bit sequence (PRBS) generator with a sequence length of 2 16 − 1. Fig. 12 shows a block diagram of the DLL, which is based on [43] . It consists of a five-stage differential voltagecontrolled delay line (VCDL), a phase detector, a charge pump, and an edge combiner. The delay cell in the VCDL is implemented by cascading two differential cross-coupled inverters, the first of which is loaded by RC branches consisting an nMOS switch and a MOS capacitor. By increasing the switch control voltage V CT R L , more capacitance is added to the inverter's load, thus increasing the delay. Once the loop is stable, the edge combiner receives the outputs of the delay cells and generates five consecutive equal-width pulses, which are fed into a 10:1 multiplexer for data serialization.
F. DLL
IV. EXPERIMENTAL RESULTS
The ASIC has been fabricated in a 0.18-μm 1P6M lowvoltage CMOS process and has an area of 4.8 × 2.5 mm 2 , as shown in Fig. 13(a) . The floor plan of a 3 × 3 subarray receiver is shown in Fig. 13(b) , while its power and area breakdown are shown in Fig. 14 . The bondpads for transducer interconnection are implemented in the top (6th) metal layer, while the 5th metal layer is reserved as a grounded shield to protect the LNA inputs from digital interference. All building blocks are powered by a 1.8-V supply except for the VCDL in the DLL, which is powered by a separate 1.2-V supply. While receiving, each subarray receiver consumes 4.3 mW, corresponding to 0.46 mW/element. The beamforming ADC along with the delay programming logic occupies about half of the subarray area, while consuming only 36% of the subarray power (1.58 mW). The total power including the datalink and LVDS drivers is 130.5 mW, corresponding to 0.91 mW/element. Fig. 13(c) shows a fabricated prototype with an integrated 24 × 9 PZT matrix transducer. It is wire-bonded to a daughter PCB for both electrical and acoustic tests. The daughter PCB is mounted on a custom-designed mother PCB, where an FPGA receives and buffers the high-speed RF data before transmitting it to a PC for image reconstruction.
A. Electrical Measurements
The electrical performance of the prototype ASIC has been characterized by wire-bonding test input signals to the selected RX transducer pads. The reconstructed digital outputs of each subarray receiver are converted back to a voltage signal according to (3) to facilitate the performance evaluation. Fig. 15 shows the measured subarray receiver transfer function at 12 AFE gain settings. It achieves an overall mid-band gain range of 49 dB, stepping from −7 to 42 dB with an average step size of 4.5 dB. The deviation from the ideal gain step (6 dB) is mainly caused by the insufficient open-loop gain of the PGA core amplifier at high gain modes. The average −3 dB bandwidth of the subarray receiver is measured as 11.9 MHz. Fig. 16 shows the measured subarray input-referred voltage noise spectrum at the highest AFE gain setting, which indicates an input-referred voltage noise density of 6.3 nV/ √ Hz at 5 MHz. Before applying the digital back-end calibration (Section III-A), the ripple pattern introduced by delay-line mismatches appears as in-band interference tones at f S /8 (3.75 MHz) and its harmonics. By subtracting the pre-recorded ripple pattern (obtained from 100 iterations with grounded inputs) from the output signal, these interference tones get significantly reduced from the spectrum without deteriorating the noise floor. Fig. 17 shows the measured output spectrum of one subarray receiver at the highest AFE gain setting with a 4.95-MHz sinusoidal test input with a peak value of 3.1 mV. Convergence of the background comparator offset calibration process.
It achieves a peak SNDR of 51.8 dB within an 80% bandwidth (3-7 MHz) around the center frequency (5 MHz), where the AFE dominates the noise floor. Fig. 18 shows the transient response of one subarray output with the proposed background comparator offset calibration enabled. Upon initialization, the offset settling process takes about 120 ADC cycles (∼4 μs) to converge. Without calibration, the original output offset of the tested subarray (also shown in Fig. 18 ) is about 6 LSB, which is reduced to −1 LSB after calibration.
The high-speed datalink has been evaluated separately using the on-chip PRBS generator, which shows a BER better than 10 −9 across 1-m coaxial cables. To better demonstrate the channel-reduction capability of the datalink, we programmed four subarrays that share the same high-speed data output channel with differential uniform delays (33, 100, 166, and 233 ns). A three-cycle sinusoidal signal is chosen as the common input to these subarrays, with a frequency of 2 MHz so as to better illustrate the relative time delay. Fig. 19 depicts the reconstructed time-domain output waveform of these four subarrays, recovered from the shared LVDS output port, which clearly shows the expected relative time delays. The worst-case inter-subarray crosstalk is measured as −57 dBc. The measured gain mismatch across 16 subarrays is less than 0.1 dB. Table I illustrates the comparison of this paper with the state-of-the-art digitization solutions for 3-D ultrasound imaging systems [16] - [18] , [44] . Based on Table I , this work achieves a 10× improvement in power efficiency, as well as a 3.3× improvement in integration density. When compared with our previous analog output receiver ASIC [1] , the subarray digitization only costs about 70% extra power and is realized within the same die size. On the other hand, the highspeed datalink introduces a non-negligible power overhead due to the relatively large feature size of the chosen technology. This, however, can be reduced in the future by adopting a more advanced CMOS technology, or applying on-chip digital beamforming [14] or compression techniques [15] .
B. Acoustic Measurements
The acoustic performance of the fabricated prototype has been characterized by mounting a waterbag on the top of the PZT-on-ASIC assembly, as shown in Fig. 20(a) . A three-needle phantom was immersed in water and placed at about 20 mm in front of the PZT matrix. A diverging wave was transmitted from the prototype by driving six elements at the center of the TX subarray (Fig. 2 ) using 20-V (peak-to-peak) three-cycle pulses. In several successive TX-RX cycles, the 16 subarrays in the prototype were steered to different angles to scan the volume. Fig. 20(b) shows the recorded digital outputs of one subarray receiver with different programmed steering angles at the lateral direction. It clearly shows an increase of the echo amplitude when the subarray beamformer is steered toward the specific needle. Fig. 21 illustrates a reconstructed B-mode image in the lateral direction. It is obtained by recording and combing the digital outputs of all subarrays, and performing the post-beamforming computation in software. The positions of all three needles are clearly shown in the image with a spatial resolution in line with the relatively small RX aperture.
The image was reconstructed from 25 beams (TX/RX cycles) with a pulse-repetition frequency of 5 kHz, leading to a theoretical volume rate of 200 volumes/s. In practice, however, the imaging rate is limited by the data transfer speed between the FPGA and the PC, as well as the software post-beamforming computing time. This constraint could be resolved by migrating the image reconstruction function to the FPGA [45] , or implementing it in a digital ASIC [14] .
V. CONCLUSION
We have presented a front-end ASIC that enables powerand area-efficient in-probe digitization for the next-generation miniature 3-D ultrasound probes. It employs beamforming ADCs and high-speed datalinks to realize an additional 4-fold channel-count reduction compared to prior analog subarray beamformer designs. The ADC directly digitizes the subarray beamformer output in the charge domain to eliminate the need for intermediate power-hungry buffers. Self-calibrated charge references are proposed to further optimize the power efficiency as well as to facilitate the system-level design. The ASIC provides an overall 36-fold channel-count reduction and consumes less than 1 mW/element in receive, thus paving the way toward sub-Watt digital probes with 1000+ elements. A prototype with integrated transducers has been successfully applied in a 3-D imaging experiment.
