Abstract-This paper presents a power-and area-efficient front-end application-specific integrated circuit (ASIC) that is directly integrated with an array of 32 × 32 piezoelectric transducer elements to enable next-generation miniature ultrasound probes for real-time 3-D transesophageal echocardiography. The 6.1 × 6.1 mm 2 ASIC, implemented in a low-voltage 0.18-µm CMOS process, effectively reduces the number of receive (RX) cables required in the probe's narrow shaft by ninefold with the aid of 96 delay-and-sum beamformers, each of which locally combines the signals received by a sub-array of 3 × 3 elements. These beamformers are based on pipeline-operated analog sample-andhold stages and employ a mismatch-scrambling technique to prevent the ripple signal associated with the mismatch between these stages from limiting the dynamic range. In addition, an ultralowpower low-noise amplifier architecture is proposed to increase the power efficiency of the RX circuitry. The ASIC has a compact element matched layout and consumes only 0.27 mW/channel while receiving, which is lower than the state-of-the-art circuit. Its functionality has been successfully demonstrated in 3-D imaging experiments.
I. INTRODUCTION
V OLUMETRIC visualization of the human heart is essential for the accurate diagnosis of cardiovascular diseases and the guidance of interventional cardiac procedures. Echocardiography, which images the heart using ultrasound, has become an indispensable modality in cardiology because it is safe, relatively inexpensive, and capable of providing realtime images. Transesophageal echocardiography (TEE), as its name indicates, generates ultrasonic images from the esophagus, by utilizing an ultrasound transducer array mounted at the tip of a gastroscopic tube (Fig. 1) . Conventionally, the elements of the transducer array are connected using microcoaxial cables to an external imaging system, where properly timed high-voltage pulses are generated to transmit an acoustic pulse, and the resulting echoes are recorded and processed to form an image. Two-dimensional (2-D) TEE probes are widely used in clinical practice. They employ a 1-D phased array transducer to obtain cross-sectional images of the heart. However, such 2-D images often fall short in providing comprehensive visual information for complex cardiac interventions, such as minimally invasive valve replacements and septal defect closures. Appropriate real-time 3-D imaging would be very beneficial for improving the success rate of such procedures [1] .
The relatively large probe heads (typically >10 cm 3 ) of current 3-D TEE probes cannot be tolerated by the patient during longer procedures (unless general anesthesia is applied) and are too large for pediatric use. For longer term monitoring and pediatric use, the volume of the probe tip should be constrained to an upper limit of 1 cm 3 and the tube diameter to 5-7 mm [2] . To enable real-time 3-D imaging, a 2-D phase array is required. For an array of aperture size D × D, the achievable signal-to-noise ratio (SNR) and the lateral resolution both improve linearly with D. Therefore, it is desirable to make full use of the available array aperture within the probe tip (5 × 5 mm 2 ). In addition, the pitch of the transducer elements should not exceed half of the acoustic wavelength (λ) to minimize grating lobes and to ensure proper spatial imaging resolution [3] . For a 2-D array with a center frequency of 5 MHz, this corresponds to a pitch of 150 μm, leading to at least 32 × 32 elements. Accommodating the corresponding number of microcoaxial cables within the narrow gastroscopic tube is difficult or even impossible. Decreasing the aperture size to reduce the number of channels will lead to a significant deterioration in both the SNR and the lateral resolution. As a result, channel reduction should be performed locally to reduce the number of cables with the aid of miniaturized in-probe electronics [4] .
A variety of approaches have been proposed to reduce the cable count in endoscopic and catheter-based ultrasound systems. Part of the beamforming function, which is conventionally performed in the external imaging system to achieve spatial directivity and enhance the SNR, can be moved into the probe [5] , [6] . Time-division multiplexing approaches have been applied in [7] and [8] to allow multiple elements to share a single cable. Solutions based on element switching schemes [9] , [10] have also been reported. All these approaches rely on the realization of a front-end applicationspecific integrated circuit (ASIC) that is closely integrated with the transducer array.
Design of such front-end ASICs is challenging in several aspects. First, the power consumption of the ASIC, which contributes to the overall self-heating of the probe, should be kept below an estimated 0.5 W [11] , to avoid excessive tissue temperature rise [12] . This translates to 0.5 mW/element for a 1000-element array and is beyond the state-ofthe-art of front-end ultrasound ASICs, which consume at least 1.4 mW/element [10] , [13] , [14] . Another challenge comes from the dense interconnection between the ASIC and the transducer array. Direct transducer-on-chip integration is desired, as it not only helps to get a small form factor but also reduces the parasitic interconnect capacitance added to each transducer element. This calls for an element-matched ASIC layout, with a pitch identical to that of transducer elements. As a result, a highly compact circuit implementation for the ASIC is called for. Prior works [13] , [15] compromised somewhat on the imaging quality by opting for a slightly larger pitch. Indirect transducer-to-chip integration via interposer PCBs [6] , [10] allows the use of a different pitch for the transducer array and the ASIC. However, the limited space within the TEE probe tip precludes this option. In this paper, we present a front-end ASIC that is optimized in both system architecture and circuit-level implementation to meet the stringent requirements of 3-D TEE probes [16] . It is directly integrated with an array of 32 × 32 piezoelectric transducer elements, which are split into a transmit (TX) and a receive (RX) array to facilitate the power and area optimization of the ASIC [17] . The RX elements are further divided into 96 sub-arrays, each with a switched-capacitorbased beamformer, to realize a ninefold cable reduction. Besides, an ultralow-power low-noise amplifier (LNA) architecture [18] , which incorporates an inverter-based operational transconductance amplifier (OTA) with a bias scheme tailored for ultrasound imaging, is proposed to increase the power efficiency of the RX circuitry, while keeping the area compact. In addition to that, a mismatch-scrambling technique is applied to mitigate the effects of mismatch between the beamformer stages and thus improve the overall dynamic range of the ASIC while receiving. These circuit techniques, while designed for lead zirconate titanate (PZT) matrix transducers, are also relevant for other types of ultrasound transducers, such as capacitive micromachined ultrasonic transducers (CMUTs). The functionality of the ASIC as well as the effectiveness of the proposed techniques has been successfully demonstrated by imaging experiments. This paper is organized as follows. Section II describes the proposed system architecture. Section III discusses the details of the circuit implementation. The experimental results are presented in Section IV. Conclusions are given in Section V.
II. SYSTEM ARCHITECTURE

A. Transducer Matrix Configuration
In conventional ultrasound probes, each transducer element is used as both transmitter and receiver. A high-voltage CMOS process is then needed to generate the TX pulses of typically tens of volts [14] . The integration density of high-voltage processes is generally lower than that of their low-voltage counterparts with the same feature size, which is disadvantageous for ASICs that directly interconnect with 2-D transducer arrays with a tiny element pitch.
In this paper, we use an array of 32×32 PZT elements with separate TX and RX elements (Fig. 2 ). An 8 × 8 central subarray is directly wired out to TX channels in the external imaging system using metal traces in the ASIC that run underneath 96 unconnected elements to bond pads on the chip's periphery. These traces are not connected to any junctions in the substrate and can hence support high TX voltages provided that they are sufficiently spaced to prevent dielectric breakdown and routed in the top metal layers to minimize capacitive coupling to the substrate. All other 864 elements are connected directly to on-chip receiver circuits, whose outputs are fed to the imaging system's RX channels.
The use of a small central TX array helps in reducing the overall cable count as well as obtaining a large opening angle while receiving. With respect to a conventional array configuration in which each transducer element is used for both TX and RX purposes, our scheme trades lateral resolution for a higher frame rate. In our scanning procedure, the transmitter is used to generate only a few wide beams, illuminating an area that can accommodate a number of parallel RX beams per TX pulse, thus yielding a high frame rate. On the other hand, it should be also ensured that the generated acoustic pressure is adequate for the target imaging depth. According to our numerical simulations in PZFlex (Weidlinger Associates Inc., Mountain View, CA, USA), 64 elements should be capable of generating sufficient pressure for an imaging depth up to 10 cm. Moreover, despite the missing elements in the receiver aperture, the point spread function is comparable to a fully populated receiver, as shown by simulations in [19] . This configuration allows the use of a dense low-voltage CMOS technology, thus saving power and circuit area. Compared with [13] , which uses the majority of elements to transmit and a sparse array to receive, it achieves better receiving sensitivity as well as lower side lobes. Moreover, it also helps to reduce the overall in-probe heat dissipation, as TX circuits normally consume more power [10] .
The transducer array was constructed by dicing a bulk piezoelectrical material (CTS 3203 HD) into a matrix. It is directly mounted on top of the front-end ASIC using the PZT-on-CMOS integration scheme described in [11] . The PZT 
B. Sub-array Beamforming in RX
The cable count reduction approach that we adopted in this work is to perform partial RX beamforming in the ASIC. The basic principle of ultrasound beamforming is to apply appropriate relative delays to the received signals in such a way that ultrasound waves coming from the focal point arrive simultaneously and can be constructively combined. Full-array beamforming for 32 × 32 transducer elements is impractical for circuit implementation due to the large delay depth required for each element, which is typically a few microseconds. The sub-array beamforming scheme [5] , also known as "micro-beamforming" [17] , mitigates this issue by dividing the beamforming task into two steps. A coarse delay that is common for all elements within one sub-array is applied in the external imaging system, while only fine delays for the individual elements (less than 1 μs) are applied by the sub-array beamformers in the ASIC, which significantly reduces the implementation complexity of the required on-chip delay lines.
The sub-array size is determined based on the following concerns. First, in order to keep the symmetry of the beamforming in lateral and elevation directions, a square sub-array is desired. Besides, a larger sub-array brings a more aggressive cable count reduction, but comes at the cost of an elevated grating-lobe level and a greater maximum fine delay in the sub-array beamformers. We selected a 3 × 3 configuration to achieve a reasonable acoustic imaging quality, while reducing the number of cables by a factor of 9 [20] . Accordingly, the 864 RX elements of the transducer matrix are divided into 96 sub-arrays and interfaced with 96 sub-array receiver circuits in the ASIC. The fine delays are programmable in steps of 30 ns up to 210 ns, allowing the sub-array's directivity to be steered over angles of 0°, ±17°, and ±37°in both azimuthal and elevation directions [11] . All sub-arrays can be programmed identically, which is appropriate for far-field beamforming and requires loading of only nine delay settings into the ASIC, which has a negligible impact on the frame rate. The ASIC is also equipped with a mode in which all sub-arrays can be programmed individually (i.e., 96 × 9 settings), allowing nearfield focusing at the expense of a longer programming time and hence a slightly slower frame rate. Fig. 3 shows the schematic of a 3 × 3 sub-array receiver. It consists of nine LNAs, nine buffers, nine analog delay lines, a programmable gain amplifier (PGA), and a cable driver. A pair of protection diodes is implemented at the input of the LNA to prevent the input from exceeding the supply voltages by more than a diode drop. The LNA output is AC-coupled to a flipped source follower buffer that drives the analog delay line. The joint output of all the nine analog delay lines is then amplified by the PGA. A cable driver buffers the output signal of the PGA to drive the microcoaxial cable connecting to the imaging system. A local bias circuit (not shown) is implemented within each sub-array.
III. CIRCUIT IMPLEMENTATION
The echo signals received by the transducer elements have a dynamic range of about 80 dB, 40 dB of which is associated with the fact that echoes from deeper tissue are attenuated more along their propagation paths. The gains of the LNA and the PGA are programmable to compensate for this attenuation. The LNA is optimized for a low noise figure (<3 dB) and provides a voltage gain up to 24 dB, to attenuate the impact of noise of the subsequent stages at small signal levels. The gain can be reduced to −12 and 6 dB to avoid output saturation at high signal levels. The PGA provides an additional switchable gain with finer steps (0, 6, and 12 dB) to interpolate between the gains steps of the LNA. Thus, an overall dynamic range of more than 80 dB, which is sufficient for TEE imaging, can be achieved.
As described in Section I, all the above circuits, along with their biasing and digital control circuits, must be implemented within the area of a 3 × 3 sub-array, i.e., 450 μm × 450 μm, while consuming less than 4.5 mW. Dedicated circuit techniques have been applied to meet these requirements, which will be discussed in this section. 
A. LNA
The choice of the ultrasound LNA topology is dictated by the electrical impedance of the target transducer. Transimpedance amplifiers (TIAs) are widely used in readout ICs for CMUTs because of their relatively high impedance [21] . However, a similarly sized PZT transducer has a much lower impedance around the resonance frequency, typically a couple of kiloohms for our transducers (Fig. 4) . In view of this, the TIA topology falls short in achieving an optimal noise/power tradeoff, since creating a low enough input impedance requires extra power spent on increasing the open-loop gain, rather than on suppressing the input-referred noise [18] . In this paper, instead, we use a capacitive feedback voltage amplifier, shown in Fig. 5 , which offers a midband voltage gain of A M = C I /C F . Its input impedance is dictated by the input capacitor C I and can be easily sized to tens of kiloohms within the transducer bandwidth, so as to sense the transducer's voltage rather than its current.
A current-reuse OTA based on a CMOS inverter is employed to enhance the power efficiency of the LNA. In previous inverter-based designs [22] , extra level-shifting capacitors (C LS ) are used to independently bias the nMOS and pMOS transistors, as shown in Fig. 6(a) . These level-shifting capacitors and the associated parasitic capacitors at the virtual ground node form a capacitive divider, which attenuates the input signal and thus increases the input-referred noise of the LNA. Enlarging C LS helps in reducing this noise penalty, at the cost of increased die area. In this paper, the levelshifting capacitors are eliminated by applying a split-capacitor feedback network [18] , [23] . As shown in Fig. 6(b) , the input bias points for the nMOS and pMOS transistors are decoupled by splitting the input and feedback capacitors into two equal pairs, which maintains the same midband gain C I /C F and the same input impedance.
To maximize the output swing, the bias voltage of the inverter-based OTA should be properly defined. This is usually achieved with the aid of a DC control loop, in which a slow auxiliary amplifier keeps the output at the desired operating point [22] . However, such a DC control loop will recover too slowly from disturbances caused by the high-voltage pulses propagating across the ASIC during the TX phase. Therefore, instead, we dynamically activate the bias control loop in synchronization with the TX/RX cycles of the ultrasound system, as shown in Fig. 7 . During the TX phase, the input of the LNA is grounded and the inverter is essentially auto-zeroed, while the auxiliary amplifier drives the gate of the nMOS transistor so as to bias the output at mid-supply. During the RX phase, the auxiliary amplifier is disconnected and both its inputs are shorted to the mid-supply. Meanwhile, the LNA starts receiving the echo signal by operating at the "memorized" bias points. Given that the typical TX/RX cycle in cardiac imaging is relatively short, ranging from 100 to 200 μs, the bias voltage hardly drifts during the RX phase. The relatively large sizes of the input transistors (W/L N = 75/0.2 and W/L P = 60/0.2), needed for flicker noise reduction, also help to keep the bias voltages stable. The sample-and-hold (S/H) operation associated with the auto-zeroing causes broadband white noise to be sampled on the gate of the nMOS transistor and held constant during the RX phase. Therefore, it appears as a small offset voltage that is superimposed on the "memorized" bias point during each TX/RX cycle, and does not deteriorate the in-band noise performance of the LNA. Moreover, it is further filtered out by the AC-coupler following the LNA and has no impact on the bias condition of succeeding stages.
A well-known downside of a single-ended inverter-based OTA is its poor power supply rejection ratio (PSRR) [24] . As the LNAs are closely integrated with high-frequency digital circuits for beamformer control, the supply line and the ground are inevitably noisy. To improve the PSRR, we generate two internal power rails within each sub-array by means of two regulators (see REG P and REG N in Fig. 8 ) that are shared by the nine LNAs of a sub-array. Given the fact that the loading currents of these regulators are known and approximately constant, their implementation can be kept rather simple to save area. A capacitorless low-dropout regulator based on a super source follower [25] , capable of providing a PSRR better than 40 dB at 5 MHz, is adopted as the topology for both regulators. Fig. 8 shows the complete schematic of the proposed LNA. The inverter-based OTA is cascoded to ensure an accurate closed-loop gain, and input transistors M 1 and M 4 are biased in weak inversion to optimize their current-efficiency. The bias voltage of M 1 , V ref P , which is derived from a diodeconnected pMOS transistor via a high-impedance pseudoresistor, is shared by the input gate of the positive-rail regulator REG P . Thus, the bias current of the OTA can be defined by the difference in the reference currents (I p1 − I p2 ) and the dimension ratio of M 1 and M p1 . In each channel, a unity-gainconnected inverter, implemented with long-channel transistors and consuming only 0.4 μA, is connected between the two regulated power rails to generate a mid-supply reference that is approximately 900 mV. The auxiliary amplifier for DC bias control is realized as a simple differential pair. With a current consumption of less than 1 μA, it is capable of settling within the 10-μs TX phase. A switchable capacitive feedback network, involving capacitors 14C and 7C that can be switched in or out under the control of digital gain-control inputs of the ASIC, is implemented to provide the mentioned three gain levels for dynamic range enhancement. An explicit loading capacitor (not shown in Fig. 8 ) is added at the output of the LNA to limit its −3 dB bandwidth below 10 MHz. Fig. 9 shows the circuit implementation and timing diagram of the sub-array beamformer. It consists of nine programmable analog delay lines, each of which is built from pipelineoperated S/H memory cells that run at a sampling rate of 33 MHz, corresponding to the target delay resolution of 30 ns. Due to the fact that the sampling rate is higher than the designed bandwidth of the LNA, the increase in the noise floor caused by aliasing is negligible.
B. Sub-array Beamformer
The capacitor in each memory cell is carefully sized to ensure that the associated kT/C noise is not dominant, while meeting the area requirement. With 300-fF metal-insulatormetal (MIM) capacitors, an input-referred rms noise voltage of about 118 μV is expected for each delay line, which is smaller than the output noise of the LNA at its highest gain setting.
The outputs of all the nine delay lines are passively joint together to sum up and average the charge sampled on the capacitors that are connected to the output node [11] . Compared with voltage-mode summation [26] , [27] , this scheme eliminates the need for a summing amplifier, and is thus more compact and power efficient. However, a potential source of errors is the residual charge stored on the parasitic capacitance at the output node, which causes a fraction of the output of the previous clock cycle to be added to the output signal. This is equivalent to an undesired first-order infinite-impulse-response low-pass filter. While this filtering can be eliminated by periodically removing the charge from the output node using a reset switch [11] , here we choose for the simpler solution of minimizing the parasitic capacitance at the output node. It can be shown that an acceptable signal attenuation within the bandwidth of 0-10 MHz of less than 3 dB is obtained if this parasitic is less than 20% of the total capacitance at the output node, which can be easily achieved with a careful layout.
The control logic for programming the delay lines is also integrated within each sub-array. Its core is a delay stage index rotator that determines the sequence in which the memory cells are used, as conceptually shown in Fig. 10 . The detailed circuit implementation is shown in Fig. 11 code, provided by a built-in serial peripheral interface (SPI), decides which of these candidates is used, allowing the delay of the individual delay line to be programmed. One-hot codes derived from the selected 4-b binary indices are re-timed by nonoverlapped clocks to control the sample/readout switches in the memory cells.
As mentioned in Section II, the SPIs in all sub-arrays can be either loaded in parallel or configured as a daisy chain to load different delay patterns to individual sub-arrays. With a 50 MHz SPI clock, only 0.54 μs is needed to program the ASIC's delay pattern in the parallel mode, while for the daisy chain mode, it takes about 13 μs (sub-arrays in each quadrant of the ASIC form one daisy chain), leading to a 9% frame rate reduction for an imaging depth of 10 cm. As such, the daisy chain mode enables near-field focusing at the expense of a slightly slower frame rate.
C. Mismatch-Scrambling
The S/H memory cells suffer from charge injection and clock feedthrough errors, the mismatch of which introduces a ripple pattern with a period of eight delay steps (240 ns) at the output of the delay lines. Such a ripple pattern manifests itself as undesired in-band tones in the output spectrum of the beamformer, which limits the dynamic range of the signal chain.
To mitigate this interference, we propose a mismatchscrambling technique by adding an extra memory cell and a redundant index register D 9 , as shown in both Figs. 10 and 11. A pseudorandom number generator (PRNG) embedded in each sub-array generates a pseudorandom bit sequence that decides whether the index of D 8 or D 9 shifts into D 1 , while the other index shifts into D 9 . Thus, memory cells are randomly taken out and inserted back into the sequence. This operation randomizes the ripple pattern and converts the interfering tones into broadband noise. The mismatch-scrambling function can be switched on/off with a control bit (MS_EN in Fig. 11) . The PRNG in each sub-array is implemented as a 12-b Galois linear feedback shift register [28] . It can be reconfigured as a shift register to allow the sequential loading of its initial state, i.e., the seeds. Similar to the daisy chain mode of the delay pattern SPI, these shift registers can also be cascaded to allow different seeds to be loaded into the individual subarrays. Applying a set of randomized seeds for all sub-arrays is expected to further decorrelate the sequences of memory cell rotation on the scale of the full-array. As a result, the excess noise generated by the scrambling process can be suppressed when the output signals of the sub-arrays are combined by the beamforming operation in the imaging system, thus improving the SNR. Fig. 12 shows the schematic of the PGA, which is implemented as a current-feedback instrumentation amplifier [17] , [29] with a single-ended output. It consists of a differential pair of super source followers with a tunable source-degeneration resistor R S , which performs as a linearized transconductor, and a current mirror with a constant load resistor R L , which converts the transconductor's output current into voltage. The voltage gain of the PGA is defined by the ratio of both resistors R L /R S . R S is implemented as a switchable resistor array ranges from 6 to 18 k , while R L is constant (24 k ). To avoid using very large CMOS switches for getting small on-resistance, Kelvin connections are used to eliminate errors caused by the on-resistance of those switches (Fig. 12) . Compensation capacitors (C C ) are added to ensure the loop stability. These capacitors are switched along with the gain settings from 800 fF at the lowest gain setting to 400 fF at the highest gain setting. A differential topology is applied Fig. 13 . Schematic of the cable driver.
D. PGA
to improve the PGA's immunity to interference. The negative input terminal (V in-) is connected to the output of a replica delay line buffer, whose input node is AC-coupled to ground while sharing the same dc bias voltage with the other buffers.
The PGA is sitting after the sub-array beamformer. Therefore, comparing its noise contribution with preceding stages, the noise averaging effect [10] of the beamformer should be taken into account. It is designed to have an input referred noise density below 30 nV/ √ Hz to prevent adding excess noise when referred to the input of the LNA.
E. Cable Driver
The cable driver is required to fan out the output signal of each sub-array across a microcoaxial cable with a capacitance of up to 300 pF. To maximize its power efficiency, a class-AB super source follower [30] , as depicted in Fig. 13 , is adopted as the topology for the cable driver. Instead of using a highimpedance pseudoresistor to form a quasi-floating gate, the gate of the pMOS transistor is only connected to the bias circuit during the TX phase, but kept floating during the RX phase, similar to the dynamic dc bias scheme used in the LNA. When referred back to the input of the signal chain, the noise contribution of the cable driver is negligible as it is compressed by the gain of the PGA.
IV. EXPERIMENTAL RESULTS
The ASIC has been realized in a 0.18-μm low-voltage CMOS process with a total area of 6.1×6.1 mm 2 , as shown in Fig. 14(a) . Fig. 14(b) presents a zoomed-in view of one subarray receiver that is matched to a 3 × 3 group of transducer elements with a pitch of 150 μm. While receiving, the ASIC consumes only 230 mW, which is less than half of the power budget for a 3-D TEE probe. Fig. 15(a) shows a fabricated prototype with an integrated 32 × 32 PZT matrix transducer. The assembly has been bonded to a daughter PCB to facilitate acoustic measurements [ Fig. 15(b) ]. A matching layer and a ground foil are applied on top of the PZT matrix. The ground foil is directly connected to the ground potential of the ASIC via PCB traces. Bonding wires on the periphery of the ASIC are covered by a nonconductive epoxy layer for waterproof.
The ASIC's 96-channel sub-array outputs and 64-channel high-voltage TX inputs are connected to a mother-PCB via microcoaxial cables with a length of 1.5 m. The mother PCB is directly mounted on a programmable imaging system (Verasonics V-1 system, Verasonics Inc., Redmond, WA, USA), which acquires the RF data from the ASIC and drives highvoltage pulses via metal traces in the ASIC to TX elements in the transducer array. Counting in the required power supply and digital control lines, the total number of cables required for connecting the ASIC to the imaging system is around 190.
Using this setup, the ASIC's electrical and acoustic performances have been characterized experimentally, the results of which are presented in this section.
A. Electrical Characterization
The electrical performance of the proposed LNA architecture has been fully characterized and evaluated with a separate test IC [18] . It demonstrates a 9.8-MHz bandwidth, an 81-dB dynamic range, and an input-referred noise density of 5.5 nV/ √ Hz 5 MHz at its highest gain, while consuming only 0.135 mW per channel. When interfaced with an external small PZT array, which gives an RX sensitivity of about 10 μV/Pa, the LNA achieves a noise efficiency factor (NEF) [31] that is 2.5× better than the prior state-ofthe-art. Fig . 16 shows the measured transfer function of a 3 × 3 sub-array receiver in the ASIC, with a uniform delay pattern applied to the sub-array beamformer. Various combinations of LNA/PGA gain settings were applied to achieve a programmable midband gain ranging from −12 to 36 dB with a gain step of 6 dB. The measured absolute values of the midband gain levels are approximately 6 dB lower than the theoretical values of the LNA/PGA gain combinations, which can mainly be attributed to signal attenuation in the delay line buffers and cable drivers and to the attenuation associated with the parasitic capacitance at the beamformer's summing node. This deviation does not deteriorate the imaging quality, as long as an adequate SNR can be maintained at the sub-array output by an appropriate selection of gain settings. The −3 dB bandwidth is about 6 MHz, ranging from 0.3 to 6.3 MHz. Note that the sinc-filtering effect of the S/H operation in the beamformer also contributes to the gain roll-off at higher frequencies, which introduces a 4-dB extra attenuation at 16.5 MHz (half sampling frequency).
To investigate the output noise level of the sub-array receiver circuits, we use an ASIC without an integrated transducer matrix, in which all bond pads for transducer interconnection are electrically shorted to ground by means of wire bonding. With the highest LNA and PGA gain settings, the electrical output noise density of a 3 × 3 sub-array is measured as 120 nV/ √ Hz at 5 MHz. This is in good agreement with the simulated value of 106 nV/ √ Hz. With a 300-mV maximum peak-to-peak output amplitude, the peak SNR at the highest gain setting thus found is about 51 dB. Fig. 17(a) shows the measured output noise spectrum without enabling the mismatch-scrambling function. Two interference tones appear at fractions of the sampling frequency ( f S /8, f S /4), which dominate the noise floor and thus reduce the dynamic range. After enabling mismatch-scrambling [ Fig. 17(b) ], these tones get eliminated from the output spectrum at the expense of a small increase in the noise floor. The noise power reduction associated with the system-level beamforming has been measured by combining the sub-array output signals acquired using the Verasonics system. Fig. 18 shows the measured rms noise voltage after beamforming as a function of the number of sub-arrays. Ideally, if the noise at the outputs of the sub-arrays is uncorrelated, the noise power after beamforming should decrease inversely proportionally to the number of sub-arrays involved. Without mismatch-scrambling, this is not the case, because the subarray output signals are dominated by (correlated) mismatchrelated tones. With mismatch-scrambling enabled, the noise level shows the expected improvement, i.e., decreasing at a slope close to 10 dB/dec, provided that randomized seeds are delivered to the different pseudorandom number generators. With the same seed used in all sub-arrays, the tones disappear from the output spectrum, but the randomized mismatch signals of different sub-array are still correlated and hence are not reduced by the averaging operation of the system-level beamformer. Table I summarizes the measured electrical performance of the ASIC. A system-level comparison with reported prior works on ASICs for 3-D ultrasound imaging is given in Table II . Our ASIC achieves both the best power efficiency in receiving and the highest integration density. Fig. 19 . Schematic of the acoustic experiment setup. For the beamsteering measurements and the characterization of TX pressure, scatterers were replaced by single-element transducers and a hydrophone, respectively.
B. Acoustic Experiments
The fabricated prototype shown in Fig. 15 was immersed in a water tank (Fig. 19) for the evaluation of its acoustic performance. To measure the TX efficiency of the center subarray, all 64 TX elements were driven simultaneously by the Verasonics system and the pressure was measured at 5 cm using a hydrophone. With a 50 V excitation, a TX pressure of 300 kPa was measured, leading to a TX efficiency of about 6 kPa/V.
To characterize the receive beamsteering function of the ASIC, a single-element transducer of a 0.5-in diameter and a 5-MHz central frequency (Olympus) has been used as an external source, which generates a quasi-continuous plane wave at the surface of the prototype transducer. The prototype was mounted on a rotating stage and turned from −50°to +50°with a step size of 2°. The delays of sub-array beamformers in the ASIC were programmed successively to steer the sub-arrays maximum sensitivity toward 0°, 17°, and 37°. The corresponding measured sub-array beam profiles, shown in Fig. 20 , are in good agreement with expectations, with the peaks of the beams corresponding well to the programmed steering angles. 
C. Imaging Results
To demonstrate the 3-D imaging capability of the prototype, a pattern of seven-point scatterers (six steel balls and one needle), forming a letter "M" [ Fig. 21(a) ], was placed at a distance of approximately 35 mm in front of the transducer array. A diverging wave was transmitted from the prototype, using a pulse of 18 V (peak-to-peak), generated by the Verasonics systems and applied to the TX sub-array through the connections on the ASIC. A 3-D volume image was reconstructed by combining the sub-array output signals recorded using the Verasonics system from multiple TX-RX events and rendered to get a frontal view of the point scatterers [ Fig. 21(b) ], which clearly shows the layout of the scatterers.
Currently, the 3-D image reconstruction has been done offline and 169 TX-RX events were used to generate one volume as shown in Fig. 21(b) [32] . In a future realtime implementation, this would correspond to a frame rate of 44.4 volume/s for an imaging depth of 10 cm. When the daisy chain mode for delay pattern programming is enabled, the frame rate reduces to about 40 volume/s. We have also noted that volumes can be reconstructed from at minimum 25 TX-RX events, at the cost of slightly degraded image quality [32] . This results in a frame rate of 300 volume/s in the fast imaging mode.
V. CONCLUSION
A front-end ASIC with an integrated 32 × 32 PZT matrix transducer has been designed and implemented to enable nextgeneration miniature ultrasound probes for real-time 3-D TEE. The transducer array is split into a TX and an RX sub-array to facilitate the power and area optimization of the ASIC. To address the critical challenge of cable count reduction, sub-array receive beamforming is realized in the ASIC with a highly compact and power-efficient circuit-level implementation, which utilizes the mismatch-scrambling technique to optimize the dynamic range. A power-and area-efficient LNA architecture is proposed to further optimize the performance. Based on these techniques, the ASIC demonstrates state-ofthe-art power and area efficiency, and has been successfully applied in 3-D imaging experiments.
