A Column-Row-Parallel ASIC architecture is proposed to enable 3D wearable / portable medical ultrasound. It offers linear-scaling interconnection, acquisition and programming time, while supporting rich functionality. High voltage MUX in Tx and specially sized source follower in Rx are used to implement parallelization for improved SNR. Fault-tolerant transceiver handles defective transducer elements to increase assembly yield and allow successful system demonstration.
active transceivers and I/Os. A single-channel transceiver takes N 2 repetitions to sweep the array. Both are difficult to scale to a large element count. Various sub-array architectures mitigate the N 2 trend, but either have worse side-lobes [1] , or need bulky analog delay lines in a column-parallel architecture [2] . A column-row transducer is hard-wired into Tx rows and Rx columns in [3] , but its functionality is limited. This work proposes a Column-Row-Parallel (CRP) architecture at the circuit level, achieving a rich set of 3D beam-formation (BF) functionality and a better tradeoff for complexity and speed. In Fig. 1(a-b) , a 16x16 capacitive micromachined ultrasonic transducer (CMUT) is biased at 20-40V (VB) [1] . Each CMUT element and its ASIC transceiver (a Tx pulser and a Rx LNA) have the same size of 250x250µm 2 . CMUT and ASIC chips are both flip-chip bonded through an interposer PCB as in Fig.  1(b) . At the ASIC perimeter, 16 column and 16 row pulser gate drivers and buffer amplifiers interface to the transceiver array. Their I/Os are multiplexed to share 16 input and 16 output ports (not shown), reducing ASIC I/Os to N. Fig. 1(c) shows the ASIC die photo in a 0.18µm HV CMOS process. Although the CMUT device is used here, the CRP ASIC architecture is equally applicable to other 2D transducers.
In column-parallel mode, the column circuitry is active while the row select logic determines which elements are parallelized along each column. For example, in Fig. 2(a) , two Tx elements are activated in parallel for each column, driven by a shared column driver. Similarly, in row-parallel mode, the row circuitry and the column select logic are active, as in Fig.  2(b) . Beamforming is realized by applying relative delays to 16 Tx drivers in real-time and 16 Rx buffers in post-processing. The CRP architecture is both scalable and flexible. First, the row-by-row or column-by-column operations allow the interconnection and data acquisition time to scale with "N". Second, column and row select logic can be reprogrammed quickly and frequently to activate different rows or columns for 3D BF, as in Fig. 2(a-b) . Its programming time also scales with "N" (0.16µs by a 100MHz clock). Third, per-element enable bits are programmed by snake-chained shift-registers (SRs) through the array, which offer fine granularity for application-specific patterns, making CRP compatible with existing BF schemes [1] [2] [3] , while enabling new ones, such as the checker board and the annular ring in Fig. 2(c-d) , and the fault-tolerant front-end discussed next. Lastly, each control set (column, row, per-element) has two multiplexed SR banks. For example, "If (SEL=1), R_en=Rbank1; else, R_en=Rbank2". It allows operation based on one bank while reprogramming the other; or alternating two pre-programmed banks for fast imaging aperture switching.
Circuit Design & Measurement Results

A. Implementing 2D Parallelism in Tx and Rx
The pulser and LNA block designs are revised based on [4] , but the focus here is to integrate blocks into the 2D CRP array.
For Tx (Fig. 3(c) ), HV pass-gate MUXes implement Tr&Tc inside each 3-level 30Vpp pulse-shaping pulser (see Fig. 5(a) or [4] ). If the pulser is off, MUX sends a "hold" voltage to pulser gate, ignoring column and row drivers. The 2D grid of column and row lines uses minimum width metal wires for least parasitic capacitance. When multiple pulsers on one line are driven by the same driver, their CMUT elements' acoustic outputs combine in space. These elements are effectively in parallel and behave as a larger CMUT element with bigger acoustic energy output.
For Rx, LNA outputs on the same line can be combined such that signals are averaged and noise is reduced, effectively achieving parallelism as if receiving from a larger CMUT element with multiple parallel LNAs. In Fig. 4 , a source follower stage (M11-M12) is used to provide a suitable LNA output impedance (Ro) for analog combining. First, Ro must be high enough to allow current summing. Otherwise device mismatch and line resistance (Rp) will disturb the DC condition and distort signals when parallelizing. Second, Ro must be low enough to ensure single LNA's performance. Otherwise the bias current (Io) is too low to maintain good output linearity; and the line capacitance (Cp) limits the bandwidth. The design starts with a slew constraint to fix Io>34µA with Ro<2.2kΩ. Then 10x minimum width metal wires are chosen to provide Rp<<Ro for current summing. Simulations ensure the final design in Fig. 4 meets all specs. Table I shows that measured SNR increase with parallel LNAs is close to theory; the discrepancy is likely due to correlated noise sources. Other measured results are in Table II . The Rx design improves on [4] with a more linear output stage and a much lower power sleep mode, achieving the best linearity and power efficiency among reported CMUT amplifiers [1, 2, 4] ; it also has the first implementation of programmable gain, greatly enhancing system flexibility.
B. Fault-Tolerant Front-end for Defective Transducers
2D CMUTs with large element count inevitably suffer from 978-1-4799-3328-0/14/$31.00 ©2014 IEEE short and open defects. A short element propagates the HV bias VB to transceiver, potentially damaging LV circuits. If the circuit provides a path to ground, the shared VB is pulled down to 0V, rendering the whole array useless. Previously, shorts were detected by a probe station; solder bumps were removed to avoid electrical contact to ASIC [1] . This manual process is difficult for mass production, as defect positions are random and vary across devices. New shorts can also emerge under a different VB to void the fixed removal pattern. In contrast, the CRP can isolate shorts in a fast, automated electrical process. Fig. 5(a) shows front-end HV transistors for a CMUT. To detect a short, M1 is briefly turned on while others are off; the short current is sensed by R T . The per-element enable bits are first programmed to test individual elements, then to disable all detected shorts, whose HV transistors are kept constantly off. A simple reprogram adapts to new shorts in the future. Finally, using a 16x16 assembly with 4 shorts isolated, 3D images are acquired with a low-power software BF scheme designed for CRP wearable / portable systems (plane-wave Tx, row-by-row Rx). Fig. 5(b-c) shows the cross-sectional images. 
