# Sonic Millip3De with Dynamic Receive Focusing and Apodization Optimization

Richard Sampson\*, Ming Yang<sup>†</sup>, Siyuan Wei<sup>†</sup>, Chaitali Chakrabarti<sup>†</sup>, and Thomas F. Wenisch\*

\*Department of EECS, University of Michigan <sup>†</sup>School of ECEE, Arizona State University

*Abstract*—3D ultrasound is becoming common for noninvasive medical imaging because of its accuracy, safety, and ease of use. However, the extreme computational requirements (and associated power requirements) of image formation for a large 3D system have, to date, precluded hand-held 3D-capable devices.

Sonic Millip3De is a recently proposed hardware design that leverages modern computer architecture techniques, such as 3D die stacking, massive parallelism, and streaming data flow, to enable high-resolution synthetic aperture 3D ultrasound imaging in a single, low-power chip. In this paper, we enhance Sonic Millip3De with a new virtual source firing sequence and dynamic receive focusing scheme to optimize receive apertures in multiple depth focal zones. These enhancements further reduce power requirements while maintaining image quality over a large depth range. We present image quality analysis using Field II simulations of cysts in tissue at varying depths to show that our methods do not degrade CNR relative to an ideal system with no power constraints. Then, using RTL-level design for an industrial 45nm ASIC process, we demonstrate 3D synthetic aperture with 120x88 transducer array within a 15W fullsystem power budget (400x less than a conventional DSP solution). We project that continued semicondutor scaling will enable a sub-5W power budget in 16nm technology.

# I. INTRODUCTION

Ultrasound systems have been a safe and effective tool for internal imaging, posing none of the dangers of other modalities such as X-ray and MRI. With the growing capabilities of handheld ultrasound systems, many new uses for ultrasound are emerging, from disaster relief to battlefield triage. Tremendous advances have also been made in 3D imaging for full-size systems, enabling greater technician productivity, accurate volumetric measurements, and images that are easier to interpret. Unfortunately, the formidable computational requirements and high data rates of the 3D digital front-end have to-date precluded 3D imaging in hand-held platforms.

Sonic Millip3De [7] is a recently proposed hardware design that leverages modern computer architecture techniques to enable 3D imaging within the tight constraints of a hand-held platform. The design combines a new algorithm for synthetic aperture beamforming, a massively parallel hardware accelerator, and 3D silicon die stacking to tightly integrate transducers, memory, and the accelerator in a single, low-power chip. In this work, we extend Sonic Millip3De with a new sub-aperture firing scheme and dynamic focus apodization that allows the system to generate images over a large depth range that are nearly indistinguishable from an ideal system with no power constraints. Using RTL-level synthesis in an industrial 45nm ASIC technology, we show that Sonic Millip3De can be implemented within a 15W power budget. We further project that, based on current Silicon scaling trends, Sonic Millip3De will achieve a sub-5W power budget by the 16nm technology node.

#### II. ALGORITHM DESIGN

# A. Iterative Delay Calculation

For a large 3D synthetic aperture system, over 100 billion round-trip delay calculations are needed to remap the received raw channel data into the image space for beamforming. Computing delays on-the-fly is too costly due to complex trigonometric and square root functions. Smaller systems avoid this problem by storing pre-calculated values in a look-up table (LUT), but due to the storage limitations of a handheld device, a naive LUT approach is not feasible.

Sonic Millip3De instead uses a new method of estimating delay values quickly and efficiently using an iterative approach [7]. The key insight of our algorithm is that, along a scanline, the change in the channel index value from one focal point to the next can be approximated accurately with quadratic equations. Rather than storing pre-calculated delays, we store only the coefficients of a piece-wise quadratic approximation (3 sections in the current design), reducing storage requirements from 4096 delay constants per scanline to a mere 16 coefficients (5 per section plus a starting offset) a 250× reduction. Using these formulae, our design computes delays efficiently with only shifts and adds.

#### B. Sub-aperture Design

The massive number of transducers (thousands) and high sampling rate (40MHz) of large-aperture 3D sys-



Fig. 1: Sliding virtual source & receive sub-aperture.

tems makes fully sampling the aperture prohibitive from both a data rate and analog-to-digital conversion (ADC) power perspective. Hence, a common practice is to split the aperture into smaller, distinct receive sub-apertures. In such a design, only a single sub-aperture receives during each firing, greatly reducing the data rate. The trade-off, however, is that the number of transmissions must be increased so that the full aperture can be reconstructed from a series of firings.

In a typical design (including our initial work [7]), sub-apertures are disjoint. We propose a new vitual source firing scheme, illustrated in Fig. 1, wherein a sliding 32x32 sub-aperture and corresponding virtual source is shifted by 8 elements on every firing, achieving a better distribution of firing angles for each receive element. This scheme allows us to achieve the same image quality as a conventional sub-aperture firing sequence while more than halving the total number of firings.

# C. Dynamic Apodization Focusing

Our initial design used a static apodization window over the entire aperture, resulting in poor balance of main and side lobe levels across depths. A fully dynamic apodization (with a distinct constant for each of 4096 focal points per scanline) requires prohibitive storage. We instead adopt a zonal scheme. We divide the image into three depth zones, each with its own optimized apodization window. The zonal apodization focus improves image quality at depth with only a modest increast in storage (3 apodization constants per scanline).

# **III. HARDWARE OVERVIEW**

The Sonic Millip3De hardware comprises three stacked silicon die layers (transducers, ADC/SRAM,

and beamformer) connected vertically using throughsilicon vias (TSVs) as shown in Fig. 2. Using a 3Dstacked design provides several architectural benefits. First, it is possible to stack dies manufactured in different technologies. Hence, the transducer layer can be manufactured in a cost-effective process for large capacitive micromachined ultrasonic transducers (CMUTs), while the beamforming accelerator can exploit the latest digital logic process technology. Second, stacking allows far more TSV links between dies than conventional chip pins, resolving the bandwidth bottleneck that plagues existing 3D systems where the probe and compute units are connected via cable. Finally, face-to-face connections via TSVs remove the long wires that would be required in such a massively parallel system, reducing interconnect power requirements.

The top die layer comprises a 120x88 grid of CMUTs with  $\lambda/2$  spacing. The area between the transducers is used for additional analog components and routing to the TSV interface. Transducers are grouped into banks such that only one transducer per bank receives data in any sub-aperture. With this banking design, only a single beamforming channel is necessary for each of the 1,024 banks rather than each of 10,560 transducers.

The second layer includes 1,024 12-bit ADCs and SRAM arrays, each corresponding to a transducer bank. The ADCs sample at a frequency of 40MHz, storing the digital output into the channel's corresponding 6kB SRAM array. The SRAMs are clocked at 1GHz and connect vertically to the corresponding computational unit on the beamforming layer, requiring a total of 24,000 face-to-face TSVs for data and address signals.

The final die layer comprises the beamforming accelerator, which itself is made up of 1,024 independent computational pipelines. Each pipeline includes three primary components. The first is the interpolation unit, which loads the channel data stream from SRAM and performs a linear 4× interpolation. The second component, the select unit, maps the interpolated receive data to focal points on a scanline using the algorithm described above. The select unit computes delay index estimates using only shift and add operations, determining how far to advance the receive data stream to locate the sample closest to the next focal point. Each select unit includes 10 sub-units which work concurrently on 10 different scanlines. Finally, the summing unit adds scanline data across channels, producing the beam-formed scanlines for the final image. The channels' summing units together form a 1024-stage pipelined network that connects all of the channels (shown in Fig. 2). The head and tail of this pipeline are connected to a low power ARM Cortex-



**Fig. 2: Sonic Millip3De Hardware Overview.** Layer 1 ( $24 \times 17$ mm) comprises 120x88 transducers grouped into banks with one transducer per back in each subaperture. Analog transducer outputs from each bank are multiplexed and routed over TSVs to Layer 2, comprising 1024 12-bit ADC units operating at 40MHz and SRAMs arrays to store incoming samples. The stored data is passed via face-to-face links to Layer 3 for processing in the 3 stages of the 1024-unit beamsum accelerator. The transform stage upsamples the signal to 160MHz. The 10 units in select stage map signal data from the receive time domain to the image space domain in parallel for 10 scanlines. The reduce stage combines previously-stored data from memory with the incoming signal from all 1024 beamsum nodes over a unidirectional pipelined interconnect, and the resulting updated image is written back to memory.

TABLE I: 3D ultrasound system parameters.

| Parameter                            | Value          |
|--------------------------------------|----------------|
| Total Transmits per Frame            | 96             |
| Total Transducers                    | 10,560         |
| Receive Transducers per Sub-aperture | 1024           |
| Storage per Receive Transducer       | 4096 x 12-bits |
| Focal Points per Scanline            | 4096           |
| Image Depth                          | 10cm           |
| Image Total Angular Width            | π/6            |
| Sampling Frequency                   | 40MHz          |
| Interpolation Factor                 | 4x             |
| Interpolated Sampling Frequency      | 160MHz         |
| Speed of Sound (tissue)              | 1540m/s        |
| Target Frame Rate                    | 1fps           |

A3 processor, which serves as the memory interface to the off-chip LPDDR2 DRAM.

#### **IV. RESULTS**

# A. Methodology

We evaluate our design in terms of image quality of simulated cysts in tissue and system power requirements. Our evaluation parameters are shown in Tab. I. We analyze image quality by simulating two rows of variable sized cysts (2-7mm diameter) using Field II [3, 4] and comparing the contrast-to-noise ratios (CNR) of images produced by our design and an ideal system. To measure hardware power and performance, we synthesize an RTL-level specification of our design in Verilog using an industrial 45nm standard cell library. Additionally we use SPICE models for interconnect power and published values for ADC [8] and DRAM power [5].

#### B. Image Quality

We contrast CNR of simulated cysts in tissue for images generated using an ideal system (precise index calculation and double-precision floating-point) against our new Sonic Millip3De design, which uses 14-bit fixed-point beamsum, iterative delay calculation, and dynamic focus. An x-z slice through the middle of the cysts is shown in Fig. 3 for both the ideal case and our design. Table II shows a CNR breakdown for all cysts for both configurations. Neither design is effective in resolving the smallest (2mm) cyst at depth, but Sonic Millip3De's image quality is nearly indistinguishable from the ideal case, providing high image quality at all depths for the larger cysts.

# C. Power Analysis and Scaling

To evaluate full system power requirements, we use a combination of RTL-level synthesis for the beamformer, SRAM, and interconnect and published estimates [1, 5, 8] for other system components. We determine that our revised design requires a full system power of 14.6W in 45nm technology (Fig. 4). Using pulished scaling trends for ADCs [6] and CMOS logic [2], we project that this design will achieve a 5W power budget (our target for safe contact with human skin) by the 16nm node.

# V. CONCLUSIONS

3D ultrasound is a safe, easy to use method of producing internal images with a wide number of applications. However, high computational requirements



**Fig. 3: Image Quality Comparison.** (a) X-Z (horizontal) slice through a series of cysts from a 3D simulation using Field II [3, 4], generated with double-precision floating point and exact delay index calculation. (b) The same slice generated via our delay algorithm, fixed-point precision, and dynamic focus. CNR for both given in Table II



**Fig. 4: Power Breakdown Across Technology Nodes.** Scaling projections based on trends reported in [2, 6]. We project meeting the 5W power budget at the 16*nm* node.

and bandwidth bottlenecks limit current 3D systems and preclude handheld 3D devices. In this paper, we have outlined Sonic Millip3De, a new hardware accelerator for handheld 3D ultrasound, which combines a new iterative delay algorithm with a highly parallel, 3Dstacked, streaming architecture. Using image quality

**TABLE II:** CNR values for both ideal system and Sonic Millip3De (SM3D). Slice image of cysts shown in Fig. 3.

| Left  |      | Right |      |
|-------|------|-------|------|
| Ideal | SM3D | Ideal | SM3D |
| 3.59  | 3.58 | 1.93  | 1.85 |
| 3.18  | 3.21 | 1.51  | 1.41 |
| 2.68  | 2.67 | 1.94  | 1.85 |
| 1.61  | 1.62 | 2.10  | 2.01 |
| 1.10  | 1.18 | 2.39  | 2.30 |
| 0.33  | 0.39 | 2.43  | 2.34 |

analysis of cysts in tissue, we have shown that our design is able to produce high quality images for a large depth range comparable to an ideal system. Using RTL-level synthesis, we have shown that Sonic Millip3De can enable 3D ultrasound for a 10,560 transducer array in 45nm technology within a 15W full-system power budget. We project our system will achieve a 5W power target for safe contact with human skin by the 16nm technology node.

#### **ACKNOWLEDGEMENTS**

This work was partially supported by NSF CCF-0815457, CSR-0910699 and grants from ARM, Inc. The authors wish to thank J. Brian Fowlkes, Oliver Kripfgans, and Paul Carson for feedback and assistance with image quality analysis and Ron Dreslinski for assistance with SPICE.

#### REFERENCES

- ARM. Cortex-M3 40G Specifications. http://www.arm.com/ products/processors/cortex-m/cortex-m3.php.
- [2] H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, and D. Burger. Dark silicon and the end of multicore scaling. *Proc.* of the 38th International Symp. on Computer Architecture (ISCA '11), pages 365–376, June 2011.
- [3] J. Jensen. FIELD: A Program for Simulating Ultrasound Systems. In Nordicbaltic Conf. on Biomedical Imaging, 1996.
- [4] J. Jensen and N. Svendsen. Calculation of pressure fields from arbitrarily shaped, apodized, and excited ultrasound transducers. *IEEE Transactions on Ultrasonics, Ferroelectrics and Frequency Control*, 39(2):262 –267, March 1992.
- [5] K. Malladi, F. Nothaft, K. Periyathambi, B. Lee, C. Kozyrakis, and M. Horowitz. Towards energy-proportional datacenter memory with mobile DRAM. *Proc. of 39th International Symp. on Computer Architecture (ISCA '12)*, pages 37–48, June 2012.
- [6] B. Murmann. "ADC Performance Survey 1997-2013". http://www. stanford.edu/~murmann/adcsurvey.html.
- [7] R. Sampson, M. Yang, S. Wei, C. Chakrabarti, and T. F. Wenisch. Sonic Millip3De: A massively parallel 3D-stacked accelerator for 3D ultrasound. *Proc. of IEEE 19th International Symp. on High Performance Computer Architecture (HPCA '13)*, pages 318–329, Feb. 2013.
- [8] B. Verbruggen, M. Iriguchi, and J. Craninckx. A 1.7mW 11b 250MS/s 2x interleaved fully dynamic pipelined SAR ADC in 40nm digital CMOS. Feb. 2012.