## 1024-Channel Single 5W FPGA Towards High-quality Portable 3D Ultrasound Platform

A. Ibrahim<sup>\*</sup>, W. Simon<sup>\*</sup>, A. C. Yüzügüler<sup>\*</sup>, F. Angiolini<sup>\*</sup>, M. Arditi<sup>\*</sup>, J.-P. Thiran<sup>\*†</sup> and G. De Micheli<sup>\*</sup> <sup>\*</sup> École Polytechnique Fédérale de Lausanne (EPFL), Switzerland

<sup>†</sup>Department of Radiology, University Hospital Center (CHUV) and University of Lausanne (UNIL), Switzerland

## ABSTRACT

Volumetric Ultrasound (US) imaging is an emerging technology for medical US applications. Typically, US imaging is 2D, where a number of vibrating elements, arranged in an array, are used to scan 2D cross-sections of the human body. In volumetric US a matrix probe of vibrating elements is used instead of the array, where conical volumes are reconstructed instead of 2D cross-sections. Today, cardiology and obstetrics are the most benefiting applications from 3D imaging, where better assessment of chamber volumes, and more expressive imaging are provided, respectively. 3D US allows the imaging of entire volumes using a single scan, unlike in 2D imaging, where multiple slices should be acquired precisely by a trained sonographer to be able to diagnose the entire structure. As a result, 3D US imaging speeds up the acquisition time, and eliminates the dependency on the presence of a trained operator during the scan. These characteristics make 3D US ideal for situations where the presence of a trained sonographer is an issue and the need to speed up the acquisition time is paramount, such as battlefields and rescue environments. However, todays 3D systems [1] are bulky, expensive, and power hungry because the processing load of 3D US is orders of magnitude higher compared to conventional 2D imaging. For this reason, 3D systems are currently only available in wellequipped hospitals, and not in rural areas and underdeveloped regions where even electricity supply is an issue.

In US imaging, an acoustic wave, whose center frequency is between 2 MHz to 30 MHz, is transmitted from the transducer into the body through a process called insonification. Part of this transmitted wave is reflected from the body scatterers, which represent tissue density changes, and acquired back by the transducer. The listened echoes are mainly amplified, and digitized before running through the digital back-end system where the reconstruction of the image is performed. The core processing step in the US imaging pipeline is called Beamforming (BF). BF identifies the scatterers location and density (i.e. stiffness of the tissue) of the scanned body structure, by summing the returned echoes after delaying them according to a certain delay profile. This delay profile represents the round-trip time-of-flight of the US wave from the transmission origin O, to a scatterer S and back to a transducer element D. The BF process includes also a function called apodization, that weighs differently the echoes received by each transducer element to account for the limited geometric directivity of the elements. The apodization window size increases with the imaging depth (expanding apodization). Finally, a visualization step, known as scan conversion, is performed.

In 3D US, delay calculation is the most computationally challenging part of the BF process and of the whole processing pipeline. In principle, trillions of time-delays need to be calculated per second. These time-delays are the result of dividing the Euclidean distances (i.e. square roots) between each scatterer in the volume and each element in the matrix probe over the speed of sound in the medium (commonly assumed as constant). Today's state of the art systems, either commercial or research ones, have dealt with this bottleneck by reducing the number of receiving channels to far fewer elements, hence simplifying the computation load. Analogprebeamforming [1] is one of the most common techniques that are used to accomplish this downscaling, where each sub-set of the probe elements is delayed according to a fixed analog delay profile, and then summed and mapped to a single receiving channel. These approaches lead to degradation in the quality and resolution of the reconstruction due to increased height of the side-lobes, and reducing collected information. Nonetheless, today's systems are bulky and expensive.

Our objective is to tackle smartly and efficiently the bottlenecks of the 3D US processing pipeline, with the aim of developing a portable, battery-operated, and cheap platform while supporting as many receiving channels as possible for providing high quality volumetric reconstruction. For example, SARUS system [2], which is very advanced and supports 1024-channels, runs on 320 FPGAs. The new ULA-OP platform [3] uses 8 high-end FPGAs and 16 DSPs to support only 256-channel BF.

In this work, we develop a fully digital, high-quality, and single-FPGA beamformer, while supporting 1024-channels, the highest number of receiving channels by today's imagers, within 5W power consumption. This is considered as a crucial step towards our final target of a complete 3D US platform. We have demonstrated our architecture on a single Kintex Ultrascale KU-040 FPGA [4], where the proposed beamformer architecture is shown in Fig. 1(a). First, the echoes received by pairs of the  $32 \times 32$  channels have been mapped to share a BRAM. Since Ultrascale macros offer dual-port read access, 512 BRAMs provide a compact storage while offering full throughput. The echoes are pre-apodized using a *static* Hanning window to reduce the implementation and hardware cost. This choice is justified since the effect of the expanding window, according to our probe specification, is only within the shallow 1.1 cm depth, which is clinically less critical. We have proposed a novel delay calculation algorithm [5], [6] that

simplifies the calculation of trillions of square roots into just a calculation of few square roots along the reference central line-of-sight using a Xilinx CORDIC IP, followed by two addition operations of "steering" per delay (see Fig. 1(b)). The calculated delays are used as indices for the apodized echoes, which are then summed using the 1024:1 adder tree, resulting in reconstructing one voxel per clock cycle. As we reconstruct a volume of  $64 \times 64 \times 600$  voxels, we need 2.5M clock cycles per volume. Finally, a demodulation step should be performed since the beamformed voxels are still in Radio-Frequency (RF) form. We have implemented a simple method by taking the absolute value of the data and applying a low-pass FIR filter of length 5 (i.e. filter order of 4). This is implemented with the help of a circular buffer that is able to store five nappes of voxels.



Fig. 1. Proposed beamformer architecture. (a) The input samples stored in BRAMs are statically pre-apodized, delayed (Fig. 1(b)), and summed to reconstruct a voxel, then demodulated. (b) TX and reference RX delays are computed and steering coefficients are applied, thus calculating  $32 \times 32$  delays that are used to index the input sample BRAMs.

Fig. 2 shows the current setup of the proposed volumetric US processing platform. The beamformer is implemented on a KU-040 [4] FPGA, and communicates via Ethernet with a computer running a custom Visual C# application hosting a scan-conversion algorithm to visualize the output. Table I shows that we can fit our  $32 \times 32$  channel beamformer in a single Kintex Ultrascale KU-040 FPGA [4]. The operating frequency of the design is 133 MHz, which ideally yields a reconstruction rate of 53.2 volumes per second (vps) for our

reference volume of 2.5M voxels. However, this reconstruction rate is restricted to 14 vps by the limited Ethernet bandwidth. We plan to improve the reconstruction rate and all the I/O interfaces by using a more efficient communication means, such as optical cables or PCI Express. The estimated power consumption of the proposed beamformer is 5 W, which meets the power budget for a portable device. The proposed implementation is scalable to support a higher number of channels, but is limited on the KU040 by the BRAM resources. The channel count could be increased up to  $90 \times 90$  on a larger FPGA like the Virtex UltraScale XCVU190 [4].



Fig. 2. The setup of the proposed volumetric US processing platform. The beamformer is implemented on a Kintex UltraScale KU040 FPGA [4].

TABLE I BEAMFORMER ARCHITECTURE RESULTS. \*Kintex UltraScale KU040 implementation results. \*\*Virtex UltraScale XCVU190 extrapolated results.

| Supported<br>Channels | Logic<br>LUTs | Regs  | BRAM  | DSP  | Clock   | Volume<br>Rate |
|-----------------------|---------------|-------|-------|------|---------|----------------|
| 32×32*                | 49.6%         | 24.1% | 71%   | 2.2% | 133 MHz | 53.2 vps       |
| 90×90**               | 65.0%         | 33.3% | 91.2% | 2.3% | 133 MHz | 53.2 vps       |

## ACKNOWLEDGMENT

The authors would like to acknowledge funding from the Swiss Confederation through the UltrasoundToGo project of the Nano-Tera.ch initiative.

## REFERENCES

- [1] Philips Electronics N.V., "iE33 xMATRIX echocardiography system," www.healthcare.philips.com.
- [2] J. Jensen, H. Holten-Lund, R. Nilsson, M. Hansen, U. Larsen, R. Domsten, B. Tomov, M. Stuart, S. Nikolov, M. Pihl, Y. Du, J. Rasmussen, and M. Rasmussen, "SARUS: A synthetic aperture real-time ultrasound system," *Ultrasonics, Ferroelectrics, and Frequency Control, IEEE Transactions on*, vol. 60, no. 9, pp. 1838–1852, Sep 2013.
- [3] E. Boni, L. Bassi, A. Dallai, F. Guidi, V. Meacci, A. Ramalli, S. Ricci, and P. Tortoli, "ULA-OP 256: A 256-channel open scanner for development and real-time implementation of new ultrasound methods," *Ultrasonics, Ferroelectrics, and Frequency Control, IEEE Transactions on*, in press 2016.
- [4] Xilinx Inc., "Ultrascale FPGA: Product tables and product selection guide," 2016, http://www.xilinx.com/support/documentation/ selection-guides/ultrascale-fpga-product-selection-guide.pdf#KU.
- [5] A. Ibrahim, P. A. Hager, A. Bartolini, F. Angiolini, M. Arditi, L. Benini, and G. De Micheli, "Tackling the bottleneck of delay tables in 3d ultrasound imaging," in *Proceedings of the 2015 Design Automation and Test in Europe (DATE 2015) Conference*, March 2015, pp. 1683 – 1688.
- [6] W. Simon, A. C. Yüzügüler, A. Ibrahim, F. Angiolini, M. Arditi, J.-P. Thiran, and G. De Micheli, "Single-FPGA, scalable, low-power, and highquality 3D ultrasound beamformer," in *The 26th International Conference* on Field-Programmable Logic and Applications (FPL), 2016.