## Demo: Efficient Delay and Apodization for on-FPGA 3D Ultrasound

A. C. Yüzügüler\*, W. Simon\*, A. Ibrahim\*, F. Angiolini\*, M. Arditi\*, J.-P. Thiran\*† and G. De Micheli\*

\* École Polytechnique Fédérale de Lausanne (EPFL), Switzerland

<sup>†</sup>Department of Radiology, University Hospital Center (CHUV) and University of Lausanne (UNIL), Switzerland

## ABSTRACT

In medical diagnosis, ultrasound (US) imaging is one of the most common, safe, and powerful techniques. Volumetric (3D) US is potentially very attractive, compared to 2D US, because it might enable telesonography - decoupling the local image acquisition, by an untrained person, and the diagnosis, by the trained sonographer, who can be remote. Unfortunately, current 3D systems are hospital-oriented, bulky and expensive, and they cannot be available in emergency operations or rural areas. This motivates us to develop a portable US platform with cheap, battery-operated, more efficient electronics.

The core of any US imaging system is the beamforming (BF), which is the most computationally challenging and materially expensive step. BF consists of delay calculation and apodization. For each volume location, to identify whether it comprises fully reflective (white voxel) or non-reflective (black voxel) tissue, it is first necessary to compute the twoway traveling delay of the sound wave from the sound origin to this location and back to each piezoelectric element on the transducer. Apodization is a weighting used to eliminate side lobes arising due to the transducer's directivity function. Typically, apodization can be performed with a Hanning function, whose bell profile smoothly attenuates sensitivity towards the transducer edges. The width of the apodization profile can also expand with the imaging depth, optimizing resolution and minimizing clutter at all depths. Different systems, either commercial or research-based [1], have dealt with the processing demands of 3D BF by reducing the number of receive channels, which simplifies computation, but sacrifices resolution. To date, there is no satisfactory answer for a portable, low-power, low cost 3D US imaging system that still has the capability to process high-channel-count, or even full-resolution, probe readouts, for better resolution and contrast.

We have previously proposed an approach [2] to more efficiently calculate delays. Instead of attempting to compute trillions of square roots per second, this method simply calculates a small reference set of delays (a few square roots produced by a Xilinx CORDIC IP), followed by, leveraging geometric considerations, the application of two additions per delay sample. In this paper we show a scalable beamformer architecture capable of supporting over 1024 transducer elements in a single, latest-generation FPGA.

Fig. 1 shows the whole FPGA system including our beamformer custom block. The latter communicates via an AXI interface. The overall system includes a MicroBlaze processor subsystem and an Ethernet interface that is presently used for all I/O. The proposed architecture of the beamformer is shown in Fig. 2. Table I shows results of the proposed beamformer architecture including the resource utilization for reconstructing a 2.5M-voxel volume, using 4MHz cen-

ter frequency, and 32MHz sampling frequency, supporting a 32×32 elements probe. The results show that a theoretical reconstruction rate of 50 volumes/s can be achieved. However, due to the bandwidth limitation of the Ethernet interface the maximum ideal reconstruction rate is 14 volumes/s. Due to further protocol inefficiencies at the Ethernet and on-chip level, we can currently achieve volume rate of 1-3 volumes/s, which are being continuously improved. One of our future goals is to establish the communication between the probe and the backend system using a higher-bandwidth and more efficient mean like PCIe. We have managed to fit all this into a single Kintex UltraScale KU040 [3] FPGA, which is unprecedented, with an estimated power consumption of around 4W, which is very suitable for a portable implementation. It should be noted that compared to [2], we have optimized the BRAMs utilization by sharing one BRAM for two receiving channels instead of one. Further, we extrapolated the requirements of the architecture for 80×80 channels on a Virtex UltraScale XCVU190 [3].



Fig. 1. The block diagram of the FPGA including the *Beamformer* custom block and its interconnection with other blocks.

Despite the computational and material efficiency of the proposed delay calculation, its approximation for the delays introduces errors of up to around  $3\mu s$  (i.e. 96 samples off for 32MHz sampling frequency) at some locations close to

TABLE I
BEAMFORMER ARCHITECTURE RESULTS.
\*Kintex UltraScale KU040 implementation results.
\*\*Virtex UltraScale XCVU190 extrapolated results.

| Supported<br>Channels | Logic<br>LUTs | Regs | BRAM | DSP  | Clock   | Volume<br>Rate |
|-----------------------|---------------|------|------|------|---------|----------------|
| 32×32*                | 78%           | 25%  | 57%  | 0.3% | 125 MHz | 50 vps         |
| 80×80**               | 86%           | 19%  | 45%  | 0.3% | 125 MHz | 50 vps         |



Fig. 2. Proposed architecture of the delay computation blocks. The receive delay is computed by applying steering coefficients to the calculated reference delay (a), then the echo samples indexed by the calculated  $32 \times 32$  delays are summed to reconstruct a voxel (b).

(b)

the probe and at broad angles (Fig. 3(a)). In order to account for the elements that need to be discarded due to either the side lobes generated by the directivity of the piezoelectric transducer or the inaccuracy introduced by the approximated delays, a trimmed apodization scheme has been designed [4]. The proposed apodization is tighter than usual, which minimizes the inaccuracy due to delay approximations while not affecting image quality severely. The model is initially developed by manually studying different voxels in the volume located at different  $(r, \theta, \phi)$ . Later, a script is used for iterating on the initial set of equations for further refinements (i.e. discarding as many "inaccurate" elements as possible while keeping as many of the "accurate" ones as possible). Fig. 3(b) shows a significant reduction of the inaccuracy in the delay calculation after applying the trimmed apodization. The proposed apodization requires 33 KB of memory (the same as the default the expanding apodization), which makes it very feasible to be implemented on FPGA.

In this demo, we will show the reconstruction of a 2.5M voxels volume, using 4MHz center frequency, and 32MHz sampling frequency, from echo samples derived from a simulated  $32\times32$  transducer. The demo setup is presented in Fig. 4, where the 3D beamformer is implemented on the FPGA, with only the last processing step (scan conversion) performed on the laptop that currently displays the output images.

## ACKNOWLEDGMENT

The authors would like to acknowledge funding from the Swiss Confederation through the UltrasoundToGo project of the Nano-Tera.ch initiative.



Fig. 3. The inaccuracy in delay calculation becomes negligible by discarding all the echoes from elements that incur more than 2 samples of calculation inaccuracy. These plots show, for each voxel S on the XZ plane of the image, the percentage of such elements when using (a) a standard expanding-aperture apodization, (b) a tighter apodization as proposed in this paper. A significant inaccuracy reduction can be noticed. The remaining inaccuracy is confined to the edges of the image, which are clinically less essential.



Fig. 4. The setup of the beamformer demo. The beamformer is implemented on a Kintex UltraScale KU040 FPGA [3].

## REFERENCES

- [1] Philips Electronics N.V., "iE33 xMATRIX echocardiography system," www.healthcare.philips.com.
- [2] W. Simon, A. C. Yüzügüler, A. Ibrahim, F. Angiolini, M. Arditi, J.-P. Thiran, and G. De Micheli, "Single-FPGA, scalable, low-power, and high-quality 3D ultrasound beamformer," in *The 26th International Conference on Field-Programmable Logic and Applications (FPL)*, 2016.
   [3] Xilinx Inc., "Ultrascale FPGA: Product tables and product
- [3] Xilinx Inc., "Ultrascale FPGA: Product tables and product selection guide," 2016, http://www.xilinx.com/support/documentation/selection-guides/ultrascale-fpga-product-selection-guide.pdf#KU.
- selection-guides/ultrascale-fpga-product-selection-guide.pdf#KU.

  [4] A. Ibrahim, F. Angiolini, M. Arditi, J.-P. Thiran, and G. De Micheli, "Apodization scheme for hardware-efficient beamformer," in *Proceedings of the 12th Conference on PhD Research in Microelectronics and Electronics (PRIME)*, 2016.