Real Time Fast Ultrasound Imaging Technology and Possible Applications  by Cruza, J.F. et al.
 Physics Procedia  63 ( 2015 )  79 – 84 
Available online at www.sciencedirect.com
1875-3892 © 2015 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license 
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
Peer-review under responsibility of the Ultrasonic Industry Association
doi: 10.1016/j.phpro.2015.03.013 
ScienceDirect
43rd Annual Symposium of the Ultrasonic Industry Association, UIA Symposium 2014 
Real time fast ultrasound imaging technology 
 and possible applications 
J. F. Cruzaa *, M. Pereza, J. M. Morenoa, C. Fritscha 
Instituto de Tecnologías Físicas y de la Información “Torres Quevedo” 
c/ Serrano, 144, 28006 Madrid (Spain) 
  
Abstract 
In this work, a novel hardware architecture for fast ultrasound imaging based on FPGA devices is proposed. A key 
difference over other approaches is the unlimited scalability in terms of active channels without performance losses. 
Acquisition and processing tasks share the same hardware, eliminating communication bottlenecks with smaller size 
and power losses. These features make this system suitable to implement the most demanding imaging applications, 
like 3D Phased Array, Total Focusing Method, Vector Doppler, Image Compounding, High Speed Part Scanning 
and advanced elastographic techniques. A single medium sized FPGA allows beamforming up to 200 scan lines 
simultaneously, which is enough to perform most of the above mentioned applications in strict real time. 
 
© 2014 The Authors. Published by Elsevier B.V. 
Peer-review under responsibility of the Ultrasonic Industry Association. 
Keywords: Fast ultrasound  imaging; High frame rate; Beamformer Architecture; Image Compounding; TFM; 3D Phased Array 
1. Introduction 
Ultrafast ultrasonic imaging is nowadays receiving increased attention due to the large number of new 
applications that could be addressed.  Ultrafast imaging assumes operation at rates above one thousand frames per 
second. Since the sound propagation velocity sets a physical limit to the frame rate, new imaging algorithms and 
hardware architectures with real-time parallel processing capabilities are required. 
 
 
* Corresponding author. Tel.: +3491 561 88 06; fax: +3491 411 76 51. 
E-mail address: jorge.f.cruza@csic.es 
© 2015 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license 
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
Peer-review under responsibility of the Ultrasonic Industry Association
80   J.F. Cruza et al. /  Physics Procedia  63 ( 2015 )  79 – 84 
The widely used phased array technique builds the image in a line-per-line basis, generating a single beam, single 
focus in emission, with dynamic focusing in reception for improved image quality. A single image line is obtained in 
the two-way transit time of the ultrasound pulse up to the desired depth. For example, if depth is 60 mm in biological 
tissue (propagation speed c≈1500 m·s-1), acquiring a single image line takes 80 μs; for an N=128-element array with 
linear scan, the time to acquire the image is over 10 ms, limiting the maximum frame rate to less than 100 images/s. 
However, several emissions (F) with focus at different depths, followed by reception beamforming, are commonly 
used, which further reduce the imaging frame rate. 
Ultrafast ultrasonic imaging opens opportunities to new applications. In the medical field, high frame rates are 
required for 3D/4D imaging (Wygant et al., 2006). It will also improve the evaluation of the myocardial function by 
allowing real-time accurate imaging of the heart movements (Hasegawa and Kanai, 2011). New diagnosis tools 
linked to the viscoelastic properties of tissue, like Transient Elastography and Supersonic Shear Imaging, require 
also very high frame rate imaging, typically around 5 Kframes/s (Tanter, 2002; Bercoff, 2004). A single reflection 
tomogram for breast cancer diagnosis by incoherent circular image compounding can be achieved in few 
milliseconds; this allows taking hundreds of tomograms to image the full breast volume in a short time, which could 
be an alternative to mammography for breast cancer screening (Camacho et al., 2012). Automated Ductal 
Echography (Teboul, 2010), could be also achieved in few seconds, thus improving medical practices. 
For Non Destructive Testing (NDT) ultrafast ultrasonic imaging will allow scanning parts at high speed with high 
spatial resolution. For example, to obtain one image per mm scanning at 1 m/s, a frame rate of 1000 images/s is 
required. Scanning speed with this tight resolution is currently limited to less than 0.07 m/s (Smith et al., 2003). 
Achieving high frame rates requires new image formation algorithms linked to new processing architectures. The 
explososcan technique composed several image lines in parallel by adding small individual delays to every scan line 
(Shattuck et al., 1984). Simultaneous beamforming of four to sixteen lines have been reported using 2D arrays (von 
Ramm et al., 1991; Rasmussen, 2012) and the frame rate is increased by a factor of x4 to x16. 
Another technique reported to increase the frame rate is Synthetic Aperture Imaging (SAI), also called Total 
Focusing Method (TFM). In this case, a single array element or a small de-focused aperture is used in emission to 
illuminate the whole region of interest, obtaining a low-resolution partial image with a larger aperture. After 
averaging K images taken from a set of different emitter positions, a high quality image is formed (Nikolov et al., 
2005). Assuming that the hardware resources are fast enough, the frame rate increases with regard to a multi-focal 
phased array by about N/K. SAI or TFM image formation is usually carried out by software following the acquisition 
phase. Some recent implementations based on GPUs achieve this task in real time (Yiu et al., 2011), but without 
considering the time involved in transferring the acquired data, which is currently the bottleneck of the procedure. 
However the improvements in frame rate given by these techniques are insufficient for the requirements of ultra-
fast imaging applications. Other recent ideas allowed reaching rates of thousands of images/s. One is based on plane 
wave emission (simultaneous triggering of all array elements) followed by RF echo data recording for post-
processing (Bercoff et al., 2004). Frame rates of 6000 images/s along a time limited by the available memory 
resources have been reported, although with a limited image quality due to the lack of emission focusing. 
Plane wave emission has also been used for coherent image compounding (Montaldo et al., 2009). In this case, 
the image is formed by superposition of RF images acquired with plane waves propagating with different angles. In 
this aspect, this is a technique analogous to SAI, producing well focused images in emission and reception at all 
depths with frame rates above 1000 images/s. 
These new concepts require the development of hardware with very fast performance to achieve these ultrafast 
imaging rates, which is the main objective of this work, where we propose a new architecture that exploits the 
possibilities offered by state-of-the-art FPGAs. 
2. Parallel beamforming for ultrafast imaging 
     Conventional phased array beamforming introduces time-varying delays to the signals received by array elements 
that compensate the two-way time-of-flight differences from every focus to every array element. This operation 
produces the aperture data, which are coherently added together (in RF) to provide the focused A-scan along the 
steering direction. The focusing delay Ti for element i must be modified for every output sample k to get strict 
 J.F. Cruza et al. /  Physics Procedia  63 ( 2015 )  79 – 84 81
Dynamic Depth Focusing (DDF, all samples in focus). Expressing Ti in sampling periods, the implemented operation 
is: 
¦
 
 N
i
ii kTkrks
1
)]([)( , (1) 
where ri is the signal received by element i (perhaps after multiplication by an apodization coefficient) and N is the 
number of elements of the active aperture. High timing resolution is required to keep low delay resolution lobes 
(Peterson and Kino, 1984). The delay Ti contains an integer and a fractional part. The latter can be conveniently 
obtained by means of a fractional delay filter, which splits the sampling period in an integer number L of sub-
sampling intervals (Laakso et al., 1996). The interpolated sample is obtained by means of a FIR filter of P 
coefficients hm(j), 1 ≤ j ≤P, that change with sample k for every element i and fraction m(k): 
  ¦
 
|  P
j
ikim
i
ikiii jnrjhL
kmZkrkTkr
ki
1
][)(][)]([
,
 (2) 
where Z represents the integer part and m/L is the fractional part of the delay. Inserting (2) in (1), 
)()()()()(
1 1 11
,,
jnrjhjnrjhks iki
P
j
N
i
P
j
mikim
N
i
kiki
  ¦ ¦¦¦
    
 (3) 
Changing the order of the summations: 
¦¦
  
 
P
j
N
i
ikim jnrjhks ki
1 1
)()()(
,
 (4) 
Interpolation and the N-channel beamforming processes can be interspersed. In a simple implementation, every P 
cycles of a clock P times faster than the sampling frequency, an output sample is obtained with the coherent sum of 
N signals. But multiple image lines can be simultaneously obtained with the same hardware. 
Figure 1 shows a simple architecture that obtains the interpolated data sk from the received samples rk for one of 
the N channels. Four DSP-cells are used to this purpose (P=4) and coefficients h are chosen from a local memory in 
every cell. The number L of coefficients provides the timing resolution but, in practice, L is chosen of the same order 
than P. Once the latency period has elapsed (9 cycles), an interpolated sample is obtained at every sampling clock 
cycle. Focusing hardware chooses one among the set of L coefficients at every clock cycle, common to all the cells 
for the same sample, which is easily achieved by shifting the h-value used in one cell to the next one. 
 
 
 
 
 
 
 
 
Fig. 1.  Implementation of the interpolation filter with DSP cells. 
The chain in Fig. 1 is extended to include other channels, as in Fig. 2. This architecture does not require logic 
resources external to the DSP-cells other than the memories to store the set of coefficients and the small delays 
required to compensate latency. Given the high operating frequency of the DSP-cells, which is several times the 
sampling frequency, several scan lines can be simultaneously beamformed with the same hardware by time-
multiplexing. If the switching frequency of the DSP-cells is fDSP and the sampling rate is fS, the ratio R=fDSP/fS yields 
the number of lines than can be beamformed concurrently. Since the switching frequency for Xilinx mid-range and 
high-end FPGAs is, at least, 400 MHz. Then, in case of arrays of up to 10 MHz and a sampling frequency of 40 
MHz, R=10 and ten scan lines can be simultaneously beamformed with the hardware shown in Fig. 2. 
x 
+ 
x 
+ 
x 
+ 0 
rk 
x 
+ 
sk 
h h h h 
82   J.F. Cruza et al. /  Physics Procedia  63 ( 2015 )  79 – 84 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Fig. 2. A systolic architecture for interpolation and beamforming on N-channels based on DSP-cells. 
3. Memory resources 
     Every channel must have a memory to compensate the differences in focusing delays, usually implemented as a 
FIFO. For an aperture D, the maximum time difference is 2D/c and the FIFO depth is given by: 
Sfc
DM 2  (6) 
 The sampling rate fS must meet the Nyquist criterion with regard to the signal frequency fR and, typically, fS = 4fR. 
With λ=c/fR=4c/fS it results M=8D/λ. 
For a maximum aperture D=128λ, M = 1 K samples. This memory can be implemented in FPGA using a single 
Block-RAM. Thus, a single beamformer would require N Block-RAMs and the simultaneous composition of R scan 
lines would demand N·R Block-RAMs. For example, beamforming R=32 lines for an N=128 element array would 
require 4096 Block-RAMs, a figure too high even for high density FPGAs. 
A different approach is followed in the proposed architecture taking into account that the time-distance between 
the input samples that compose a given output sample in different lines is lower than the maximum delay, so that 
they will be into the delay compensating memory. In other words, the beamformer has access to any sample taken up 
to M sampling clock periods before the current one for any scan line. 
To this purpose, instead of having a FIFO for every scan line, a single ring buffer with M positions is assigned to 
every channel. The contents of this memory are provided to every beamformer input r(i) on Fig. 3 as required for the 
current sample and scan line under control of the focusing logic. This approach requires, theoretically, N instead of 
N·R Block-RAMs to beamform simultaneously any number R of scan lines. However, the Block-RAM bandwidth 
(fRAM) limits the number of data samples provided every sampling period to S = fRAM /fS. Since the multi-beamformer 
requires R samples per sampling clock period, a number B=[R/S] of Block-RAMs per channel achieves the required 
throughput. Typically Xilinx Block-RAMs in current FPGAs operate over 200 MHz. Thus, for the example above 
(fS= 40 MHz, R=10) results in S=5 and B=2, that is, two Block-RAMs are required per channel, increasing the 
theoretical number of required Block-RAMs to B·N = 256, a number that fits well in medium sized FPGAs. 
x 
+ 
x 
+ 
x 
+ 0 
rk(1) 
x 
+ 
sk(1) 
h h h h 
x 
+ 
x 
+ 
x 
+ 
rk(2) 
x 
+ 
sk(1)+sk(2) 
h h h h 
x 
+ 
x 
+ 
x 
+ 
rk(N) 
x 
+ 
Σ sk(i) 
h h h h 
lat 
lat 
 J.F. Cruza et al. /  Physics Procedia  63 ( 2015 )  79 – 84 83
Figure 3 summarizes these results and shows the proposed architecture. Data from the A/D converters included in 
the Analog Front End (AFE) are temporarily stored in a ring buffer (~ 2 BRAMs) addressed by the focusing logic to 
provide the samples required by every scan line. This logic selects also the set of coefficients for the beamformer 
cell of the current channel. Every beamformer cell has the structure shown in Figs. 1 and 2, where the pipeline 
registers are replaced by small buffers. 
With this architecture, the same hardware can be used for different applications, with a range of parallel 
beamforming capabilities given by the ratio R as long as the ring buffer is given the same bandwidth by using 
multiple Block-RAMs. Furthermore, mid-range and high-end FPGAs can integrate several parallel beamformers of 
this kind. This naturally multiplies the number of lines simultaneously beamformed. For example, if the application 
parameters yield R=32 and the FPGAs available resources fit 4 beamformers (mid-range FPGA), 128 scan lines can 
be built simultaneously at the sampling rate. This allows performing TFM in real-time. 
 
 
 
 
 
 
 
 
 
 
 
 
Fig. 3. Structure of the DSP-cells based parallel beamforming architecture. 
4. Implementation and validation 
The parallel beamforming architecture has been coded in VHDL. In order to get maximum performance for 
different applications and on different devices some parameters of the architecture are user defined: channel number, 
filter order, ring buffer depth and working frequency. There are other parameters that can be dynamically changed 
like the sampling rate that defines the ratio R with regard to the device performance. 
A 32 channel beamformer has been implemented in a Xilinx XC7K325T-2 FPGA. To take advantage of the high 
DSP-cell switching frequency, two ring buffers and two focusing circuits have been attached to each filter to adapt 
their bandwidths. With this configuration, filters work at 400 MHz, while ring buffers and focusing circuits operate 
at 200 MHz. A single beamformer builds simultaneously 5 scan lines for a 10 MHz array, or 20 scan lines for a 2.5 
MHz array. Resource utilization in this device for this implementation is as shown in Table 1. 
Table 1. Resource utilization for a 32-channel beamformer in a XC7K325T-2 
 Slices Registers LUTs DSP-cells Block-RAMs 
Used 4629 11632 11554 64 32 
Available 50950 407600 203200 840 890 
% Used 9.1% 2.9% 5.7% 7.6% 3.6% 
 
Certain number of parallel beamformers can be integrated in a single device as a function of the available 
resources. In this case, the XC7K325T-2 FPGA can accommodate up to 10 parallel beamformers of 32 channels for 
signals up to 10 MHz, getting up to 50 scan lines in real time. For signals up to 2.5 MHz, 200 scan lines can be 
AFE 
Ring 
Buffer 
x 
+ 
x x x 
AFE 
Ring 
Buffer 
x 
+ 
x 
+ + + + + + 
h h h h h h h h 
Focusing 
logic 
Focusing 
logic 
Sample 
buffers Sample 
buffers 
84   J.F. Cruza et al. /  Physics Procedia  63 ( 2015 )  79 – 84 
obtained in parallel, which allows implementing TFM in real time. The implementation was validated with synthetic 
signals on a Xilinx KC705 (XC7K325T-2 FPGA) evaluation board via JTAG probe by comparison with results 
provided by software in Matlab® that yielded exactly the same image. 
It must be highlighted that the proposed architecture is fully scalable by increasing the number of FPGAs and 
analog front-ends. To this purpose, the communication bandwidth among FPGAs should be equal to that of the 
DSP-cells on a bus wide enough to accommodate partial results (typically, 20 bits). This calls for communication 
bandwidths of 480x20 = 9.6 Gb/s. Such high rate can be attained using one or more of the available GTX pairs on 
the same FPGA, which can run up to 10 Gb/s each. 
5. Conclusions 
A new beamforming architecture that takes advantage of the availability of high-performance DSP-cells in state-
of the art FPGAs has been proposed. It performs focusing delay interpolation and beamforming simultaneously in a 
chain of DSP-cells, using the high switching speed of the DSP-cells for parallel beamforming by time-multiplexing 
the same resources. Multiple parallel beamformers fit in medium sized FPGAs, their number being given by the 
ratio of the DSP-cell bandwidth to the sampling frequency. The architecture is fully scalable to deal with different 
demands and applications. A full DDF image can be obtained in real time, providing a good alternative to software-
based TFM by avoiding the bottleneck of data transfer from the analogue front-end to the processing platform. 
Acknowledgements 
     Work carried out within the project EuroStar E!6771 funded by the UE,  MINECO and DASEL, S.L. 
References 
Bercoff, J., Tanter, M., Fink, M., 2004. Supersonic Shear Imaging: A New Technique for Soft Tissue Elasticity Mapping, IEEE Trans. UFFC, 51, 
4, 396-409. 
Camacho, J., Medina, L., Cruza, J.F., Moreno, J.M., Fritsch, C., 2012. Multimodal Ultrasonic Imaging for Breast Cancer Detection, Archives of 
Acoustics, 37, 3, 253-260. 
Hasegawa, H., Kanai, H, 2011. High-frame rate echocardiography using diverging transmit beams and parallel receive beamforming, J. of 
Medical Ultrasonics, May, 1-12. 
Laakso, T. I., Valimaki, V., Karjalainen, M., Laine, U.K., 1996. Splitting the Unit Delay”, IEEE Signal Processing Magazine, Jan., 30-58. 
Montaldo G. Tanter, M. Bercoff, N., Benech, N. Fink, M, 2009. Coherent Plane-Wave Compounding for Very High Frame Rate Ultrasonography 
and Transient Elastography, IEEE Trans. UFFC, 56, 3, 489-506. 
Nikolov M., Behar, V., 2005. Analysis and optimization of synthetic aperture ultrasound imaging using the effective aperture approach, Int. J. 
Information Theory & Applications, 12, 257-265. 
Peterson D.K., Kino, G.S., 1984. Real-Time Digital Image Reconstruction: A Description of Imaging Hardware and an Analysis of Quantization 
Errors, IEEE Trans. Sonics Ultrason., 31, 4, 337-351. 
Rasmussen, M. F., Férin, G., Dufait, R., Jensen, J.A., 2012. Comparison of 3D Synthetic Aperture Imaging and Explososcan using Phantom 
Measurements, IEEE Int. Ultrasonics Symp. Proc., 113-116. 
Shattuck D. P, Weinshenker, M.D., Smith, S. W., von Ramm, O. T., 1984. Explososcan: a parallel processing technique for high speed ultrasound 
imaging with linear phased arrays. 
Smith R. A., Bending, J. M., Jones, L. D., Harman, T. R. C., Lines, D. I. A., 2003. Rapid ultrasonic scan of ageing aircraft, Insight-Non 
Destructive Testing and Condition Monitoring, 45, 3, 174-177. 
Tanter, M., Bercoff, J., Sandrin, L., Fink, M., 2002. Ultrafast Compound Imaging for 2-D Vector Estimation: Application to Transient 
Elastography. IEEE Trans. UFFC, 49, 10, 1363-1374. 
Teboul, M., 2010. Advantages of Ductal Echography (DE) over Conventional Breast Investigation in the diagnosis of breast malignancies”, 
Medical Ultrasonography, 12, 1, 32-42. 
von Ramm, O. T., Smith, S. W., Pavy, H. G., 1991. High-Speed Ultrasound Volumetric Imaging System. Part II: Parallel Processing and Image 
Display. IEEE Trans. UFFC., 38, 2, 109-115. 
Wygant, I.O., Karaman, M., Oralkan, O., Khuri-Yakub, B., 2006. Beamforming and hardware design for a multichannel front-end integrated 
circuit for real-time 3D catheter-based ultrasonic imaging”, in Medical Imaging, Proc. Spie, 6147, 61470A-1. 
Yiu B. Y. S., Tsang, I. K. H., Yu, A. C. H, 2011. GPU-Based Beamformer: Fast Realization of Plane Wave Compounding and Synthetic Aperture 
Imaging, IEEE Trans. UFFC, 58, 8, 1698-1705. 
 
