GPU-based beamformer: Fast realization of plane wave compounding and synthetic aperture imaging by Yu, ACH et al.
Title GPU-based beamformer: Fast realization of plane wavecompounding and synthetic aperture imaging
Author(s) Yiu, BYS; Tsang, IKH; Yu, ACH
Citation IEEE Transactions on Ultrasonics, Ferroelectrics, and FrequencyControl, 2011, v. 58 n. 8, p. 1698-1705
Issued Date 2011
URL http://hdl.handle.net/10722/139255
Rights
©2011 IEEE. Personal use of this material is permitted. However,
permission to reprint/republish this material for advertising or
promotional purposes or for creating new collective works for
resale or redistribution to servers or lists, or to reuse any
copyrighted component of this work in other works must be
obtained from the IEEE.
IEEE TransacTIons on UlTrasonIcs, FErroElEcTrIcs, and FrEqUEncy conTrol, vol. 58, no. 8, aUgUsT 20111698
0885–3010/$25.00 © 2011 IEEE
Abstract—Although they show potential to improve ultra-
sound image quality, plane wave (PW) compounding and syn-
thetic aperture (SA) imaging are computationally demanding 
and are known to be challenging to implement in real-time. In 
this work, we have developed a novel beamformer architecture 
with the real-time parallel processing capacity needed to en-
able fast realization of PW compounding and SA imaging. The 
beamformer hardware comprises an array of graphics process-
ing units (GPUs) that are hosted within the same computer 
workstation. Their parallel computational resources are con-
trolled by a pixel-based software processor that includes the 
operations of analytic signal conversion, delay-and-sum beam-
forming, and recursive compounding as required to generate 
images from the channel-domain data samples acquired using 
PW compounding and SA imaging principles. When using two 
GTX-480 GPUs for beamforming and one GTX-470 GPU for 
recursive compounding, the beamformer can compute com-
pounded 512 × 255 pixel PW and SA images at throughputs of 
over 4700 fps and 3000 fps, respectively, for imaging depths of 
5 cm and 15 cm (32 receive channels, 40 MHz sampling rate). 
Its processing capacity can be further increased if additional 
GPUs or more advanced models of GPU are used.
I. Introduction
Ultrasound imaging is conventionally based upon a pulse-echo sensing mechanism that sequentially ac-
quires image data over a group of beamlines [1]. This im-
aging paradigm typically allows a single transmit axial 
focus to be defined. If better focusing quality is desired, 
then the pulse-echo sensing is repeated a few times over 
each beam-line with different nominal transmit focusing 
locations. However, in doing so, the overall data acquisi-
tion time is lengthened concomitantly, meaning that the 
imaging frame rate is reduced. To improve the image qual-
ity without affecting the frame rate, it is necessary to 
make use of non-beamline-based imaging paradigms that 
can form focused broad-view images without requiring ad-
ditional pulse-echo firings. one approach is to use plane-
wave (PW) compounding principles that insonate broad-
field pulses with different steering angles [2], [3]. another 
non-beamline-based imaging method that has been pro-
posed is the synthetic aperture (sa) imaging technique, 
which transmits unfocused point-source firings from dif-
ferent lateral positions [4]. For both methods, each firing’s 
pulse echoes are acquired over all array channels; because 
the transmit firings are essentially broad-field insonations, 
the received channel-domain raw data can be used to form 
a low-resolution image (lrI) by performing delay-and-
sum beamforming at each pixel position. High-resolution 
images (HrI) with whole field-of-view focusing may also 
be obtained computationally via recursive summation of 
a series of lrIs [5].
although PW compounding and sa imaging have 
shown potential to improve image quality, their real-time 
realization is inherently a nontrivial implementation task. 
one technical challenge that concerns many system de-
velopers is the massive computational demand of these 
imaging methods compared with conventional ultrasound 
image formation [6]. Because both PW compounding and 
sa imaging beamform every pixel directly from the chan-
nel-domain data based on different sets of focusing delays 
[3], [4], their image formation process is inherently more 
complicated than the conventional approach, which works 
on a line-by-line basis using the same set of focusing de-
lays. In existing ultrasound scanners, field programmable 
gate arrays (FPgas) and digital signal processors (dsPs) 
are usually used to handle computations related to real-
time image formation [7]. nevertheless, they are merely 
intended to work with the conventional beamline-based 
imaging paradigm, so their computational capacity is not 
sufficient to facilitate all of the computation processes re-
quired for advanced ultrasound imaging methods. Thus, it 
is necessary to develop another real-time computing plat-
form to address this computational bottleneck.
recently, the emergence of graphics processing units 
(gPUs) has spurred the pace of development in high-per-
formance computing because they can be readily convert-
ed into parallel processors through the use of application 
programming interfaces provided by the vendors. leverag-
ing this powerful computational hardware, we present in 
this paper a gPU-based beamformer architecture that can 
carry out the image formation steps of PW compounding 
and sa imaging at real-time processing frame rates. This 
is in line with our intent to develop a programmable soft-
ware processor module that can form ultrasound images 
from channel-domain data samples that are acquired in 
their rF form using an array transducer.
II. Beamforming Principles for PW compounding 
and sa Imaging
A. Overall Description
To facilitate discussion of our beamformer architecture, 
we first begin by reviewing the mathematical principles 
GPU-Based Beamformer: Fast Realization 
of Plane Wave Compounding and 
Synthetic Aperture Imaging
Billy y. s. yiu, Member, IEEE, Ivan K. H. Tsang,  
and alfred c. H. yu, Member, IEEE
Correspondence
Manuscript received May 9, 2011; accepted May 17, 2011. This work 
was supported in part by the Hong Kong Innovation and Technology 
Fund (ITs/492/09, InP/210/10, InP/211/10). all three authors have 
contributed equally to the preparation of this article.
The authors are with the Medical Engineering Program, The Uni-
versity of Hong Kong, Pokfulam, Hong Kong sar (e-mail: alfred.yu@ 
hku.hk).
digital object Identifier 10.1109/TUFFc.2011.1999
yiu et al.: gPU-based beamformer 1699
of PW compounding and sa imaging. The overall goal of 
these two imaging paradigms is to derive HrIs that are 
compounded recursively from a set of lrIs. For the ith 
HrI and a compound frame size of M, the image value for 
pixel Po can be denoted mathematically as:
 H P L Pi o m o
m i M
i
( ) ( ),=
= − +
∑
1
 (1a)
where Lm(Po) represents the corresponding pixel value in 
the mth lrI. By inspection, the recursive form of (1a) is 
given by
 H P H P L P L Pi o i o i o i M o( ) ( ) ( ) ( ),= + −− −1  (1b)
where Hi−1(Po), Li(Po), and Li−M(Po) are, respectively, 
the pixel value for the previous HrI, the latest lrI, and 
the earliest lrI in the compounding group. In general, 
Lm(Po) can be computed from the channel-domain rF 
data received for a particular transmit firing (steered 
planar pulses or point-source excitations, depending on 
whether PW compounding or sa imaging is performed). 
This computation process involves multiple stages which 
are described in the next two subsections.
B. Analytic Signal Conversion
as a precursor step in lrI formation, the analytic sig-
nal is first computed for each channel-domain rF data 
vector (with K depth samples). This quantity requires 
computation of the rF signal’s Hilbert transform, which 
can, in practice, be found via a finite-impulse response 
(FIr) filtering operation whose impulse response is set 
equal to the following definition of the Hilbert transform:
 h l ll l( ) ( ) .= {02 for even  / for odd pi  (2)
as such, for the kth depth sample in the nth receive chan-
nel for the mth transmit firing, each Hilbert-transformed 
data sample is essentially given by the following convolu-
tion output:
 ρnm
l k
k D
nmk h l r l, ,( ) ( ) ( ),=
=
+ −
∑
1
 (3)
where rn,m(l) denotes the corresponding channel-domain 
rF sample and D is the number of taps in the FIr filter 
kernel.
C. Delay-and-Sum Procedure
given the analytic channel-domain signals, delay-and-
sum beamforming is performed to compute the lrI of 
the mth transmit firing. This procedure can essentially 
be considered as a weighted summation of interpolated 
channel-domain samples αn,m(Po) for all N array channels. 
For the pixel Po, its value can be expressed as follows for 
an apodization weight wn in the nth channel:
 L P w Pm o n nm o
n
N
( ) ( ).,= ⋅
=
∑ α
1
 (4)
In the simplest case, αn,m(Po) can be found via a linear 
interpolation of the two adjacent analytic data samples 
an,m(κ) and an,m(κ + 1) that correspond most closely with 
the pixel’s focusing delay in the nth channel. This is math-
ematically given by
 α λ κ λ κnm o nm nmP a a, , ,( ) ( ) ( ),= ⋅ + −[ ] ⋅ +1 1  (5a)
where κ is the proximal depth sample number correspond-
ing to the focusing delay τn,m(Po), and λ is the interpola-
tion weight (between 0 and 1). For a given rF sampling 
rate fs, these two quantities can be found as
 κ τ= ⋅ f Pnm os , ( ) , (5b)
 λ κ τ= + − ⋅1 f Pnm os , ( ), (5c)
where ⌊·⌋ represents a floor operator that gives the largest 
integer not greater than the operand.
In PW compounding and sa imaging, the beamform-
ing delay to be used for each channel corresponds to the 
two-way time-of-flight from the transmit source to the 
pixel position and back to receive element position [3]–[5]. 
For a pixel with lateral-depth position coordinates {xo, zo}, 
this is equal to the following for the nth channel at the 
mth firing:
 τnm o
o o
o
P
d P m d P n
c, ( )
( ; ) ( ; )
,=
+T R  (6a)
where dT(Po; m) and dr(Po; n) respectively denote the 
transmit propagation distance for the mth firing and the 
receive propagation distance for the nth channel with re-
spect to the pixel of interest. These two distance quanti-
ties can be found from geometrical principles as
 d P m x x m z z mo o oT T T( ; ) ( ) ( ) ,= − + −
2 2  (6b)
 d P n x x n z z no o oR R R( ; ) ( ) ( ) ,= − + −
2 2  (6c)
for a given transmit-center position {xT(m), zT(m)} and a 
receiver position {xr(n), zr(n)}.
III. gPU Beamformer architecture
A. Hardware Setup
Fig. 1 shows a high-level illustration of the hardware 
organization for our gPU-based beamformer that is in-
tended to facilitate real-time realization of PW com-
pounding and sa imaging. as can be seen, the gPUs are 
IEEE TransacTIons on UlTrasonIcs, FErroElEcTrIcs, and FrEqUEncy conTrol, vol. 58, no. 8, aUgUsT 20111700
housed inside a Pc workstation. They are connected as 
expansion boards through PcI-Express buses with real-
time data-transfer bandwidth (maximum of 8 gB/s for 
16-lane buses). Their parallel processing resources are 
managed through a software-based application program-
ming interface known as cUda (compute unified device 
architecture; nVIdIa, santa clara, ca). note that the 
computational hardware involved in this beamformer is 
inherently different from those seen in a few ultrasound re-
search platforms that are based on computer clusters [8], 
pipelined dsP networks [9], distributed groups of FPgas 
[10], and multi-core cPUs [11].
during operation, our gPU-based beamformer takes in 
channel-domain rF data acquired from an array trans-
ducer and calculates HrIs recursively in real-time. In 
approaching this task, the beamformer’s computational 
resources have been partitioned into two sectors in which 
one of the gPUs in the group is designated for HrI pro-
cessing and the rest are responsible for lrI processing (as 
shown in Fig. 1). The channel-domain data samples are 
presumed to be first streamed from the scanner’s front-end 
electronics to the Pc’s raM. They are transferred into 
each lrI-processing gPU in frame batches (controlled by 
a master cPU) to compute a set of consecutive lrIs, 
which are subsequently fed into the HrI-processing gPU 
for recursive compounding with other lrIs. The HrIs are 
then shown on the Pc display in real-time and are sent to 
the storage device for archival.
B. Computation of Low-Resolution Images
To facilitate computation of lrIs in our beamformer, a 
batch-based pipelining approach has been adopted. Each 
available gPU in the lrI-processing group is assigned 
to handle one batch of channel-domain data that are ac-
quired from a set of transmit firings. as described in sec-
tion II, for the processing of each lrI, the beamformer 
must perform the operations of analytic signal conversion 
and delay-and-sum; the gPU processing architecture for 
this is described as follows.
1) Analytic Signal Conversion: Fig. 2(a) gives a concep-
tual illustration of how this stage is implemented on the 
gPU. as can be seen, one block of threads in the gPU is 
assigned to compute the analytic signal for one channel of 
pre-beamform rF data. In turn, each thread in the block 
is instructed to carry out a Hilbert transform operation 
[see (2) and (3)] and derive the analytic signal samples for 
a consecutive group of depth samples in the same channel 
(group size is kept down to a few samples to achieve a high 
parallel processing efficiency).
2) Delay-and-Sum Operation: Fig. 2(b) illustrates how 
delay-and-sum beamforming is performed in the gPU-
based beamformer to obtain the lrI pixel values. In this 
processing stage, each block of threads is allocated to 
compute the lrI pixel values within a two-dimensional 
grid. For each individual thread, it is instructed to cal-
culate a single lrI pixel value via the following four-step 
procedure:
 1)  Estimate the focusing delays for all channels with re-
spect to the pixel position for that thread [see (6a)–(6c)];
 2)  retrieve each channel’s analytic data samples whose 
index position corresponds most closely with the 
pixel’s focusing delay for that channel;
Fig. 1. system-level overview of the hardware setup for the gPU-based 
beamformer. during operation, each frame of channel-domain data is fed 
into an idle gPU in the lrI processing group to facilitate beamforming. 
The HrI-processing gPU is then used to perform recursive compound-
ing of multiple lrIs.
Fig. 2. Multi-thread processing architecture for lrI computation. dur-
ing analytic signal conversion (a), latency is reduced by copying an en-
tire channel of rF data to the thread block’s shared memory. For the 
delay-and-sum stage (b), the position of sample in each channel to be 
beam-summed in a thread is denoted as a dashed curve in the analytic 
data array.
yiu et al.: gPU-based beamformer 1701
 3)  Interpolate each channel’s signal value to be used for 
lrI pixel summation based on the retrieved analytic 
samples [see (5a)–(5c)];
 4)  obtain the lrI value by multiplying an apodization 
weight to each of the interpolated signal values and 
summing the apodized values [see (4)].
C. Computation of High-Resolution Images
once a batch of lrIs has been computed, it is trans-
ferred to the gPU designated for HrI processing to carry 
out recursive compounding with other lrIs. as noted in 
(1a) and (1b), each HrI pixel value is the sum of mul-
tiple lrI values at the same pixel position, and it can be 
calculated recursively from the previous HrI, latest lrI, 
and earliest lrI in the compounding group. In our beam-
former, each gPU thread in the HrI processor has been 
assigned to handle one pixel of recursive summation. This 
thread assignment scheme is quite similar to the one used 
for the delay-and-sum operation, except for the trivial dif-
ference of retrieving values from HrI/lrI frames rather 
than from the analytic data array.
D. Processing Speed Enhancement Strategies
To reduce latency overheads in gPU-based paral-
lel processing, efficient management of memory access is 
known to be crucial, given that each memory read-write 
operation may require up to several hundred clock cycles. 
In general, this task can be facilitated through agile use 
of gPU’s two-tier memory structure, which comprises: 
shared memory for each thread block (small in size, but 
with fast access speed); texture and global memory resid-
ing in the gPU’s device core (slower access speed, which 
may improve if cached). as a general rule of thumb, it is 
desirable to exploit the shared memory to store data val-
ues that are repeatedly accessed by the beamformer.
In the first stage of lrI computation [Fig. 2(a)], pro-
cessing latency is lowered by using the shared memory to 
store an entire channel of rF data and thereby facilitating 
fast data access by each thread in a block. The Hilbert 
transform filter coefficients are also stored in the shared 
memory to accelerate the analytic signal computation 
process. For the delay-and-sum stage [Fig. 2(b)], latency is 
kept low by either: 1) creating texture memory pointers to 
cache analytic data samples that are repeatedly fetched to 
different threads (for Tesla-class gPUs), or 2) simply ex-
ploiting the global memory cache (for Fermi-class gPUs). 
note that the analytic data samples are not transferred to 
the shared memory during delay-and-sum because the size 
of this fast-access memory in currently available gPUs is 
not large enough to store the entire data array. Instead, 
the shared memory is used in this stage to store the apo-
dization weights and the set of pre-calculated channel-
domain receive delays for the corresponding scanline (only 
transmit delays must be computed during operation be-
cause the transmit-center position is different for each fir-
ing). For the HrI compounding operation, latency over-
head is rather nominal because each thread only requires 
retrieval of a few data samples from memory to calculate 
an HrI pixel.
IV. Prototype Implementation
A. PC Backbone
Based on our beamformer architecture, we have assem-
bled a prototype Pc platform that can support different 
combinations of gPU devices. This prototype operates on 
a motherboard with seven PcI-Express 16-lane expansion 
slots (X58; EVga corporation, Brea, ca), and it uses 
a quad-core, 2.66-gHz cPU as the host controller (i7–
920; Intel corporation, santa clara, ca); 6 gB of ddr3 
raM is included in the prototype Pc as a data buffer for 
channel-domain samples.
B. GPU Computational Platform
In this work, the prototype Pc has been used to test 
the efficiency of various multi-gPU combinations in carry-
ing out beamforming for PW compounding and sa imag-
ing. In all of the multi-gPU combinations, HrI processing 
is performed using a Fermi-class gTX-470 gPU (with 448 
cores, 1.215 gHz clock speed, 1280 MB of global memory, 
and a 768 kB level-two cache). For lrI processing, differ-
ent dual-gPU configurations have been considered based 
on the Tesla-class gTX-275 gPU and the Fermi-class 
gTX-480 gPU (three possible combinations of these have 
been attempted). specifications for these gPU models are 
readily available from the manufacturer, so they will not 
be repeated here. nevertheless, a few important differ-
ences should be noted between them. First, the gTX-
480 has distinctly more processor cores than the gTX-275 
(480 versus 240), although their processor clock is similar 
(1.401 gHz versus 1.404 gHz). second, the gTX-480 in-
cludes a level-two global memory cache (768 kB) that is 
not found in the gTX-275, but its texture memory fill-
ing rate is slower (42 billion/sec versus 50.6 billion/sec). 
Third, the gTX-480 allocates 48 kB of shared memory for 
every 32 cores, in contrast to the gTX-275, which assigns 
16 kB for every 8 cores. note that to facilitate compara-
tive analysis, our work has also included the evaluation of 
single-gPU configurations that involve the use of a gTX-
275 gPU or a gTX-480 gPU. For these configurations, 
the lrI computation and HrI compounding operations 
were performed on the same gPU.
C. Beamformer Software
The gPU-based beamformer is coded in c++ via a 
functional programming approach, and various cUda 
syntaxes and functions (ver. 3.0) are invoked to realize 
multi-thread processing on the gPUs. In the current ver-
sion of the beamformer software, lrI processing is car-
IEEE TransacTIons on UlTrasonIcs, FErroElEcTrIcs, and FrEqUEncy conTrol, vol. 58, no. 8, aUgUsT 20111702
ried out in batches of 50 frames, and this parameter is 
chosen to maintain balance between processing overheads 
and data transfer bandwidth (confirmed via low-level per-
formance tests, data not shown). For the analytic signal 
conversion stage of the beamfomer, a 51-tap FIr filter is 
implemented for the Hilbert transform, and its coefficients 
are computed during beamformer initialization. This filter 
order is empirically chosen to achieve consistent Hilbert 
transform results. For the delay-and-sum stage, the grid 
size responsible by each thread block is empirically tuned 
to be 16 × 16 pixels for gTX-275 gPUs and 64 × 8 
pixels for gTX-480 gPUs (because the two gPU mod-
els had different memory caching performance). also, the 
beamforming delays are computed as the two-way pulse-
echo propagation times for a nominal acoustic speed of 
1540 m/s, and a Hanning window is used as the apodiza-
tion weight.
V. Performance assessment
A. Overview of Study
To assess our beamformer’s efficacy in executing image 
formation operations in real-time, we have performed a 
series of imaging experiments using typical data acquisi-
tion parameters for PW compounding and sa imaging 
(see Table I). In these studies, channel-domain rF data 
were synthesized for a field-of-view that encompasses a 
group of point targets located at different depths. We have 
used our beamformer to compute HrIs for various imag-
ing depths (between 5 and 15 cm, in 1-cm increments) 
that essentially correspond to different rF data sizes in 
each channel (with a 40 MHz rF sampling rate, there 
are 520 additional rF samples acquired per channel for 
every 1-cm depth increase). Each HrI is 512 × 255 pixels 
(for all imaging depths examined), and it is derived from 
recursive compounding of 49 lrIs that correspond to the 
total number of independent firing positions in sa imag-
ing or planar steering directions in PW compounding.
Fig. 3 shows an example of HrIs produced by using the 
gPU-based beamformer to process channel-domain rF 
data obtained from the field-of-view. as can be seen, the 
images for both PW compounding and sa imaging gave 
a sharper visualization of the point targets than those for 
beamline-based imaging (generated from another codec 
that we have written). In particular, the defocusing ef-
fects that appeared in the beamline-based images are less 
apparent in the compounded PW and sa images. such 
an observation is consistent with previous reports on the 
anticipated benefits of PW compounding and sa imaging 
[3]–[5].
To quantitatively analyze the gPU beamformer’s per-
formance, we have measured the processing throughput 
by counting the mean number of HrIs that can be com-
puted in 1 s. This was averaged over a 10-s observation 
period (with synchronized availability of channel-domain 
rF data samples). such a performance measure provides a 
practical account of the processing capacity that includes 
various overhead sources like memory transfer from raM 
(for loading channel-domain rF samples) and between 
gPUs (for transferring batches of lrI data into the HrI-
processing gPU).
B. HRI Processing Throughput
1) Overview of Results: as an indicator of whether our 
gPU-based beamformer is capable of producing HrIs in 
real-time, Fig. 4 plots the processing frame rate as a func-
tion of imaging depth for different lrI-computation array 
compositions and beamformer sizes. In general, it should 
be noted that the HrI processing throughput (inclusive 
of lrI computation, recursive compounding, and latency 
overheads) decreases at greater depths, as expected, giv-
en the increased amount of rF data samples that must 
be processed over each channel. nevertheless, the HrI 
throughput for all the multi-gPU configurations is still 
greater than 1500 fps even at an imaging depth of 15 cm 
that is used in cardiac imaging scenarios. This is much 
faster than the real-time display frame rate requirement 
(usually 100 fps, at most), and it makes instant replay of 
the HrI cineloop at higher frame rates easily possible. 
In general, from a real-time data streaming perspective, 
the processing throughput should be faster than the data 
acquisition frame rate (i.e., pulse repetition frequency 
(PrF) in PW compounding and sa imaging [5]) to avoid 
dropping raw data frames in the beamformer. This implies 
that, in our example, the PrF may need to be adjusted 
TaBlE I. Parameters for the Imaging Example. 
general imaging parameters
 acoustic speed of tissue layer 1540 m/s
 attenuation coefficient 0.5 dB/(cm·MHz)
 array size 128 elements
 array pitch 0.3048 mm
 Imaging frequency 10 MHz
 Pulse repetition frequency 5 kHz
 rF sampling frequency 40 MHz
 number of rF samples per 
channel
520 for every 1 cm
 Transmit waveform two-cycle sinusoid
Plane wave compounding parameters
 steering angle range −12° to +12°
 number of steering angles 49 (in 0.5° increments)
 Plane wave transmit aperture 128 channels
 receive aperture 32 channels (spanned over array)
synthetic aperture imaging parameters
 Virtual source axial position −2 cm
 Point source lateral spacing 0.6 mm
 number of virtual sources 49
 Virtual source transmit aperture 64 channels
 receive aperture 32 channels (centered upon Tx 
aperture)
Beamline imaging parameters
 Transmit focus 3 cm
 Transmit aperture 64 channels
 receive aperture 32 channels
 number of beamlines 127
yiu et al.: gPU-based beamformer 1703
depending on the beamformer’s processing capacity and 
the size of the raw data array.
2) Scaling of Processing Power Using Multiple GPUs and 
Advanced GPU Models: among the gPU array composi-
tions [Fig. 4(a)], the one that uses two gTX-480s for lrI 
computations and one gTX-470 for HrI compounding 
has achieved the highest HrI processing throughput (see 
black curve). For 32 receive beamforming channels, it is 
capable of computing more than 4700 and 3000 HrIs per 
second, respectively, for imaging depths of 5 cm (used in 
carotid imaging) and 15 cm (needed for cardiac imaging). 
This shows that multi-gPU configurations can effectively 
boost the throughput beyond that achievable with the use 
of a single gPU (the two light-gray curves). note that the 
throughput for the gPU array involving two gTX-275s 
for lrI processing (dark-gray dashed curve) is similar to 
that for the one that uses a single gTX-480 for lrI pro-
cessing (light-gray solid curve). This is an expected result 
because the total number of parallel computation cores 
available for lrI processing is the same for the two con-
figurations (both have 480 cores in total). This indicates 
that our beamformer’s HrI processing throughput scales 
directly with the number of computation cores in the 
platform. another point of interest is that for the hybrid 
configuration (gTX-275 & gTX-480 for lrI computation 
plus gTX-470 for HrI compounding), the HrI processing 
throughput (dark-gray solid curve) is between that for two 
gTX-275s and two gTX-480s, as expected. This shows 
that the combined use of both Tesla-class and Fermi-class 
gPUs is possible in our beamformer architecture.
3) Balancing Between Processing Throughput and Beam-
former Size: When more receive channels are used in the 
gPU-based beamformer, a reduction in HrI processing 
throughput can be noticed [see Fig. 4(b)]. In particular, 
when using two gTX-480s for lrI computation and a 
gTX-470 for HrI compounding, the throughput for 128 
channels (light gray curve) is, respectively, less than two-
thirds and one-third of those for 64 channels (dark gray 
Fig. 3. Point-target image examples for three imaging paradigms: (a) sa imaging (49 virtual point sources); (b) PW compounding (49 steered plane 
waves); (c) conventional beamline-based imaging (3 cm axial focus). all images were obtained using a 32-channel beamformer on receive. see Table 
I for data acquisition parameters.
Fig. 4. HrI processing throughput as a function of imaging depth for 
(a) various lrI-computation array types with a 32-channel beamformer 
size; (b) various beamformer channel sizes with a dual gTX-480 array 
for lrI-computation. For each 1 cm increase in imaging depth, the rF 
data size is raised by 520 samples for each channel (in accordance with 
the sampling parameters). a gTX-470 gPU is used for HrI processing 
in all multi-gPU configurations.
IEEE TransacTIons on UlTrasonIcs, FErroElEcTrIcs, and FrEqUEncy conTrol, vol. 58, no. 8, aUgUsT 20111704
curve) and 32 channels (black curve). such a performance 
gap remains apparent at all imaging depths examined in 
our example. as such, depending on image quality re-
quirements, it may be beneficial to tune the beamformer 
channel size to ensure that the gain in image contrast is 
commensurate with the increase in processing demand.
C. Individual Process Timings
1) LRI Computation Accounted for a Large Portion of the 
Process Time: as a more detailed evaluation of the gPU-
based beamformer’s performance, Fig. 5 shows the timing 
breakdown analysis for a representative time window in a 
case that uses a gPU array comprising two gTX-480s (for 
lrI computation) and one gTX-470 (for HrI compound-
ing). This timing diagram is obtained for an imaging 
depth of 5 cm and a beamformer size of 32 channels. In 
general, it has shown that lrI computation occupies more 
computational resources than HrI compounding. Indeed, 
the two lrI-processing gPUs are mostly engaged during 
the beamformer’s operation, while there is idle time in the 
HrI-processing gPU. as such, the beamformer’s process-
ing throughput seems to be constrained by lrI computa-
tions. on the other hand, the spare resources available on 
the HrI-processing gPU provide opportunities for imple-
menting various post-processing operations related to the 
analysis and interpretation of HrIs (e.g., image segmenta-
tion algorithms).
2) Delay-and-Sum Operation Dominated the LRI Com-
putation Time: another observation to be noted from 
Fig. 5 is that, between the two lrI processing stages, 
the delay-and-sum operation seems to occupy a significant 
longer timeframe than analytic signal conversion. never-
theless, the operational time is still well within real-time 
constraints (Fig. 5 shows that batch processing of 50 lrI 
frames requires about 25 ms). This computational time 
can be expected to decrease further as newer gPU models 
with more parallel computation cores and faster memory 
access schemes become available.
VI. concluding remarks
High computational demand is known to be a technical 
hurdle for real-time implementation of advanced imaging 
methods, like PW compounding and sa imaging, that 
work with the pre-beamform data of each array element. 
To address this processing demand, we have designed a 
gPU-based beamformer architecture that offers real-time 
processing frame rates of more than 1000 fps when using 
two Fermi-class gPUs as the lrI-processing core and an-
other gPU as the HrI processor. The advantages of us-
ing gPUs as a beamforming processor are two-fold. First, 
their hardware architecture, comprising hundreds of pro-
cessor cores, has been specifically developed to facilitate 
single-instruction, multiple-thread computations. This 
parallelism can help accelerate the beamforming opera-
tion of multiple image pixels without running into power 
wall issues faced by cPU microprocessors. second, the 
software-based programmability of gPUs (via the c++ 
language) is perhaps a more accessible alternative than 
FPgas that must be configured using low-level hardware 
description languages.
To our knowledge, this work is the first demonstra-
tion of using gPUs to develop an ultrafast processor for 
advanced imaging methods that work with pre-beamform 
data. We expect that this platform can be readily ex-
Fig. 5. Individual process timings (obtained from actual measurements) for a gPU array with two gTX-480s for lrI computation and one gTX-470 
for HrI compounding. results are shown over a time window encompassing a few batch-processing cycles, and they correspond to the case with 32 
receive channels and 5 cm imaging depth. The processing is carried out in batches of 50 frames.
yiu et al.: gPU-based beamformer 1705
tended to facilitate real-time realization of other advanced 
techniques, such as adaptive beamforming. It is our antici-
pation that, with the advent of gPUs, high-speed back-
end processing of various novel imaging methods can be 
performed on a stand-alone Pc workstation. such a back-
end processor design approach is seemingly a less labor-
demanding alternative to others that are based on the use 
of computing devices such as FPgas and dsPs, whose 
programming is more complicated than gPUs.
references
[1] T. a. Whittingham, “Medical diagnostic applications and sources,” 
Prog. Biophys. Mol. Biol., vol. 93, no. 1–3, pp. 84–110, 2007.
[2] J. y. lu, J. cheng, and J. Wang, “High frame rate imaging system 
for limited diffraction array beam imaging with square-wave aper-
ture weightings,” IEEE Trans. Ultrason. Ferroelectr. Freq. Control, 
vol. 53, no. 10, pp. 1796–1812, 2006.
[3] g. Montaldo, M. Tanter, J. Bercoff, n. Benech, and M. Fink, “co-
herent plane-wave compounding for very high frame rate ultraso-
nography and transient elastography,” IEEE Trans. Ultrason. Fer-
roelectr. Freq. Control, vol. 56, no. 3, pp. 489–506, 2009.
[4] J. a. Jensen, s. I. nikolov, K. l. gammelmark, and M. H. Pedersen, 
“synthetic aperture ultrasound imaging,” Ultrasonics, vol. 44, suppl. 
1, pp. e5–e15, 2006.
[5] s. I. nikolov, K. l. gammelmark, and J. a. Jensen, “recursive 
ultrasound imaging,” in Proc. IEEE Ultrasonics Symp., 1999, pp. 
1621–1625.
[6] s. I. nikolov, B. g. Tomov, and J. a. Jensen, “real-time synthetic 
aperture imaging: opportunities and challenges,” in Proc. Asilomar 
Conf. Signals Systems and Computers, 2006, pp. 1548–1552.
[7] g. york and y. Kim, “Ultrasound processing and computing: re-
view and future directions,” Annu. Rev. Biomed. Eng., vol. 1, pp. 
559–588, 1999.
[8] F. Zhang, a. Bilas, a. dhanantwari, K. n. Plataniotis, r. abi-
projo, and s. steriopoulos, “Parallelization and performance of 3d 
ultrasound imaging beamforming algorithms on modern clusters,” in 
Proc. ACM Int. Conf. Supercomputing, 2002, pp. 294–304.
[9] c. r. Hazard and g. r. lockwood, “Theoretical assessment of a 
synthetic aperture beamformer for real-time 3-d imaging,” IEEE 
Trans. Ultrason. Ferroelectr. Freq. Control, vol. 46, no. 4, pp. 972–
980, 1999.
[10] J. a. Jensen, M. Hansen, B. g. Tomov, s. I. nikolov, and H. Holten-
lund, “system architecture of an experimental synthetic aperture 
real-time ultrasound system,” Proc. IEEE Ultrasonics Symp., 2007, 
pp. 636–640.
[11] r. diagle, “Ultrasound imaging system with pixel oriented process-
ing,” WIPo Patent application Wo/2006/113445.
