Abstract-In this paper a parametric beamformer, which can handle all imaging modalities including synthetic aperture imaging, is presented. The image lines and apodization coefficients are specified parametrically, and the lines can have arbitrary orientation and starting point in 3D coordinates. The beamformer consists of a number of identical beamforming blocks, each processing data from several channels and producing part of the image. A number of these blocks can be accommodated in a modern field-programmable gate array device (FPGA), and a whole synthetic aperture system can be implemented using several FPGAs. For the current implementation, the input data is sampled at 4 times the center frequency of the excitation pulse and is match-filtered in the frequency domain. In-phase and quadrature data are beamformed with a sub-sample precision of the focusing delays of 1/16th of the sampling period. Each line is completely specified by 3 input parameters. The focusing delays are calculated iteratively in a 8-stage deep pipeline, and focusing information for 8 different lines is interleaved to produce delays at every clock cycle. The apodization is specified using piecewise linear approximation with 255 levels. A beamforming block uses input data from 4 elements and produces a set of 10 lines. Linear interpolation is used to implement sub-sample delays. The VHDL code for the beamformer has been synthesized for a Xilinx V4FX100 speed grade 11 FPGA, where it can operate at a maximum clock frequency of 167.8 MHz. Each beamformation block requires 12 multipliers, 5 buffers for parameters, 8 buffers for input data and 32 buffers for output data (I and Q). Furthermore double-buffering is used for the input data, thus simplifying the synchronization. Up to six beamforming blocks can fit in one FPGA. Clocked at 150 MHz they produce 900 × 10 6 I and Q samples/second. Assuming a pulse repetition frequency of 5000 Hz, these blocks can be configured to beamform in real time 256 B-mode lines of synthetic aperture data from 4 transducer elements, or 64 lines from 16 elements.
I. INTRODUCTION
Medical ultrasound imaging is a widely used imaging modality, which is characterized by high mobility and short preparation time. The current sequential line acquisition image generation approach in commercial scanners has been around for 30 years and is essentially based on sequential processing [13] . A very promising algorithm for ultrasound image generation is synthetic aperture (SA) imaging [1] , [3] , [9] , [10] , [12] , which provides uniform resolution across the image and fast data acquisition. The latter is especially desired for cardiac examinations. The SA technique also makes feasible 3D imaging, in which a large number of image lines/points/planes has to be created quickly for the purpose of real-time display.
Storing and accessing the focusing information for each image line is a major problem for the real-time beamformers implementing synthetic aperture imaging or advanced flow imaging techniques. The focusing precision is of paramount importance for the image quality [4] and simple compression is not an option. Parametric recursive delay generators have been suggested for the case of image lines originating from the transducer [2] , [14] . For the purpose of the vector flow imaging [5] - [8] , though, a free choice of origin and direction of the lines have to be possible. The purpose of this paper is to present a parametric beamformer capable of fast focusing in an arbitrary direction in 3D space, using a recursive parametric delay generation algorithm, requiring only 3 input parameters per line [11] . The algorithm uses successive approximation with error accumulation/compensation. In the current hardware implementation, the delay approximation is pipelined, and the parameter sets for 8 lines are fed into the pipeline in an interleaved fashion to keep all pipeline stages active.
Section II presents the theory behing the beamformer, Section III describes the implementation of the beamformer in hardware, and Section IV presents the performance estimates and the resource utilization with a Xilinx V4FX100 (speed grade 11) FPGA.
II. BEAMFORMER OVERVIEW
The presented beamformer handles images acquired using the conventional line-by-line acquisition as well as synthetic aperture techniques. In conventional imaging a focused wave is transmitted from the transducer in a given direction. Echoes are scattered back by inhomogeneities on the path of the propagating wave. The received signals are coherently summed to form a beam. To reconstruct the reflectivity at given spatial location, the distance from the origin of the beam to that location and back to the transducer element that has recorded the echo signal must be calculated.
A spherical wave rather than a focused wave is transmitted to acquire data in synthetic transmit aperture imaging. In receive all element are used to record the back-scattered signal. A full image is reconstructed for every emission. It has low resolution, because there is no transmit focusing. The measurement is repeated by transmitting a spherical wave with another transducer element and a new low resolution image is created. After all elements have been used in transmit, all low resolution images are summed and an image with highresolution is formed.
Delay-and-sum beamforming is used to reconstruct images for both SA and conventional imaging. Because each transmission covers the whole region of interest, the notion of "beams" loses its meaning in the case of synthetic aperture imaging, where an image could be specified as a set of picture elements. The distance to the reconstructed points in SA images does not necessarily increase from point to point, and the sampled RF data must be stored until all points have been beamformed. The beamforming of the high resolution image H( r i ) at spatial location r i can be expressed as:
where τ ToF is the time of flight of the echo s r (t) received by transducer element at r r , after an emission with transducer element located at r e (see Fig. 1(b) ). The coefficients a e and a r are transmit and receive apodization, respectively. Digital beamformers operate with sampled signals and interpolation is used to reconstruct the signal s r (τ ToF ) when τ ToF is not an integer multiple of the sampling period. The time of flight t ToF is the sum of forward and backward propagation times τ f and τ b , which can be calculated with the same algorithm. In the rest of the paper we will consider only one of the propagation times and the subscript indicating a forward or backward propagation will be omitted. For convenience the image points are placed along lines defined by an origin r o , direction ζ, length L and distance between two points ∆r as shown in Fig. 1(a) . The coordinates r i = (x, y, z) T of the i:th point along the line are:
The distance l i from the transmission origin r e to a point r i is:
where r oe is the origin of the line expressed in coordinates relative to the position r e of the element, the distance to which is sought. For each of the three coordinates the squared distance can be expressed as
The value at the previous focal point is
Subtracting (5) from (4) results in
which is the increment from sample to sample, when performing the focusing. The origin of the line x oe and the increment from sample to sample ∆x are constants and can be precalculated. The difference between the distances to two consecutive points squared
where the constants A and B are calculated as:
x oe , y oe , z oe are the components of the vector r oe from (3). The square of the distance L i = l 2 can be recursively found from the previous squared distance:
Multiplying this expression with ( f s /c) 2 gives the squared sample index corresponding to the propagation time. It has been shown in [11] that only Λ i is needed to recursively calculate τ i = f s c l i . The apodization coefficients are also parametrically calculated using piece-wise linear approximation of the ideal curve.
III. IMPLEMENTATION
A block diagram of the developed unit is shown in Fig. 2 . An image is specified as a set of lines to accommodate both SA and conventional imaging, and each unit can beamform 8 lines in parallel using data from 4 channels. A beamformer unit contains 5 delay calculation units -4 for the path to the four receive elements and 1 for the transmit path. The apodization coefficients and delays are calculated parametrically. Linear interpolation is used to generate samples with sub-sample delay precision. The sum of the 4 receive channels represents a partial low-resolution image in the case of SA imaging. It is multiplied with the coefficient for transmit apodization before it is summed with the rest of the low-resolution images.
A. Delay generation
The delay-generation unit recursively calculates the propagation time to points along an image line (see Fig. 1(a) ). It consists of two blocks as shown in Fig. 3(a) . The first block calculates the difference
where τ i is the propagation time from the focal point i to a transducer element. Using this difference and the time of propagation from a transducer element to the origin of the image line, it is possible to recursively calculate the distance to all points along the line using the RISQRT unit shown in Fig. 3(c) . search for the right value of τ i starts from τ i−1 . To decrease the complexity of the circuit, the search starts always from a value that is less than τ i . In other words, if Λ i < 0 then the start value for the search is τ start = τ i−1 , otherwise it is τ start = τ i−1 − ∆t. The circuit consists of a pipeline which is 8 stages deep.
Each stage adds one bit of precision to the estimate. The output of the pipeline is the estimate of the propagation time τ i and the residual error ∆ i , where
Both the estimate τ i and the residual error ∆ i are fed back to the input as shown in Fig. 3(c) . The k:th stage (shown in Fig. 4 ) performs the following operation:
where ε(k) is the step size at k:th iteration. In the start ε is equal to ∆t. At every next stage, the step is divided by two. τ in and ∆ in are the approximation of the propagation time and the residual error calculated in the previous stage. τ m and ∆ m are the new candidates for the final result. The result ∆ out and τ out from the stage depend on the sign of ∆ m . If ∆ m < 0, then τ m < √ T i and τ out = τ m , else τ out = τ in . The initial step size is chosen to be a power of two, and all multiplications are reduced to shifting operations as shown in Fig. 4 . Furthermore, each stage is fixed to a given iteration number, and the shifting operation is therefore reduced to a suitable signal wiring.
The pipeline consists of 8 stages, and it is therefore necessary to calculate delays for 8 lines simultaneously to keep the pipeline full. The algorithm needs the delay for the first point in the line to start the recursive procedure. This delay is sent to the output of the RISQRT circuit to avoid idle clock cycles in the result.
B. Apodization
The apodization curves are described using piecewise linear approximation with maximum error of 1% of the full scale. A greedy algorithm is used to find the segments off-line. Each segment is specified by a start value, slope and a segment length in samples. The start value and the slope for each segment are encoded using 14 bits and the number of samples is encoded using 8 bits. Although the internal calculations are performed with 14-bit precision, the result is 8-bit, thus introducing up to 0.4% numerical error.
C. Interpolation
Linear interpolation is used to generate samples for time instances that are not integer multiple of the sampling period. The linear interpolation uses the built-in digital signal processing blocks available in the Xilinx Virtex-4 family of FPGAs. Each such block can multiply two 18-bit numbers and accumulate the result in a 48-bit register. The unit must operate at a clock frequency which is twice higher than the clock frequency of the delay generator, because two samples are needed for each output sample. The output sample is calculated as follows:
The index n is the integer part of the delay. It is represented by 12 bits and can address up to 4096 samples. The subsample delay α is formed by the 4 least significant bits of the propagation time. The circuit is shown in Fig. 5 .
IV. RESOURCES
Each beamformation unit produces 1 sample at every clock cycle. The circuit is clocked at 150 MHz. The total number of samples is about 1 × 10 9 (about 200 lines, 1000 samples, 5000 transmissions/second). A total of 6 beamformation blocks are required to beamform SA images in real time. Each block must therefore beamform 4 sets of 8 lines in parallel. The resources necessary for the beamformation fall into 3 categories: builtin block RAM to store parameters, input samples and result; dedicated digital signal processing blocks to perform the interpolation and apodization; and logic resources used to control the circuit and generate delay and apodization coefficients.
A. Block RAM:
The dedicated Block RAM (BRAM) units can be configured to store 512 36-bit words, which is sufficient to hold the parameters for one transducer element for 32 lines. Three 32-bit parameters are required to describe the delays for a single line. Up to eight segments can be used to approximate the apodization curve for each element. The parameters for every segment are packed in 36 bits (14-bit start value and inclination and 8-bit length). The total number of words, thus, is (32 × 3 + 8) + (8 × 4 × 8) = 360 < 512, and 1 BRAM is sufficient. The transmit parameters must be changed at every transmission because the transmit element changes from emission to emission, therefore duble buffering is used. The total number of BRAMs for parameters per beamforer unit is thus 6.
In-phase and quadrature signals are beamformed simultaneously. For each channel 4096 16-bit I and Q samples are stored in Block RAM. This requires 64 BRAMs, including those used for double buffering.
Each focusing block produces 32 lines which must be accumulated over all emissions. The precision is set to 24 bits. The number of samples is 1024. Thus, 96 BRAMs are used to store the beamformed data.
A XC4VFX100 FPGA has 376 BRAMs, and can therefore accommodate 2 beamformation units.
B. Dedicated DSP blocks:
Three DSP blocks are used per channel for interpolation and apodization. One extra DSP block provides transmit apodization (see Fig. 2 ), giving a total of 26 dedicated DSP blocks per beamformation unit to process I and Q data. A XC4VFX100 FPGA has 160 Extreme DSP slices and is not a restricting factor for the implementation.
C. Logic requirements:
Using the synthesis report generated by the Xilinx ISE tool, a beamformation block requires about 3 300 out of 42 176 available slices. The maximum clock frequency for a device of speed grade 11 is 167.8 MHz and is limited by the RISQRT pipeline logic.
V. CONCLUSION
The present paper describes the beamformer building blocks of a synthetic aperture imaging system. The beamformer can beamform groups of 8 lines originating from any point inside the region of investigation and having an arbitrary orientation, inter-sample distance and length. This makes it possible to beamform lines that are suitable for vector flow estimation. Furthermore the calculation of forward and backward propagation times have been separated enabling the calculation of the time flight for conventional and synthetic aperture images. The description of the beamformed lines uses an exact formula in 3 dimensions making it suitable for 3D imaging. The error in the delay calculation is fed back into the delay calculation unit, thus limiting the maximum error to half of the value of the least significant bit. In the presented case, this error is less than f s /16. Adding more stages in the RISQRT unit can increase the precision.
There are two sources of limitations in the presented design. The speed of calculations is limited by the logic involved in the time-of-flight calculations (the RISQRT circuit). Additional pipelining could increase this speed, but in that case the speed limitation will be imposed by the RAM blocks which must deliver 2 input samples for each beamformed pixel.
The second limitation is imposed by available bandwidth, which makes it necessary to use the built-in RAM blocks for buffers. The number of these RAM blocks limit the number of beamformation units per FPGA to 2 if synthetic aperture data are to be beamformed.
VI. ACKNOWLEDGMENTS
This work was supported by grant 26-04-0024 by the Danish Science Foundation and B-K Medical A/S, Herlev, Denmark.
