I. Introduction
S ynthetic aperture (SA) focusing was conceived in the 1950s and has been implemented in radar systems using digital computers since the late 1970s [1] . Originally these systems used a single transmitting and receiving antenna, and the approach was known as monostatic SA imaging. The first attempts to build systems using ultrasound were made in the late 1970s and 1980s [2] - [7] . These systems were more or less a direct implementation of the monostatic SA imaging and had several limitations imposed by the technology. Poor signal-to-noise ratio and image quality prevented them from being used in medical scanners. From the beginning of the 1990s, more focus has turned toward the real-time implementation of synthetic aperture imaging [8] - [13] prompted by the advances in digital technology. During this period, several techniques to increase the signal-to-noise ratio have been introduced, such as the use of multiple elements in transmit [9] , [10] , [14] - [20] , and new ways of calculating the propagation path have been defined [21] . The use of coded excitations has further increased the signal-to-noise ratio, image contrast, and penetration depth [22] - [26] . The real appeal of using SA methods for medical ultrasound is the use of synthetic transmit aperture (STA) focusing, where all transducer elements are used to receive at every transmission [19] , [27] - [29] . The combination of all these approaches has made it possible to generate not only images at high frame rates, but also images that have superior contrast and resolution [30] , [31] .
In the last decade, effects of motion in SA imaging have been studied [32] , and new methods for estimating the blood velocity were developed [19] , [33] - [35] . In many cases, estimating the blood velocity using STA focusing performs better than traditional methods due to the uninterrupted flow of information for every point in the image. The methods are complicated to implement as they require beamforming along the flow direction, which may change from point to point in the image.
All these advances have spurred new interest in the realtime implementation of STA imaging [36] . The most critical problem is the beamformer and the very large number of points to generate. This paper describes a parametric delay-and-sum beamformer capable of creating conventional 2-D ultrasound images, 3-D ultrasound volumes, and images acquired using synthetic aperture techniques. Among these applications, synthetic aperture imaging imposes the largest computational demands, which has influenced the architecture of the beamformer. The main challenge addressed in this work is the large number of image points beamformed per second-about 2 × 10 9 or 200 lines × 1000 samples/line × 5000 emissions/sec × 2 in-phase and quadrature phase (IQ) samples using signals from 256 channels acquired from up to 256 emissions. The delays and apodization coefficients used in the beamformation process are unique for every channel and transmission, and reading them from look-up tables exceeds 2 times the available bandwidth in the system. Using a flow-through architecture common for most scanners exceeds the available bandwidth 3 times. These 2 problems have been solved by suitable partitioning of processing and calculating the delay and apodization coefficients at run-time. The beamformer is implemented in an experimental scanner [37] , [38] , which will be used to test the clinical relevance of the developed STA methods. The paper is organized as follows. Section II describes the algorithms used in the beamformer and its overall design. Section III gives implementation details in terms of data flow and required hardware resources. All calculations are done using integer numbers (fixed-point numbers) and the errors introduced are analyzed in Section IV. Beamforming results, shown in Section V, comply with the expected side-lobe levels predicted by the error analysis. Section VI concludes the paper.
II. Beamformer Design
The beamformer consists of several beamforming units, each processing data from a limited number of channels and producing a part of the image. Fig. 1 gives an overview of a single beamforming unit. An image is specified as a set of lines to accommodate both SA and conventional imaging, and each unit can beamform 8 lines in parallel using data from 4 channels. It contains 5 delay calculation units-4 for the receive propagation time and 1 for the transmit propagation time. The unit contains also 5 units calculating the apodization coefficients. Both the apodization coefficients and the propagation times are calculated parametrically. Linear interpolation is used to generate samples with subsample delay. The sum of 4 receive channels represents a partial low-resolution image in the case of SA imaging. It is multiplied by the coefficient for transmit apodization before it summed with the rest of the low-resolution images. Several line-buffers accumulate the result until all transmissions have finished. The following sections describe each of these units.
A. Line and Image Definition
The goal of this work has been to design and implement a beamformer that can handle arbitrary image and line geometry. It has been decided that images are defined as sets of lines and that each line consists of equispaced points as shown in Fig. 1 . An image line is described by an origin r o , direction ζ, distance between successive points ∆r and the number of points in the line:
Another design objective is that the beamformer should handle both conventional beamforming, in which a focused transmit beam is created, and synthetic aperture imaging, in which spherical waves are transmitted and an image is created by coherent summation of data acquired over several transmissions. Delay-and-sum beamforming has the needed flexibility to accommodate both imaging situations. For phased and linear array images, only data from a single emission are beamformed to create a line or a set of lines in the case of a parallel beamformer:
where g j (t) is the signal received by the jth element, w j is the dynamic apodization coefficient, and the time of flight is
where r p , r o , and r j are the coordinates of the beamformed point, the origin of transmission and the receive element, respectively. The geometry is illustrated in Fig. 1(a) . The speed of sound is c and the number of elements used in receive is N .
In synthetic aperture imaging, a full low-resolution image is formed after every transmission [19] , and these lowresolution images must be added coherently to form the final high-resolution image. A double summation is thus needed to form a single point:
where M is the number of transmissions, i is the index of the transmitting element, and w ij is an apodization coefficient that depends both on the position of the transmit and receive elements. Notice that the propagation time is now calculated from r i , the position of the transmitting element, rather than r o , the origin of the line as it was done in (3). The envelope of the beamformed signal is used both to form a B-mode image and to estimate the blood velocity. Thus, both the in-phase and quadrature-phase signals are needed in the beamformation process.
At every emission all points in the image are beamformed using all channel data, and a low-resolution image is formed. The resulting low-resolution image is accumulated in a buffer for the number of emissions M . Large buffers are needed for the storage of input data and intermediate results. This precludes the use look-up tables as done in [36] . Instead we have developed a recursive procedure for the calculation of time-of-flight.
B. Recursive Calculation of Time-of-Flight
There exist many approaches to calculating the time of flight such as the ones described in [39] , [40] . Most designs use either some form of symmetry or treat special cases. In this work, we want to calculate general propagation times in 3-D space. This enables beamforming along the direction of flow, which is used for vector flow estimation [35] , [41] .
The time of flight is made of 2 components t ip and t jp , where i, j, and p indicate indexes of transmit element, receive element, and a point along a line. Each unit calculating time of flight calculates the time of flight from a single element to a point along the line, and the index of the transmit or receive element will be skipped in the following. The calculation procedure is given by
where p is the index of the point along the image line (0 ≤ p < P ) and r oi is the position of the origin of the line expressed in a coordinate system, whose origin coincides with the center of the transducer element. The squared distance is given by
For each of the 3 coordinates, the squared distance can be written as
The value at the previous point p − 1 is
Subtracting (7) from (6) gives
where
are constants unique for a given combination of line and element and are passed as parameters to the delay calculation unit. The calculation of the difference between the squared propagation times expressed in number of samples, d p = f s t p , is straightforward: where
and f s is the sampling frequency. The squared propagation time can be found by accumulating the following differences:
The circuit that finds Λ p is shown in Fig. 2 . It consists of 3 registers. The product pD is found by adding D to the register M at every step.
What is needed is the propagation time d p = d 2 p . Several methods exist to find the square root of a number such as Newton's method [42] , a coordinate rotation digital computer (CORDIC) processor [43] , [44] , and recursive iterative calculation (RISQRT) [45] . The RISQRT method was previously found to occupy the least amount of resources [45] and is the one that has been chosen for implementation. The calculation procedure is described in the following paragraphs.
It can be shown that 
The outlined procedure needs multiplications and comparisons. These can be avoided by keeping track of bothd p [m] and the difference∆
The condition that 
The multiplications can be avoided by making sure that only multiplications by 2
x are used. The initial value ∆d = 2 N b , where N b represents the number of bits needed to encode ∆d. At every iteration, we form
The procedure to find the propagation times d p expressed in number of samples thus becomes
5: {Output data:dp -estimate of dp} 6: {Output data:∆p -residual errord Two implementations of this circuit are possibleiterative and pipelined. The former outputs a new result at every N b clock cycles, and the latter outputs a result at every clock cycle. The calculation is recursive, and the pipelined circuit needs to calculate delays for N b lines interleaved as illustrated in Fig. 3 , where 8 lines are being beamformed simultaneously. The processing that is performed in a single stage of the pipeline is shown in Fig. 4 k is reduced to appropriate wiring.
C. Apodization
Apodization curves must be calculated separately for every combination of line and channel. The apodization curves are not fixed, and it has been found that piecewise linear interpolation describes the curves with sufficient precision. Each segment is described by a start value, increment, and number of samples. The start value prevents the accumulation of errors from the use of integer numbers. Furthermore, discontinuities are handled in an easy manner.
D. Interpolation
To achieve higher precision, subsample delays must be used. This is achieved using interpolation [46] , [47] . Linear interpolation is used to generate samples for time instances that are not integer multiples of the sampling period. It gives better results than nearest neighbor interpolation [36] and is still relatively simple to implement. The propagation time d p consists of an integer part and a fraction:
The range of the integer part n is determined by the maximum number of samples stored by the system. The output from the beamforming is
where g [n] is the sampled signal of the channel that is processed, p is the index of the beamformed sample along the image line. For every output sample, 2 input samples must be read from the buffer memory. The interpolation unit must therefore work at twice the clock frequency as the rest of the circuit. Fig. 5 shows a block diagram of the interpolation circuit. Two registers are used to store the address n and the interpolation coefficient α. The sample at address n is fetched in register R1 during the first halfperiod of the clock cycle. During this period, the address n+1 is formed, and g[n+1] is read in R2. Both R1 and R2 are clocked at the double clock frequency. The register for the result is clocked at the working clock frequency. This allows for an uninterrupted output of samples s [p] , which are produced at every clock cycle of the fundamental clock. The total delay in the circuit is 1 clock cycle.
III. Implementation
This section gives the implementation details. The goal has been to implement a versatile beamformer capable of beamforming 40 high-resolution images per second consisting of up to 192 lines and 1024 samples/line. Each highresolution image is made by adding together the beamformed low-resolution images from up to 128 emissions. The analog-to-digital converter operates at a sampling frequency of 75 MHz, and the basic clock frequency has been chosen to be 150 MHz. The processing has been distributed over 64 boards. Data from 256 channels are used to create synthetic-aperture images in real time. The total number of beamformed samples per second is 881 million. Each beamformation unit can process 150 million IQ samples hence, 6 beamformation units per board are needed. As it will be shown below, 3 beamformation units can be fitted in a Virtex-4 (Xilinx, Inc., San Jose, CA) field-programmable gate array (FPGA), used in the system. The beamformation of high-resolution images from 4 receive channels is therefore distributed over 2 FPGAs. Fig. 6 shows one of the FPGAs. The device chosen for the implementation is XC4VFX100 (Xilinx) , and its logic resources, which are essential to the design of the beamformer, are shown in Table I . The serial transceivers are configured to operate at 3.2 Gb/sec. The data communication between every 2 FPGAs is implemented with 4 serial links, giving a bandwidth of 12.8 Gb/s. Separate The buffers for parameters, IQ samples and beamformed lines are held internally in the FPGA using the block RAMs. The multiplications and some of the additions are implemented using the dedicated digital signal processing (DSP) blocks, which can perform a multiplyand-accumulate operation in a single clock cycle [49] .
The limitations imposed by the resources are detailed in the following paragraphs.
A. I/O Bandwidth
The data, sampled at 75 MHz, are consequently downsampled so that the channels' data do not exceed 4096 samples [37] . The samples are sampled at 12 bits/sample. Then they are matched-filtered, and the resulting size is 16 bit. Both IQ samples are created in the filtration process. The matched filtered samples are sent for further processing to another FPGA. The bandwidth B in of this communication link sets a limit on the highest achievable pulse repetition frequency f prf ; f prf is proportional to the input bandwidth of the FPGA 1 , B in , and is inversely proportional to the number of channels N chnl , the number of input samples N s , and the precision N bin :
A typical imaging situation will have a f prf between 3000 Hz and 10 000 Hz, which leaves a comfortable margin.
The target is to create up to 40 high-resolution frames per second to capture fast movements. One FPGA houses 3 beam-former blocks, as shown in Fig. 3 , and processes half image. The required output bandwidth is thus
= 180 Mb/s (20) where N l is the number of beamformed lines, N p is the number of samples per line, N b is the number of bits per sample, and f r is the frame rate. It can be seen that the output bandwidth is well below the peak bandwidth of 12.8 Gb/s. Notice that the samples are represented by 24 bits to decrease the effects of truncation in the calculations (see also Section IV-C).
B. Block RAM
A full low-resolution image is created for every emission. Then the low-resolution images are summed together to create a high-resolution image. Typically between 64 and 128 emissions are used to create a high-quality B-mode image. Two approaches are possible: 1) create low-resolution images in one FPGA and then send them for summation in another, and 2) accumulate them inside the FPGA. In this work, the latter approach has been adopted because of the high bandwidth the former scheme requires. All buffers shown in Fig. 3 are implemented using the dedicated block RAMs. They are synchronous and dual-ported and can be configured to address 1 bit, 4 bit, 8 bit, 16 bit, and 32 bit. They feature additional bits, so that one can use 36 bits per address.
To simplify the synchronization, double buffering (sometimes also called ping-pong buffering) is used to load the data. Buffers are needed to hold the parameters for the delay and apodization generation, the input IQ samples, and the output beamformed lines.
Generation of Delays and Apodization Coefficients:
To calculate the delays, 5 parameters are needed per line per channel-the 2 constants B and C, see (12) and (10) The apodization is calculated using piecewise linear approximation. The number of line segments is limited to 8. For each line-segment, 36 bits are required. The slope and initial value are represented with 14-bit internal precision and the number of samples in a segment with 8 bits. The total number of 36-bit words is 32 lines × 8 segments = 256 36-bit words. The block RAMs can be configured as 512 addresses × 36-bit words. Hence, one block RAM is sufficient to describe all lines for a single channel. Every acquisition is performed using a different transmit element, and new parameters must be downloaded at every transmission using double buffering. A single beamformer block like the one in Fig. 1 requires 6 block RAMs.
Input Samples:
Input samples must be stored while a full image is being beamformed. Inside the FPGA, the samples from 4 channels for 1 emission must be stored. It has been decided that each beamformation block operates independently and will have its own copy of the input samples. Up to 4096 IQ 16-bit samples are stored for each channel. Double buffering is used for communications. The number of RAM blocks per channel thus is 2 buffers × 2 bytes/sample × 4096 samples/2048 bytes/RAM block = 16 blocks of RAM. A single beamformer handles 4 channels and thus needs 64 RAM blocks.
Output Samples/Line Buffers:
Output samples are stored in 24 bits to minimize the effects of rounding in the multiplication. A beamforming block has to process 32 lines, which is 1/6th of all the lines in an image. Storing all lines internally in the FPGA leaves room only for 2 beamforming blocks. Thus it has been decided that a beamformer block will store internally only 8 lines. Once they are beamformed, they are sent further while the beamforming block scans the next 32 lines. To beamform 8 lines fully, the information from all transmissions is needed. Because only the data from a single transmission is kept in the FPGA, the sampled data are first saved in one of the external DDR RAMs, while the data from the other DDR RAM are being processed. The amount of data needed for 1 image is 128 transmissions × 2(IQ) × 2 bytes/sample × 4096 samples/channel × 4 channels = 8 MB. The required bandwidth is 8 MB × 40 frames/s = 320 MB/s. The bandwidth is 2.4 GB/s (see Fig. 5 ), which makes it possible to read up to 8 times the data necessary to beamform an image or a part of it. The beamformer blocks create only 1/4th of what is required at a time, thus the sample data must be read 4 times, thus using 50% of the available bandwidth.
The memory buffers needed to store 8 lines is 2 buffers × 8 lines × 1024 samples/line × 3 bytes/sample × 2(IQ)/2048 bytes/RAM block = 48 blocks of RAM.
So one beamformation block uses a total of 118 RAM blocks-6 for parameters, 64 for input samples, and 48 for output samples. The total number of RAM blocks is 376 (see Table I ) allowing for 3 beamformer blocks to fit in the FPGA.
C. DSP Slices
The DSP slices are needed because of the built-in 18 − bit × 18 − bit multipliers. These multipliers can operate at frequencies of up to 450 MHz and can perform a multiply-and-accumulate operation at every clock cycle. The data from a single channel is first interpolated as shown in Fig. 5 using one multiplier. Then the 8-bit receive apodization is multiplied onto the result. Finally, the summed data are multiplied by the transmit apodization before summation. Thus, a beamformer block requires 10 Extreme DSP slices for the I channel and 10 Extreme DSP slices for the Q channel. The total number of Extreme DSP slices for 3 beamforming blocks is thus 60. The total number of available Extreme DSP slices is 160 as given in Table I .
D. Logic Cells
Using the synthesis report generated by the Xilinx ISE tool, a beamformation block requires about 3400 to 3500 out of 42 176 available slices. Three beamformation blocks thus occupy about 25% of the available logic resources. The maximum clock frequency for a device speed grade 11 is 167.8 MHz in our implementation and is limited by the RISQRT pipeline logic.
IV. Error Analysis
There are 2 errors introduced in the beamformation process-errors from the calculated time of flight and the calculated apodization coefficients.
A. Errors in the Time of Flight
The results in this section are based on calculations using the parameters listed in Table II . To illustrate the errors, the time of flight to the points of a line with origin r o = (0, 0, 0.003) T m and direction ζ = (0, 0, 1) T has been used in this section. The number of points in the line was set to 1024 and the end depth to 0.139 m. Fig. 7 shows an example of how the error for elements with coordinates r 5 = (−0.038, 0, 0)
T m looks as a function of distance along the line. The error is characterized by a dcoffset that changes as a function of depth and a random ac-component. The middle plot shows how the dc-offset (bias) changes. It is due to the error accumulated recursively in the calculation of the squared distance; see (9) and (13) . This bias varies from channel to channel; it can start from a negative value and move toward a positive value, or it can start from a positive value and move toward a negative value, or it can be 0. Subtracting this bias from the total error gives the error due to the RISQRT algorithm, which is shown in the bottom plot of Fig. 7 . The time-of-flightd p is calculated in samples using fixedpoint numbers with 12 bit for integer part and 4 bit for the fractional part. The error due to RISQRT varies in the interval [0 : 2 −4 ] samples. Previously, Holm and Kristoffersen have given an analysis of the effect of quantizing the delays to the side lobes in a focused system [50] . In their work, they distinguish between random and periodic errors. Periodic errors give a rise to undesired grating lobes. To characterize the nature of the error, we look at the power density spectrum of the error, which is shown in Fig. 8 . The figure shows the power-density spectrum of the error for a single point in a line across all channels. The spectrum has been averaged over all points in a line. The top graph shows the error caused by quantizing the exact time of flight with 4-bit precision for the fractional part. The error is white with uniform distribution. The expected level is 10 log((2 −4 ) 2 /12) = −34.88 dB. The total error is not white, but the lack of dominant periodic components prevents the rise of grating lobes. The distribution of the error (not shown) can be described by a Gaussian function. The error introduced by the RISQRT algorithm is white in nature, and its power is the same as the power of a quantization error.
The delay calculation circuit has been tested for several imaging geometries-linear and phased array imaging and beamforming for directional flow estimation. These tests are illustrated in [45] . The mean absolute error for all these cases is half of the least significant bit (2 −5 ). The sampling frequency is set to f s ≥ 4f 0 , which means that the mean absolute error is on the order of λ/128. The reason is that for the same channel, the bias-as seen in the top plot in Fig. 7 -changes from line to line and has a mean of zero. Thus, the only error that is always present is the error from the RISQRT circuit, which in magnitude and probability distribution is equal to the error introduced by quantization of the exact time of flight.
B. Errors in Calculation of Apodization
We have chosen to use 7 significant bits to encode the apodization (plus a sign bit). The slope is encoded with 14 bits to minimize the effect of accumulated error. This error manifests itself as a discontinuity at the point when a new segment starts. Fig. 9 shows an example of dynamic apodization based on expanding Hamming window. The parameters of the transducer are the same as in the previous section (see Table II ). The top plot shows a view of the approximated apodization as a function of depth and element number. The bottom plot shows the maximum absolute error of the calculation. For comparison, the maximum errors when the apodization is quantized with 6 and 7 significant bits are shown too. The mean absolute error tends toward the mean absolute error of 7-bit quantization. For a given point, the calculated apodization curve is "smooth." The average increase in the side-lobe level of the magnitude spectrum of the apodization function is less than 1 to 2 dB. 
C. Errors from Truncation of Intermediate Calculations
The data format for the sampled data are 16 bit, but data are sampled with 12-bit precision. The linear interpolation uses the built-in 18 × 18 multipliers and adds 4-bit precision from the interpolation. The 16-bit result is multiplied by the 7-bit apodization coefficient giving a 23-bit result without any rounding employed. The data from the 4 channels are added together without any loss of precision adding extra 3 bits. The 17 most significant bits of the sum are then multiplied by the 7-bit transmit appodization. The system is dimensioned so that 128 transmissions can be added together without loss of precision. Again, only the 17 most significant bits of the weighted sum are accumulated in the line buffers, which are 24 bits wide. Thus, there are 2 sources of noise from truncation-one before the multiplication with transmit apodization and one before the accumulation to create the final high-resolution image. Assuming that the apodization coefficients are 1, then the total power of the error is
where M is the number of transmissions, and P 17-bit is the noise of rounding the result at 17 bits. It is also assumed that the error can be modeled as white noise with uniform distribution. As mentioned before, 4 channels are processed on a single board. To create the final image, the results from each board are sent to be added to the sum from the other board. The system is dimensioned for up to 256 channels, distributed over 64 boards. The sum is 16 bit. The summation over 64 boards adds 6 bits to the result. The result from accumulating 4 channels must be downshifted, so that only 10 bits remain to prevent overflow. The noise from this cascade operation, when the resulting sum is 16 bit, is
where P 10-bit is the noise introduced by rounding the result at 10 bits. The total noise is the sum of the 2 components. For a system with N = 256 receive elements (64 boards) and M = 128 transmissions, the total noise is
This implies that the side-lobe level will be dominated by the noise introduced by the rounding operations.
Another implementation is to use a 24-bit accumulation. In this case, the result from summation must be rounded at 17-bit level. The power of the noise, thus becomes
This level is lower than the typical display range of 60 dB as shown in Fig. 10 . Fig. 10 shows a B-mode image of a sector scan formed by a 128-element transducer using synthetic aperture focusing with 128 transmissions. The signals for the beamformer have been created by a simulation using Field II [51] , [52] . The transducer parameters are the same as in Table II , except for the number of elements, which is 128, and the pitch, which is λ/2.
V. Imaging Results and Discussion
The goal of the simulation is to reveal the influence of the different imaging parameters. A reference image has been created in which the input samples are represented with floating point numbers with double precision. The result from the implemented beamformer is shown in the bottom image of Fig. 10 . The input signal has been quantized with 12 bits (range: −2 11 , 2 11 − 1). The precision of the delays and apodization is as described in the previous sections. The delays were calculated with 12 bits for integer part and 4 bits for the fractional part. The apodization was calculated with 7-bit signed numbers. The samples have been right-shifted by 11 bits before accumulation in the line buffers. Table III summarizes which processing steps are present in each test case. All point spread functions have been calculated using exactly the same steps. Gradually the double-precision parameters are replaced by integer parameters as shown in the table. As it can be seen from Fig. 11 , the results from all test cases, except for the 16-bit accumulation, are rather similar. The main lobes are virtually indistinguishable and the side-lobe levels are concentrated within 1 to 2 dB and are below −60 dB from the peak value.
The largest influence on the image is the noise introduced by the rounding operations. The results in the 16-bit accumulation represent the most conservative design. In this case, the side-lobe level is about 50 dB below the peak value of the point spread function. The energy in the side lobes is about 40 dB below the energy of the main lobe as shown in Fig. 11 . Due to attenuation, signals from larger depths will rarely use the full 12-bit range of the A/D converter. The 4 × λ sampling means that the 2 adjacent samples used by the linear interpolation will not be close to the maximum value simultaneously. Therefore, there is no need to downshift the result from the linear interpolation. Summing 24 bit across boards will, thus, not introduce any noise, and the resulting image will be comparable to the reference image.
VI. Summary and Conclusion
This paper describes a beamformer implemented using modern programmable devices (FPGA). The beamformer consists of beamforming blocks that process the data in parallel. A single device can house 3 beamforming blocks and is capable of beamforming data from 4 transducer ele- 
1 Implemented using serial links. Fig. 11 . Projection of the point spread function as a function of angle. The top plot shows the RMS and the bottom, the maximum value. The precision of the delays and apodization coefficients is sufficient to get a result that is less then 1 dB different than the reference. The major contributors to reduced image quality are the rounding operations in the beamforming process.
ments and producing 450 (150 per block) million complex (IQ) samples per second. This is equivalent to forming 4600 low-resolution images consisting of 96 lines, each long 1024 samples, because this number is sufficient to meet Nyquist's sampling criterion. If the lines are shorter, then their number can be increased. The suggested beamformer can handle images with arbitrary geometry, as long as they can be described as a collection of lines. Each line can have arbitrary origin, direction, and inter-sample distance. The length of the line is limited to 1024 samples. Another restriction of the beamformer is the size of the input buffer, which is 4096 complex samples (IQ).
To make the implementation of the beamformer possible, several design ideas have been employed. Because it is impossible to read all imaging parameters from an external memory for the desired image size, we have developed a recursive parametric delay calculation unit. It can calculate the delays with a precision of λ/128 without employing any multipliers, thus sparing resources for the interpolation unit. Similarly, the apodization coefficients are calculated using a piecewise linear approximation. The results from the individual transmissions are accumulated internally in the device, thus reducing the requirements for input/output bandwidth by the number of transmissions M . To increase the image size, we have used the fact that the memory bandwidth to read data are 8 times faster than the rate at which data are coming to the FPGA. The output of the FPGA is time-delayed by 1 frame. Only 1/4th of the image is beamformed using all transmission. Then the frame is sent further. Then another quarter of the image is processed, and so on.
The delay generation has been tested and shown to perform with the expected accuracy. Tests have shown that the nonperiodic nature of the error prevents the forming of secondary grating lobes. The use of integer delays and apodization coefficients does not worsen the image quality, which is mainly determined by the sampling frequency in combination with linear interpolation. The single largest source of reduction in image quality is the introduction of noise due to rounding operations in the data path. Experiments have, however, shown that it is not necessary to truncate the result, because the results are well within the range of numbers handled by the hardware. The point spread function of the images created using the described beamformer have the same main lobe as the reference images produced using double-precision floating point calculations for the delays, the apodization coefficients, and in the interpolation procedure. The side-lobe level is about 1 dB higher, but still below −60 dB when the suggested architecture is used in combination with 24-bit summation for the end result. When 16-bit summation is used, the side lobe is about 50 dB below the peak value.
The present work has shown that it is possible to create real-time synthetic transmit aperture system using present technology. The quality of the images is close to what is attainable using off-line processing with high precision. The described beamformer architecture is implemented in hardware and will be used further to assess the clinical relevance of the use of synthetic aperture focusing for medical applications.
