Synthetic aperture (SA) ultrasound imaging has not been introduced in commercial scanners mainly due to the computational cost associated with the hardware implementation of this imaging modality.
I. INTRODUCTION
This paper describes the possibilities given by synthetic aperture (SA) ultrasound imaging and the problems associated with the implementation of a real-time imaging system. SA imaging originally developed for radar applications in the 50's is a radically different method from the common sequential scanning employed in medical ultrasound [1] .
Attempts to implement a ultrasound scanner based on SA imaging date back to late 1960s and early 1970s [2] , [3] . In the 1970s and 1980s, SA imaging was investigated primarily for nondestructive testing. Although transducer arrays were used, the acquisition method was a direct implementation of radar SA focusing, known as monostatic synthetic aperture imaging [4] . Till the start of the 1990s SA imaging was rarely considered for medical applications. A number of investigations have been conducted in recent years [5] - [11] , including some attempts to implement a real-time SA scanner [12] . All these efforts have brought about the creation of novel imaging algorithms such as recursive imaging [13] and the use of orthogonal codes [7] , [14] . It has also been demonstrated both with phantoms and in-vivo that SA imaging can be used for high-precision blood flow estimation [9] , [15] , [16] . All these developments are described in greater detail in Section II. Since a number of in-vivo investigations have already been successfully carried out, the logical next step is to develop and implement a real-time SA scanner. Rather than developing a scanner that supports one or two imaging modalities, we set about developing a platform, flexible enough to support all of the enumerated algorithms. Section III describes some of the problems that we have encountered in doing so and the choices that we have made.
II. ADVANTAGES OF SA IMAGING
SA imaging offers a number of advantages over conventional ultrasound scanning: perfectly focused images, highercontrast and depth of penetration, fast image acquisition suitable for real-time 3D imaging and precise vector velocity estimates.
A. Better images
The principle of SA imaging is illustrated in Fig. 1 . A spherical wave is transmitted using one or several elements. The back-scattered echo stems from the whole region under investigation. All elements are used to record the returned signal and a full image is reconstructed. The image has a low resolution because the transmitted wave is not focused. The measurement is repeated using another origin of the spherical wave until the whole aperture is covered. All of the reconstructed images are added together and a high resolution image is created. The beamformation is performed by finding the geometric distance from the origin of the spherical wave to the image point and back to the transmitting element. Dividing this distance by the speed of sound gives the propagation time. The reconstruction process can be expressed as:
where r p = [x, y, z] T is the point at which location the image H is being reconstructed, s j (t) is the signal received by the j:th element, t p is the propagation time from the transmit location to the point and back to the receive element, and a are the weighting (apodization) coefficients applied on the signal. The propagation time is calculated exactly for every point (see Section III-C) and a perfectly focused image can be reconstructed [9] , [17] . A problem in SA ultrasound images is the lower signalto-noise ratio (SNR) due to the use of un-focused transmit waves. This can be solved by combining multiple element transmissions with frequency modulated signals [9] , [18] .
A pre-clinical study was conducted by Pedersen, Gammelmark and Jensen [17] . In this study several sonographers were asked to score images with respect to image quality and penetration depth. The SA images had both larger penetration depth and better image quality. Figure 2 shows two screenshots of the movies presented to the physicians participating in the study.
B. Better flow estimates
Color flow maps can be created with high-precision using SA-imaging and cross correlation [9] , [15] , [16] . Two factors contribute to the precision of flow: the continuous availability of data in all points and the possibility to beamform a signal along the direction of the flow, thus increasing the correlation between signals acquired at different time instances. Figure  3 shows an in-vivo color flow map produced using crosscorrelation, directional beamforming and SA imaging. Long FM pulses have been used in transmit resulting in a loss of information in the first 13 mm. A full color flow map can be created in as low as 28 emissions [9] , [15] resulting in very high rate. Vector flow imaging has been shown to visualize turbulent flow and combined with the high frame rates of SA imaging, it would be possible to visualize the complex flow patterns in the heart.
C. Real-time 3D imaging
The main obstacle for real-time 3D imaging has been the sequential acquisition (line-by-line) of the conventional scanners. Great progress has been made in the area of realtime 3D imaging with the use of 2D matrix transducer arrays. A typical approach is to transmit a broad beam and to form multiple receive beams within its limits [20] . A tradeoff exists, however between frame rate and volume size. The situation becomes worse when blood flow is to be estimated.
SA imaging gives the opportunity to create high-quality volumetric images at high frame rates, while at the same time estimating the blood flow [21] , [22] This is enabled by the fact that the number of emissions is not dependent on the image size, and the same acquisitions can be used both for B-mode images and for color flow maps.
III. REAL-TIME IMPLEMENTATION
The objective of the project is to build a research scanner that can transmit arbitrary signals, store raw sampled data, and perform conventional and synthetic aperture focusing in realtime. All this should be supported by the same hardware which must be flexible enough to accommodate different imaging geometries.
The basic stages of the scanner are pre-processing, beamforming, flow estimation and post-processing. The preprocessing stage is responsible for the decoding of the received signals, which in its simplest form consists of a per-channel matched filter. The beam-forming stage consists of a number of focusing units followed by accumulators for the case of SA imaging. The flow estimation is based on cross-correlation. Both I and Q values are produced by the beamformer and the envelope detection, logarithmic compression, scan-conversion and image display are left for a separate PC-based unit [23] .
A. Overall system design
The system under development has 1024 channels distributed over 64 boards. The boards are mounted in 19-inch racks in groups of 8. Each board processes data from 16 channels and handles both transmit and receive signal processing. There are 4 Virtex-4 FPGAs on each board for signal processing. A 5th FPGA with a PowerPC processor enabled and Linux operating system running on it, is responsible for parameter setup and communications over a local area network. Inside a board all FPGAs can communicate with each other using serial links. The bandwidth between each 2 FPGAs is 12.8 Gb/sec. The FPGAs can communicate with each other inside each board. The boards are chained so that each of them can communicate with its two neighbors. One of the FPGAs is connected to the analog-to-digital converters and another one is connected to the digital-to-analog converters. These two FPGAs will be denoted as Tx and Rx FPGAs, respectively.
The FPGAs are XC4VFX100 by Xilinx. They have been chosen because of the availability of multi giga-bit serial links, dedicated digital-signal-processing (DSP) slices and the presence of a PowerPC core. There are three distinct types of resources available in these FPGAs: RAM blocks (BRAM), DSP slices capable of single-cycle multiply-and-accumulate operations, and general logic slices. Each device has 376 18-Kbit BRAMs, 160 DSP slices and 42176 logic slices.
The A/D and D/A converters operate at 70 MHz. The processing of samples is performed at double this frequency: f clk = 140 MHz. The dedicated DSP and RAM blocks can operate at frequencies of about 400 MHz and will be clocked for any practical purpose at 280 MHz, which leaves a comfortable margin.
B. Matched filtration
The matched filtration is performed in the Rx FPGA, which is responsible for the reception of data, its decimation, storage in external memory, and matched filtration. The data is sent to the other FPGAs via the multi-gigabit serial links.
Present day systems have a penetration depth of about 500 wavelengths (500 λ). SA-imaging can produces image with up to 50% larger penetration when a combination of coded excitations and virtual sources is used. A choice has been made to limit the system to 1000λ. Using a sampling frequency which is 4 times the center frequency, a maximum of 4000 samples are sampled per channel at every emission. Frequency modulated signals are used in transmit. Their length is up to 20 μs, and the transmit clock frequency is 70 MHz. Thus, a waveform is up to 1400 samples long. The pulse compression is performed with the time-reversed transmit waveform weighed with a suitable window. The pulse compression is implemented in frequency domain before beamforming in the Rx FPGA. A 8192-point fast Fourier transform (FFT) is used to avoid circular convolution.
The FFT unit is implemented using an intellectual property core by Xilinx (Xilinx LogiCore FFT v3.2). A pipelined version of the FFT algorithm is used as it gives the best performance/resources ratio. The FFT blocks are clocked at 280 MHz. The number of required FFT blocks depends on how many sets of data must be processed per second. A pulse repetition frequency of 4 kHz provides 70000 clock cycles for data processing. A pipelined FFT block needs 8192 clocks to process data from one channel, thus it can process 8 channels per shot. A total of 4 FFT blocks (2 for the forward and 2 for the inverse transform) are required to process data from 16 channels, which is needed for 3D imaging. They will occupy 34% of the logic slices, 26 % of the RAM blocks and 60 % of the DSP slices.
Six FFT blocks are required to process data in real-time from 16 channels at f prf of 5 kHz. They will use 90% of the available DSP slices. The samples from the FFT block are processed one at a time, thus 2 complex multiplications are performed per sample. Double (ping-pong) buffering is employed as means of communication between the matched filtration unit and the other units along the data path.
C. Beamforming
Delay-and-sum beamforming is used for both conventional and SA imaging. The images are described as sets of lines. Each line is specified in 3-D space with its origin r o , direction ζ, length L and distance between two samples Δr. In the case of SA imaging r o can be placed anywhere in front of the transducer and can have arbitrary direction to be able to track flow.
In real-time SA-imaging mode, each board processes data from 4 receive channels and produces 192 lines of 1024 samples each. The beamforming is performed on complex data and both I and Q samples are produced. The board produces a partial contribution of the high-resolution image -it produces a high-resolution image for 4 receive channels (M = 4 in (1)). The so-beamformed image is sent to the next board in the chain where it gets summed with high-resolution image for other 4 channels and so forth. One of the challenges is how to partition the beamforming. There are two possibilities: (1) to use one FPGA purely for the accumulation of low-resolution images until a high-resolution one is created and (2) to divide the high-resolution image into regions and each region to be processed in a separate FPGA. 5000 low resolution images of 192 lines of 1024 I and Q 16-bit samples must be created per second. The bandwidth required to transfer the low resolution images is 31.5 Gb/s. The bandwidth available in the system is 12.8 Gb/s, which makes this option unfeasible.
It has therefore been decided that each FPGA will produce a part of the high-resolution image. This relaxes the bandwidth requirements by an order of magnitude, but larger amounts of RAM blocks are required internally to store the intermediate results as it is shown in the following paragraphs.
The beamformer contains the following main blocks: delay calculation unit, apodization generator, interpolation unit and line accumulators. The delay and apodization generators supply the interpolation unit with parameters. The interpolation unit creates samples when a sub-sample delay is needed to produce the result. The outputs of the interpolation units are summed together. The resulting sum is multiplied with the transmit apodization coefficient and is accumulated in the output line buffers. All calculations are made on 16-bit data. The line-buffers are 24-bit to decrease rounding noise. The delay-generation unit and the beamformer are described in greater detail in [24] , [25] .
1) Delay generation: One of the challenges in describing a SA-image is that the notion of "beam" looses its meaning -one can treat a image as a collection of points. Storing delay and apodization coefficients for every point is not practical. More than 5×10 9 sets of parameters must be read. Each set consists of 16-bit delay value and 8-bit apodization coefficient. Thus 15 GByte/s must be read from an external memory. Assuming a double-data-rate 128-bit interface, only 6.4 GByte/sec can be read at an operating frequency of 200 MHz. Thus the parameters must be calculated dynamically in the FPGA.
The delay-and-sum beamformation requires the calculation of the time-of-flight t p (i, j) from the origin of transmission r i to the point being beamformed r p and back to a receive element r j (see (1) . It is given by:
where c is the speed of sound. As said, the points are uniformly distributed along a line. The forward and backward propagation times t f and t b are calculated independently, and only one of them will be considered in the following. An exact recursive equation exists for the squared propagation times t 2 i and t 2 i−1 , where i is the index of the point along the line:
where the constants A and B are calculated as:
where x oe , y oe , z oe are the coordinates of the first point in the line relative to the position of the transmitting (or receiving) element. The increment of the coordinates between two successive points along the line is given by Δ r = [Δx, Δy, Δz] T . Then the square root of the time is found using a recursive equation. The procedure is described in detail in [24] .
2) Apodization: The apodization curves are described using piecewise linear approximation with maximum error of 1% of the full scale. A greedy algorithm is used to find the segments off-line. Each segment is specified by a start value, slope and a segment length in samples. The start value and the slope for each segment are encoded using 14 bits and the number of samples is encoded using 8 bits. Although the internal calculations are performed with 14-bit precision, the result is 8-bit, thus introducing up to 0.4% error in the apodization coefficient.
3) Interpolation: Linear interpolation is used to generate samples for time instances that are not integer multiple of the sampling period. The linear interpolation uses the built-in digital signal processing blocks available in the Xilinx Virtex-4 family of FPGAs. Each such block can multiply two 18-bit numbers and accumulate the result in a 48-bit register. The unit must operate at a clock frequency which is twice higher than the clock frequency of the delay generator, because two samples are needed for each output sample. The output sample is calculated as follows:
The index n is the integer part of the delay. It is represented by 12 bits and can address up to 4096 samples. The subsample delay α is formed by the 4 least significant bits of the propagation time.
4) Used resources:
Each of the beamformation units can beamform 32 lines during one pulse-repetition period. To achieve the goal of processing close to 200 lines, one must use 6 beamformation blocks.
As mentioned previously, there are 3 critical resources in the FPGAs: dual-port RAM blocks, dedicated DSP computational units and logic blocks. The most critical resource is the number of RAM blocks in the FPGA. A single block RAM (BRAM) can store all of the parameters that describe both apodization and delays for a single element for 32 lines. Parameters for 5 channels must be calculated inside a beamformation unit -4 receiving and one transmitting. Since the transmitting position changes at every transmission, a new set of parameters must be downloaded from an external memory. Double buffering is used in this case too, and the number of BRAMs for parameter storage is 6 per beamformation unit.
Both I and Q signals are processed in parallel. The raw channel data is 4096 samples long with a 16-bit precision. Accounting for the double-buffering operation, a total of 64 RAM blocks is required to store the raw channel data.
The number of samples in a line is limited to 1024 and the number of lines per beamformation unit is fixed to 32. The precision for the accumulation operation is chosen to be 24-bit, although the data processed further in the system is truncated to 16-bit. A total of 96 BRAMs are required for the formation of high-resolution image inside the FPGA. Since the chosen device has 376 BRAMs, only 2 beamformation units can fit in a single FPGA. In contrast to the usage of memory, these units together utilize only 33% of the available computational resource.
IV. CONCLUSION
This paper has described the the opportunities that SAimaging offers as well as some of the problems that have been considered in our real-time implementation. SA imaging yields B-mode images with higher quality. The number of emissions is not coupled with the image size and high frames are achievable using sparse transmissions. The imaging is not confined any more to the traditionally used beams in ultrasound imaging -one reconstructs points inside the region under investigation in a manner that fits one's needs. This has been used to obtain precise blood flow estimates, but the areas of application can be extended to tracking of tracking of specific regions, elasticity imaging etc.
There are a number of challenges that must be overcome in the implementation of synthetic aperture imaging in real time. In our implementation two have been the critical resourcescommunication bandwidth between devices and the number of available built-in block RAMS.
We have shown that it is possible to develop a system for real-time synthetic aperture imaging using commercially available high-end FPGAs.
