Abstract-A multilevel Lagrangian carotid strain imaging algorithm is analyzed to identify computational bottlenecks for implementation on a graphics processing unit (GPU). Displacement tracking including regularization was found to be the most computationally expensive aspect of this strain imaging algorithm taking about 2.2 h for an entire cardiac cycle. This intensive displacement tracking was essential to obtain Lagrangian strain tensors. However, most of the computational techniques used for displacement tracking are parallelizable, and hence GPU implementation is expected to be beneficial. A new scheme for subsample displacement estimation referred to as a multilevel global peak finder was also developed since the Nelder-Mead simplex optimization technique used in the CPU implementation was not suitable for GPU implementation. GPU optimizations to minimize thread divergence and utilization of shared and texture memories were also implemented. This enables efficient use of the GPU computational hardware and memory bandwidth. Overall, an application speedup of 168.75× was obtained enabling the algorithm to finish in about 50 s for a cardiac cycle. Last, comparison of GPU and CPU implementations demonstrated no significant difference in the quality of displacement vector and strain tensor estimation with the two implementations up to a 5% interframe deformation. Hence, a GPU implementation is feasible for clinical adoption and opens opportunity for other computationally intensive techniques.
I. INTRODUCTION
U LTRASOUND strain imaging [1] is a modality which provides information about mechanical properties of tissues. This modality has been used for several noninvasive clinical diagnostic applications over the last decade. For example, quasi-static breast strain imaging has been used to both detect and discriminate between benign and malignant breast lesions [2] . Several researchers have also shown the usefulness of carotid strain imaging in vivo as it can predict plaque vulnerability [3] - [9] . Carotid strain imaging is implemented using the natural cardiac pulsation of the carotid arteries as stimuli. Along longitudinal views, the effect is such that motion of the near wall and far wall of the artery are in the opposite directions as the artery expands and contracts. This inherent discontinuous motion makes the task of estimating the displacement map for carotid strain imaging a challenging one [10] . In addition, there may be additional physiological motion which causes both walls to move differently.
Recently, graphics processing units (GPUs) have been used in strain imaging to allow for faster computational and processing efficiency. NVIDIA (Santa Clara, CA, USA) has introduced compute unified device architecture (CUDA) which is a parallel programming platform allowing easy programming of GPUs. Idzenga et al. [11] identified that about 90% of their strain imaging algorithm computation time was spent in performing normalized cross correlation (NCC) computations and hence implemented NCC on a GPU using CUDA. Verma and Doyley [12] used a GPU to perform beam forming and 2-D cross correlation and achieved frame rates 400 times faster than conventional methods. Rosenzweig et al. [13] implemented acoustic radiation force impulse imaging on GPU to improve the efficiency of cubic spline interpolation and Loupas et al.'s [14] 2-D autocorrelator. Yang et al. [15] implemented a hybrid CPU-GPU strain imaging algorithm using 1-D cross correlation where median filtering of displacement estimates and strain estimation was done on the CPU for a previous frame while the remaining compute intensive calculations were done on GPU for the current frame in parallel. Recently, Peng et al. [16] implemented a 3-D coupled subsample estimation algorithm for breast elastography using GPU. This was accomplished by using a K20 NVIDIA GPU to obtain a 3-D cross correlation map which was upsampled using spline interpolation with the marching-cube algorithm [17] and then fit to an ellipsoidal model to estimate the final displacements [18] , [19] .
Lagrangian carotid strain imaging is generally performed using the following steps. First, a displacement vector map based on the interframe displacements estimated is generated. Second, accumulated displacement vectors are obtained prior to the estimation of the corresponding accumulated strain tensors. Then, the gradient of the displacement vectors is used for calculating the local strain tensors. The first step can be viewed as a deformable registration problem, and possible solutions are block matching methods [20] , [21] . In these methods, blocks of data are matched between consecutive frames using a similarity measure such as NCC. To form the displacement map accurately, Shi and Varghese [10] used a multilevel pyramid scheme where coarser blocks of data are used to initialize search regions for higher levels that use finer blocks to improve spatial resolution. Multilevel block matching has allowed for the accurate calculation of discontinuous displacement maps. Here, strain calculated at lower levels can be used to scale the data at higher levels to reduce signal decorrelation and enhance the accuracy of displacement estimation [22] , [23] . Block matching methods, however, do not utilize global information and hence can be susceptible to signal decorrelation and peak hopping errors. In order to address this problem, McCormick et al. [24] used a Bayesian regularization approach where displacement estimates from neighboring blocks are used as a prior to aid accurate displacement estimation. The amount of deformation per frame can be small, and micrometer precision is desired; hence, subsample displacement estimation is necessary. McCormick and Varghese [25] used a 2-D windowed sinc interpolator to enable accurate and precise subsample displacement vector estimation. The techniques developed in [10] and [22] - [25] have been combined together in [6] and have successfully shown to quantify carotid plaque instability.
Since the strain imaging algorithm described in [6] is a combination of techniques developed over the years, the improved quality observed in the strain maps formed comes at the cost of significantly increased computation time. Moreover, for carotid strain imaging, displacements accumulated over an entire cardiac cycle and corresponding accumulated strain estimation are required to characterize plaque vulnerability. Here, around 25 frames of data over a cardiac cycle have to be processed which takes around 2 h with the current algorithm on a singlethreaded CPU [6] . In this paper, we implement the algorithm presented in [6] on a GPU to allow for faster processing and hence making this approach feasible for clinical adoption.
This paper reports on two main contributions. First, we quantify computation time for different techniques which have been incorporated to our strain imaging algorithm. Second, we implement these algorithms on a GPU without loss of image quality and demonstrate the speedup in processing time making the algorithm feasible in a clinical setting. In this process, we redesigned algorithms to make them suitable for GPU computing and explored standard optimizations for the same.
II. MATERIALS AND METHODS
An Intel(R) Xeon(R) CPU E5-2640 v4 at 2.40 GHz was used to run the algorithm. The GPU used was a Tesla K40c which belongs to the Kepler architecture with compute capability 3.5. The original algorithm was implemented in C++ using the insight toolkit (ITK) library [26] . CPU multithreading was not used. Table I presents different stages used in the original algorithm which will be discussed in the following paragraphs.
Computational bottlenecks in the original implementation by McCormick et al. [6] were first identified. A timing analysis for a single interframe displacement calculation with this algorithm is presented in Fig. 1 . These interframe displacement computation stages take a total of 317.25 s per frame pair. The next step is displacement accumulation and strain tensor calculation. This is done after plaque segmentation, and hence the time taken for this step varies depending on the plaque size. Our timing analysis on a carotid wall segmentation versus a large plaque segmentation shows that this took between 2.38 and 3.07 s per frame pair, respectively. This time includes the time spent in writing out the data to files with each plaque pixel location and corresponding strain values. However, this is very small compared to the 317.25 s taken for displacement estimation and hence we focus on speeding up the displacement estimation process.
We use the term current image and next image to describe the initial and the second frame that is used in interframe displacement estimation, respectively. The upsample and prepare multilevel data (UPMD) stage calculates the current and next images that will be used by the three levels of the multilevel displacement estimation algorithm. Note that these images are sampled differently for each level. First, a windowed sinc interpolator is used to upsample the original current and next image by a factor of 2 in both the axial and lateral directions. Following this upsampling, the decimation factors presented in Table II are applied since the lower levels are operated at a coarser granularity to improve computational speed and reduce erroneous displacement tracking. Higher decimation is allowed in the axial direction after conversion to envelope signals (to satisfy Nyquist criterion) since we have higher resolution in that direction. Following the decimation, Gaussian smoothing is applied which allows for derivative operations to be applied without singularities.
The data update (DU) stage is used to load the appropriate current and next image before proceeding to further stages for a given level. The affine transformation (AT) stage is used only for level 2 and level 3 where the current image blocks are scaled according to strain observed in previous levels as discussed in [22] and [23] . The correlation helper (CH) and the NCC stages perform NCC of data blocks from current image to search regions in the next image. For performing NCC, the mean and variances of data block being crosscorrelated are required. These values are calculated in the CH stage, and hence it is a helper to the NCC stage. The radius and size of the data blocks formed from the current images along with the search region ratios and resulting NCC matrix sizes are shown in Table III . The data block size is calculated as 2 × radius + 1. The NCC matrix sizes are calculated as 2 × ceil(searchratio × radius) + 1. The last column determines the number of NCC operations performed of the current image data block to different offsets of the same sized block in the next image. The essence of multilevel tracking is having large data block sizes at the lower levels accompanied with a large search region as one is less likely to generate false positives with a large data block size. However, the resolution of displacement vectors generated is low with large blocks. Hence, the displacement vectors from the lower levels are simply used to guide the search region initialization of the higher levels which perform NCC at smaller block sizes to achieve high spatial resolution for the displacement vectors.
The RC stage performs a Bayesian regularization process on the NCC matrices and is described in detail in [24] . The key process was, given an NCC matrix, corresponding values from left, right, top and bottom NCC matrices were used to remove noisy NCC values in the given matrix. This is an iterative process, and three iterations have been found to be sufficient to reduce noise in the estimation. Following this, the regularized NCC matrices were used to form the displacement vectors. To accurately estimate the displacement vector, subsample displacement calculation (SDC) [25] , [27] is done using interpolation. A simple parabolic interpolation is used for level 1 and level 2 while the final level uses a 2-D windowed sinc interpolation, a reconstructive method that does not have a closed form expression for the peak, but provides unbiased and accurate subsample estimates. The CPU implementation for sinc interpolation uses a Nelder-Mead simplex optimization to find this peak [25] .
A. GPU Version 1
To implement this algorithm on a GPU, we first identified bottlenecks of the original ITK algorithm. Fig. 1 illustrates that Bayesian regularization was the main bottleneck. The next bottleneck being the CH and NCC stages together. We initially implemented these three stages on the GPU with the frames copied to the GPU before the CH stage and the regularized matrices copied back to the CPU after the RC stage. The improved computational results were presented at a conference where Meshram and Varghese [28] reported an application speedup of 13.75×. In this paper, we implement all the remaining stages of the algorithm on the GPU and hence achieve a significantly faster implementation with higher speedup.
Data movement between GPU and CPU memories in the final implementation is as follows. The current and next frames are copied to the GPU memory before the computational stages. After the end of SDC stage in each level, displacement estimates are copied back to CPU memory and used to initialize offsets for the NCC search region for the next level. These offsets are copied to the GPU memory from the CPU memory at the beginning of each NCC stage. For the final level, the displacement estimates generated are the final results for the given frame pair and the algorithm moves on to the next frame pair. The GPU implementation of the UPMD stage performs upsampling using windowed sinc interpolation. Following this, it performs the appropriate decimation and smoothening of the images and stores the resulting images for all three levels in GPU memory. The DU stage was not required for GPU implementation and hence was removed.
Radio frequency (RF) data from the Siemens S2000 system are stored in a signed short format which is a 2-B format. However, in order to reduce any quantization errors in the upsampling stage, we moved to a single precision floating point format which is a 4-B format. This makes the GPU NCC stage more computationally expensive when compared to the corresponding CPU stage. The current GPU images are then used as input for the AT stage for level 2 and level 3 which generates, transformed current image data blocks based on the strain computed in prior levels. The data blocks are formed from the original current image for level 1. The size of these data blocks are presented in column 3 of Table III . The CH stage is now used only for setting up offsets for search regions based on displacements computed in prior levels. The CH stage was relieved of calculating the mean and variances of data blocks as this is now done in the NCC stage itself. The NCC stage forms the NCC matrices which are stored in GPU memory. The size of the NCC matrices formed is presented in column 5 of Table III. The NCC matrices are then regularized in the RC stage. Following this, a parallel reduction operation is performed on the regularized NCC matrices to find the maximum NCC value. An optimized GPU-based reduction algorithm presented by Harris [29] was used for this purpose. For levels 1 and 2, parabolic interpolation was performed using the maximum NCC value and its neighbors [27] . For level 3, the CPU implementation used the Nelder-Mead simplex optimization to interpolate the peak in the NCC matrix using 2-D windowed sinc interpolation. Although this process can be done in parallel for each NCC matrix, enough data blocks do not exist to make this approach suitable for CUDA. Hence, an alternate approach that works efficiently on a GPU, namely, a multilevel global peak finder (MLGPF) was implemented.
MLGPF starts with finding the maximum or peak NCC value. Following this, we set a region around this maximum which ranges from the matrix element prior to the maximum and ends at matrix element after the maximum. When done in 2-D, this approach forms a rectangular region. A grid is formed in this region and a 2-D windowed sinc interpolator is utilized to compute in parallel with multiple threads as shown by red points in Fig. 2. Fig. 2 shows how this process was performed with 64 threads. Our initial attempt with MLGPF used 1024 threads to generate a finer grid. Following this, we again use the Harris [29] reduction approach to find the maximum of all grid points. After this, the process is repeated with a smaller region around the new interpolated maximum. For the first step, where the sampling distance is large, we use a relaxed shrinking factor shown in green in Fig. 2 for the new region because the actual global peak may be closer to another grid point than the one we calculated as the maximum to avoid any jitter errors. However, once we have a small search region, we are less likely to miss the global peak, and hence we can use a tighter shrinking factor shown in blue. Thus, relaxed shrinking was used only for the first iteration and tighter shrinking used for further iterations. Five iterations of the algorithm with 1024 samples in each iteration were found to match or exceed the subsample peak computed using the Nelder-Mead simplex optimization approach. If we consider the distance between neighboring values in the estimation/interpolation matrix as a single unit, the relaxed shrinking factor denotes two units (one unit in each direction), and the tight shrinking factor represents one unit (half unit in each direction).
The intermediate results with GPU Version 1 are presented in the Appendix.
B. GPU Version 2
Speedup for level 3 of the SDC stage was just 4.67×, which was quite low indicating that the current configuration of the MLGPF algorithm utilizing 1024 grid points with 1024 threads was not optimal. SDC is not a notable part of the timing analysis in Fig. 1 ; however, it does become a notable component in GPU Version 1. The new bottlenecks for GPU Version 1 are the RC stage and CH + NCC stage. Hence, we further optimize RC, CH + NCC, and SDC stages.
For both the RC and CH + NCC stages, the NCC matrix sizes for levels 1-3, respectively, are 67 × 81 (5427), 41 × 59 (2419), and 23 × 41 (943). These NCC matrices sizes are the optimal CUDA block size to use for the respective levels since if we use these block sizes, a unique thread can operate on each NCC matrix element. However, CUDA block sizes are limited to 1024, and sizes above these are not supported. Hence, the sizes used for level 1, level 2, and level 3 for both these stages were 32 × 32, 32 × 32, and 16 × 32, respectively, and the same threads stepped through multiple NCC value calculation. This appears to be reasonable but is inefficient use of hardware. The problem requires understanding of the execution model on the GPU.
The execution model can be thought of as launching multiple CUDA blocks where each block has multiple threads. Execution of threads happens in the form of CUDA warps which are groups of 32 threads that run in lock-step fashion. If all 32 threads in each warp are not always performing useful calculations, it results in inefficient use of the GPU hardware. GPU Version 1 suffers from this issue and an example of this would be if we consider how warp 0 of level 1 in the RC or NCC stage would function. The threads ids in the form of (lateral, axial) or (X, Y ) for warp 0 would be (0, 0), (1, 0) , . . . , (31, 0) . After doing the computation for the same offset as the thread ids, these threads would step ahead to calculate for offsets (32, 0), (33, 0) , . . ., (63, 0). Next, they would step to (64, 0), (65, 0), …, (95, 0). The problem now is offsets from (67, 0) to (96, 0) that do not exist, and the threads working with these offsets are not doing useful work. This issue is called thread divergence. One solution is to simply change the block dimensions used. Thus, for level 1, since NCC size is 67 × 81, we launch CUDA blocks with size (67, 15) using 1005 threads that still do not exceed 1024 threads. Therefore, the threads no longer iterate through the x or lateral dimension but only in the y or axial dimension. Here, the last warp is always divergent since the execution model will still use 1024 threads. However, all the other warps have nondivergent execution in the common case except for the last iteration which may have divergence for one warp computing the bottom rightmost offsets.
Furthermore, optimizations include use on chip shared memory and the texture cache which is a read-only data cache. Shared memory can be thought of a cache managed by the programmer. Shared memory is especially useful with data reuse, and we want this data to be loaded in shared memory before the threads start using it. Hence, the threads typically load data in parallel then synchronize and start using the loaded data. This works great for the current image data block for NCC. However, since the search region is quite large for level 1 and level 2, the search region does not fit in the shared memory or it brings down occupancy of the implementation. Hence, we use texture cache for this which is very easy to use in compute capability 3.5 by simply giving compiler directives or intrinsics to mark the data to be loaded in texture cache. This can only be done if the data are read and never written to. For the RC stage although shared memory helps, there is not enough shared memory to load data from every neighboring NCC matrix into separate locations. Hence, the GPU implementation first loaded each neighbor synchronized threads, performed the computations, synchronized threads, overwrote the shared memory with a new neighbor, synchronized threads, and then performed the next computation. Doing this with texture memory achieved the same effect without any synchronization overhead and hence provided better performance for the RC stage.
Next, we attempt to improve the performance of the MLGPF scheme. The motivation behind using 1024 threads was to use fewer numbers of iterations for the scheme to converge. This assumed that the 2-D interpolation using windowed sinc function was computationally inexpensive. To test this assumption, we evaluate the difference between computation times with an implementation where the interpolator is removed. Of course, this produces incorrect results. The correct implementation takes 0.176 s, while after removing interpolation, it only takes 0.004 s. This shows that the windowed sinc function was the bottleneck in this scheme. Hence, the next goal was to converge while performing fewer number of interpolations even if it meant increasing the number of iterations. Ignoring the relaxed shrinking factor, in every iteration, we performed 1024 calculations to converge by a factor of 1/32 in each dimension. However, if we used just 64 threads and implemented shrinking of 1/8 in each iteration, we get a shrinking factor of 1/64 in two iterations while performing only 128 calculations. Hence, the number of threads used was changed to 64 instead of 1024. Nine iterations were needed instead of five iterations with 1024 threads. We also needed to use relaxed shrinking factors for the first four iterations. However, the scheme now takes 0.019 s instead of 0.176 required with 1024 threads.
C. Comparison of GPU and CPU Performance
Last, we compare the quality of estimates generated by the CPU and final GPU implementations, respectively. To achieve this, we use the SNR e metric as reported in [25] and [27] for where m e and s e are the mean and standard deviation, respectively, of the resulting strain estimates. We also present strain maps from both implementations to provide a qualitative visualization. The SNR e study followed the same protocol used in McCormick and Varghese [25] with a uniform tissuemimicking (TM) phantom undergoing precise deformation achieved using a motion table. Ten independent realizations were used to obtain statistically significant results. Pointwise strain comparison on the TM phantom was also performed. To quantify the accuracy of the algorithm, we report on the expected and observed strains by the CPU and GPU algorithms on the phantom. Finally, we report on the strain value difference in a human in vivo acquisition.
III. RESULTS
Performance comparison results are reported for RF data acquired using a Siemens S2000 system (Siemens Ultrasound, Mountain View, CA, USA), using an 18L6-HD transducer operated at a center frequency of 11.4 MHz. The dimensions of the acquired frames and the decimated sizes for different levels are presented in Table IV. Table V lists the number of data blocks that were required at different levels. No overlap between data blocks was used. As mentioned in [6] , overlap is not used because guidance is already provided by higher levels for the final level and independent matching blocks are ideal for Bayesian regularization. Note that both CPU and GPU implementations are flexible for any frame size, upsampling ratios, decimation factors, search region ratios, and desired amount of overlap between blocks. The configuration specified in Tables II-V is that used in our carotid plaque patient studies [3] - [5] , [30] - [32] . GPU speedups and optimization can vary depending on the configuration, and hence the configuration that was successfully used in our previous analysis on human subjects was optimized.
Tables VI and VII show the effect of different optimizations for the CH + NCC and RC stage, respectively. The second column presents the computation time for Version 1 which does not include these optimizations. In the third column, we changed the CUDA block dimensions to 67 × 15, 41 × 24, and 23 × 41 for level 1-3, respectively, and both stages completed faster. The shared memory and/or the texture cache configuration that worked best is shown in the last columns and is used in the final version.
Computation time taken by both the GPU implementations to complete different stages is presented in Fig. 3 . Detailed GPU implementation timing analysis of the final version is presented in Fig. 4 . The speedups for all levels and different stages and net speedups are shown in Table VIII . Finally, the time taken by different implementations and the associated speedup for a frame pair is shown in Table IX . Quantitative comparisons of the strain SNR e with the CPU and final GPU implementations are shown in Fig. 5 for a uniform TM phantom for 10 independent realizations. Note that the two implementations have very similar performance and the GPU version performs slightly better at increased deformation of 7% and 3.5% in the axial and lateral directions, respectively. Fig. 6 presents the median strain value obtained in the same uniformly elastic TM phantom. The expected strain is plotted with the estimated CPU and GPU strains. The average median value with error bars indicating the standard error is shown in Fig. 6 . Note that there is excellent agreement between the CPU and GPU estimates with the expected strain up to 5% uniaxial deformation. However, when the uniaxial deformation is increased to 7%, the CPU algorithm provides a worse performance when compared to the GPU also seen in the SNR e results in Fig. 5 . To quantify differences between GPU and CPU implementations, a pixel-to-pixel absolute difference of the corresponding strain maps was computed and the median and maximum absolute differences obtained, as shown in Fig. 7 . Note that the absolute differences are nearly zero up to a 5% applied deformation but the CPU performs poorly for the 7% uniaxial deformation. In a similar manner, the absolute differences in maximum strain observed for an in vivo plaque was 6 × 10 −4 , 6 × 10 −4 , and 4 × 10 −4 for accumulated axial, lateral and shear strains, respectively, between the CPU and GPU implementations.
As mentioned previously, we use a 4-B single precision floating point format for storing the upsampled RF data for the GPU, whereas the CPU used a 2-B format. This leads to minor differences in values computed with the algorithm for low-tomoderate strains. However, as observed for the 7% axial strain case in Figs. 5-7 , the GPU provides better estimation than the CPU.
Finally, a qualitative comparison is presented in Fig. 8 on an in vivo acquisition on a human patient with carotid plaque in the longitudinal view. The two approaches identify similar areas in plaque with high strains.
IV. DISCUSSION AND CONCLUSION
Note that the largest speedups were obtained for the regularization step in our algorithm. Bayesian regularization is a computationally intensive process with a high amount of data reuse leading to the high speedups. Data reuse is defined as the same or contiguous data elements used by threads running in parallel. Algorithm with high data reuse has high speedups because it is typically more computationally expensive to fetch data. The GPU execution model enables the performance of more computations with less fetches if there is good data reuse. NCC stage speedups are slightly lower as we go to higher stages, with smaller data block sizes, and hence the amount of data reuse is less when compared to lower stages.
Many researchers have described implementations of NCC on GPU's previously. Idzenga et al. [11] noted that approximately 90% of their computation time was spent performing NCC in their CPU implementation. In comparison, in our approach, we spent 15.08% of our computation time on NCC, excluding the DU stage. Most of the time 77.35% was spent on Bayesian regularization. The remaining 8% was spent on upsampling, ATs, and SDC using a 2-D windowed sinc function for interpolation.
For NCC, after accounting for MATLAB to C/C++ transfer speedup, Idzenga et al. [11] reported speedups of 67.33× for large cross correlation block sizes of 256 × 9 and speedup of 10.25× for smaller cross correlation block sizes of 16 × 9 using a NVIDIA K20 GPU. Peng et al. [16] also used NCC for 3-D cross correlation estimation and reported speedups in the range of 120.21×-107.69× for block sizes of 61 × 11 × 3-61 × 7 × 3, respectively. These speedups are comparable to those reported in this paper of 75.54×-35.47× for larger and smaller blocks, respectively, for NCC with the NVIDIA K40 GPU. In addition, the literature reports also corroborate that smaller blocks have lower speedups because of lower data reuse.
However, only NCC computations have been previously reported with parallel computing implementations to the best of our knowledge. Thus, implementing Bayesian regularization, 2-D windowed sinc interpolation which was used for upsampling, ATs, and subsample estimation, and the MLGPF scheme in parallel are unique contributions of this work. For SDC Peng et al. [16] used a matching cube and ellipsoid model for 3-D coupled estimation and this took 27 ms with a NVIDIA K20 GPU. Our 2-D windowed sinc interpolation based on the MLGPF scheme for subsample estimation took 19 ms for the final level. Both approaches are comparable in their computation times.
Overall, the GPU implementation of displacement tracking provided a 168.75× speedup. However, this speedup considers a single-threaded CPU implementation. The DU stage which was used in the CPU does not provide any advantage and was removed in the CPU as well. After removing the DU stage from the CPU version, we obtained a speedup of 151.53× with the GPU. Thus, if we were to use a multithreaded optimized CPU version, it would need to run at least 152 threads in order to perform better than the GPU. If we assume that each core could run two threads without slowing down either of the threads, we would need a 76 core CPU. In fact, even with such a powerful CPU, it would be unlikely that the CPU could match the performance because perfect thread to speed up scaling is never achieved due to intricate issues like false sharing [33] , i.e., when two threads write into neighboring memory locations and cause a cache miss. There may also be memory bandwidth bottlenecks, load balancing, and core resource availability issues in such a CPU implementation.
Transferring the computation from CPU to GPU allowed for speedups in the displacement calculation. The strategies that helped in achieving this were first identifying the computational bottlenecks and then working toward removing those with GPU implementation. This approach follows Amdahl [34] and is famously known as the Amdahl's law in computer science. Second, we focused on minimizing data movement between the CPU and GPU as this is expensive. Finally, we explored GPU optimizations which focused on making the most efficient use of the GPU cores and memory bandwidth. The NVIDIA visual profiler was found to be a useful tool to implement all the above strategies.
GPU implementation makes it feasible for the algorithm to be clinically adopted since a cardiac cycle with 25 frames would now take about 50 s, whereas the original implementation would have taken 2.2 h. Moreover, this opens up the opportunity for more computationally intensive approaches such as further upsampling of the data in the lateral dimension since ultrasound acquisitions have poor lateral resolution.
One should also note that there is scope for further speedup of the algorithm. With a speedup of NCC and RC stages, these are still our main bottlenecks as shown in Fig. 6 . This is possible if we divide our data block calculation between multiple GPUs as we have enough data blocks for this to scale well. Last, it is important to note that the Tesla K40c GPU used in this work is now three-generation old as it belongs to the Kepler architecture (2012), since then we have had Maxwell (2014), Pascal (2016), and Volta (expected) (2018) architectures. The newer generations will enable the GPU implementation to run even faster on these platforms.
APPENDIX
A timing analysis for GPU Version 1 is shown in Fig. 9 . The NCC and RC stages become the main computational bottlenecks in GPU Version 1. To provide a clear comparison between GPU Version 1 and the CPU implementation, a log plot comparing the timing is presented in Fig. 10 . Log plot better represents the low computation times obtained with the GPU.
