Abstract Medical imaging is fundamental for improvements in diagnostic accuracy. However, noise frequently corrupts the images acquired, and this can lead to erroneous diagnoses. Fortunately, image preprocessing algorithms can enhance corrupted images, particularly in noise smoothing and removal. In the medical field, time is always a very critical factor, and so there is a need for implementations which are fast and, if possible, in real time. This study presents and discusses an implementation of a highly efficient algorithm for image noise smoothing based on general purpose computing on graphics processing units techniques. The use of these techniques facilitates the quick and efficient smoothing of images corrupted by noise, even when performed on large-dimensional data sets. This is particularly relevant since GPU cards are becoming more affordable, powerful and common in medical environments.
Introduction
Computational image processing is a field that has seen tremendous advances in recent years. These advances are the result of huge demands coming from areas such as medicine [1] , agriculture [2] , security [3] , traffic and satellite data analysis [4] and industry [5] . These fields require image processing tasks such as noise and artifact removal and smoothing [6] , geometrical correction [7] , contrast enhancement [8] , image restoration, [9] and illumination correction [10] . Briefly, the use of image processing techniques, particularly of image preprocessing, is mainly intended to enhance the data presented in the original images so that the processed data can be analyzed more easily using higher-level techniques of computational image analysis, such as image segmentation [11] or image registration [12] . However, many of the original images that need to be enhanced have large dimensions [13] and need to be processed in real time or near real time [14] . This is the case, for example, in the fields of robotic navigation or assisted surgery, or even when the input data are long sequences of 2D or 3D images, such as in ultrasound imaging [15] . Additionally, to obtain more robust and efficient results, the computational complexity of the more recent methods has considerably increased, leading to slower runtimes. Therefore, the use of parallel computing strategies has attracted attention, and this has led to higher speeds, particularly in time-constrained applications for medical diagnosis [17] .
Frequently, noise corrupts images, and this may be due to the image acquisition procedure involved or to artifacts generated by data transmission or other processes [17] . The image smoothing method proposed by Jin and Yang [18] has obtained very promising results, particularly when applied to medical images. However, a long computational time when performing several iterations on the input image is required by this method, especially when applied to large-scale images; as a result, its use has become less attractive for some potential applications. Additionally, there is a frequent and increasing demand for fast responses from computational methods in high-resolution image processing, and real time is preferable due to the severe time constraints that characterize medical imaging. Therefore, we have developed a parallel implementation of the smoothing method proposed by Jin and Yang [18] using general purpose computing on graphics processing units (GPGPU) [19] and compute unified device architecture (CUDA) [20] in order to speed up its runtime. We have assessed the performance of this strategy by comparing the runtime of parallel implementation (GPU) against that of sequential implementation (CPU).
The method adopted for image smoothing selects each pixel from the input image and thus requires a large number of calculations; this leads to the long runtime mentioned. Briefly, the method involves the use of an m Â n ð Þmatrix, which is processed for T iterations. Thus, the computational complexity of the processing of the input image is equal to O m Â n Â T ð Þ , where m and n are the number of rows and columns of the input image, respectively.
In our parallel implementation, the input image data are stored in the GPU's memory, where the highest number of accesses occurs, in order to eliminate as many data accesses as possible within the main memory system [19, 21] . Hence, input image processing is executed in parallel in the GPU. The experimental findings confirmed that the combination of the CUDA architecture and GPGPU techniques was very promising in terms of speeding up the runtime of image processing and computational analyses. These approaches led to high processing performance at a low cost, mainly when compared with parallel implementations in multicomputers.
As far as the authors known, this was the first time that the smoothing method adopted was parallelized using CUDA architecture and GPGPU techniques. The findings are of great interest for image processing and analysis, mainly within the medical community. In this case, medical images of ever higher resolution need to be smoothed as fast as possible in real clinical scenarios. Nowadays, computers with GPUs are commonly available in medical environments and, although these computers are not always the most up-to-date models, their computational power is still sufficient to achieve efficient fast results. This paper is organized as follows: Sect. 2 introduces the image smoothing method proposed by Jin and Yang [18] ; Sect. 3 describes the metrics of structural similarity (SSIM), peak signal-to-noise ratio (PSNR) and normalized cross-correlation (NCC), all of which are used to assess the quality of the smoothing results; Sect. 4.3 presents the parallel implementation of the image smoothing method; the computational runtimes demanded by the CPU-and GPU-based implementations are discussed in Sect. 5; and finally, Sect. 6, presents the concluding remarks.
Image smoothing method
Images frequently have multiplicative noise, which comes from multiplying an original image I by a noisy image I n [22] . This type of noise is present, for example, in microscopy, ultrasound and infrared imaging [23] . Multiplicative noise is usually more difficult to remove than additive noise [24] . Therefore, to overcome this problem, variational models for multiplicative noise removal have been integrated into smoothing methods specially developed for such images [24, 25] . In 2011, Jin and Yang [18] proposed a very promising method for removing and smoothing multiplicative noise from corrupted images using the variational model for additive noise removal proposed by Rudin et al. [26] , as shown here:
where X is a closed area belonging to R 2 , f is the image corrupted by additive noise, u is the image in the current smoothing iteration, J(u) is a regulator term and k is a weight parameter. Jin and Yang designed the method specifically to remove multiplicative noise from ultrasound images, and they concluded that the function proposed by Krissian et al. [27] could be adopted to solve the variational model of Eq. (1), using:
where u is the original image, f ¼ u þ ffiffiffiffiffi ug p is now the input image corrupted by multiplicative noise and g is a Gaussian variable with a nonzero mean. Thus, the variational model adopted by Jin and Yang [18] is:
where k [ 0 is a weight parameter. As such, the model given by Eq. (3) deals with the problem of multiplicative noise removal by adopting:
where r and div are the gradient and divergent operators, respectively. In order to discretize the continuous part of Eq. (4), Rudin et al. [26] used the finite difference scheme, adopting h = 1 for the step size and Dt for the time interval, which leads to:
where the parameter d [ 0 is a constant defined close to zero, and term m is defined as:
where min a j j; b j j ð Þ is a function that returns the smallest absolute value between a and b and sign a ð Þ is a function that determines the sign of a, returning 1 if a is positive, -1 if it is negative and 0 if a is equal to 0. Assuming the iterations of the model k = 1, 2, …, Eq. (4) can be rewritten as:
where f is the input image affected by multiplicative noise. In this method, the k parameter is automatically calculated for each new iteration as: 
where r 2 is the variance of the image at iteration k. As an example, Fig. 1 shows the result of the smoothing method when applied to ultrasound images.
Assessment metrics
The comparison between two images is a natural task for the human visual system, but it is not so natural for computer systems. Therefore, various authors have proposed different solutions which assess the similarities between two images and, in particular, evaluate the performance of image preprocessing methods [28] [29] [30] [31] [32] . Basically, there are two classes of solutions: One is based on intensity error and the other on structural information.
Based on intensity error
For image smoothing, the comparative solutions or similarity indices use intensity error in order to estimate the error between the enhanced image, i.e., the smoothed image, and the original image before noise corruption. The main disadvantage of these similarity indices is the possibility of failure where there are displacements between the images under comparison. Moreover, these indices compare the intensity variation of each pixel of the input images, which can lead to similar results for images with different types of geometrical distortions [29] . Nevertheless, indices based on intensity error are frequently used to compare the performance of image enhancement [33] [34] [35] and smoothing [13, 17] methods, due to their simplicity.
In particular, the peak signal-to-noise ratio (PSNR) index has been widely used to assess the performance of image restoration and smoothing methods. This index determines the ratio between the highest possible strength of a signal, which in the case of images is the highest intensity value, and its strength as affected by noise [15, 17] . For simplicity, the PSNR is represented according to a logarithmic scale (base 10), since some signals can have very high values.
The PSNR can be calculated from the mean squared error (MSE), which is computed as: 
where m and n are the dimensions of the images I and I r to be compared, as follows:
where MAX I is the maximum intensity value that a pixel can assume, which is equal to 255 for 8-bit grayscale images. Thus, the higher the PSNR value is, the more efficient the performance of the preprocessing algorithm is. The two images are considered identical, when the MSE value is 0 (zero), and the PSNR value is undefined. Normalized cross-correlation (NCC) is another metric based on pixel intensity. It is widely used in image registration [30, 36] to compare the degree of similarity between two input images. NCC is as follows:
where x i and y i denote the intensity values of each pixel of the m Â n ð Þimages under comparison, leading to values in the interval ½0; 1, where 1 (one) indicates a best match [37] .
Based on structural information
In this class of quality metrics, the goal is to find changes in the structural information of the images under comparison. The analysis of the structural information represented in the input images assumes that the human vision system is adapted to extract, i.e., segment, structural information from what is seen, and to search for changes in the structures detected. In other words, any possible differences, such as those due to artifacts generated by noise processes [37] , are quantified.
The structural similarity index (SSIM) is the main index in this category which analyzes the performance of computational image processing methods [13, 38] . Wang et al. [29] proposed this similarity index in an attempt to prevent images with very different visual qualities ending up with high similarity values, as can happen when the similarity indices are based on intensity error. The index measures the change in three components of each image under comparison: luminance, contrast and image structure. The former is defined as average pixel intensity. The contrast component is modeled using the standard deviation of the intensity, while image structure comes from the normalized image using the standard deviation of images under comparison. The SSIM is as follows:
where l refers to luminance, c to contrast and s to structure, and a [ 0, b [ 0 and c [ 0 are weights. These three components are relatively independent, and therefore, modifying one of them does not affect the others. More details of the calculation of these components, as well as a detailed analysis of them, are presented in [29] . The SSIM is an index which applies to each pixel of the input image, and for convenience, the mean SSIM is usually adopted. The mean structure similarity index (MSSIM) is the average of all the SSIM values obtained. For identical images, this index is equal to 1 (one). As the images become different, the index becomes lower until it is equal to -1 when the images are exactly opposite, i.e., one is the negative of the other.
Parallelization of the smoothing method
Studies have shown that GPU-based parallel methods have focused on massively parallel programming [40] , and most common image processing methods can operate with parallelization strategies based on the data decomposition technique. This section describes the steps involved in the GPU-based parallel implementation, which was developed in order to optimize the runtime performance of the adopted smoothing method.
The smoothing method adopted in this study, as described in Sect. 2, made use of four fundamental equations to find a solution for the multiplicative noise smoothing process given by Eq. (4). The method starts by solving the finite difference scheme adopted in Eq. (5). Then, Eq. (7) obtains the final value for each pixel according to the ongoing iteration, and Eq. (8) finds the associated weight parameter. Thus, Eqs. (4), (5), (7) and (8) define a sequence of steps for the parallelization of the smoothing method. The implementation procedure was based on the NVIDIA programming best practices guide [19] .
The CUDA architecture was developed with the objective of using data parallelism, by establishing a new model named single instruction multiple thread (SIMT). In this model, data are represented as a stream, which is structured as an array, and when running one or more instructions using this array, the instructions are defined as a kernel [17, 19] . A kernel performs operations in parallel along the entire stream, using it as both input and output [19, 39] .
In the SIMT model, the calling of multiple kernels follows a hierarchy of thread groups. This feature divides each kernel into independent blocks, and as a result, the efficient threading support in the GPUs ensures transparency, portability and scalability, besides allowing a CUDA program to be executed in any number of processor cores. Threads are used for fine-grained parallelism; groups of threads, defined as ''blocks,'' are used for coarse-grained parallelism; groups of blocks are placed in a grid which represents a kernel call. As illustrated in Fig. 2 , this hierarchy allows each thread within a block and each block in a grid to have a unique identifying index [39] .
Setting the occupancy level
The setting of each kernel must be adjusted to use the correct number of blocks and threads in order to optimize the occupancy of the CUDA cores (code lines in Fig. 3); i.e., if the number of blocks and threads is not sufficient, some cores will not be able to execute the code, wasting some of the processing power. In our implementation, we used 256 threads per block in all the kernels. The number of blocks B is given by:
where T px is the total number of pixels of the input image and T Tb is the number of threads per block. In this case, we defined one two-dimensional variable (numBlocks), which has the image height (HImage) as the first dimension (T px ) and the image width as the second dimension (T px ). These calculations determine the settings used to perform one thread per pixel. When there are excess threads, a stopping criterion discards them. Applications developed for massively parallel architectures achieve greater performance when the graphics card resources are used efficiently. The occupancy level of the GPU measures the proportion of active processors in the graphics card during a kernel execution. This calculation takes into account the following specification query attributes acquired from the CUDA device: the maximum number of threads per block, the number of blocks per multiprocessor, the number of registers per multiprocessor and the shared memory per multiprocessor. Increasing the number of concurrent threads is a good strategy for the purpose of making full use of the GPU, and the limit of threads is defined by the architecture. However, a high level of GPU occupancy does not guarantee an additional performance gain [19] because there is a problem of memory latency, and a high level of occupancy may reduce the overall performance [40] .
Optimizing the memory hierarchy in CUDA
As shown in Fig. 4 , each multiprocessor can use four types of memory: a set of registers for each stream multiprocessor (SM), a shared memory between the SMs, a constant cache shared between the SMs, and a texture cache which optimizes the bandwidth of the texture memory. Registers have the largest bandwidth, and like other kinds of memory, threads can access them; threads can also access data in different memory spaces. Each SM used in the experiments has 256 kB worth of memory registers [19] .
In the case of shared memory, the bandwidth is similar to the registers, and threads can cooperate to load and compute data shared by them. Each memory module has a set of 32-bit registers, which makes the threads access consecutive positions of a data vector more efficiently. A module can receive multiple requests for the same data, but this creates conflicts. However, automatic serialization satisfies all memory access requests. As this serialization can reduce bandwidth performance, a broadcast device is set up to prevent the reading of all the threads at the same memory address [19] . On the other hand, all threads can access the GPU global memory (GDRAM) simultaneously. However, there are some restrictions which improve the bandwidth. Global memory has the lowest bandwidth but has the largest storage capacity. In order to obtain the maximum possible speedup, a group of threads are used which has consecutive indices and is bundled into a unit named a warp. Thus, a single SM can run multiple warps simultaneously. The size of a warp depends on the GPU specification [19, 40] . All threads have read-only access to the GPU memory cache, which has 48 kB for each SM; moreover, the threads of a half-warp can read only one memory address. Only instructions from the GPU can write into this kind of memory, and these processes persist throughout the execution of multiple kernel calls [19] .
All threads can also access the texture memory, which is only read by kernels. This kind of memory uses a separate cache with a capacity of 32 kB per SM and provides highperformance access when all threads perform operations on memory addresses close to them [40] . The types of access of on-chip memory for Compute 3.5 and later devices are indicated in Table 1 .
Implementation of kernels in CUDA C
Tasks of computational image processing and analysis usually involve a large amount of data processing. Thus, the first strategy is to allocate the required space in the GDRAM and then copy the input image as a data matrix from the host memory (RAM) to the device's memory (GDRAM); this process allows data to be managed directly in the GPU. Accesses to the coalesced memory are performed in contiguous segments; half-warps access the segments simultaneously. Such accesses are known as coalesced memory accesses, and they enable parallel operations, thereby reducing the number of memory transactions [19, 39] . The data are then loaded into contiguous segments, and this allows a thread block to process an input image more efficiently; moreover, both the global memory and the texture memory are used.
Equations (5), (7) and (8) were implemented in the kernels called kDiFinitas, kVariancia and kFinal, respectively. The threads from the kDiFinitas kernel perform the computations in Eq. (5) in each image pixel independently. This kernel has several threads, each of which represents a matrix index and processes a specific image pixel. Thus, to manage access to a set of image pixels in the ''for-loops,'' each pixel has an access condition.
First of all, in the kDiFinitas kernel, each pixel from the input image is associated with a thread, and then the thread blocks are stored in the texture memory. After running the kDiFinitas kernel, the kVariancia kernel performs the parallelized computation of the k parameter according to Eq. (7). In the parallel implementation, an auxiliary vector stores the values of the operations involved in each iteration, i.e., each thread calculates the resultant value of each iteration. Figure 5 presents the pseudocode of the developed algorithm. The kFinal kernel computes the weight parameter, used previously in the kVariancia kernel, and then applies it to each image pixel, giving access to the texture cache and coalesced access to the global memory. Each thread attributes the resulting value to the corresponding memory, providing the data needed to calculate the final sum of each image pixel. Finally, the SomaElem kernel assists with the calculation of the vector values. The vector is divided into two equal parts, their values are summed, and each thread sums two values and keeps them in the lowest available vector position. This procedure continues until only one vector position remains for the storage of the resulting sum. In the case of a vector with an odd number of elements, an extra element with a zero value is added. This procedure uses the partitioning strategy of the global memory to optimize the bandwidth of the active warps during memory access; the warps are organized into partitions. This is the slowest kernel used, and this is because the memory blocks become less contiguous while the elements are processed. Figure 6 illustrates the implemented parallelization technique.
An image corrupted by multiplicative noise is used as input (step 1), and after running the kernels described previously in steps 4-8, the result will be a new noisesmoothed image. Steps 4, 6 and 7 perform the reading of data in the texture memory. On the other hand, the results of each step of memory writing go into the global memory, where the output images are stored.
Equations (5), (7) and (8) were implemented as a nested ''for-loop.'' A CPU-based implementation was also developed as a comparison with the GPU-based implementation. The main memory system was accessed contiguously for all of these loops in order to optimize execution, and the GDRAM was accessed contiguously as well, creating a fair comparison [19] between the implementations.
Experiments and discussion
This section describes the infrastructure used to perform the experiments and also discusses the results.
Test infrastructure
The used test infrastructure includes a desktop computer equipped with an Intel(R) Core(TM) i7-4790 3.60 GHz processor, 16 GB of RAM (DDR3-1600 MHz), Linux Ubuntu 14.04 operating system, CUDA 1 nvcc release 7.5 compiler driver and GNU gcc/g?? compiler version 4.8.4. Additionally, there was a GPU NVIDIA Tesla K20c, with 2496 CUDA cores and 5 GB of GDRAM.
Results and discussion
In this section, we present results of experiments aimed at evaluating the performance of the method adopted. The runtime performance of GPU-based implementation is the Compute the new pixel values for each new iteration using Eq. (7) 13 Copy image from GPU memory to CPU memory 14 Output: Smoothed image Fig. 5 Pseudocode of the developed parallel implementation focus of this study; however, the PSNR, SSIM and NCC metrics were used to confirm the smoothing method's accuracy. In the tests, 15, 25 and 50 smoothing iterations were adopted. We used a set of six images with different resolutions (128 9 128, 256 9 256, 512 9 512, 1024 9 1024, 2048 9 2048 and 4096 9 4096 pixels) built synthetically with an image editor software and then corrupted with synthetic multiplicative noise of a variance equal to 0.3. There were 100 iterations for each test, and the mean and the standard deviation values of the time spent smoothing each input image were calculated. The total time spent (Table 2) was computed from the moment the data were loaded into the main memory system until the end of the smoothing process when the resultant image was produced. The function cudaThreadSynchronize was performed after each kernel call, forcing the CPU to wait for the complete kernel execution, and the sdkResetTimer, sdkStartTimer and sdkStopTimer timing functions were used to obtain the kernel execution time. The execution times of each kernel were added together to obtain the total execution time. Table 2 shows that the execution times of the CPU-based implementation were longer than those of the GPU-based implementation except in the case of the smallest test image (128 9 128 pixels). This distinct behavior occurred because the speedup achieved with the data processed in Fig. 6 Parallel CUDA-based implementation of the adopted image smoothing method the CUDA cores did not justify the computational effort involved in transferring a small amount of data to the GPU memory or the latency times necessary for the initialization of the GPU. For large images, the speedup of the GPU was around 10, but less for smaller ones. Moreover, the GPUbased implementation achieved noise smoothing in real time for all tested images. The parallel implementation had transparent and portable scalability in GPUs based on CUDA architecture; besides, the performance scales increased exponentially, as shown in Fig. 7 . Furthermore, when we considered images with dimensions greater than 256 9 256 pixels, a speedup of the GPU-based implementation was evident; for example, it was about 10.65 times faster for images with 4096 9 4096 pixels. As an illustrative example, Fig. 8 shows the results of the CPU-and GPU-based implementations applied to the Fig. 7 Processing time of the proposed GPU-based implementation, which scales up exponentially Fig. 8 Original noisy test image with 4096 9 4096 pixels and the smoothed images obtained by the CPU-and GPU-based smoothing implementations, respectively test images. Figure 8 shows, from left to right, the image affected by the multiplicative noise and the images smoothed by the CPU-and GPU-based implementations. The values listed in Table 3 were computed using NCC and SSIM metrics in order to confirm that the structural information resulting from the noise images corresponded to smoothed images, since all values were close to 1 (one). As for the smoothing method's accuracy and time performance, the optimal number of iterations for better image preservation seems to be 15.
The PSNR values were also computed for each static test image before and after being smoothed by the CPUand GPU-based implementations ( Table 4 ). The values demonstrate the efficiency of the smoothing method and confirmed that the two implementations smoothed the images using the method adopted.
We also tested three synthetic videos with 240 frames and different resolutions (128 9 128, 256 9 256, 512 9 512 pixels) and one real ultrasound video with 255 frames of 320 9 240 pixels. The smoothing method was applied only once for each video frame.
The average runtime for the real ultrasound video was 5.92 s for the CPU-based implementation and 2.87 s for the parallel implementation in CUDA. Thus, the processing time of the parallel implementation was about 2.06 times faster when processing the entire ultrasound video. Figure 9 shows an example of the smoothing of a video frame selected randomly from the tested video. Table 5 indicates the frame rates of the CPU-and the GPU-based implementations when smoothing the four test videos. In this table, the values in bold can be considered in line with real-time processing ([20 frames per second) and therefore acceptable for routine medical image processing [43, 44] . As given in Table 5 , the experiments using the parallel GPU-based implementation revealed an even higher reduction in the runtime of the smoothing method in The values in bold can be considered in line with real-time processing relation to the CPU-based implementation, confirming initial expectations. The CUDA architecture as a computational infrastructure for image preprocessing has revealed to be a viable, capable and alternative option to deliver high-performance processing in many applications; moreover, it can even provide real-time processing at an affordable cost [40] . Here, the performance gain of the parallel GPU-based implementation confirms the high processing capacity available in the CUDA architecture, with all videos used in the experiments processed in real time. Today, the available resources in these graphic cards have increased the performance gain more efficiently, taking into consideration the number of cores and GDRAM memory as well as the SIMT parallel model associated with memory optimization techniques.
Therefore, the benefit of using GPU-based implementations can be totally justified since the reduction in the runtime can minimize or even eliminate the time restrictions; such restrictions are common in many applications (such as in the medical field) that use image processing and analysis methods, requiring fast or real-time results for image-based diagnosis [45, 46] . However, optimal implementation requires maximum efforts, particularly when using the CUDA architecture.
Conclusions
The use of parallel computing techniques to fully explore the high-performance multiprocessor architecture is not new. However, the cost of the more traditional hardware for high-performance computing is not low; thus, more affordable alternatives such as GPU hardware should be considered.
The present work has described how to use the highperformance computing CUDA-based architecture as a computational infrastructure to accelerate an algorithm for noise image removal. The parallel GPU-based implementation developed was compared against the corresponding sequential CPU-based implementation in several experiments, and image quality metrics confirmed the similarity of the smoothing results achieved by each implementation. The parallelization of the image smoothing method based on a variational model using the CUDA architecture reduced the runtime by up to 10.65 times in comparison with the CPU-based implementation.
The novel CUDA-based implementation developed to smoothing multiplicative noise by using an effective variational method seems to be a high-performance solution for applications with images susceptible to this type of noise, and which have high processing time constraints. Moreover, the proposed GPU-based parallelization approach has transparency, portability and scalability, thanks to the adopted SIMT model.
More and more complex methods and larger and larger data sets are used in the medical imaging domain that has high time constraints, which makes the use of the CUDA architecture extremely attractive as the study conducted here confirms. As a future works, we intend to extend the proposed CUDA-based implementation to enable it to perform in multi-GPUs, besides combining it with multithread (OpenMP) and multicomputer (MPI) in order to achieve higher performances using heterogeneous parallel computing platforms.
