Performance Analysis of Non Local Means Algorithm using Hardware Accelerators by Antony, Daniel Sanju
Abstract
Image Denoising forms an integral part of image processing. It is used as a standalone
algorithm for improving the quality of the image obtained through camera as well as a
starting stage for image processing applications like face recognition, super resolution etc.
Non Local Means (NL-Means) and Bilateral Filter are two computationally complex de-
noising algorithms which could provide good denoising results. Due to its computational
complexity, the real time applications associated with these filters are limited.
In this thesis, we propose the use of hardware accelerators such as GPU (Graphics Pro-
cessing Units) and FPGA (Field Programmable Gate Arrays) to speed up the filter exe-
cution and efficiently implement using them. GPU based implementation of these filters
is carried out using Open Computing Language (OpenCL). The basic objective of this
research is to perform high speed denoising without compromising on the quality. Here
we implement a basic NL-Means filter, a Fast NL-Means filter, and Bilateral filter using
Gauss Polynomial decomposition on GPU. We also propose a modification to the existing
NL-Means algorithm and Gauss Polynomial Bilateral filter. Instead of Gaussian Spatial
Kernel used in standard algorithm, Box Spatial kernel is introduced to improve the speed
of execution of the algorithm. This research work is a step forward towards making the
real time implementation of these algorithms possible. It has been found from results
that the NL-Means implementation on GPU using OpenCL is about 25x faster than
regular CPU based implementation for larger images (1024x1024). For Fast NL-Means,
GPU based implementation is about 90x faster than CPU implementation. Even with
the improved execution time, the embedded system application of the NL-Means is lim-
ited due to the power and thermal restrictions of the GPU device. In order to create
a low power and faster implementation, we have implemented the algorithm on FPGA.
FPGAs are reconfigurable devices and enable us to create a custom architecture for the
parallel execution of the algorithm. It was found that the execution time for smaller im-
ages (256x256) is about 200x faster than CPU implementation and about 25x faster than
GPU execution. Moreover the power requirements of the FPGA design of the algorithm
(0.53W) is much less compared to CPU(30W) and GPU(200W).
iii
