Digital Signal Processing Filtering with GPU by Anwar, Sajid & Sung, Wonyong
Digital Signal Processing Filtering with GPU 
Sajid Anwar, Wonyong Sung 
School of Electrical Engineering 
Seoul National University  





Abstract— This work designed digital filter kernels with GTX 
260 GPU. Various FIR and IIR kernels are implemented. IIR 
filters due to its recursive nature have dependency on previously 
computed output samples. The dependency problem can be 
solved by separate computation of particular and homogeneous 
solutions. Comparing to CPU based implementation 
experimental results show speed improvement of 3 to 40 times 
for IIR and FIR filter kernels respectively.  
 
Keywords— FIR, IIR, GPU, CUDA, Homogenous Solution, 
Particular Solution, GPGPU 
I. INTRODUCTION 
Graphics Processing Unit (GPU) is responsible for 
manipulating and displaying graphical data. Present GPUs 
have hundreds of CPU cores and are thus excellent in parallel 
processing. Each of the CPU core can run hundreds of threads 
in parallel. This huge amount of processing capability leads to 
General Purpose Computing on GPU (GPGPU). However 
GPGPU limits the developer to a set of defined graphics API. 
The developer has to conduct the task using standardized 
graphics APIs like OpenGL and DirectX. With the advent of 
Compute Unified Device Architecture (CUDA), these 
limitations have vanished [2]. This work has implemented 
both recursive and non-recursive filters on GPU with CUDA. 
In contrast to FIR filters, recursive filters have dependency on 
previous output samples. A technique is used to reduce the 
dependency time. Nvidia GTX 260 having 194 cores is the 
implementation platform.  
The next two sections discuss FIR and IIR kernels and 
corresponding CUDA implementations. Section 4 discusses 
mapping of our suggested solution onto CUDA threads and 
blocks. Section 5 finally concludes. 
II. FIR FILTER KERNELS 
FIR filters have inherent parallel nature and are easily 
implemented with CUDA. A 16-tap FIR filter is implemented. 
Fast shared memory on the GPU is intelligently utilized to 
save valuable cycles. Profiling results are given in Table 3. 
The consumed time includes memory transfers as well. 
III. IIR FILTER KERNELS 
A first Order recursive filter is shown in Fig. 1. It’s evident 
from the figure that the current output is dependent on current 
input sample and previous output sample. Dependence on 
previous output sample describes the inherent sequential 
nature of this category of filters. The dependency time can be 
reduced by decomposing the computation into two separate 
solutions; Homogenous and Particular [1]. Figure 2 shows this 
decomposition process. It can be observed that the 
homogeneous solution is dependent on initial condition while 
particular solution is not. So particular solution can now be 
computed in parallel. The next section discusses how this 
scheme is mapped onto the GPU. 
 
 






[ 1] [ 1]
[ 1] [ 1] [ ] [ 1] similarly
[ 2] [ 1] [ ] [ 1] [ 2]















y n a ax n
y n a y n ax n x n
y n a y n a x n ax n x n
y n p a y n a x n a x n
a y n
z n a






+ = + +
+ = − + + +
+ = − + + + + +




1 2[ 1] [ 2]
... Particular Solution
px n a x n− −+ + + +
 
Figure 2 Decomposition Process 
IV. MAPPING ONTO GPU 
Particular solution is computed in parallel by a total of L 
blocks. Each block contains M threads. Each thread process N 
input elements sequentially. All threads run in parallel. Table 
2 and Fig. 3 shows this process. Threads are synchronised at 
their finishing point. Once this is done thread level initial 
condition is propagated as depicted by Fig. 4. This 
propagation is sequentially conducted. At the end of step 2 




































































put can be re
hreads per bl
 may reduce 
lock level 











-1,1,0] to   
-1,1,N-1] ... 






















































































































r the Brain K
igher Educati








































. K. Mitra, “Effi


































in part by 
es Developm





















 3 to 40 ti
.  Future wor
r recursive fi
the Ministry
ent (MOEH
in part suppo
istan. 
essor Implement
 [Available On
uting/207402986
 
3
 
 
ith 
by 
ous 
n is 
evel 
sing 
ple. 
mes 
k is 
lters. 
 of 
RD) 
rted 
ation 
line] 
 
