Optimizing Memory Efficiency for Convolution Kernels on Kepler GPUs by Chen, Xiaoming et al.
Optimizing Memory Efficiency for Convolution Kernels on
Kepler GPUs
Xiaoming Chen, Jianxu Chen, Danny Z. Chen, and Xiaobo Sharon Hu
Department of Computer Science and Engineering, University of Notre Dame
Notre Dame, IN 46556, USA
{xchen7, jchen16, dchen, shu}@nd.edu
ABSTRACT
Convolution is a fundamental operation in many applica-
tions, such as computer vision, natural language processing,
image processing, etc. Recent successes of convolutional
neural networks in various deep learning applications put
even higher demand on fast convolution. The high com-
putation throughput and memory bandwidth of graphics
processing units (GPUs) make GPUs a natural choice for
accelerating convolution operations. However, maximally
exploiting the available memory bandwidth of GPUs for
convolution is a challenging task. This paper introduces
a general model to address the mismatch between the mem-
ory bank width of GPUs and computation data width of
threads. Based on this model, we develop two convolution
kernels, one for the general case and the other for a special
case with one input channel. By carefully optimizing mem-
ory access patterns and computation patterns, we design
a communication-optimized kernel for the special case and
a communication-reduced kernel for the general case. Ex-
perimental data based on implementations on Kepler GPUs
show that our kernels achieve 5.16× and 35.5% average per-
formance improvement over the latest cuDNN library, for
the special case and the general case, respectively.
CCS Concepts
•Computing methodologies →Massively parallel al-
gorithms;
Keywords
Convolution; graphics processing unit; memory bandwidth
1. INTRODUCTION
Convolution is a fundamental operation in many image
processing and computer vision applications. For example,
image convolution is a key component in numerous basic im-
age processing routines, such as edge detection [1], smooth-
ing [1], template-based object detection [2], etc. Recently,
convolutional neural networks (CNNs) [3] have become a
powerful deep learning model which has been widely adopted
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full cita-
tion on the first page. Copyrights for components of this work owned by others than
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-
publish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
DAC ’17, June 18-22, 2017, Austin, TX, USA
c© 2017 ACM. ISBN 978-1-4503-4927-7/17/06. . . $15.00
DOI: http://dx.doi.org/10.1145/3061639.3062297
in various computer vision applications, such as image recog-
nition [4], image classification [5], object detection [6], etc.
State-of-the-art CNNs typically have quite a few convolu-
tional layers. Propagating through these convolutional lay-
ers is always a computation bottleneck in both the training
and inference phases of CNNs.
With the rapid development of many-core parallel pro-
cessors, new methods have been developed by leveraging
high computation throughput and memory bandwidth of
graphics processing units (GPUs) to accelerate convolution
operations. These methods can be roughly classified into
four categories: (1) general matrix multiplication (GEMM)
based convolution [7,8], (2) direct convolution [9–11], (3) fast
Fourier transform (FFT) based convolution [12–14], and (4)
the Winograd algorithm [15,16].
Convolution can be easily converted into a multiplica-
tion of two matrices by unrolling all the involved convolu-
tion operations [7]. Highly optimized GEMM kernels (e.g.,
cuBLAS [17]) can be invoked to compute matrix multipli-
cations. This is the default method in Caffe [18], a popular
deep learning framework. Although good performance can
be attained, it requires a huge amount of additional mem-
ory. Recently, cuDNN [8] adopted a GEMM-like method,
in which sub-blocks of the input matrices are constructed in
on-chip memory at run-time, and thus no additional memory
is needed. A direct method was proposed in [9], but the re-
ported performance is not good enough when there are more
than 100 channels. In [10], optimization techniques were
discussed for direct convolution on GPUs, but the proposed
method was not compared with any public library. Cuda-
convnet2 [11] also implemented direct convolution on GPUs,
but there is no detailed document to describe its methodol-
ogy or performance. FFT-based convolution [12–14] can re-
duce the arithmetic complexity compared with direct meth-
ods. However, the filters need to be padded to the same size
as the input image, which incurs additional memory and
computation time. In addition, in order to reuse the Fourier
transform of the filters, the batch size should be big enough.
Recent studies have shown that the Winograd algorithm can
significantly reduce the arithmetic complexity for the 3×3
filter [15, 16], at the cost of increased memory usage and
filter size dependent specialized processing.
Although FFT-based methods and the Winograd algo-
rithm can be faster than direct methods in some cases, they
are not universal. Direct convolution is still fundamental
and considered the best in general. In this paper, we aim
to improve the memory efficiency of direct convolution on
GPUs, targeting at two cases: (1) a special case with one
ar
X
iv
:1
70
5.
10
59
1v
1 
 [c
s.D
C]
  2
9 M
ay
 20
17
input channel, which appears in numerous image processing
applications and the input layer of CNNs (for grayscale im-
ages), and (2) the general case for CNNs. Specifically, we
introduce a general model to address the mismatch between
the shared memory bank width of GPUs and computation
data width of threads. Based on this model, by carefully
optimizing the memory access patterns and computation
patterns, we design a communication-optimized kernel for
the special case and a communication-reduced kernel for the
general case. Experimental data based on implementations
on Kepler GPUs show that our convolution kernels achieve
5.16× and 35.5% average performance improvement com-
pared with the latest cuDNN library, for the special case
and the general case, respectively.
2. PROBLEM FORMULATION
In this section, we present the problem formulation, which
illustrates the main challenges and the general model that
we propose to overcome such challenges.
2.1 GPU Memory Constraints and Modeling
Most GPU programs are memory bandwidth hungry. GPUs
usually have a complex memory hierarchy subject to differ-
ent constraints. Global memory (GM) accesses should be
coalesced in order to reduce latency. Bank conflict should
be avoided when accessing the shared memory (SM). For
the constant memory (CM), all the access addresses within
one warp should be identical to take full advantage of the
broadcast mechanism. These are basic constraints that GPU
programs should satisfy to achieve good performance.
The SM bank width, which also plays an important role in
GPU performance, however, has received less attention from
programmers and researchers. We elaborate this problem
below. Let WSMB be the SM bank width. (WSMB is 8 (bytes)
on Kepler and 4 on other GPU architectures.) Further, let
WCD be the computation data width for each thread. For
example, if each thread takes float as the minimum unit
fort computation, then WCD = 4. The relation between
WSMB and WCD can be described by
WSMB = n ·WCD. (1)
If n = 1, the SM bank width and computation data width
are matched; otherwise, they are unmatched. Mismatch be-
tween WSMB and WCD frequently occurs in practice. Even
when WSMB is 4, WCD can be 2 (for short or fp16) or 1 (for
char). Fig. 1 illustrates the impact of a mismatch. Con-
sider multiple threads reading from or writing to the SM. A
conventional method shown in Fig. 1a is often used, where
contiguous threads access contiguous elements, as it is easy
to program. But, such a method may fail to fully utilize the
available SM bandwidth. For example, if n = 2, as shown
in Fig. 1a, any two accesses that fall into the same bank
have to be serialized. Yet, if we can double the computation
0 1 2 3 4 5 6 7
Bank 
0
Bank 
1
Bank 
2
Bank 
3
0 1 2 3 4 5 6 7
Bank 
0
Bank 
1
Bank 
2
Bank 
3
(a) (b)
0 1 2 3 4 5 6 7 0 1 2 3
Element in shared memory (size WCD) Thread
Figure 1: Different SM access patterns. (a) Conventional
approach. (b) Matched approach.
0
200
400
600
800
1000
2K 3K 4K 5K 6K 7K 8K
Ex
ec
ut
io
n 
tim
e 
(m
s)
Matrix dimension
cuBLAS
MAGMA
MAGMA mod.
Figure 2: Execution time comparison for the single-precision
GEMM on a Kepler K40m GPU.
Channel 0
Channel C-1
+
+
...
Output feature map 0
Output feature 
map F-1
K
(a)
Reused 
pixels
(b)
Figure 3: Basics of convolution. (a) A general convolution
operation in CNN (C is the # of channels, K is the filter size,
and F is the # of filters). (b) Data reuse in convolution (the
solid circles mark pixels that are being used for convolution).
data width through intelligent thread layout and computa-
tion pattern redesign so that WSMB =WCD, as in Fig. 1b,
then each thread can obtain 2 elements together in a single
access, resulting in a 2× improvement in the SM bandwidth.
To demonstrate the importance of this problem, we com-
pare the performance of the single-precision GEMM on a Ke-
pler K40m GPU, as shown in Fig. 2. MAGMA is highly op-
timized for Fermi and is faster than cuBLAS on the Fermi ar-
chitecture [19]; however, it becomes 2.4× slower than cuBLAS
on the Kepler architecture. The SM bank width of the Ke-
pler architecture is twice of that of Fermi, causing a mis-
match between WCD and WSMB for the MAGMA kernel that
operates on float, which results in the loss of half of the SM
bandwidth. Yet, a modification to the MAGMA kernel by
matching WCD with WSMB saves 36% of the execution time
on average.
Consequently, for applications that are sensitive to the
SM bandwidth, memory access patterns and computation
patterns should be reorganized to match WCD with WSMB .
That is, each thread should be designed such that it accesses
and computes n basic elements as a single unit. In this way,
we can obtain an n× improvement in the SM bandwidth.
2.2 Data Sharing in Convolution
One key aspect in developing highly efficient convolution
operations on GPUs is to maximize data sharing, which is
also a key factor for communication reduction. Consider the
general case of convolution in CNNs (see Fig. 3a). Fig. 3b
illustrates a simple data reuse method, in which pixels can
be reused in both the horizontal and vertical directions as
indicated by the solid circles within the dashed boxes. A
simple analysis shows that an input pixel can be used up
to K×K×F times, where K is the filter size and F is the
number of filters. This feature should be fully exploited to
reduce both GM and SM accesses. For this aim, elaborate
memory access patterns and computation patterns need to
be used, while still satisfying the basic constraints of the
GPU memory hierarchy.
Another challenge arises from the SM bank width model
presented in the previous subsection. As convolution is SM
bandwidth bounded, when designing convolution kernels on
GPUs, we must strive to match WCD and WSMB , in order
to fully utilize the SM bandwidth.
W
H
Pixels required for block
Figure 4: Image partitioning.
...
...
Thread 0 Thread 1 Thread 2
Block
...
...
Block
(a)
...
...
Thread 0 Thread 1 Thread 2
Block
...
...
Block
2 convolutions
(b)
Figure 5: Our convolution method for the special case (K =
3 in this example). (a) For matched WCD and WSMB . (b)
For unmatched WCD and WSMB (n = 2 in this illustration).
Taking all these requirements into consideration, the goal
of this paper is to develop general solutions for convolution
on GPUs, such that (i) the memory communication is re-
duced as much as possible, (ii) the basic constraints of the
GPU memory hierarchy are satisfied, and (iii) the SM bank
width and the computation data width are matched.
3. CONVOLUTION FOR SPECIAL CASE
This section presents our convolution kernel for the spe-
cial case, in which the input has only one channel (C = 1
in Fig. 3a). This case arises at the first layer of CNNs (for
grayscale images) and in many image processing applica-
tions. We first show how we design the thread layout and
then discuss how we achieve optimal memory accesses.
3.1 Thread Layout
The goal of thread layout is to judiciously allocate com-
putation to thread blocks (TBs) and individual threads to
maximize both coarse-grained and fine-grained parallelism.
Fig. 3b depicts the general concept for our parallelization
methodology, in which each thread keeps K×K pixels of the
input image used by a convolution operation in the register.
However, directly applying this method would cause a prob-
lem. A thread cannot move right and down simultaneously.
In other words, once a thread moves to the right to compute
the next convolution, it loses some pixels needed by the con-
volution below. One could use additional registers to store
such lost pixels, but a lot more registers would be required.
To resolve this issue, we propose an alternative scheme to
maximize both parallelism and 2D data sharing.
To achieve coarse-grained parallelism, we partition the in-
put image into blocks of size H×W each (see Fig. 4). Such
partitioning enables data sharing along the vertical direc-
tion, since one row of the input image can be used by the
convolutions of K rows. A TB with W threads handles one
image block. Different image blocks are assigned to differ-
ent TBs so that they are computed in parallel. Each block
needs some additional pixels outside its right and bottom
boundaries to compute convolutions.
In terms of fine-grained parallelism, all the W threads
in a TB compute convolutions of one row in parallel, as
illustrated in Fig. 5a. Once a row is finished, the W threads
move down to compute the next row. Thus we read a new
row from the input image for each down movement. This
process continues until reaching the bottom of the block.
3.2 Optimizing Memory Accesses
We now discuss how we schedule memory accesses in coor-
dination with the thread layout design in Section 3.1. Since
the filters in the special case (with only 1 input channel) are
typically small, they can reside in the CM and no further
scheduling is needed. We focus our discussion on accessing
the input image in the GM to minimize GM communication.
We first consider the simple case of WCD = WSMB . For
each row of the block, we first read it into the SM (includ-
ing the needed pixels outside the block boundaries), and
then the W threads read their corresponding pixels into re-
spective registers. This process allows horizontal data shar-
ing and avoids redundant reads from the GM as adjacent
threads share some common pixels. Hence, our 2D data
sharing method works as follows: in the horizontal direc-
tion, the SM provides inter-thread data sharing; in the ver-
tical direction, intra-thread data sharing is achieved through
the private registers of the threads.
A simple analysis shows that each pixel in a block is read
from the GM only once, which is, of course, the theoret-
ical lower bound. For the entire image, only those pixels
which are needed by a block and outside the block bound-
aries are read more than once. But, the proportion of such
halo pixels is small. As a result, this method is (almost)
communication-optimal for GM accesses.
When considering the SM bank width model presented
in Section 2.1, the above thread layout and memory access
schedule are suitable only when n = 1, i.e., WCD = WSMB .
On the Kepler architecture, n = 2 if we use float as the
basic computation unit. Following the general idea depicted
in Fig. 1b, we propose to have each thread read, write, and
compute n pixels together (using built-in data types such as
float2 or float4). Each thread is responsible for n con-
tiguous output pixels in each row and n×H output pixels
in the block (for one output feature map). The number of
threads in a TB is reduced to W
n
. With this approach, each
thread needs a few more registers (O(K×(n−1))) to store
K×(K+n−1) pixels that are used by the convolutions for n
contiguous output pixels. Fig. 5b shows our modified con-
volution method specifically for the Kepler architecture.
3.3 Implementation
Algorithm 1 outlines the flow of our special case convolu-
tion method at the TB level. The algorithm starts by read-
ing K image rows of the block into the SM (line 1). After
that, the first K−1 rows are read into the threads’ regis-
ters (line 3). Then convolution is performed on all the rows
within the block iteratively in a loop (lines 4-11). For each
row, the data are first read from the SM into the threads’
registers (line 6), and then each thread computes convolu-
tions for all the filters (lines 7 and 8). We use a prefetch-
ing mechanism to overlap computations and GM accesses.
Before the threads’ computation tasks, the next image row
of the block is prefetched into the threads’ registers (line
5). Although this operation may take a long time, it can
be overlapped with convolution computations, since they
have no data dependency. After prefetching is finished, the
prefetched data are written into the SM (line 10).
Algorithm 1: Convolution for the special case on GPUs.
1 Load rows 0 to K−1 (the first K rows) of the block into the SM;
2 __syncthreads();
3 Each thread loads (K−1)×(K+n−1) pixels from the SM into
register;
4 for k=K−1; k<H+K−1;++k do
5 Prefetch row k+1 of the block into register;
6 Each thread loads the latest row from the SM into register;
7 for f=0; f <F ;++f do
8 Each thread computes n convolutions for filter f and
writes the results back to the GM;
9 __syncthreads();
10 Store the prefetched row into the SM;
11 __syncthreads();
The above algorithm is quite memory efficient. When
computing convolutions, the involved pixels are in registers
so the latency can be ignored. The filters are fetched from
the CM. In our method, all the threads within a warp al-
ways compute convolutions using the same filter at the same
time, so they always access the identical address, which is
the best case for the CM. As the filters are quite small in
the special case, a high hit rate of the constant cache can
be expected. When accessing the SM and GM, contiguous
threads always read or write contiguous addresses (at the
granularity of n pixels as a single unit), so both coalesced
GM access and conflict-free SM access are achieved. Our
experimental results in Section 5 support our analysis here.
4. CONVOLUTION FOR GENERAL CASE
This section presents our convolution kernel for the gen-
eral case, where the input has multiple channels (see Fig. 3a).
Note that the method for the special case cannot be applied
here for the following reasons. For the special case, we can
keep the needed pixels in the threads’ registers since the fil-
ters are small. Thus, we can finish one convolution at once
withK×K fused multiply-add (FMA) operations. With mul-
tiple channels, the involved pixels of one convolution cannot
entirely reside in the registers. Hence, the computation of
one convolution must be divided into multiple steps, and
the intermediate results should be accumulated in the regis-
ters. In addition, the filters (proportional to the number of
channels) become larger and may no longer fit in the CM.
Instead, the GM needs to store both the filters and the in-
put image. Our basic idea for the general case is inspired by
the blocked GEMM method for GPUs [19], but we optimize
memory communication by maximally sharing data.
4.1 Thread Layout
Similar to the special case, each input channel of the in-
put image is partitioned into blocks of size H ×W each
(see Fig. 4). We use a 2D TB layout which is similar to
that adopted by the blocked GEMM method [19]. Since in
the general case, a TB cannot be responsible for all the fil-
ters, we divide the computation into a 2D TB layout of size
TBX×TBY . In the X dimension, a TB is responsible for
FTB contiguous filters, where TBX = dF/FTBe. In the Y
dimension, a TB is responsible for C image blocks at the
same location of all the C channels. Within a TB, we use
a 2D thread layout of size TX×TY . Each thread is respon-
sible for WT output pixels and FT filters, where TX =
FTB
FT
and TY =
W×H
WT
. Each thread keeps FT ×WT pixels in the
register to store the intermediate convolution results for the
...
CSH channels of image 
blocks in shared memory
W+K-1
Thread 1 (Y)
H
+K
-1
...
Thread 0 (Y)
WT+K-1
WT
...
CSH channels of filters 
in shared memory
FTB
Thread 1 (X)
K*K
...
Thread 0 (X)
...
......
FT*WT 
pixels in 
register
F
T  filter values 
in register
WT+K-1 pixels 
in register
* *
Padding
Storage order
Storage order
...
1st round
2nd round
Shared memory
Register
Figure 6: Our convolution method for the general case (n=2
in this illustration).
WT pixels and FT filters. In the following subsections, our
discussion will focus on optimizing memory access patterns
to reduce memory communication.
4.2 Optimizing Memory Accesses
Fig. 6 depicts our convolution method for the general case
at the TB level. To improve the GM efficiency, we store CSH
channels of image blocks (including the needed halo pixels
outside the block) and the filters in the SM (shown as the
two blue boxes in Fig. 6). When reading filter values from
the GM to the SM, since the block is transposed, padding
(the gray box in Fig. 6) is required for the SM to avoid bank
conflict. For the image blocks, pixels are directly read into
the SM without transposition, so padding is not needed.
To increase data sharing within a thread, the WT out-
put pixels computed by one thread are contiguous along
the horizontal direction. This is a major difference from
the blocked GEMM method [19] where contiguous output
pixels are computed by contiguous threads. Computing
WT contiguous output pixels by one thread involves read-
ing (WT +K−1) ×K×C pixels from the SM, instead of
WT×K×K×C if they are computed by different threads. As
C may be large, it is impossible to put all of these involved
pixels in the register. So we only keep a row of WT +K−1
input pixels in each thread’s register, and the convolution
results are accumulated iteratively. The WT +K−1 input
pixels are used in K rounds of computation of WT output
pixels. A round of computation refers to an FMA operation
in a convolution.
When reading filter values from the SM, we use the same
thread layout as that used in the blocked GEMM method [19].
However, in order to meet the requirement of the SM bank
width model, each thread should read n contiguous values
as a single unit along the horizontal direction, as illustrated
in Fig. 6 (the upper blue box). This method is conflict-free
as contiguous threads in the X dimension read contiguous
units from the SM.
Our thread layout and memory access patterns also avoid
bank conflict when accessing the SM for image blocks. As
the X dimension of the 2D thread layout is assigned along
the feature direction, TX contiguous threads in the X dimen-
sion access the identical address of the SM for image blocks,
which benefits from the broadcast mechanism of the SM.
The only issue of this approach is at the writing back
Algorithm 2: Convolution for the general case on GPUs.
1 Register: rAcc[FT ][WT ], rImg[WT +K−1], rF lt[FT ];
2 Shared memory: shImg[CSH ][H+K−1][W+K−1],
shF lt[CSH ][K ×K][FTB+padding];
3 Clear rAcc;
4 Load CSH channels of image blocks into shImg;
5 Load CSH channels of filters into shF lt;
6 __syncthreads();
7 for c=0; c<C; c+=CSH do
8 Prefetch next CSH channels of image blocks into register;
9 Prefetch next CSH channels of filters into register;
10 for i=0; i<CSH ;++i do
11 for j=0; j<K;++ j do
12 Each thread loads WT +K−1 pixels into rImg;
13 for k=0; k<K;++k do
14 Each thread loads FT filter values into rF lt;
15 rAcc[0, · · · , FT−1][0, · · · ,WT−1]+ =
rF lt[0, · · · , FT−1]× rImg[k, · · · ,WT +k−1];
16 __syncthreads();
17 Store prefetched image blocks into shImg;
18 Store prefetched filters into shF lt;
19 __syncthreads();
20 Write rAcc back to the GM;
phase. Writing the results back to the GM is not coalesced,
as contiguous threads in the X dimension compute different
output feature maps. However, we have found that in the
general case convolution, the writing back phase consumes
very little time, so we do not optimize the uncoalesced writ-
ing back operations. If one wants to make the writing back
coalesced, the SM can be used as a buffer to reorganize the
data layout. However, this would lead to additional cost
including the SM latency and TB barriers.
4.3 Implementation
Algorithm 2 outlines our general case convolution method
at the TB level. The algorithm starts by clearing the results
(line 3) and loading the first CSH channels of image blocks
and filters into the SM (lines 4 and 5). After that, a loop
iteratively accumulates the results for all the channels (lines
7–19). For each channel, a thread needs to conduct K rows
of computations (lines 11–15), and for each row, K rounds
of computation are conducted (lines 13–15). The image data
are loaded into each thread’s register only for each row (line
12), and these data are used by K rounds of computation.
The filter data are loaded into the register in each round
(line 14). We also use a prefetching method to overlap GM
accesses and computation (lines 8, 9, 17 and 18). After the
intermediate results of all the channels are accumulated, the
final results are written back to the GM (line 20).
Fig. 6 illustrates the first two rounds of computation for
thread (0, 0). The thread first loads a row of WT+K−1 pixels
from the SM into the register. In the first round, it loads FT
filter values from the first row of the SM storing the filters,
and then updates the intermediate results by multiplying the
FT filter values and the first WT pixels (the green dashed
lines). In the second round, the FT filter values are loaded
from the second row of the SM, but the pixels are not loaded
again, as they are already in the row of WT +K−1 pixels
with an offset (the purple dashed lines).
Compared with direct GEMM-based convolution meth-
ods, our method reduces GM communication by approxi-
mately 1
K
, since one image row is used by the convolution of
K rows. The SM communication for fetching image pixels
is reduced by WT+K−1
WT ·K .
0
20
40
60
80
100
G
Fl
op
/s
cuDNN Our kernel
0
200
400
600
800
G
Fl
op
/s
cuDNN Our kernel Unmatched kernel
0
300
600
900
1200
G
Fl
op
/s
cuDNN Our kernel
(a)
(b)
(c)
Figure 7: Performance of the special case convolution for
different convolution parameters (N,K,F ). (a) 1×1 filter.
(b) 3×3 filter. (c) 5×5 filter.
5. EXPERIMENTAL RESULTS
We have implemented our proposed methods and con-
ducted experiments on a Kepler K40m GPU with peak per-
formance of 4290 giga floating-point operations per second
(GFlop/s) for single-precision. Our code is compiled with
compute capability 3.5. As we aim at direct convolution,
we compare our kernels with the GEMM-based convolution
provided by cuDNN [8] (version 5.1).
5.1 Results of Special Case
Through design space exploration, we determined that
the best block size for the special case convolution kernel is
W = 256 and H = 8. The performance comparison between
our kernel and cuDNN for different convolution parameters
(image size N , filter size K, and number of filters F ) is
shown in Fig. 7. For the 1×1 filter, actually there is no data
sharing; however, our kernel still obtains an average 6.16×
performance gain, due to the well-designed communication-
optimal kernel. For the 3×3 and 5×5 filters, our kernel
obtains 6.43× and 2.90× average performance gains over
cuDNN, respectively. The average performance gain of the
three filters we have tested is 5.16×. The performance is
lower when F = 1, due to the low overlap between commu-
nication and computation, as the computation workload is
quite low for F = 1. However, our kernel can be more than
10× faster than cuDNN when F =1.
For the 3×3 filter, we have also implemented another ker-
nel in which WCD and WSMB are unmatched, i.e., the basic
unit for computation is float. As seen from Fig. 7b, the
performance is reduced by 19% if WCD and WSMB are un-
matched. It can be expected that the performance degrada-
tion will be higher for the general case if WCD and WSMB
are unmatched, as the SM is used to store both the input
image and the filters in the general case. If we compare the
unmatched kernel with cuDNN, even if WCD and WSMB are
unmatched, our parallelization strategy is still much better
than cuDNN for the special case.
5.2 Results of General Case
Table 1: Best configurations of our general case convolution ker-
nel for different filter sizes for Kepler K40m.
Filter size 3×3 5×5 7×7
W 32 32 64
H 4 8 4
FTB 64 32 32
WT 16 8 8
FT 4 8 8
CSH 2 1 1
0
500
1000
1500
2000
G
Fl
op
/s
cuDNN Our kernel
0
500
1000
1500
2000
G
Fl
op
/s
cuDNN Our kernel
(a)
(c)
0
500
1000
1500
2000
2500
G
Fl
op
/s
cuDNN Our kernel(b)
Figure 8: Performance of the general case convolution for
different convolution parameters (N,K,C, F ). (a) 3×3 fil-
ter. (b) 5×5 filter. (c) 7×7 filter.
Table 1 lists the best configurations of our general case
convolution kernel for different filter sizes for the Kepler
K40m GPU. The performance comparison between our ker-
nel and cuDNN for different convolution parameters (image
size N , filter size K, number of channels C, and number
of filters F ) is shown in Fig. 8. For the three filter sizes
we tested, we get 30.5%, 45.3%, and 30.8% average im-
provements over cuDNN. Only when the image is very small
(32×32), our kernel may be a little slower than cuDNN. In all
the other cases, our kernel is always faster than cuDNN. The
average performance improvement of the three filter sizes is
35.5%. The highest performance we have achieved is 2020
GFlop/s, which is 47% of the hardware peak performance.
6. CONCLUSIONS
In this paper, we introduced a general model to address
the mismatch between the SM bank width and computation
data width of threads. Based on this model, we designed
and optimized two convolution kernels on GPUs. By care-
fully optimizing the thread layout and memory access pat-
terns, we attained 5.16× and 35.5% average performance
improvements over the latest cuDNN library, for the special
case and the general case, respectively.
Although we have only implemented our convolution ker-
nels on the Kepler architecture, our proposed ideas can be
applied to other applications and architectures. For exam-
ple, one of the recent development trends of CNNs is to use
shorter data types, such as half-precision floating-points and
16- or 8-bit fixed-points, to reduce both the storage require-
ment and execution time. For these data types, mismatch
between the SM bank width and the computation data width
exists even for architectures with 4-byte SM bank width. As
a result, our proposed model and method will benefit appli-
cations using these data types.
7. ACKNOWLEDGMENTS
This project was supported by the National Science Foundation
under grants 1640081, 1217906, 1629914 and 1617735, and the
Nanoelectronics Research Corporation (NERC), a wholly owned
subsidiary of the Semiconductor Research Corporation (SRC),
through Extremely Energy Efficient Collective Electronics (EX-
CEL), an SRC-NRI Nanoelectronics Research Initiative under Re-
search Task IDs 2698.004 and 2698.005.
8. REFERENCES
[1] Rafael C Gonzalez and Richard E Woods. Digital Image
Processing. Pearson, 3rd edition, 2007.
[2] S. Chaudhuri, S. Chatterjee, N. Katz, M. Nelson, and
M. Goldbaum. Detection of Blood Vessels in Retinal Images
Using Two-dimensional Matched Filters. IEEE Transactions
on Medical Imaging, 8(3):263–269, 1989.
[3] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard,
W. Hubbard, and L. D. Jackel. Backpropagation Applied to
Handwritten Zip Code Recognition. Neural Comput.,
1(4):541–551, 1989.
[4] Karen Simonyan and Andrew Zisserman. Very Deep
Convolutional Networks for Large-Scale Image Recognition.
CoRR, abs/1409.1556, 2014.
[5] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.
Imagenet Classification with Deep convolutional Neural
Networks. In NIPS, pages 1097–1105, 2012.
[6] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra
Malik. Rich Feature Hierarchies for Accurate Object Detection
and Semantic Segmentation. In CVPR, pages 580–587, 2014.
[7] Kumar Chellapilla, Sidd Puri, and Patrice Simard. High
Performance Convolutional Neural Networks for Document
Processing. In ICFHR, 2006.
[8] Sharan Chetlur, Cliff Woolley, Philippe Vandermersch,
Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan
Shelhamer. cuDNN: Efficient Primitives for Deep Learning.
CoRR, abs/1410.0759, 2014.
[9] S. Li, Y. Zhang, C. Xiang, and L. Shi. Fast Convolution
Operations on Many-Core Architectures. In HPCC, pages
316–323, 2015.
[10] Ben Van Werkhoven, Jason Maassen, Henri E. Bal, and
Frank J. Seinstra. Optimizing Convolution Operations on
GPUs Using Adaptive Tiling. Future Gener. Comput. Syst.,
30:14–26, 2014.
[11] cuda-convnet2. Url:
https://code.google.com/archive/p/cuda-convnet2/.
[12] Michae¨l Mathieu, Mikael Henaff, and Yann LeCun. Fast
Training of Convolutional Networks through FFTs. CoRR,
abs/1312.5851, 2013.
[13] Nicolas Vasilache, Jeff Johnson, Michae¨l Mathieu, Soumith
Chintala, Serkan Piantino, and Yann LeCun. Fast
Convolutional Nets With fbfft: A GPU Performance
Evaluation. CoRR, abs/1412.7580, 2014.
[14] Tyler Highlander and Andres Rodriguez. Very Efficient
Training of Convolutional Neural Networks using Fast Fourier
Transform and Overlap-and-Add. CoRR, abs/1601.06815, 2016.
[15] Andrew Lavin and Scott Gray. Fast Algorithms for Convolu-
tional Neural Networks. In CVPR, pages 4013–4021, 2016.
[16] Hyunsun Park, Dongyoung Kim, Junwhan Ahn, and Sungjoo
Yoo. Zero and Data Reuse-aware Fast Convolution for Deep
Neural Networks on GPU. In CODES, pages 33:1–33:10, 2016.
[17] cuBLAS. Url: http://docs.nvidia.com/cuda/cublas/.
[18] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev,
Jonathan Long, Ross B. Girshick, Sergio Guadarrama, and
Trevor Darrell. Caffe: Convolutional Architecture for Fast
Feature Embedding. CoRR, abs/1408.5093, 2014.
[19] Rajib Nath, Stanimire Tomov, and Jack Dongarra. An
Improved MAGMA Gemm For Fermi Graphics Processing
Units. Int. J. High Perform. Comput. Appl., 24(4):511–515,
2010.
