CUDA ENHANCED FILTERING IN A PIPELINED VIDEO PROCESSING FRAMEWORK by Dworaczyk Wiltshire, Austin Aaron
CUDA ENHANCED FILTERING IN A PIPELINED VIDEO PROCESSING
FRAMEWORK
A Thesis
presented to
the Faculty of California Polytechnic State University
San Luis Obispo
In Partial Fulfillment
of the Requirements for the Degree
Master of Science in Computer Science
by
Austin Dworaczyk Wiltshire
June 2013
c© 2013
Austin Dworaczyk Wiltshire
ALL RIGHTS RESERVED
ii
COMMITTEE MEMBERSHIP
TITLE: CUDA Enhanced Filtering in a Pipelined
Video Processing Framework
AUTHOR: Austin Dworaczyk Wiltshire
DATE SUBMITTED: June 2013
COMMITTEE CHAIR: Professor Christopher Lupo, Ph.D., Com-
puter Science Department
COMMITTEE MEMBER: Professor Alexander Dekhtyar, Ph.D.,
Computer Science Department
COMMITTEE MEMBER: Professor John Seng, Ph.D., Computer Sci-
ence Department
iii
Abstract
CUDA Enhanced Filtering in a Pipelined Video Processing Framework
Austin Dworaczyk Wiltshire
The processing of digital video has long been a significant computational
task for modern x86 processors. With every video frame composed of one to
three planes, each consisting of a two-dimensional array of pixel data, and a
video clip comprising of thousands of such frames, the sheer volume of data is
significant. With the introduction of new high definition video formats such as
4K or stereoscopic 3D, the volume of uncompressed frame data is growing ever
larger.
Modern CPUs offer performance enhancements for processing digital video
through SIMD instructions such as SSE2 or AVX. However, even with these
instruction sets, CPUs are limited by their inherently sequential design, and can
only operate on a handful of bytes in parallel. Even processors with a multitude
of cores only execute on an elementary level of parallelism.
GPUs provide an alternative, massively parallel architecture. GPUs differ
from CPUs by providing thousands of throughput-oriented cores, instead of a
maximum of tens of generalized “good enough at everything” x86 cores. The
GPU’s throughput-oriented cores are far more adept at handling large arrays of
pixel data, as many video filtering operations can be performed independently.
This computational independence allows for pixel processing to scale across hun-
dreds or even thousands of device cores.
This thesis explores the utilization of GPUs for video processing, and evalu-
ates the advantages and caveats of porting the modern video filtering framework,
iv
Vapoursynth, over to running entirely on the GPU. Compute heavy GPU-enabled
video processing results in up to a 108% speedup over an SSE2-optimized, mul-
tithreaded CPU implementation.
v
Contents
List of Tables viii
List of Figures ix
1 Introduction 1
2 Background 3
2.1 Planar Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Byte Representation . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Vapoursynth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 CUDA Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4.1 Thread Block Model . . . . . . . . . . . . . . . . . . . . . 7
3 Design 11
3.1 From Core to CUDA . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 CUDA Performance Optimizations . . . . . . . . . . . . . . . . . 13
4 Implementation 16
4.1 Lut . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Merge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3 Transpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.4 Expr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.4.1 L1 Cache vs Shared Memory . . . . . . . . . . . . . . . . . 23
4.4.2 Constant Memory . . . . . . . . . . . . . . . . . . . . . . . 24
4.5 CUDA Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.6 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
vi
5 Results 28
5.1 Lut Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.2 Merge Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.3 Transpose Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.4 Expr Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.5 CUDA Stream Results . . . . . . . . . . . . . . . . . . . . . . . . 36
5.6 Complex Script Results . . . . . . . . . . . . . . . . . . . . . . . . 40
5.7 Fermi and Kepler . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6 Conclusion 42
7 Future Work 44
7.1 Multi-GPU Support . . . . . . . . . . . . . . . . . . . . . . . . . 44
7.2 Extended Filter Support . . . . . . . . . . . . . . . . . . . . . . . 45
7.3 Expanded Bits Per Pixel support . . . . . . . . . . . . . . . . . . 45
7.4 Investigate CPU Threading problems . . . . . . . . . . . . . . . . 46
7.5 Providing Support for Non-CUDA Devices . . . . . . . . . . . . . 46
Bibliography 47
vii
List of Tables
4.1 An overview of the core filters available in Vapoursynth . . . . . . 18
viii
List of Figures
2.1 An example of banding [1]. . . . . . . . . . . . . . . . . . . . . . . 5
2.2 An example layout of blocks in a grid and each block’s associated
thread array [18]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 The basic layout of Streaming Machines in the Fermi architecture
[18]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1 The effects of containing all shipping logic in one filter versus a
filter-by-filter basis. Larger is better. . . . . . . . . . . . . . . . . 13
4.1 Invert results, with unoptimized and optimized kernels running on
the CUDA Kepler and Fermi architectures. . . . . . . . . . . . . . 17
5.1 Lut execution speed in frames per second. Higher is better. . . . . 30
5.2 Merge execution speed in frames per second. Higher is better. . . 31
5.3 Transpose execution speed in frames per second. Higher is better. 32
5.4 Expr execution speed in frames per second. Higher is better. . . . 34
5.5 Expr execution speed in frames per second for 1 filter iteration. . 34
5.6 Expr execution speed in frames per second for 16 filter iteration. . 35
5.7 Expr execution speed in frames per second for 32 filter iteration. . 35
5.8 CUDA Streams results for one filter iteration on the Fermi archi-
tecture. Greater is better. . . . . . . . . . . . . . . . . . . . . . . 37
5.9 CUDA Stream results for one filter iteration on the Kepler archi-
tecture. Greater is better. . . . . . . . . . . . . . . . . . . . . . . 39
5.10 The results of a complex script run on an i7 Xeon E5-2650, i7
3770k, GTX 560 Ti, and Kepler K20Xm GPU. . . . . . . . . . . . 40
ix
Chapter 1
Introduction
Video processing (filtering, noise removal, transformation, etc.) is a compu-
tationally intensive process, frequently requiring several seconds per frame in the
case of extreme motion compensation or similar tasks. Historically, video pro-
cessing speed has been directly proportional to the sequential processing speed
of a single core CPU.
With the move to multiple CPUs, and eventually multicore CPUs, editing
and processing speeds improved greatly. This performance boost was eventually
counteracted with the move to larger resolution video formats such as 720p, 1080p
(2K), and eventually 4K and 8K. These massive increases in video resolution led
to a heavier and heavier burden on inherently sequential CPUs.
In order to improve multimedia handling in desktop CPUs, chip makers such
as Intel and AMD created an instruction set enabling programmers to guide the
CPU in performing a uniform operation, or instruction, on multiple chunks of
pixels, or data, at the same time. This form of parallelism is frequently referred
to as SIMD, or Single Instruction Multiple Data.
1
These SIMD instructions sets (like MMX, SSE, SSE2, 3DNow!, AVX, etc.)[7][9]
often operate on either a 64-bit or 128-bit data block [12]. This instruction width,
combined with the 8-bit representation of pixel data in modern video [5], allows
for a maximum of 16 pixels to be operated on in parallel by a single core pro-
cessor. While an excellent improvement over standard sequential processing, it
still offers a relatively low performance ceiling, considering that a 1920 x 1080
consists of 2,073,600 pixels.
Recently, an industry shift towards GPU-oriented processing has occurred.
GPU architectures offer an alternative approach to video processing. By provid-
ing hundreds or thousands of cores, along with a programming and scheduling
model aimed specifically at massive compute scalability across all available data,
GPUs achieve a high amount of data parallelism. The massively multithreaded
approach to general purpose GPU (GPGPU) programming allows for each thread
to process one pixel (or more, as covered later in this paper) at a time, completely
independent of all other threads. Video pixels are processed by the thousands in
GPUs compared to the tens in the case of a standard CPU core.
This thesis details an extension of the Vapoursynth [10] video processing
framework that supports a fully GPU-enabled filtering pipeline. It aims to run
as much filtering logic on the GPU as possible, resulting in a performance boost
of up to 108% for particular, compute-bound, filter implementations.
All background information will be detailed in Chapter 2, followed by overall
design in Chapter 3, implementation in Chapter 4, results in Chapter 5, and
conclusion and future work in Chapters 6 and 7.
2
Chapter 2
Background
Video data is represented in a multitude of formats, which vary in colorspace,
resolution, and bit depth. While these formats vary, they are all combinations of
several uniform principles, which are described in the following sections. Addi-
tionally, the general history and architecture of the Vapoursynth video processing
framework are detailed in Section 2.3.
2.1 Planar Video
The core of video data is represented by a two dimensional array of pixel val-
ues, where each value corresponds to a color or intensity. Video data is commonly
represented using three separate planes, and thus three separate arrays. In the
case of RGB video, each plane corresponds to red-only, green-only, or blue-only
color values. In the case of YV12 video, the base plane corresponds to only luma,
or luminance, values, while the two remaining planes correspond to subsampled
chroma, or chrominance, values. The chroma planes are subsampled in order to
conserve space, and take advantage of the fact that the human eye is more sen-
3
sitive to luma information and has trouble discerning variation in chroma-only
information [8].
2.2 Byte Representation
Video formats are not only distinguished by plane count, but by the word size
used to represent a pixel value in each plane. Most video formats represent a pixel
using only 8 bits [5], or 256 color values. This is done to conserve space, while
still providing an acceptable quality image. Professional video is often stored
using 9, 10, or 16 bits (512, 1024, or 65536 color values, respectively) in order
to retain more color information and thus a closer representation to the original
film content [25]. Most consumer grade equipment cannot play formats with
greater than 8 bits per pixel, but this is starting to change as consumers recognize
the benefits of an enhanced color range, particularly for computer generated or
animated content.
A recent movement in Japan has seen the increased usage of 10-bit video to
store anime content [14], as its increased color range prevents a form of visual
distortion referred to as “banding” (see the 8-bit gradient in Figure 2.1). Banding
is often seen in 8-bit video containing large swaths of solid color or gradients.
These large swaths of color are often seen in anime due to its artistic, or drawn,
nature, and can be particularly distracting in scenes containing large blank walls
or close up shots of character faces.
True 10-bit video playback requires a 10-bit capable display, which is not a
common feature in standard desktop monitors. Still, 10-bit video processing is
becoming preferred by many encoding experts in the industry, as it provides a
greater pixel precision, allowing for greater accuracy in color interpolation and
4
Figure 2.1: An example of banding [1].
downsampling.
2.3 Vapoursynth
Vapoursynth was written by Fredrik Mellbin [10] and designed from the
ground up to be a portable and efficient video filtering framework based off of
the much older framework, Avisynth. Avisynth was started back in 2003 by
Ben Rudiak-Gould and began as a Windows-only, single-threaded application
[23]. Later iterations expanded colorspace support [6] and added multithreading
through specialized plugins [24].
Vapoursynth provides a strong API that can be ported to any number of
scripting language frontends, including Python, Lua [4], and more. For the sake of
simplicity, and the fact that a variety of processing options are already available,
audio processing is excluded from initial versions of Vapoursynth.
5
Vapoursynth operates through a filter chain paradigm. A video is imported
(or created through the use of Vapoursynth’s BlankClip() function), and then
processed in sequence by any number of filters in a sequential chain. Each frame
of the input video is processed by every filter in the order dictated in the loaded
script. Additional filters outside of the standard library can be loaded dynami-
cally at runtime, providing a fully modular filtering experience.
This modular approach allows for plugin authors to ignore the specifics of
frame loading, frame caching, and even the threading model, and instead focus
on strong, optimized algorithms for their plugin implementations.
Additionally, Vapoursynth provides two performance enhancements out of the
box: a central threadpool and a frame cache. The thread pool allows for the easy
utilization of multicore CPUs. Each thread is assigned to a single frame, which
is then pushed through the filter pipeline. After this frame has been processed,
it is readied for output in the central core and the processing thread releases its
resources and returns to the thread pool to process another frame.
The decision to implement frame-level parallelism helps to keep filter au-
thors from manually managing thread resources. This separation encourages
clean and concise code while simultaneously preventing race conditions and non-
deterministic bug tracking.
Vapoursynth’s frame cache acts as a central list for currently processing
frames. By limiting the list size, Vapoursynth can precisely manage its mem-
ory usage, preventing large memory consumption spikes. The frame cache is
especially useful in the case of temporal filtering plugins, which require specific
frames within a radius before or after the current frame. Frames within that
radius are likely to be stored in the current frame cache, preventing expensive
6
retrieval and processing stalls while a temporal filter waits for dependent frames
to be readied.
2.4 CUDA Architecture
NVIDIA’s Compute Unified Device Architecture (CUDA) [22] embodies a
massively parallel computation platform, with a scalable programming model
that efficiently adapts to any problem size. The CUDA programming SDK al-
lows for a write once, deploy everywhere approach to data computation, as all
computation functions (referred to as “kernels” on the CUDA platform) scale
inherently to the capabilities of the NVIDIA device on which they run. It is for
this reason that hardware upgrades to a CUDA-capable machine result in a di-
rectly proportional performance increase with no code recompilation or additional
optimization.
Of course, architectures evolve and certain programming paradigms should
be followed in order to take full advantage of all compute devices. A complete
understanding of CUDA’s thread block model is required to correctly and ef-
ficiently optimize GPU kernels, and since the remainder of this paper revolves
around the CUDA architecture, a quick summary of CUDA is provided in the
following subsections.
2.4.1 Thread Block Model
CUDA operates around the idea of a block grid, where each block contains
an array of threads (see Figure 2.2). Both grids and blocks can be expressed
in 1, 2, or 3 dimensions as long as they do not surpass the limits of the base
7
Grid
Block (1, 1)
Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0)
Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1)
Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2)
Block (2, 1)Block (1, 1)Block (0, 1)
Block (2, 0)Block (1, 0)Block (0, 0)
Figure 2.2: An example layout of blocks in a grid and each block’s
associated thread array [18].
compute architecture. These limits change (generally by growing) from architec-
ture version (commonly referred to as “Computer Capability” (CC) in CUDA
documentation) to architecture version, thus requiring a bit of adaptability by
the kernel runner when executing. Luckily, CUDA provides support for querying
all device capabilities (across multiple devices), enabling kernel executions to be
tuned at runtime to a device’s CC.
As previously mentioned, each thread block consists of an array of threads.
Thread blocks are limited to a maximum of 512 threads for devices less than
CC 2.0, and 1024 threads for CC 2.0 and above [15]. Grids have much larger
dimensions, with x and y dimensions limited to 65535 for CC less than 3.0, and
231 − 1 for CC 3.0 and above. Block grids are limited to 65535 blocks in the
8
Dispatch Unit
Warp Scheduler
Instruction Cache
Dispatch Unit
Warp Scheduler
Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
SFU
SFU
SFU
SFU
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
Interconnect Network
64 KB Shared Memory / L1 Cache
UniformCache
Core
Register File (32,768 x 32-bit)
Figure 2.3: The basic layout of Streaming Machines in the Fermi ar-
chitecture [18].
z-dimension for all current Compute Capabilities.
Threads are executed by Streaming Multiprocessors (SM), which partition
and enumerate incoming thread blocks. All threads in a block execute concur-
rently on a SM, and multiple thread blocks can execute concurrently on a SM.
The threads of a block are partitioned into warps of 32 threads and each thread
operates in lockstep with the rest of the warp. All threads in a warp execute the
same instruction at the same time, thus if a particular thread diverges, the warp
executes each alternate code path in a serial fashion. Therefore it is important
to reduce branch divergence as much as possible to obtain optimal performance.
CUDA offers a variety of architecture enhancements that provide a strong
toolset for programmers to write high performance kernels. These enhancements
9
include block-level shared memory, texture and constant memory caches, a va-
riety of atomic and synchronization functions, and much more. All of these
enhancements are used in the kernels described in subsequent chapters and will
be detailed as needed.
10
Chapter 3
Design
The goal behind the CUDA port of Vapoursynth is to create a high-performance
library of GPU-enabled kernels for use in dynamic video filtering. Additionally,
all enhancements are designed to be API compatible and support backwards
compatibility with all legacy plugins.
All CUDA kernels are required to be in separate source files, which are exter-
nally callable via a C wrapper. To keep things simple, each filter examines the
frame storage location (either host side or device side) of an incoming frame and
picks the appropriate processing kernel. If a frame is located in the host’s local
memory, then a CPU kernel is executed. If the frame is located in the video card
memory, then a GPU kernel is executed.
Initial design efforts abstracted out several steps of GPU processing in the
core. This was done to encourage additional ports to OpenCL or alternative
GPU programming frameworks, preventing Vapoursynth from being locked into
CUDA, and thus only NVIDIA GPUs.
During the design phase of this project, a second alternative came to light
11
through the GPU Ocelot framework [2], but its utilization is left for future work.
3.1 From Core to CUDA
The CUDA enhancement layer is designed to integrate as closely with the core
as possible, taking advantage of already written memory management routines,
while also enhancing the filtering pipeline to ensure optimal processing speeds
when working with GPU data.
A separate memory counter was established in order to allow for the inde-
pendent tracking of host memory and device memory. Ideally, this will allow for
completely separate frame caches, but this is not currently implemented and is
intended for future work.
One of the central tenants of GPU processing is to ship all necessary data to
the GPU and keep it there as long as possible. This is recommended to reduce
the cost of communication over the relatively slow PCI Express bus between the
GPU and the host system. If each filter had to ship frame data to and from the
GPU every time it’s executed, processing would be severely limited to the PCI
Express bus speed, instead of the device computation speed. Additionally, the
frame transfer logic would be redundant between all GPU-capable filters, as each
would require a copy of the transportation code.
In order to prevent the PCI Express bus bottleneck, an additional filter has
been added to Vapoursynth’s standard library. This new filter has the sole pur-
pose of shipping data to and from the GPU, and must be explicitly called by
the user in the loaded script. This central filter contains all frame shipping logic
in one place, allowing for optimal PCI Express bus utilization and cleaner filter
12
0	  
100	  
200	  
300	  
400	  
500	  
600	  
700	  
?	  H.264	  encoding	  speed	  
Fr
am
er
at
e	  
(fp
s)
	  
TransferFrame	  Integra1on	  Results	  
No	  TransferFrame	  
TransferFrame	  
Figure 3.1: The effects of containing all shipping logic in one filter
versus a filter-by-filter basis. Larger is better.
code.
The effects of this filter are expressed in Figure 3.1. It is clear that containing
frame handling code in one filter leads to a significant performance increase in
the filter pipeline.
3.2 CUDA Performance Optimizations
Several design considerations are key to ensuring optimal kernel performance
for each filter. The first consideration is the aforementioned frame shipping filter,
labeled forthwith as TransferFrame. TransferFrame is a user controlled filter
who’s sole purpose is to ship data to and from the GPU.
13
Another design consideration, one that applies directly to every kernel, is co-
alesced global memory accesses. Global memory coalescing is a key optimization
point in CUDA kernels, and is distinctly detailed in the CUDA Best Practices
guide [21]. In order to ensure every filter’s GPU kernel operates at optimal speed,
slight modifications are necessary. These modifications are generally not easy to
comprehend at first glance.
For example, a na¨ıve kernel operates on a source plane, with each GPU thread
responsible for processing a single pixel. This is a simplistic design that normally
works well for matrix data. There is just one problem when it comes to video data
that separates it from standard computational mathematics. Standard matrix
operations frequently deal with integer or floating point data. Integer and floating
point numbers are stored using 4 bytes on CUDA devices. In contrast, most video
data is stored in 8 bits, or 1 byte in order to save space. Some professional video
is stored in 9, 10, or 16 bits, and very rarely as a float [8]. Due to the standard
video pixel taking up just 1 byte, or a fourth of a standard integer, special care
must be taken to ensure fully optimized memory accesses on CUDA devices.
On most CUDA devices, with Compute Capability greater than or equal to
2.0, global memory transactions are coalesced along a 128-byte L1 cache line [21].
Assuming a kernel with a block width of minimum 32 threads, and a standard
warp size of 32 threads, that means that if the input data is stored in a standard
integer format 4 bytes across and each thread requests one integer, the whole
warp request coalesces into a single 32 x 4 bytes request, aka. 128 bytes.
With consumer video’s 8-bit format, this presents a bit of a problem. Since
coalescing is concerned with the requests made per warp, and the standard warp
size is 32 threads, then a standard memory access of one pixel per thread only
requests 32 x 1 bytes. This is only a fourth of the standard 128 byte L1 cache
14
line, which means that four requests are needed to properly fill an entire cache
line. Ergo, a kernel will only operate at around a fourth of its optimal bus speed.
It is possible to work around these access limitations of 8-bit video through
the use of pointer casting. By casting the source pointer from an 8-bit integer
to a 32-bit (4 byte) integer, memory transactions appear to fulfill the 32 x 4
bytes requirement for optimal coalescing. Once the 4-byte integer is retrieved,
its contents are recast to four, 1-byte integers, and then processed according to
the original filter algorithm. Once all bytes have been processed, the results are
recast to a single 4 byte integer and written back to global memory, thus fulfilling
the requirements for coalesced reads and writes.
Due to the use of strided memory accesses and uniform block distribution,
special consideration must be made when allocating CUDA grids that utilize this
casting technique. Strided memory allocation is a programming technique used
to optimize lookup performance in large array allocations. Memory is allocated
using a unit stride that aligns with a specific device modulus. For example, frames
on the CPU are allocated using a stride of 32 bytes, while they are allocated on
the GPU with a stride of 512 bytes. This is done to ensure that all memory
allocations align well within the device architecture, instead of being offset and
inhibiting read and write performance.
Since each thread is in fact processing four pixels instead of the normal one,
the number of required blocks in the x-dimension reduces to a fourth of the orig-
inal grid size. If a mistake is made and this fact is not accounted for, some
pixels will be redundantly reprocessed, effectively wasting GPU resources. Mem-
ory allocation strides also need to be compensated accordingly in CUDA kernels,
compensating for the fact that the new data source and destination appear to be
a fourth as wide compared to their 1-byte counterparts.
15
Chapter 4
Implementation
The central build system of the Vapoursynth project is Waf, a Python-backed
self-coined “meta build system”. The first step in CUDA integration was to create
a compiler hook for NVIDIA’s NVCC compiler in the Waf framework. This hook
triggers on all files with the “.cu” file extension, which contain all CUDA code.
A few workarounds are required to make the NVCC compiler fully operational
with the base Waf configuration script, particularly with respect to compiler
symbol definition and debugging symbols. This is due to NVCC’s ignorance of
standard compiler flags, which must be passed in via a special ‘-Xcompiler’ flag
to indicate which compiler flags are to be passed directly to the C/C++ compiler.
Vapoursynth requires several unique C/C++ flags in order to properly generate
its static libraries. These special flags require an extra filtering step during Waf’s
build configuration phase in order to get NVCC to work properly.
With CUDA compilation working, a test implementation was developed based
on the simple Invert filter. The Invert filter accepts an input frame and performs
a binary invert on each pixel. Since the actual operation is so concise, with no
extraneous code included in the base filter, the Invert filter proved to be a prime
16
0	  
50	  
100	  
150	  
200	  
250	  
300	  
350	  
400	  
450	  
1	  Itera-on	   16	  Itera-on	   32	  Itera-on	  
Ex
ec
u&
on
	  S
pe
ed
	  (F
ra
m
es
	  P
er
	  S
ec
on
d)
	  
Invert	  Filter	  Op&miza&on	  Results	  
Kepler:	  Unop-mized	  
Kepler:	  Op-mized	  
Fermi:	  Unop-mized	  
Fermi:	  Op-mized	  
Figure 4.1: Invert results, with unoptimized and optimized kernels
running on the CUDA Kepler and Fermi architectures.
proof-of-concept for initial CUDA porting efforts.
With all frame data being managed by the TransferFrame filter, Invert is an
extremely simple filter port that requires a basic CUDA kernel and an externally
callable C wrapper. This C wrapper is used to pass frame data to the GPU
kernel from the main processing function. The CUDA kernel is called once for
every plane in the input frame, which usually consists of a total of three planes
for luma and chroma information, as detailed in Chapter 2.
The kernel follows a form of pixel-level parallelization, where each pixel is
operated on independently. Each pixel is read from the source plane, binary
inverted, and then stored in the output frame.
An initial version followed a simple design where all threads were assigned to
operate on one pixel each, and then write back their results to the destination
17
AddBorders AssumeFPS BlankClip Cache
ClipToProp CrobAbs/CropRel DoubleWeave Expr
FlipVertical/FlipHorizontal Interleave LoadPlugin Loop
Lut Lut2 MaskedMerge Merge
ModifyFrame PEMVerifier PlaneAverage PlaneDifference
PropToClip Resize Reverse SelectClip
SelectEvery SeparateFields ShuﬄePlanes Splice
StackVertical/StackHorizontal Transpose Trim Turn180
Table 4.1: An overview of the core filters available in Vapoursynth.
frame. As discussed in Section 3.2, this is a very inefficient design for the CUDA
architecture and provides relatively low throughput. A second, more efficient
kernel using casting for coalesced memory accesses was created. A comparison of
the execution speed between the two implementations is provided in Figure 4.1.
Results were tabulated over three separate runs, where each run executed the
Invert filter 1, 16, or 32 times. Additionally, all runs were executed on both a
Fermi and Kepler NVIDIA GPU. The details behind these cards are provided in
Chapter 5.
After successfully implementing a working Invert filter, work began on port-
ing the standard filter library of Vapoursynth. At the time of this writing, the
standard library consists of the filters detailed in Table 4.
Among the standard set of filters, there are a few that actually have no effect
on frame data, and thus require no extra work to provide a GPU-compatible ver-
sion. Several other filters (such as AddBorders, BlankClip, and SeparateFields)
only perform basic memory set or copy operations, and thus provide little ground
for GPGPU research.
It is for this reason that a subset of the standard filter library was selected
for implementation. This subset provides a variety of video processing method-
18
ologies, which help to demonstrate the strengths and weaknesses of GPU-based
video processing, as well as emphasize differences between CUDA architecture
generations.
The following sections break down each ported filter and explain its strengths
and weaknesses, along with a comparison between its CPU and GPU algorithms.
4.1 Lut
The Lut, or look-up table, filter is one of the simplest filters in the ported set.
The look-up table is precomputed, containing 2B entries, where B is the number
of bits per sample in the source frame. The filter operates by performing a simple
look-up using the value of a source pixel as an index. The value contained in the
look-up table for the given source pixel is then written back to the destination
frame.
This boils down to two global memory reads with one global memory write
per pixel. Since there is essentially no arithmetic computation in the Lut filter,
it becomes entirely memory bound. Initial thoughts for optimizing the Lut filter
include loading the look-up table into a different GPU memory, such as constant
or shared memory. Unfortunately, due to the look-up table’s unknown size until
runtime, these efforts are problematic.
Loading the look-up table into shared memory is essentially impossible given
the fact that there are only 48KB of shared memory per block in CUDA. If
an input clip contains 16 bits per pixel, the look-up table requires 216 entries,
equaling 131,072 bytes or 128KB, which is well over shared memory’s size limit.
Constant memory may be a possible alternative, but speed increases may
19
be limited due to constant memory’s broadcast architecture. If all threads in a
warp access the same memory location, then the memory accesses coalesce and
the resulting look-up is broadcast to the entire warp. However, any requests with
differing addresses are serialized, therefore minimizing the advantages of constant
memory.
Additionally, constant memory must be allocated at compile time, not exe-
cution time. Since a look-up table can be a variety of sizes, most of which are
well over constant memory’s 64KB size limit, constant memory is not a valid
containment location.
4.2 Merge
The Merge filter takes in two clips and produces a blended combination of the
two. A configurable floating point bias can be passed in to indicate a preference
towards one clip or the other, with the default value being 0.5, resulting in equal
parts being merged from the two input clips.
With two input clips and one output clip, the Merge filter, like Lut, requires
two global memory reads with one global memory write. It stands apart from
Lut in the fact that it requires an arithmetic operation for an output pixel to
be produced. Granted, it is a relatively simple arithmetic operation, so Merge is
still primarily memory bound, but it begins to tip the scales towards a compu-
tationally bound filter.
20
4.3 Transpose
Vapoursynth’s Transpose filter performs a standard matrix transpose oper-
ation on an input frame. There is no arithmetic operation necessary to render
a pixel value, simply a global memory read and a global memory write. How-
ever, Transpose differs from other filters such as Merge and Lut in the fact that
it requires the use of shared memory for optimal GPU performance. Transpose
also requires particular algorithmic adjustments to compensate for the fact that
each GPU thread operates on four pixels at a time while still providing coalesced
memory accesses for both reads and writes to global memory.
In order to achieve optimal memory performance, particularly on memory
writes, a large block of shared memory is allocated to store intermediate pixel
values. Each 4 byte chunk of pixels is read in from the source frame by a GPU
thread. With all threads working together, memory accesses coalesce into a
single unit as described in section 3.2. This works well for read in, but requires
an adaptation for write out. The reason for this is that a single thread attempting
to write a row of four pixels to a column of four pixels (as required by definition of
a transpose) causes significant memory thrashing for each additional pixel. This
stems from the fact that each pixel in a column is separated by the frame’s unit
stride in memory. Thus, writing four pixels in a column requires four separate
memory transactions.
This performance problem can be surmounted using shared memory and a
different thread organization model for write back. In short, the new algorithm
has a thread read in a row of pixels and write back a row of pixels. This contrasts
to the standard approach, in which a single thread reads in a row of pixels and
writes back a column of pixels.
21
In order to achieve coalesced memory accesses for both read and write, use of
CUDA’s synchronization function, syncthreads(), is required. syncthreads() is
CUDA’s native, block-level synchronization call which ensures that all threads
in a block reach the same code point at the exact same time. The use of
syncthreads() and shared memory ensures that all threads have loaded the nec-
essary data into shared memory before proceeding to the write back phase. This
prevents threads from possibly writing back uninitialized values to global mem-
ory.
Once all threads have reached the synchronization point, they proceed to write
back values to the same relative global memory addresses they read from. The
difference is the manner in which shared memory is accessed. On read, shared
memory is accessed in the same x and y coordinates as the thread’s relative
position in the thread block. On write back, the x and y coordinates are flipped,
and due to shared memory’s high performance architecture, no performance hit
occurs during the column-wise memory access.
4.4 Expr
Where Lut, Merge, and Transpose tend to be memory-bound filters, Expr has
the potential to be almost entirely compute-bound due to its purpose as a reverse
Polish notation evaluator. Expr operates by accepting a reverse Polish notation
string, which contains references to source pixels as well as standard mathematical
operations such as addition, subtraction, square root, log, and many more [11].
Additionally, Expr supports basic ternary operators and “greater than” or “less
than” operators for more complex processing.
The CPU version of Expr was written by Fredrik Melbin [11], and includes
22
a fully optimized SSE2 assembly implementation. Expr works with an internal
stack for operand processing, in addition to a second stack for operation tracking
and execution. The operation stack is evaluated instruction by instruction and
corresponding results are stored on the internal operand stack.
Expr is a rework of Avisynth’s MaskTools, specifically mt lut, mt lutxy, and
mt lutxyz. It is for this reason that Expr supports expressions utilizing up to
three separate input clips, which allows for highly specialized filtering techniques
such as edge isolation, sharpening, and much more.
The GPU version of Expr utilizes several import components of the CUDA
architecture and demonstrates a significant performance boost over its CPU ver-
sion, even when factoring in the frame transfer overhead to and from the GPU.
Expr’s optimization techniques are described in detail in the following sec-
tions.
4.4.1 L1 Cache vs Shared Memory
Expr, like Lut, Merge, and Transpose, operates on a block of four pixels
per GPU thread, which helps to meet coalescing requirements. However, Expr
requires a special operand stack per thread, where operation results are stored
for future processing. This operand stack is represented as an array of floating
point numbers. Unfortunately, due to thread-level register allocation in CUDA
(which is governed by the NVCC compiler at compile time), this stack cannot be
safely stored in ultra-fast registers, and instead spills over to what CUDA calls
local memory.
Local memory is actually the exact same thing as global memory in CUDA,
only accesses are restricted per thread. Local memory is used whenever local
23
variables in a CUDA kernel cannot be stored in registers. Luckily, CUDA caches
local memory accesses via the L1 cache, which helps to significantly reduce look-
up latency and encourages a high performance kernel. CUDA’s L1 cache shares its
precious resources with shared memory, with the default configuration providing
48KB of space to shared memory and a mere 16KB to the L1 cache.
The current implementation of Expr does not require any shared memory, but
it does require heavy use of the L1 cache due to its operand stack spillage. In order
to combat the performance hit of this stack spillage, CUDA offers the ability to
configure the cache configuration on runtime via the cudaFuncSetCacheConfig()
function. Using this function, it is possible to reverse the cache bias towards
shared memory and instead provide the L1 cache with 48KB and shared memory
with a mere 16KB. This increased cache space results in a measurable speed up
in kernel execution, as most, if not all, local memory accesses can be cached in
L1, preventing the significantly slower accesses to global memory.
4.4.2 Constant Memory
Expr operates by pulling from a set of operation instructions, which do not
change during the life of the kernel and are required by all threads in a warp to
operate. It is these properties that make Expr’s operation instructions a prime
target for use in constant memory.
Constant memory provides two distinct advantages to Expr’s kernel execution
which result in a marked performance speedup. The first is the architecture of
constant memory, which is essentially a specialized, read-only data cache that
operates at a much lower latency than global memory.
The second advantage that constant memory provides is its unique broadcast
24
capabilities. If the same memory address is requested from constant memory by
all threads in a warp, only one look-up occurs with the resulting value broadcast
to all threads of the warp. This saves on memory bandwidth and kernel wait
time, allowing Expr to continue performing arithmetic operations as quickly as
possible.
These optimizations reduce Expr’s dependency on memory operations, en-
abling it to instead focus on pure arithmetic throughput, for which CUDA was
specifically designed. The effects of these optimizations, along with the effect
that different CUDA architectures have on arithmetic throughput, are detailed
in Chapter 5.
4.5 CUDA Streams
To further encourage processing parallelism, CUDA provides a framework for
asynchronous kernel execution with respect to the host. This allows the CPU to
schedule kernel executions and then continue on with its host-side work. Almost
all functions in the standard CUDA library have an asynchronous implementa-
tion, and often only require one additional parameter. This additional parameter
is the CUDA Stream identifier.
Streams are CUDA’s way of organizing simultaneous memory transfers or
kernel executions. The number of streams that can be executed at the same time
depends on the Compute Capability of the GPU. At the time of this writing, CC
2.0 and higher support the use of streams, with CC 2.0+ supporting up to 16
concurrent streams and the new CC 3.5 supporting up to 32 concurrent streams
[20].
25
CUDA stream support was implemented in the GPU port of Vapoursynth
within the first few versions of the core framework. Early versions assigned
streams per frame via the TransferFrame filter, using a frame-specific property.
TransferFrame simply stored a stream index value in the frame property which
could then be used to retrieve a stream reference from a central pool. This
central pool is allocated once at script startup, with all streams recycled every 16
or 32 frames, according to device Compute Capability. Every GPU-enabled filter
retrieved the stream index from the incoming frame’s properties. The stream
index would then be used to retrieve a stream reference from the central pool,
allowing for fully asynchronous kernel launches within the current filter context.
Later versions kept the stream pool but instead assigned streams on a per
plane basis. This removed the need for a frame property, and moved a lot of the
stream handling into the core, away from the filter developer. Additionally, while
earlier versions used synchronous memory copies between the host and device, this
new version allows for completely asynchronous memory copies per plane. The
Kepler architecture is especially adept at utilizing streamed kernels and memory
transfers due to its new HyperQ scheduler. HyperQ was introduced with the new
GK110 Kepler architecture [19] and allows for 32 simultaneous hardware-managed
connections (compared to the single connection available in Fermi). Each stream
is handled by its own hardware work queue and inter-stream dependencies are
optimized, with operations in one stream no longer blocking other streams. The
effects of HyperQ are detailed in Chapter 5.
26
4.6 Validation
While high performance kernels are an important aspect of this thesis, high
performance is only useful if it’s programmtically correct. In order to ensure
correctness, a suite of Python unit tests were created. Each unit test evaluates
a filter, with several filters having multiple unit tests for a variety of parameter
configurations. A unit test operates by running a generated input clip under
both the CPU and GPU implementations of a target filter. Next, every plane of
every frame is compared in the resulting output clips. If the absolute difference
between two versions of a plane is greater than zero, the test fails.
All currently implemented filters posses a corresponding unit test, and all tests
pass with 100% bit identical output between the CPU and GPU filter algorithms.
This is extremely usefull for framework development and filter refinement, as all
output changes are immediately testable.
27
Chapter 5
Results
Results were collected on two separate CUDA architectures, Fermi and Ke-
pler. The Fermi card is a GTX 560 Ti by MSI [3], with 1024 MB of GDDR5 RAM
and a 256-bit wide memory bus. The Kepler card is a K20Xm [16], with 6144 MB
of RAM and a 384-bit wide memory bus, with error-correcting code (ECC) mem-
ory enabled. All CPU tests were conducted on an Intel Core i7 3770k, running
at 3.5Ghz with a 3.9Ghz Turbo boost, with 16 GB of DDR3 RAM.
All tests were run using a custom Python framework, with all filters run in a
serial fashion using increasing thread counts, iteration counts, and CPU or GPU
processing. Results are then written back to a CSV file and tabulated in the
following graphs.
Each filter is run with 1, 2, 4, and 8 CPU threads. This provides a direct
comparison between GPU code and several forms of parallelized CPU code. Addi-
tionally, all filters were run with 1, 16, and 32 iterations. A larger iteration count
reduces the overhead of CPU to GPU transfers over the PCI Express bus, demon-
strating the performance benefits of GPU code under a computationally heavy
28
workload. All iterations utilized a 1920 x 1080 resolution clip consisting of 1000
frames. The clip was generated in the CPU host memory using Vapoursynth’s
BlankClip() function.
GPU results are reported using only one CPU thread, as performance de-
grades with each additional CPU thread when operating on a purely GPU-
enabled filter pipeline. Additional research needs to be conducted as to why
this performance degradation actually occurs. Nevertheless, the greatest perfor-
mance for GPU code occurs when the least amount of CPU resources are used,
resulting in very favorable results from a system overhead standpoint. With the
filter chain using the least amount of CPU cycles possible, the rest of the proces-
sor can focus on alternative tasks, such as H.264 compression or high resolution
video playback.
5.1 Lut Results
As detailed in Chapter 4, Lut is a memory-bound kernel meaning that its
performance is on par with that of a memory copy operation, with a few caveats.
Lut’s GPU performance, as detailed by Figure 5.1, shows a lower execution speed
when compared against all other CPU runs for 1 iteration. This is due to the
simple fact that all data must be shipped to the GPU before it can be operated
on, as well as shipped back for output. This performance limitation is reduced
when multiple iterations occur, as seen in the 16 and 32 iteration runs.
Lut’s 16 and 32-iteration performance numbers indicate a more favorable
performance profile, as they consistently beat out the 1 and 2 CPU thread im-
plementations. The GPU version falls a bit below the numbers achieved by the
4 and 8 CPU thread runs, although only by a small margin. It is important to
29
1.00	  
10.00	  
100.00	  
1000.00	  
10000.00	  
cpu	  1	  thread	  cpu	  2	  thread	  cpu	  4	  thread	  cpu	  8	  thread	   gpu-­‐fermi	   gpu-­‐kepler	  
Ex
ec
u&
on
	  S
pe
ed
	  (F
ra
m
es
	  P
er
	  S
ec
on
d)
	  
LUT	  
1	  itera8on	  
16	  itera8on	  
32	  itera8on	  
Figure 5.1: Lut execution speed in frames per second. Higher is better.
remember that the GPU version of LUT is using only a 4th or an 8th of the CPU
resources of those latter CPU runs, while still achieving a comparable result.
5.2 Merge Results
Merge shares similar results with Lut, as it too is a memory bound filter.
As detailed in Chapter 4, Merge does requires some basic arithmetic in order to
produce its output frame. These extra arithmetic instructions result in a slightly
reduced throughput when compared to Lut, as more time is spent performing
actual computation instead of a pure memory manipulation operation.
Merge’s single threaded performance for one iteration is almost twice as slow
for the GPU algorithm when compared to the CPU algorithm (see Figure 5.2).
This performance gap only increases for every additional thread applied to the
CPU implementation, which is to be expected. However, when multiple iterations
30
1.00	  
10.00	  
100.00	  
1000.00	  
10000.00	  
cpu	  1	  thread	  cpu	  2	  thread	  cpu	  4	  thread	  cpu	  8	  thread	   gpu-­‐fermi	   gpu-­‐kepler	  Ex
ec
u&
on
	  S
pe
ed
	  (F
ra
m
es
	  P
er
	  S
ec
on
d)
	  
Merge	  
1	  itera8on	  
16	  itera8on	  
32	  itera8on	  
Figure 5.2: Merge execution speed in frames per second. Higher is
better.
are introduced, the GPU version of Merge gains some ground back by beating out
the one and two thread CPU implementations. It can’t quite beat the four and
eight thread CPU implementations, but the delta is significantly smaller than the
single iteration execution. Again, it is important to keep in mind that the GPU
version of Merge only uses one CPU thread, so for all iterations the GPU version
of Merge achieves greater or comparable performance using only a fraction of the
CPU.
5.3 Transpose Results
Transpose is an interesting filter and compares well against both Lut and
Merge as a memory bound filter. As shown in 5.3, Transpose demonstrates a level
of performance between that of Lut and Merge. This is likely due to Transpose
having a much greater instruction count than Lut, while also not requiring any
additional arithmetic to render an output pixel, as in the case of Merge. Yet
31
1.00	  
10.00	  
100.00	  
1000.00	  
10000.00	  
cpu	  1	  thread	  cpu	  2	  thread	  cpu	  4	  thread	  cpu	  8	  thread	   gpu-­‐fermi	   gpu-­‐kepler	  
Ex
ec
u&
on
	  S
pe
ed
	  (F
ra
m
es
	  P
er
	  S
ec
on
d)
	  
Transpose	  
1	  itera8on	  
16	  itera8on	  
32	  itera8on	  
Figure 5.3: Transpose execution speed in frames per second. Higher
is better.
again, as with all memory bound filters, the GPU version of Transpose trails
behind the CPU version for all single iteration tests. Since the actual kernel
execution time on the GPU is so small, Transpose’s performance is governed
almost entirely by the transfer speed between the CPU and the GPU along the
PCI Express bus.
Interestingly, Transpose demonstrates a performance even greater than Lut
when operating with 16 or 32 iterations. Even with Transpose’s increased instruc-
tion count, it only requires one global memory read and one global memory write
per pixel, whereas Lut requires two global memory reads for every global memory
write. With heavier iteration counts Tranpose becomes a high throughput filter
due to its simple memory reordering. Lut and Merge require extra memory reads
or arithmetic operations in order to render an output pixel and thus suffer some
delay between pixel read and pixel write.
32
5.4 Expr Results
Expr is juxtaposed from the three previous filters in the fact that it is a
compute-bound filter, meaning that most of its time is spent performing arith-
metic calculations instead of memory lookups. It is for this reason that the GPU
version of Expr demonstrates a significant performance increase over the original
CPU version, even when using an SSE2 optimized algorithm.
This performance increase stems from CUDA’s throughput oriented cores,
which excel at performing raw numerical computation. CUDA cores are gener-
ally much simpler than standard x86 processors, as x86 processors are optimized
for sequential code performance and use a variety of more complex architectural
features such as increased cache sizes, complex branch prediction, and an empha-
sis on instruction pipelining. CUDA cores are much simpler with smaller cache
sizes, simple memory models, and a design emphasis on massively parallel appli-
cations such as graphics or geometry calculations. It is for this reason that CUDA
cores excel at raw number crunching when juxtaposed with an x86 processor.
Figure 5.4 details Expr’s CUDA performance against a multithreaded, CPU
SSE2-optimized algorithm. It is quite clear that the CUDA version of Expr offers
a significant performance increase over the CPU version, with the Kepler’s 32 it-
eration (Figure 5.7) run even beating out several 16 iteration (Figure 5.6) runs
on the CPU in pure frames per second. In fact, when comparing single iterations
(Figure 5.5), the CUDA version of Expr demonstrates a 108% performance im-
provement over the SSE2-optimized CPU implementation when running on the
Fermi architecture, and a 68% performance improvement when running on the
Kepler architecture.
Expr’s performance is largely governed by it being a compute-bound filter.
33
1.00	  
10.00	  
100.00	  
1000.00	  
cpu	  1	  thread	   cpu	  2	  thread	   cpu	  4	  thread	   cpu	  8	  thread	   gpu-­‐fermi	   gpu-­‐kepler	  
Ex
ec
u&
on
	  S
pe
ed
	  (F
ra
m
es
	  P
er
	  S
ec
on
d)
	  
Expr	  
1	  itera8on	  
16	  itera8on	  
32	  itera8on	  
Figure 5.4: Expr execution speed in frames per second. Higher is
better.
0.00	  
20.00	  
40.00	  
60.00	  
80.00	  
100.00	  
120.00	  
140.00	  
160.00	  
180.00	  
cpu	  1	  thread	   cpu	  2	  thread	   cpu	  4	  thread	   cpu	  8	  thread	   gpu-­‐fermi	   gpu-­‐kepler	  
Ex
ec
u&
on
	  S
pe
ed
	  (F
ra
m
es
	  P
er
	  S
ec
on
d)
	  
Expr	  -­‐	  1	  Itera&on	  
1	  itera9on	  
Figure 5.5: Expr execution speed in frames per second for 1 filter
iteration.
CUDA excels are performing high speed arithmetic operations, and whereas Lut,
Merge and Transpose are memory bound and have trouble competing against
34
0.00	  
5.00	  
10.00	  
15.00	  
20.00	  
25.00	  
30.00	  
35.00	  
cpu	  1	  thread	   cpu	  2	  thread	   cpu	  4	  thread	   cpu	  8	  thread	   gpu-­‐fermi	   gpu-­‐kepler	  
Ex
ec
u&
on
	  S
pe
ed
	  (F
ra
m
es
	  P
er
	  S
ec
on
d)
	  
Expr	  -­‐	  16	  Itera&ons	  
16	  itera;on	  
Figure 5.6: Expr execution speed in frames per second for 16 filter
iteration.
0.00	  
2.00	  
4.00	  
6.00	  
8.00	  
10.00	  
12.00	  
14.00	  
16.00	  
18.00	  
cpu	  1	  thread	   cpu	  2	  thread	   cpu	  4	  thread	   cpu	  8	  thread	   gpu-­‐fermi	   gpu-­‐kepler	  
Ex
ec
u&
on
	  S
pe
ed
	  (F
ra
m
es
	  P
er
	  S
ec
on
d)
	  
Expr	  -­‐	  32	  Itera&ons	  
32	  itera:on	  
Figure 5.7: Expr execution speed in frames per second for 32 filter
iteration.
the CPU, Expr excels over the CPU in all cases. This is further improved by
enhancements to the Kepler architecture over the Fermi architecture for 16 and
35
32 iteration runs.
As the next generation of CUDA, Kepler’s raw compute performance signifi-
cantly increased over the older Fermi architecture. With a new generation of the
Streaming Machine and a much larger transistor count, Kepler offers the poten-
tial for a vast performance increase over all prior architectures. These additional
Streaming Machines require a significant workload in order to produce measurable
performance gains. Due to Expr’s heavy emphasis on arithmetic computation it
is a prime candidate for taking advantage of all of Kepler’s new CUDA cores,
much more so than the Lut, Merge, or Transpose filters.
For Expr, single iteration runs are almost too short to properly harness Ke-
pler’s added CUDA cores, but the 16 and 32 iteration runs offer enough workload
to demonstrate a large performance boost over both the Fermi and CPU architec-
tures. It would be very interesting to see performance numbers from NVIDIA’s
new GTX Titan or GTX 780 GPUs, both of which run off the latest GK110 Ke-
pler chips. The only significant difference is the number of Streaming Machines
available for CUDA computation, along with varying GDDR5 amounts and the
use non-ECC enabled RAM.
5.5 CUDA Stream Results
The use of streams in a CUDA application allows for fully asynchronous com-
munication between the host CPU and device GPU. Instead of kernel calls block-
ing the execution of a CPU thread, stream-enabled kernels return instantly and
allow the CPU thread to continue with other work. CUDA streams are man-
aged by dedicated hardware on devices with Compute Capability 2.0 and above.
Kepler has a very special hardware management device referred to as Hyper-Q
36
0	   50	   100	   150	   200	   250	   300	   350	   400	   450	  
Expr	  -­‐	  1	  Thread	  
Expr	  -­‐	  2	  Thread	  
Expr	  -­‐	  4	  Thread	  
Expr	  -­‐	  8	  Thread	  
LUT	  -­‐	  1	  Thread	  
LUT	  -­‐	  2	  Thread	  
LUT	  -­‐	  4	  Thread	  
LUT	  -­‐	  8	  Thread	  
Merge	  -­‐	  1	  Thread	  
Merge	  -­‐	  2	  Thread	  
Merge	  -­‐	  4	  Thread	  
Merge	  -­‐	  8	  Thread	  
Transpose	  -­‐	  1	  Thread	  
Transpose	  -­‐	  2	  Thread	  
Transpose	  -­‐	  4	  Thread	  
Transpose	  -­‐	  8	  Thread	  
Execu&on	  Speed	  (Frames	  Per	  Second)	  
Fermi	  Stream	  Comparison	  -­‐	  1	  Itera&on	  
Streamed	  Memcpy	  /	  Kernels	  
No	  Streams	  
Streamed	  Kernels	  
Figure 5.8: CUDA Streams results for one filter iteration on the Fermi
architecture. Greater is better.
[19], which allows for each stream to be processed within its own hardware work
queue, enabling complete computational independence between streams. Fermi
offers only one hardware queue for all streams, which can introduce false serial-
ization in a streamed application.
That being said, the current implementation of streams in Vapoursynth pro-
duces interesting results with a multithreaded CPU pipeline paired with a 16
(Fermi) or 32 (Kepler) streamed GPU pipeline. As stated in Chapter 4, all filter
37
kernels are executed using CUDA streams, allowing their execution and memory
copies to be completely asynchronous.
Figure 5.8 illustrates an important performance delta that streams offer dur-
ing multithreaded CPU execution. More specifically, stream-enabled filter kernels
demonstrate a distinct performance boost with four CPU threads when compar-
ing against stream-disabled filter kernels. The Expr, Lut, and Merge kernels dis-
play the performance delta markedly so, with the Transose kernel showing little
difference between streamed vs. non-streamed execution. Asynchronous mem-
ory transfers appear to make the most significant performance difference, with
a fully asynchronous pipeline (memory copies and kernel executions) achieving
the greatest performance. It is important to note that streams have the greatest
effect on execution times when running with multiple CPU threads. The 2, 4,
and 8 thread executions display a discernible gap between their non-streamed
counterparts.
Its not quite clear why Expr and Lut display such a significant performance
peak when executing with only streamed kernels and non-streamed memory
copies while using multiple CPU threads. This is an area of interest for future
research.
While the Fermi architecture shows some improvement through the use of
streamed kernels and memory copies, Kepler’s new HyperQ technology produces
a significant performance improvement. Figure 5.9 presents the performance of
non-streamed kernels, only streamed kernels, and streamed kernels with streamed
memory copies. The latter approach (which was made possible by assigning
streams on a per-plane basis and enabling fully asynchronous transfers) leads to
a discernible boost in execution speed.
38
0	   20	   40	   60	   80	   100	   120	   140	   160	  
Expr	  -­‐	  1	  Thread	  
Expr	  -­‐	  2	  Thread	  
Expr	  -­‐	  4	  Thread	  
Expr	  -­‐	  8	  Thread	  
LUT	  -­‐	  1	  Thread	  
LUT	  -­‐	  2	  Thread	  
LUT	  -­‐	  4	  Thread	  
LUT	  -­‐	  8	  Thread	  
Merge	  -­‐	  1	  Thread	  
Merge	  -­‐	  2	  Thread	  
Merge	  -­‐	  4	  Thread	  
Merge	  -­‐	  8	  Thread	  
Transpose	  -­‐	  1	  Thread	  
Transpose	  -­‐	  2	  Thread	  
Transpose	  -­‐	  4	  Thread	  
Transpose	  -­‐	  8	  Thread	  
Execu&on	  Speed	  (Frames	  Per	  Second)	  
Kepler	  Stream	  Comparison	  -­‐	  1	  Itera&on	  
Streamed	  Memcpy	  /	  Kernels	  
No	  Streams	  
Streamed	  Kernels	  
Figure 5.9: CUDA Stream results for one filter iteration on the Kepler
architecture. Greater is better.
Similar to the Fermi execution, the greatest benefits of a fully streamed
pipeline are seen when utilizing multiple CPU threads. This is likely due to the
fact that CUDA allows for simultaneous memory copies, provided they operate
in opposite directions. With a single CPU thread, bidirectional memory copies
are impossible given that only one frame is pushed through the filter pipeline
at a time. With multiple CPU threads, the potential increases significantly due
to handling multiple frames simultaneously. These bidirectional, simultaneous
memory copies are directly observable through the NVIDIA Visual Profiler, which
reports a variety of important kernel execution metrics.
39
5.064	  
11.746	  
7.069	  
17.723	  
30.1	  
25.7806	  
0	  
5	  
10	  
15	  
20	  
25	  
30	  
35	  
Complex	  Script	  
Ex
ec
u&
on
	  S
pe
ed
	  (F
ra
m
es
	  P
er
	  S
ec
on
d)
	  
Complex	  Script	  Execu&on	  
Xeon	  
Xeon	  4	  Threads	  
i7	  3770k	  
i7	  3770k	  4	  Threads	  
Fermi	  
Kepler	  
Figure 5.10: The results of a complex script run on an i7 Xeon E5-2650,
i7 3770k, GTX 560 Ti, and Kepler K20Xm GPU.
5.6 Complex Script Results
In order to provide a more “real world” test for the extended CUDA pipeline,
a script was devised that utilizes multiple GPU-capable filters. This script makes
multiple calls to the Expr, Merge, and Transpose filters in order to instantiate a
computationally complex filter chain. This mix of filters and additional complex-
ity provides an excellent performance test for CPU and GPU filtering platforms.
The script was run on two different CPUs and two different GPUs. The Xeon
processor is an i7 Xeon E5-2650 running at 2.00Ghz with a 2.8Ghz Turbo boost.
The i7 3770k is running at 3.5Ghz with a 3.9Ghz Turbo boost, and the Fermi and
Kepler cards are detailed above. Figure 5.10 illustrates the execution speed of
each run, with the CPUs utilizing one or four CPU threads. All GPU executions
were run with one CPU thread. The script differs from previous performance tests
by using a 4k resolution input clip, whereas earlier tests were limited to 2k, or
1920 x 1080 pixels. This was done to demonstrate the performance improvement
of a GPU pipeline as video resolutions continue to increase.
40
The complex script is a very important performance metric, as it illustrates
the raw power of a pure GPU filter pipeline through its ability to transfer data
to the GPU and keep it there. None of the CPU runs are able to produce greater
than 18 frames per second, which is below the standard 24 frames per second of
realtime video. On the other hand, all GPU executions are able to filter frames at
greater than realtime, with roughly 30 frames per second for Fermi and 25 frames
per second for Kepler. In short, a GPU enabled pipeline is able to filter video
at a framerate greater than realtime using only one CPU thread, while a CPU
pipeline is hard pressed to meet 18 frames per second using four CPU threads,
which is well below the realtime framerate threshold.
5.7 Fermi and Kepler
Some of the above results are surprising, given that Kepler is supposed to be
an improved GPU architecture over the older Fermi. The decreased performance
results recorded during testing may be the direct result of the Kepler K20Xm’s use
of ECC memory, while also being targeted at high precision scientific applications.
The GTX 560 Ti Fermi card is targeted towards raw graphics performance and
thus has shorter kernel startup times as well as no ECC memory. Future work
is intended to test the GPU framework on graphics cards such as the GTX 780,
which sports the newer GK110 Kepler architecture while still being targeted at
high performance graphics processing instead of scientific computation.
41
Chapter 6
Conclusion
CUDA enhanced video processing offers several benefits over traditional CPU-
based video processing. In particular, video filters consisting of many arithmetic
computations see a significant performance benefit when run on massively paral-
lel, throughput-oriented GPUs.
However, several algorithm design characteristics must be observed in order
to obtain optimal filter performance. More specifically, coalesced memory trans-
actions are crucial for all filter implementations, especially for memory-bound
filters requiring several memory reads for one memory write. In addition, the
efficient use of constant or shared memory on the GPU can result in a significant
performance speedup when compared to a global memory-only filter. This per-
formance difference is best demonstrated through the Expr and Tranpose filters,
which use the broadcast capabilities of constant memory and the extreme access
speeds of shared memory, respectively. Additionally, a fully asynchronous execu-
tion pipeline is crucial for obtaining maximum performance while running in a
multithreaded CPU environment.
42
All in all, a completely GPU-enabled filtering pipeline is a viable design tar-
get for video processing frameworks like Vapoursynth, and can result in both
saved CPU cycles and a tremendous performance improvement for compute-
bound video filters.
43
Chapter 7
Future Work
While the current implementation has proven to be a strong and stable plat-
form for GPU video processing, several enhancements are left for future work.
7.1 Multi-GPU Support
The current implementation relies on multiple GPU streams for parallel kernel
processing, but the possibility for a much greater speedup exists in systems with
multiple CUDA-capable GPUs. By striping the workload across two or more
GPUs, two frames can easily be processed in parallel, no matter how complicated
the kernel. This contrasts with GPU streams, which must share resources between
all streams running on the GPU and thus do not guarantee complete parallel
processing.
44
7.2 Extended Filter Support
After finishing the standard library of core filters, several other Vapoursynth
plugins show strong potential for porting to the GPU. In particular, the Generic
Filters [13] plugin, which implements a large number of mask-based operations,
a la Masktools for Avisynth, should see a particularly large speedup when paired
with an all GPU filtering chain. A number of its basic filters, including hor-
izontal and vertical convolution, sobel edge filtering, and local average infla-
tion/deflation, are already implemented in a basic form in NVIDIA’s NPP library
[17]. A proper evaluation of the NPP library against a custom implementation
is definitely needed.
One caveat of using the NPP library is that it does not natively support video
with a bit depth greater than 8 bits, which is a basic requirement for a capable
plugin contribution to the Vapoursynth project.
7.3 Expanded Bits Per Pixel support
In the interest of time, all filters ported to CUDA so far have been limited
to support only 8 bits per pixel. This is by far the most common video format,
which means that these filters will work stably with almost all consumer video.
However, in the interest of being feature complete, these filters need to support all
standard bits per pixel video formats, including 9/10/16-bit and floating point.
45
7.4 Investigate CPU Threading problems
Currently, when using more than one CPU thread for GPU filtering in Vapoursynth,
performance takes a significant hit depending on the filter chain. Further research
needs to be conducted into why this is and ways to improve performance in a
multithreaded environment.
7.5 Providing Support for Non-CUDA Devices
The definitive problem with a CUDA-backed Vapoursynth project is that it is
by definition limited to only CUDA-capable devices. At the time of this writing,
this means that only NVIDIA sanctioned devices are able to run the kernels
written in this project fork.
A direct port to OpenCL is a possible solution, but poses a few problems.
Specifically, the complete rewrite and duplication of code just for extended device
support.
Another option is offered through the GPU Ocelot [2] project, which provides
the ability to perform a low-level PTX assembly conversion to support alternative
devices such as ATI’s GPUs or the recent Intel MIC (Knight’s Corner) project.
The benefit of using Ocelot is that there is no source code rewrite required, and
it essentially allows a direct binary conversion. The downside is that Ocelot is
still a very young project and can be difficult to setup and configure.
46
Bibliography
[1] An example of color banding. https://commons.wikimedia.org/wiki/
File%3AColour_banding_example01.png.
[2] Gpu ocelot: A modular dynamic compilation framework for heterogeneous
systems. https://code.google.com/p/gpuocelot/.
[3] Msi gtx 560 ti hawk. http://www.techpowerup.com/gpudb/b936/
msi-gtx-560-ti-hawk.html.
[4] Luajit bindings for vapoursynth. https://github.com/tgoyne/luasynth,
2012.
[5] Apple. Video sample rate and bit depth. http://documentation.apple.
com/en/finalcutpro/usermanual/index.html#chapter=C%26section=
11%26tasks=true.
[6] I. Brabham. Avisynth version 2.6. http://avisynth.org/mediawiki/
Changelist_25-26.
[7] I. Corportation. Ia-32 intel architecture software developer’s manual. Intel
Corportation, 2001.
[8] Equays. Color formats for image and video processing. http://www.
equasys.de/colorformat.html.
47
[9] A. M. D. Inc. Amd extensions to the 3dnow! and mmx instruction set
manual. Technical report, March 2000.
[10] F. Melbin. Vapoursynth video processing framework. http://www.
vapoursynth.com/about/, 2012.
[11] F. Melbin. Expr - vapoursynth filter. http://vapoursynth.com/doc/
functions/expr.html, 2013.
[12] S. Moore. Using streaming simd extensions (sse2) to perform big multipli-
cations. application note ap-941, intel corporation, 2000. version 2.0. Order,
(248606-001).
[13] O. Motofumi. Vapoursynth generic filters. http://forum.doom9.org/
showthread.php?t=166842, 2013.
[14] Nand. Hi10p info / guide. http://haruhichan.com/wpblog/index.php/
205/hi10p-info-guide.html, July 2011.
[15] NVIDIA. Nvidia cuda compute capabilities. http://docs.nvidia.com/
cuda/cuda-c-programming-guide/index.html#compute-capabilities.
[16] NVIDIA. Nvidia kepler k20xm. http://www.techpowerup.com/gpudb/
1884/tesla-k20xm.html.
[17] NVIDIA. Nvidia performance primitives. https://developer.nvidia.
com/sites/default/files/akamai/cuda/files/CUDADownloads/NPP_
Library.pdf.
[18] NVIDIA. Nvidia cuda c programming guide version 5. http://
docs.nvidia.com/cuda/cuda-c-programming-guide/index.html, Octo-
ber 2012.
48
[19] NVIDIA. Nvidia’s next generation cuda compute architecture:
Kepler gk110. http://www.nvidia.com/content/PDF/kepler/
NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf, 2012.
[20] NVIDIA. Tuning cuda applications for kepler. http://docs.nvidia.com/
cuda/kepler-tuning-guide/index.html, 2012.
[21] NVIDIA. Cuda: Coalesced access to global memory. http:
//docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.
html#coalesced-access-global-memory, 2013.
[22] NVIDIA. Parallel programming and computing plaform — nvidia cuda.
http://www.nvidia.com/object/cuda_home_new.html, June 2013.
[23] B. Rudiak-Gould. Avisynth. http://avisynth.org/mediawiki/Main_
Page, 2003.
[24] SEt. Avisynth 2.6 mt. http://forum.doom9.org/showthread.php?t=
148782, 2013.
[25] B. Waggoner. Compression for great digital video: power tips, techniques,
and common sense. Focal Press, 2002.
49
