University of Texas at El Paso

DigitalCommons@UTEP
Open Access Theses & Dissertations

2014-01-01

Runtime Pipeline I/O Scheduling System for
GPU-Based Heterogeneous Architectures
Julio Cesar Olaya
University of Texas at El Paso, jolaya@miners.utep.edu

Follow this and additional works at: https://digitalcommons.utep.edu/open_etd
Part of the Computer Sciences Commons
Recommended Citation
Olaya, Julio Cesar, "Runtime Pipeline I/O Scheduling System for GPU-Based Heterogeneous Architectures" (2014). Open Access
Theses & Dissertations. 1311.
https://digitalcommons.utep.edu/open_etd/1311

This is brought to you for free and open access by DigitalCommons@UTEP. It has been accepted for inclusion in Open Access Theses & Dissertations
by an authorized administrator of DigitalCommons@UTEP. For more information, please contact lweber@utep.edu.

RUNTIME PIPELINE I/O SCHEDULING SYSTEM
FOR GPU-BASED HETEROGENEOUS ARCHITECTURES

JULIO CESAR OLAYA BUILES
Department of Computer Science

APPROVED:

Rodrigo Romero, Ph.D., Chair

Shirley Moore, Ph.D., Co-chair

Yoonsik Cheon, Ph.D.
Vladik Kreinovich, Ph.D.
Michael McGarry, Ph.D.

Charles Ambler, Ph.D.
Dean of the Graduate School

Copyright ©
by
Julio Cesar Olaya Builes
2014

Dedicated with love to
Luz Stella, Julio,
Jeronimo, Angela, Mauricio,
and Paulina.

RUNTIME PIPELINE I/O SCHEDULING SYSTEM
FOR GPU-BASED HETEROGENEOUS ARCHITECTURES

by

JULIO CESAR OLAYA BUILES

DISSERTATION

Presented to the Faculty of the Graduate School of
The University of Texas at El Paso
in Partial Fulfillment
of the Requirements
for the Degree of

Doctor of Philosophy

Department of Computer Science
THE UNIVERSITY OF TEXAS AT EL PASO
August 2014

Acknowledgements
I would like to express my deepest gratitude to my advisor, Dr. Rodrigo Romero, for his support
and confidence in me. His guidance and advice made my doctorate an incredible and wonderful
experience.
I wish to thank the other members of the committee, Dr. Shirley Moore, Dr. Yoonsik Cheon, Dr.
Michael McGarry, and Dr. Vladik Kreinovich for their guidance, valuable comments, suggestions, and
patience during this investigation.
I want to thank Dr. Ann Gates and Dr. Aaron Velasco for providing me with the means, through
the Cyber-ShARE Center of Excellence, to complete my degree and for their guidance, which made me
a better computer scientist.

This material is based upon work supported in part by the National Science Foundation under
CREST Grant No. HRD-0734825, Grant No. CNS-0923442, and CREST Grant No. HRD-1242122.
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the
author(s) and do not necessarily reflect the views of the National Science Foundation (NSF).
v

Abstract
Heterogeneous architectures can improve the performance of applications with computationally
intensive operations. Even when these architectures may reduce the execution time of applications, there
are opportunities for additional performance improvement as the memory hierarchies of the central
processor cores and the coprocessor cores are separate. Applications running on heterogeneous
architectures where graphics processing units (GPUs) execute throughput-intense, data-parallel
operations may run in a single address space provided by unified virtual addressing or expand the upper
bounds of scalability and high performance computing by explicitly partitioning and transferring data
across orthogonal host and device address spaces. For explicit handling, applications must allocate
space in the GPU global memory, copy input data, invoke kernels, and copy results to the CPU memory.
By overlapping inter-memory data transfers and GPU computation steps, applications may further
reduce execution time. This research presents a software architecture with a runtime pipeline for GPU
input/output scheduling that acts as a bidirectional interface between the GPU computing application
and the physical device. The main aim of this system is to reduce the impact of the processor-memory
performance gap by exploiting device I/O and computation overlap. Evaluation using application
benchmarks shows processing improvements with speedups up to 2.37x with respect to baseline, nonstreamed GPU execution. In addition, the presented input/output scheduling system is a high-level,
systems abstraction that removes application software complexity while exploiting the input/output and
processing concurrency capabilities of the underlying GPU.

vi

Table of Contents
Acknowledgements...................................................................................................................................... v	
  
Abstract .......................................................................................................................................................vi	
  
Table of Contents ...................................................................................................................................... vii	
  
List of Tables ..............................................................................................................................................ix	
  
List of Figures .............................................................................................................................................. x	
  
Chapter 1: Introduction ................................................................................................................................ 1	
  
Chapter 2: Background ................................................................................................................................ 6	
  
2.1	
   Architecture ............................................................................................................................... 6	
  
2.2	
   Programming Model .................................................................................................................. 8	
  
2.3	
   Stream Concurrency ................................................................................................................ 10	
  
2.4	
   Direct Memory Access ............................................................................................................ 11	
  
Chapter 3: Related Projects and Concepts ................................................................................................. 12	
  
3.1	
   Overlapping operations ............................................................................................................ 12	
  
3.2	
   Virtualization ........................................................................................................................... 13	
  
3.3	
   Libraries ................................................................................................................................... 14	
  
3.4	
   Software Approaches ............................................................................................................... 15	
  
Chapter 4: RPIOSS System Model ............................................................................................................ 17	
  
4.1	
   Virtual Layers .......................................................................................................................... 19	
  
4.2	
   Pipeline I/O Scheduling ........................................................................................................... 22	
  
4.3	
   CUDA Context ........................................................................................................................ 27	
  
Chapter 5. Results using RPIOSS .............................................................................................................. 28	
  
5.1	
   Experimental Setup .................................................................................................................. 28	
  
5.2	
   Metrics ..................................................................................................................................... 28	
  
5.3	
   Methodology ............................................................................................................................ 30	
  
5.4	
   Benchmark Descriptions and Analyses of Results .................................................................. 31	
  
5.5	
   Data Transfer Bandwidth ......................................................................................................... 51	
  
5.6	
   Summary of Results ................................................................................................................. 55	
  

vii

Chapter 6: Conclusions and Future Work ................................................................................................. 57	
  
References.................................................................................................................................................. 58	
  
Appendix A: Elapsed Time for Vector Addition ....................................................................................... 62	
  
Appendix B: Elapsed Time for Black-Scholes .......................................................................................... 64	
  
Appendix C: Elapsed Time for Nearest Neighbors ................................................................................... 66	
  
Appendix D: Elapsed Time for 1D Convolution ....................................................................................... 68	
  
Appendix E: Elapsed Time for Histogram ................................................................................................ 70	
  
Appendix F: Vector Addition Benchmark ................................................................................................. 72	
  
Appendix G: Black-Scholes Benchmark ................................................................................................... 76	
  
Appendix H: 1D Convolution Benchmark ................................................................................................ 81	
  
Appendix I: Nearest Neighbors Benchmark .............................................................................................. 85	
  
Appendix J: Histogram Benchmark........................................................................................................... 94	
  
Appendix K: RPIOSS Manager Program .................................................................................................. 97	
  
Appendix L: RPIOSS Shared Memory Subsystem Program .................................................................. 100	
  
Appendix M: RPIOSS Dependence Subsystem Program ....................................................................... 103	
  
Appendix N: RPIOSS Storage Allocation Program ................................................................................ 106	
  
Appendix O: RPIOSS Copy Host-Device Program ................................................................................ 108	
  
Appendix P: RPIOSS Kernel Launcher Program .................................................................................... 110	
  
Appendix Q: RPIOSS Copy Device-Host Program ................................................................................ 113	
  
Appendix R: Device program .................................................................................................................. 115	
  
Appendix S: Reference for RPIOSS API Functions................................................................................ 117	
  
Appendix T: How to use RPIOSS ........................................................................................................... 118	
  
Vita……………………………………………………………………………………………………...119	
  

viii

List of Tables
Table 5.1: Metrics for Vector Addition. ....................................................................................................33	
  
Table 5.2: Metrics for Black-Scholes. .......................................................................................................37	
  
Table 5.3: Metrics for 1D convolution. .....................................................................................................41	
  
Table 5.4: Metrics for Nearest Neighbors .................................................................................................44	
  
Table 5.5: Metrics for histogram benchmark ............................................................................................48	
  
Table A.1: Result of Vector Addition using 32 threads. ...........................................................................62	
  
Table A.2: Result of Vector Addition using 64 threads. ...........................................................................62	
  
Table A.3: Result of Vector Addition using 128 threads. .........................................................................63	
  
Table B.1: Result of Black-Scholes using 32 threads................................................................................64	
  
Table B.2: Result of Black-Scholes using 64 threads................................................................................64	
  
Table B.3: Result of Black-Scholes using 128 threads..............................................................................65	
  
Table C.1: Result of Nearest Neighbors using 32 threads. ........................................................................66	
  
Table C.2: Result of Nearest Neighbors using 64 threads. ........................................................................66	
  
Table C.3: Result of Nearest Neighbors using 128 threads. ......................................................................67	
  
Table D.1: Result of 1D Convolution using 32 threads.............................................................................68	
  
Table D.2: Result of 1D Convolution using 64 threads.............................................................................68	
  
Table D.3: Result of 1D Convolution using 128 threads...........................................................................69	
  
Table E.1: Result of Histogram using 32 threads. .....................................................................................70	
  
Table E.2: Result of Histogram using 64 threads. .....................................................................................70	
  
Table E.3: Result of Histogram using 64 threads. .....................................................................................71	
  

ix

List of Figures
Figure 1.1: CPU and GPU execution flow. .................................................................................................3	
  
Figure 2.1: Host-device high-level architecture of a GPU. .........................................................................9	
  
Figure 2.2: High-level stream components that influence concurrency. ...................................................10	
  
Figure 4.1: Execution timeline with the minimum execution time achieved by the maximum, ideal
overlap of processing and data transfers. ...................................................................................................17	
  
Figure 4.2: High-level structure of RPIOSS. .............................................................................................19	
  
Figure 4.3: Vector addition example using the API functions of RPIOSS. ..............................................20	
  
Figure 4.4: Vector addition using RPIOSS................................................................................................21	
  
Figure 4.5: Vector addition using high-level CUDA API functions. ........................................................21	
  
Figure 4.6: RPIOSS functional blocks.......................................................................................................22	
  
Figure 4.7: Vector addition example using RPIOSS. ................................................................................24	
  
Figure 4.8: Data and synchronization structure of RPIOSS. .....................................................................25	
  
Figure 4.9: Stage Algorithm. .....................................................................................................................26	
  
Figure 5.1: Pseudocode for storage management, data transfer, and grid launch operations of the VA
base program. .............................................................................................................................................32	
  
Figure 5.2: Pseudocode for VA using RPIOSS for GPU I/O and execution. ............................................32	
  
Figure 5.3: RPIOSS Speedup for Vector Addition. ...................................................................................33	
  
Figure 5.4: Percentage of time of each GPU operation for Vector Addition. ...........................................34	
  
Figure 5.5: Pseudocode for storage management, data transfer, and grid launch operations of the of
Black-Scholes base program......................................................................................................................36	
  
Figure 5.6: Pseudocode for Black-Scholes using RPIOSS for GPU I/O and execution. ..........................36	
  
Figure 5.7: Speedup for Black-Scholes. ....................................................................................................37	
  
Figure 5.8: Percentage of time of each GPU operation for Black-Scholes. ..............................................38	
  
Figure 5.9: Pseudocode for storage management, data transfer, and grid launch operations of the C1D
base program. .............................................................................................................................................40	
  
Figure 5.10: Pseudocode for C1D using RPIOSS for GPU I/O and execution. ........................................40	
  
x

Figure 5.11: Speedup for 1D convolution .................................................................................................41	
  
Figure 5.12: Percentage of time of each GPU operation for 1D convolution. ..........................................42	
  
Figure 5.13: Pseudocode for storage management, data transfer, and grid launch operations of the NN
base program. .............................................................................................................................................43	
  
Figure 5.14: NN pseudocode using RPIOSS for GPU I/O and execution.................................................43	
  
Figure 5.15: Speedup for Nearest Neighbors. ...........................................................................................44	
  
Figure 5.16: Percentage of time of each GPU operation for Nearest Neighbors benchmark. ...................45	
  
Figure 5.17: Pseudocode for storage management, data transfer, and grid launch operations of the
histogram base program. ............................................................................................................................46	
  
Figure 5.18: Histogram pseudocode using RPIOSS for GPU I/O and execution. ....................................46	
  
Figure 5.19: Speedup for histogram. .........................................................................................................47	
  
Figure 5.20: Percentage of time of each GPU operation for histogram. ...................................................48	
  
Figure 5.21: Percentage of time of each GPU operation for histogram. ...................................................50	
  
Figure 5.22: Pseudocode for storage management, data transfer, and grid launch operations of the VA
base program. .............................................................................................................................................50	
  
Figure 5.23: Speedup for seismic tomography. .........................................................................................51	
  
Figure 5.24: Percentage of time for each GPU operation for seismic tomography. ..................................51	
  
Figure 5.25: Host memory throughput for Vector Addition......................................................................52	
  
Figure 5.26 Host memory throughput for Black-Scholes. .........................................................................53	
  
Figure 5.27: Host memory throughput for 1D Convolution. .....................................................................53	
  
Figure 5.28: Host memory throughput for Nearest Neighbors. .................................................................54	
  
Figure 5.29: Host memory throughput for Histogram. ..............................................................................54	
  
Figure 5.30: Speedups obtained with RPIOSS. .........................................................................................55	
  

xi

Chapter 1: Introduction
General purpose processors are designed with the best fundamental functionality in mind, i.e.,
they are designed to deliver acceptable performance for the average application. When applications
deviate from the average by requiring substantial specialized processing, processor performance is likely
to be inadequate. Architecturally, such situation presents an opportunity for the introduction of a
coprocessor – a unit of dedicated, special-purpose hardware intended to address the performance deficit
originated by the specialized needs of an application. For instance, floating-point or math coprocessors
used to be standard additions for engineering and heavily numerical applications when processors
included only integer processing capabilities. Other specialized application needs include video
processing [28], graphics processing [11], digital signal processing [9], encryption/decryption [33],
biometrics processing [29], and data-parallel processing [27]. Considering a wide range of application
requirements, coprocessor implementations can range from tightly-coupled, on-chip designs, such as the
latest APUs [3] for data-parallel processing, to cluster designs where coprocessors are dedicated devices
connected to the main bus of general-purpose processors in cluster nodes, such as GPU clusters for
graphics and high-performance computing applications [14].
This dissertation focuses on the utilization of graphics processing units (“GPUs”), a type of
coprocessor that evolved from a hardwired graphics hardware accelerator through a programmable
graphics processing device to a parallel processing coprocessor, for data-parallel applications. Because
of the device origin, graphics processing with GPUs is based on a very mature hardware and software
architecture well supported by application programming interfaces such as DirectX [35] and OpenGL,
the Open Graphics Library [43]. This work, however, is focused on applications that use GPUs for
general-purpose computing or GPGPU computing [30] – a widely applied misnomer as GPGPU
computing targets design and implementation of embarrassingly parallel algorithms for data-parallel,
throughput-intensive applications. For these applications where data abundance is a given, the main
issue is exploiting the highest deliverable throughput of the underlying hardware. Mapping parallel
algorithms to this general processor/data parallel coprocessor hardware platform is a great challenge for
the software architect as the systems state of the art is very low level from the application perspective.
1

What started as expressing any parallel processing as vertex processing, in order to benefit from parallel
computing hardware in graphics processors, has evolved into having the expressive capabilities of
general-purpose languages. However, there is still a systems gap in support of data transfers that is
addressed by every GPGPU application. This gap is the main motivation for this work. The question
addressed by this work is “What systems abstraction is needed to provide applications with a seamless,
high throughput data flow among processors and coprocessors?”
While GPUs are capable of significantly accelerating applications that require intensive, dataparallel computations, a GPU device (“the device”) has a memory hierarchy that is – at least partially –
physically separate from the host computer memory hierarchy. Applications may run in a single address
space, which implies a performance cost as there is no control over the timing and frequency of data
transfers across low-bandwidth high-latency buses, or expand the upper bounds of scalability and high
performance computing by explicitly partitioning and transferring data across orthogonal host and
device address spaces. Taking advantage of the latter alternative results in an essential need for
multiple, expensive data transfers to the device memory before its computations take place and back to
the host main memory once device computations are complete. In addition, host and device tasks and
data transfers must be coordinated to maintain a high and effective utilization of the involved processing
cores and data transfer paths with the aim to maintain the greatest achievable throughput. For this
reason, a number of operational steps, as shown in Figure 1.1, are necessary for accessing the
computational power of the device streaming processors, namely, device memory allocation (1), input
data transfer from host memory to device memory (2), initialization of device for execution of GPU
computational kernels (3), GPU kernel invocation (4), and output data transfer from device memory
back to the host main memory (5). Although these required processing steps may improve the
performance of applications, they are typically programmed as a sequential transaction in which
overhead operations (1, 2, 3, and 5) wrap advantageous high-performance parallel processing (4) on
device cores. This happens because there may be data dependences among sequences, a lack of device
resources for a fully parallel kernel execution, or a combination of both scenarios. This motivates a

2

search for alternatives to reduce the inefficiencies both on the host and the device by overlapping data
transfers with host and device processing operations.
3$

CPU$

Main$
memory$

1$

2$

Device$

5$

Global$$
memory$

GPU$

4$

Mul5processors$1…N$
$

Figure 1.1: CPU and GPU execution flow.
The main aim of exploiting potential overlaps of communication and computation is the
improvement of the application execution time with respect to current approaches which are based either
on unified virtual addressing or on explicit sequences of GPU activity bursts preceded and followed by
communication-induced GPU and CPU idle times.
Unfortunately, sequential execution of device I/O operations implies a data transfer schedule that
negatively impacts the utilization of processors because both the host and the device must wait for each
other to complete their corresponding steps of I/O transactions. This is the default operation mode for
general-purpose programming enhanced with parallel processing on GPUs introduced by NVIDIA [42]
The programming model is based on a combination of a general-purpose sequential language such as
C/C++ and a parallel computing environment such as NVIDIA’s Compute Unified Device Architecture
(CUDA) for GPU computing, which was chosen as the experimental platform for this research.
The CUDA application-programming interface (API) provides blocking functions, which prevent
the GPU kernels from beginning execution until the device has received all the needed data from the
host. To alleviate the performance impact of these functions, the API also provides non-blocking
functions for asynchronous transfers through pinned memory and management of a number of streams.
Each stream represents a queue that contains operations to be executed in order. The API thus enables
splitting data transfers and kernel execution, which affords an additional degree of concurrency that can
3

potentially completely hide or at least lessen the impact of data-transfer overhead through stream-based
asynchronous transfers and kernel execution. However, use of resources and scheduling of a grid may
restrict execution concurrency and encumber performance. For instance, if a kernel consumes too many
resources or the transfer of data is only in one direction due to a long grid, this may cause serialization
[44].
Multiple host threads that share device access provide a second alternative to the use of blocking
functions, which requires device capabilities in which processes can have one context per device [38]
and host threads can share the CUDA context. This approach to exploiting both host and device thread
level parallelism (TLP) further complicates multithreaded host application development with the need to
manage access to the shared context. To share data, host thread binding to the device is required in order
to maintain data consistency and handle dependences.
The chosen strategy for concurrent application development will require additional efforts to
distribute blocks of data to streams, track computation and communication tasks, and manage resources
to avoid idle processors during grid execution. The responsibilities get more complicated when using the
device driver API since it introduces additional tasks such as initializing the device unit, the CUDA
module, and the context for the device. Therefore, there is a need to create a layer of application
management to encapsulate thread handling, data transfers, synchronization, and high-level device
management.
To address the aforementioned limitations and take advantage of the capabilities of GPUs, this
research presents a runtime pipelined I/O scheduling system (RPIOSS), which is a software pipeline
architecture that provides a high-level abstraction to maximize host and device processor utilization and
overall throughput. RPIOSS functionality is located between the user application and the device. The
goals of RPIOSS are:
•

Reducing and hiding of overhead of memory allocation and data transfers.

•

Lessening the latency of kernel launching.

•

Maximizing utilization of host and device processing cores through elimination of sequential
synchronous GPU operations.
4

To this end, RPIOSS asynchronously allocates and transfers blocks of data and launches kernels.
RPIOSS manages operation overlapping through host threads that share access per device. It uses a
producer-consumer approach and set functions to determine dependences at runtime for execution of
multiple independent kernels. RPIOSS API functions encapsulate functionality and enable application
developers to leverage GPU computing through RPIOSS abstractions for concurrent data transfers and
computations.
To address the main question above, this work is based on researching the performance
implications of using RPIOSS to exploit potential overlap of communication and computation between
the host and the device by doing the following:
•

Providing a high-level abstraction for communication and computation that removes the need for
low-level exploitation of device hardware.

•

Presenting a design of a systems abstraction to address computation and communication
overhead for an improved performance.

•

Providing a quantitative and comparative analysis of performance of the proposed abstraction.
The rest of this document is organized as follows. Chapter 2 presents an overview of the GPU

architecture, programming model, and model of parallelism. Chapter 3 presents related work. Chapter 4
introduces the details of the RPIOSS model. Chapter 5 shows results and analysis of execution of
sample GPU applications. Finally, Chapter 6 includes conclusions and future work.

5

Chapter 2: Background
The early GPU was designed as a hardware accelerator to exclusively perform graphics
computations and was not programmable. Nowadays, GPUs are programmable and used as generalpurpose computing devices that developers and researchers are finding increasingly useful for many
diverse applications.
The GPU design was based on the notions of graphics pipelines. One of them was the fixedfunction graphics pipeline, a prominent hardware design from the 80’s and the 90’s, which was not
programmable. In this period of time graphic APIs, which included functions for hardware and software
services, were introduced. For instance, OpenGL, Direct3D, and the initial NVIDIA GeForce GPU had
similar logical graphics pipelines that enabled developers to use API functions to send commands and
data from the CPU main memory to the GPU memory through of a host interface. Each stage of the
graphics pipeline represented an operation. For instance, the vertex shader stage worked on the vertices
of polygons, the transform stage gave each vertex a position on a screen, lighting stages determined
vertex color, the raster and setup stage provided the vertex information that contributed to each rendered
pixel, and the frame buffer interface stage controlled read and write operations to display the image
frame buffer in memory. However, GPU hardware designers noted that some rendering algorithms do
not require using all the stages. This motivated them to change the hardware to make those pipeline
stages programmable [25, 45]. As a result, GPUs have progressed continuously over the last years from
their focus on graphics processing to providing support for general purpose computing applications with
significant data parallelism. Improvements of software and hardware capabilities have been added in
each device generation (e.g., GeForce, Fermi, Tesla, Kepler, and Maxwell). Basic device features are
described in the following subsections [12, 34, 25, 36, 37, 40, 56].
2.1

Architecture
The NVIDIA GeForce 256 architecture was the first programmable card with hardware

acceleration capabilities. It was made available for consumers in 1999, and it opened access to new
hardware features such as texturing, lighting, and geometric transformations. This architecture was a
6

fixed graphic pipeline since it was not possible to modify the data once the developers sent the data to
the pipeline.
The next step of evolution of NVIDIA GPUs was the GeForce 3, which appeared in 2001. This
GPU architecture introduced the first step for general programmability by allowing programming of
some stages such as the pixel and vertex shader. In 2002, the GeForce FX was introduced as a
programmable card that allowed developers to manipulate input and output data operations. In 2004, the
GeForce 6 introduced the first card with an interface for the peripheral component interconnection
express (PCI-express) bus and dynamic flow controls.
In 2006, the GeForce 8 series introduced a major step in the GPU evolution with a generalpurpose computing model and synchronization barriers. It was the first GPU to support C language and
eliminated vector registers by using scalar thread processors. Vector registers represented a single
instruction multiple data (SIMD) lane within a thread block, while the scalar thread processor was a
thread block scheduler that delegates multiple thread blocks to SIMD processors. The GeForce 8 series
also introduced an execution model similar to SIMD called the single introduction multiple thread
(SIMT) model, in which multiple independent threads executed a single instruction. The GeForce 8
series had three key features: a memory hierarchy, a number of streaming multiprocessors (SMs), and a
number of stream processors (SPs), which are the device computing cores. SMs were programmable
unified processors that enable handling of operations such as vertex, pixel, and geometry computations.
The memory hierarchy of the device, as shown in Figure 2.1, has off-chip and on-chip levels. The lowest
level is global memory, the main memory of the GPU, which has the greatest capacity and latency of all
levels. Data in global memory, which may have been transferred into the device from the host memory
or generated through kernel computations, is accessible by all threads in an application. Constant
memory and texture memory, two additional levels in the hierarchy, are used for kernel read-only data
that can be read and modified by code running on the host. On-chip memory includes shared memory
used to reduce latency to access data, but only threads in a thread block have read and write access to it.
The Tesla architecture appeared in 2006 with the GeForce 8800 (G80), which represented the
second generation of programmable GPU computing. It improved uncoalesced transactions and included
7

a 24-bit integer multiplier. An even better architecture was presented in the next generation with Fermi
devices in 2009. The Fermi design included a 64-bit multiplier; duplicated the number of cores; added
load and store units that allowed determination of the source and destination addresses in memory;
added special function units for hardware computation of sine, cosine, and square root; incorporated a
dual warp scheduler in which each scheduler handles a sub-vector of 32 threads; added an L1 cache
memory to take advantage of locality of reference as kernel local memory, which was intended to store
temporary data and had configurable partitioning with shared memory; and included an L2 cache
memory to store SM-shareable data. Other Fermi improvements were faster context switching,
concurrent kernel execution, out-of-order thread block execution, double precision operations, and dual
transfer memory engines. The Kepler architecture introduced dynamic parallelism, Hyper-Q, and the
grid management unit. By adding dynamic parallelism, the GPU may create new work, coordinate
results, and regulate work scheduling without involving host processors. Hyper-Q enabled use of a
single GPU by various CPU cores. The grid management unit managed flexible grids and worked as a
dispatch control system needed by features such as dynamic parallelism. Finally, the latest architecture,
Maxwell, improved energy efficiency.
2.2

Programming Model
The central point of the GPU programming model is to facilitate the implementation of massive

data-parallel algorithms. For instance, the software and hardware interface provided by CUDA contains
high-level API functions for C/C++ programs to execute operations in the device started by processes
running on the host. API functions enable storage allocation in device global memory, data transfers
from and to the GPU, and kernel launches.
An execution configuration distributes kernel execution on the device by explicitly specifying
the creation of a grid of thread blocks to be mapped to streaming multiprocessors. Device hardware
includes a thread block scheduler that distributes blocks to SMs using a round-robin approach.
Scheduling depends on data being available for a thread block. Device hardware uses a warp scheduler
that selects which warp to run. The warp scheduler issues kernel instructions in SIMD mode, which
means that all SPs in a SM execute the same instruction with different data for each thread.
8

Host&
Processor&

I/O&&
bridge&

System&bus&

Memory&bus&

Main&&
memory&

I/O&bus&
Device&
Global&memory&
L2&cache&
f&

Mul-processors&(1…N)&
&&&&&Shared&memory&
&&&&&&L1&cache&
&&&&&&registers&
&&&&&Streaming&processors&

Constant&&&Texture&
Caches&

Figure 2.1: Host-device high-level architecture of a GPU.
The elements of the GPU programming model support data-parallel programming with implicit
scalability. When a host program launches a kernel, it forms a hierarchy of threads. The first level is at
the block level; each block having a unique index in the grid. The second level is the thread level, which
has a unique index for each thread. Using these indices, a kernel determines the index subscripts of a
vector to execute in SIMT mode. In this way, the GPU programming model supports task parallelism.
While data parallelism tends to be fine-grained, task parallelism is coarse-grained. To take advantage of
task parallelism, a host program divides and transfers data and launches kernels in different streams.
The CUDA programming model is based on compute levels that identify device functionality
available in each GPU architecture. For instance, an important feature of compute level 1.0 is to ensure
that atomic operations end without interruption; compute level 1.1 supports overlap of communication
and computation, but not on all devices; compute level 1.2 reduces the problem of coalesced access to
9

global memory and bank problems with shared memory, which is present in lower compute levels;
compute level 1.3 supports double precision calculations; compute level 2.0 supports features such as
configuration of memories, error correcting code base memory checking,

execution of multiple

concurrent kernels, and dual transfer of data; and compute level 3.x increases the grid size and presents
faster atomic operations.
2.3

Stream Concurrency
Stream management enables expression of explicit concurrency at the task level. Streams feature

a queue to issue operations, such as copying data and launching kernels, for either in-order or out-oforder execution. Figure 2.2 shows how host processes introduce commands and operations into streams
through queue structures. The stream scheduler takes commands from each stream and sends them to the
device. The stream scheduler controls two transfer engines – the host-to-device copy engine and the
device-to-host copy engine – and one compute engine, which facilitate overlapping of operations [40,
44].

Host(

Streams(

A1(

A2(

A3( …( An(

B1(

B2(

B3( …( Bn(

C1(

C2(

C3( …( Cn(

Stream(scheduler(
Device(
DMA(
engines(

Thread(block(
scheduler(

Mul8processors((1…n)(
Figure 2.2: High-level stream components that influence concurrency.

10

2.4

Direct Memory Access
There are two main advantages of using of DMA operations in GPU programming. First, data

copies are faster since the device does not require host intervention. Second, DMA enable the device to
read and write to host memory. For DMA transfers, page-locked host memory, also called pinned
memory, is used to map device address space for direct access. These enables kernels to read and write
host memory directly.

11

Chapter 3: Related Projects and Concepts
3.1

Overlapping operations
Numerous research efforts have focused on improving the performance of parallel applications

running on GPUs. For instance, CrystalGPU provides a high level of abstraction, overlaps
communication and computations, and is based on an execution pipeline. Its design consists of stages for
copying data to the device, for kernel execution, and for copying data from the [16]. A master thread
distributes work in round-robin fashion to host threads assigned to each device. Both CrystalGPU and
RPIOSS handle concurrent operations, but the former pre-allocates a GPU buffer to alleviate some of
the overhead, while the latter allocates storage concurrently.
CudaDMA employs warp specialization for improving memory bandwidth utilization [4] in
device global to shared memory transfers. While a set of dedicated warps copies data between on-board
and on-chip memory, additional warps focus on computations using the transferred data in a producerconsumer approach. Similarly, RPIOSS uses a pipeline with host threads for bidirectional transfers
between host and device memory and grid launching. It requires explicit data-copy specification, but it
removes the need for explicit device storage allocation and deallocation. RPIOSS and CudaDMA have
complementary roles, as the former transfers data between host and device memory while the latter
transfers device data between global and shared memory.
StarPU features a dynamic task scheduler for multi-GPU systems with an earliest finish time
policy based on estimates of task duration and data transfer cost. NUMA produces an affinity that
achieves 15% reductions in execution time and can be combined with prefetching, which typically
provides a 20% improvement over scheduling without prefetching. StarPU includes a virtual shared
memory (VSM) system that handles data transfers as required by task scheduling. Like RPIOSS, VSM
takes advantage of communication and computation overlapping to reduce overhead introduced by data
transfers [4].
SemCache focuses on improving application performance by eliminating redundant host-device
data transfers in libraries using distributed shared memory techniques with data-structure granularity.
RPIOSS also eliminates redundant communications at the highest abstraction level, as it uses set
12

functions for dependence checking at the kernel parameter level. While SemCache, like DSM systems,
offers automatic replication and migration, RPIOSS performs data transfers only as needed based on
kernel invocations [1].
A fifth effort uses a data pipe to overlap host and device operations through a memory
management unit [47]. Drawbacks of using a pipe in this project are that it requires extra copy
operations, and it uses operating system resources. In contrast, RPIOSS uses host shared memory to
work in user space and avoid redundant host-side copy operations.
Other research efforts focus on rearranging resources for improving concurrent execution. One
project uses an extension of the CUDA software stack to take advantage of the complete performance of
a multi-GPU architecture [47]. It applies context funneling to enable multiple host threads to share a
device context. The non-stop kernel [53] project introduces host-device message-based communication
during kernel execution. This approach uses pinned memory to pass messages and streams for
transferring the message buffer back and forth between the host and device memory. In addition, it can
provide more threads as necessary. Another work improves concurrency with an elastic kernel. It is
based on a non-serializing stream and elastic kernels that permit fine-grained control of resources [44].
In this project, hardware memory mapping enables adjustment of the number of threads and blocks
during grid initialization and time slices are used for kernel execution and data transfers.
Other works handle dependences dynamically to improve utilization of GPU resources. For
instance, Hyper-Q [8] removes dependences and allocates multiple host threads to a device through a
grid management unit. In addition, it generates multiple hardware work queues to reduce false
dependences. A final project is based on dependence tracking and offers functions to overlap kernel
execution and data transfers [32]. RPIOSS also determines dependences dynamically based on kernel
parameters aiming to reduce data transfers and rewriting of results on the device.
3.2

Virtualization
Virtualization is a well-known approach for managing coprocessors. For instance, previous work

has used virtualization to reduce the complexity of interaction between the main processor and

13

coprocessors for tasks such as transferring data between processors and between applications and the
operating system [20].
Previous research also virtualizes GPUs as data-parallel virtual machines [46]. This approach
hides and manages GPU components using a command processor that is responsible for resource
scheduling. Another earlier work uses GPUs through a virtual warp-centric programming model to
increase parallelism of unbalanced workloads [22]. The cooperative heterogeneous computing
framework, another example of prior work, uses a virtual layer to distribute workloads to both the host
and the device. It uses an intermediate representation to prepare the code for cooperative computing
before sending it to a CUDA device driver. This framework also has a memory model that assigns
memory space, accesses memory, launches a kernel, and merges distributed data. In addition, it uses
global block scheduling [26].
Similar to virtual machines at the hardware level, RPIOSS intercepts and redirects service calls.
For instance, vCUDA is a framework that provides GPU-based hardware acceleration. Its system
architecture provides a virtual library for the guest operating system that catches calls and redirects them
through a virtual machine monitor to virtual CUDA stubs in the host operating system, which in turn
passes down calls to the device driver [52]. GViM, another virtual machine at the hardware level,
provides support to GPU-accelerated virtual machines. GViM also uses a library in the host operating
system to capture, pack, and send calls to the frontend driver. In turn, the frontend driver interacts with
the backend in the host operating system to launch GPU operations and return their results to the host
[20].
3.3

Libraries
A research project on automatic CPU-GPU communication management and optimization [24]

used runtime libraries to manage early storage allocation and data transfer to the device. The aim of this
work was to remove cyclic host/device communication-execution-communication patterns. Another
earlier research was a responsive GPGPU execution model for runtime engines (RGEM) which included
a preemptive scheduler for memory data transfers. RGEM divided a non-preemptive memory transfer
into multiple preemptive data transfers that mapped a buffer in user space into a buffer in the operating
14

system space. Thus, the GPU were able to access data through direct memory access. In addition,
RGEM scheduled kernels based on their priorities [23]. A similar work presented in [31] provided APIs
to connect computations and a compiler that translated API calls to native code. Twin Peaks, a software
platform for OpenCL, enabled GPU applications for execution in CPU cores. It had a runtime system
that scheduled work through a command queue for memory allocation, data transfers, and kernel
execution; the operating system scheduled system level threads [19]. RPIOSS has user-level API
functions to schedule workloads and manage interactions between the operating system and device
driver, thus supporting interactions between the host and device processing cores.
3.4

Software Approaches
Various environments may be used to implement parallel algorithms, but those based on

language extensions possess a combination of longevity and a wide, diverse application base. One of
them is the Message Passing Interface (MPI) – one of the oldest and most popular parallel programming
environments for distributed and clustered architectures. At the beginning, the MPI message-passing
specification was supported in C, C++, and FORTRAN. Recently, however, other bindings are available
in Java, Python, Perl, and R. The message-passing model instantiates a program as a set of independent
processes that interact with one another by sending messages for data transfer and synchronization. MPI
has both blocking and non-blocking functions for asynchronous execution of a process. However,
additional care is needed to avoid race conditions. MPI processes are mapped to one or more processors,
while message passing is an abstraction of the underlying communication network. The model implies
that each processor may have a portion of the data, which creates opportunities for high performance on
NUMA architectures since processes access local data more often than non-local data. Another
important characteristic is that most of the programs follow the SPMD (single program multiple data)
approach since different processes execute the same program [18, 49].
OpenMP is a set of API functions and compiler directives for parallel programming on sharedmemory multiprocessor architectures supported in C, C++, and FORTRAN. This model instantiates a
program as a multithreaded process where light threads are mapped to available processors and data
transfers and synchronization are implemented on shared memory. In an OpenMP process, a master
15

thread executes the sequential sections of the program and forks threads to execute parallel sections.
Performance improvements are obtained when the overhead of thread fork, join, and synchronization
operations is negligible compared to the benefit of parallel execution [18, 49].
The OpenCL (Open Computing Language), a parallel computing API based on C, has a
programming model similar to CUDA that can be used to exploit GPU-based heterogeneous
architectures. Platforms such as AMD/ATI and NVIDIA support the portability offered by OpenCL.
OpenCL and CUDA device architectures and conceptual model are similar. For instance, OpenCL
compute units and processing elements correspond to SMs and SPs in CUDA, respectively. Also, it has
a memory hierarchy similar to CUDA. However, the constant memory in OpenCL can be dynamically
allocated; the local memory in OpenCL is similar to shared memory in CUDA. The data-parallel
execution model of CUDA and OpenCL are similar as well. For instance, the concept of a kernel
function is the same for both OpenCL and CUDA; the terms grid, thread, and block of CUDA
correspond to NDRange, work item, and work group of OpenCL, respectively. OpenCL and RPIOSS
both manage a device through low level APIs. However, the host program in OpenCL creates the
context, determines the number and type of devices, and submits work for execution by creating
command queues with API function calls, while low-level operations of the host program in RPIOSS are
hidden from the developer [2, 25].
Finally, the OpenACC API consists of a set of compiler directives, library routines, and
environment variables. The host program can be written in C, C++, or FORTRAN. Compute-intensive
sections are mapped to the device as parallel or kernel regions. Similar to the CUDA execution model,
OpenACC programs need to transfer data from and to device and launch kernels. The OpenACC
compiler manages the communication and computation steps and device caches based on explicit
program directives [13, 25]

16

Chapter 4: RPIOSS System Model
The Runtime Pipeline I/O Scheduling System (RPIOSS) is a dynamic bidirectional interface
between processing on host cores and processing on device cores. RPIOSS aims to improve application
execution time by exploiting overlap of device input/output operations with host and device
computations. In the general case, an application will execute a number of kernels, and execution time
will be minimized by maximum computation and communication overlap as shown in figure 4.17.
However, this ideal scenario will not happen in an application due to algorithmic, host, and device
resource constraints that reduce the number of opportunities for overlap. A worst-case scenario would be
a sequential combination of segments of host core processing with one device data transfer or a kernel
execution at a time. RPIOSS takes advantage of potential overlap of computation and data transfers both
in the single-kernel and multiple-kernel case as shown in figure 4.18, in which grids are subgrids of
different kernels in the latter case. In the single-kernel case, RPIOSS overlaps device data transfers with
kernel execution by partitioning grid execution into a series of partial transfers and subgrid launches to
reduce the latency of kernel launches. In the multiple-kernel case, RPIOSS combines this strategy with
kernel data-dependence analysis to remove unnecessary device storage handling and data transfers
occurring between contiguously executed kernels from the device perspective.

Figure 4.1: Execution timeline with the minimum execution time achieved by
the maximum, ideal overlap of processing and data transfers.
17

Figure 4.2: Single-kernel and multi-kernel execution timeline showing scheduling of input,
output, and kernel operations partitioned into partial transfers and subgrids to maximize
overlap of operations.
RPIOSS software architecture is based on a concurrent processing pipeline where stages
asynchronously effect host processing, device I/O operations, device computations, and management
overhead tasks. From the application perspective, this strategy may produce an improvement of
execution time with respect to non-streamed device execution. RPIOSS provides a low-overhead
abstract API that, in addition to its main endeavor, simplifies development of high-performance GPU
computing applications. The design of the RPIOSS architecture addresses the following challenges:
•

Hiding of implementation details of API functions for device data transfers and processing.

•

Capturing and redirecting grid inputs and outputs followed by scheduling of operations on the
pipeline stages.

•

Reducing the cost of data transfers and storage allocation and deallocation.

•

Managing asynchronous interleaved operations of the pipeline stages for maximizing utilization
of processing cores.

•

Analyzing and managing kernel data dependences dynamically to avoid redundant transfers.

Given that host memory is managed by processor hardware, the operating system, and application
code and device memory is handled by device hardware and application code, i.e., they are logically and
physically separate, device processing requires an inter-memory data transfer preamble and postamble.
Some device memory handling can be done statically through storage type declarations both on the host
code and on the device code. However, the rest of the memory handling is done dynamically. In
18

addition, most of the data to be processed by the device must be dynamically transferred with a device
bandwidth below that of the processor main bus. All of these operations can become the main sources of
overhead in GPU computing. Approaches to lessen their impact include management of memory with
non-blocking functions and pinned memory and data transfers using DMA.
A high-level structure of the runtime pipelined I/O scheduling system is shown in Figure 4.3. To
improve utilization of host and device processing cores, the software pipeline overlaps the steps for
storage allocation/deallocation, communication, and execution of GPU computing applications. In the
most general case, the host process asynchronously submits pipelinable data and kernel computations to
RPIOSS while proceeding to perform processing on the host until RPIOSS output is needed and
available. RPIOSS determines data dependences to schedule a software pipeline in the device that
produces correct output with a minimum of data transfers and kernel launch latency.

Figure 4.2: High-level structure of RPIOSS.
4.1

Virtual Layers
RPIOSS virtual layers, which are shown in Figure 4.3, specify well-defined logical interfaces.

They have the advantages of hiding implementation details, offering a common interface for similar
devices, and managing the software and hardware resources. The foundation is based on NVIDIA19

provided [39] low-level device driver APIs to manage device storage, transfer data between host and
device, and launch kernels. Using these APIs, high-level language programs can invoke low-level
primitives provided by the device driver. RPIOSS has two virtual layers. Similar to the virtual layer
presented in [18, 49], the first layer enables interaction between the host code and RPIOSS by
intercepting and redirecting input and output parameters. The following RPIOSS API functions are
included in the first layer:
•

rpiossInit establishes the device context for a kernel, a vector to process, a number of stream
processors, and check if the analysis of dependence and the generation of subgrids are on or off.

•

rpiossData specifies input/output data for a kernel, and check if zero copy is on or off.

•

rpiossGrid specifies a kernel execution configuration.
Functions rpiossInit, rpiossData, and rpiossGrid redirect input and output data to dynamically

access them from an RPIOSS data structure in host shared memory. This removes the need for
redundant data copying from the host user level. Figure 4.3 illustrates how RPIOSS API functions can
be used by a GPU computing application.
rpiossInit(P,Vs, Sn, DependencesFlag, SubgridFlag);
rpiossData(a, IN, ZeroCopyFlag);
rpiossData(b, IN, ZeroCopyFlag);
rpiossData(c, OUT, ZeroCopyFlag);
rpiossGrid(K, Bx, By, Bz, Tx, Ty, Tz, Ms);
rpiossRun();
Figure 4.3: Vector addition example using the API functions of RPIOSS.
Vector addition examples using RPIOSS API functions and CUDA API functions are shown for
comparison in Figure 4.4 and Figure 4.5, respectively.

20

rpiossInit(vectorAddPtx, numOfElems, numOfStreams, OFF, OFF);
rpiossData(hostA, IN, OFF);
rpiossData(hostB, IN, OFF);
rpiossData(hostC, OUT, OFF);
rpiossGrid(kernel, Bx, By, Bz, Tx, Ty, Tz, 0);
rpiossRun();
Figure 4.4: Vector addition using RPIOSS.

// allocate device memory
cudaMalloc((void **) &deviceA, sizeBytes);
cudaMalloc((void **) &deviceB, sizeBytes);
cudaMalloc((void **) &deviceC, sizeBytes);
// copy data to device
int portion = numOfElem/numberOfStreams;
for(i = 0; i < numberOfStreams; i++) {
cudaMemcpyAsync(deviceA+i*portion, hostA+i*portion,
portionBytes, cudaMemHostToDevice, stream[i]);
cudaMemcpyAsync(deviceB+i*portion, hostB+i*portion,
portionBytes, cudaMemHostToDevice, stream[i]);
}
// kernel configuration and invocation
dim3 thread(Tx, Ty, Tz);
dim3 block(Bx, By, Bz),
kernel<<<block, thread>>>(deviceA, deviceB, deviceC);
// copy result back to host
for(i = 0; i < number_of_streams; i++) {
cudaMemcpyAsync(hostC+i*portion, deviceC+i*portion,
portionBytes, cudaMemHostToDevice, stream[i]);
}
// Synchronize streams
…
// Deallocate device memory
…
// Destroy streams
…
Figure 4.5: Vector addition using high-level CUDA API functions.

21

The second layer in Figure 4.3 enables interaction between RPIOSS and the device
through the device driver API. RPIOSS releases both single-kernel and multi-kernel application
workloads from the need to explicitly access low-level device interface functions when aiming to
obtain the greatest achievable performance. Stages of the RPIOSS pipeline, as illustrated in
Figure 4.6, are exposed to host processes through RPIOSS API calls that use host shared
memory to execute tasks.
4.2

Pipeline I/O Scheduling
RPIOSS consists of a manager, a dependence subsystem, a host-shared memory subsystem, and

a pipeline subsystem, as shown in Figure 4.6.

Figure 4.6: RPIOSS functional blocks.
4.2.1 Manager
The manager, shown in Appendix K, initializes the dependence subsystem, which removes
redundant data transfers when the host process invokes more than one kernel; invokes the pipeline
subsystem to redirect input and output of host shared memory; destroys host threads; and releases
synchronization resources. The host shared memory subsystem, which runs in user space, handles
RPIOSS data structures used to access application data and synchronization resources.

22

4.2.2

Host Shared Memory Subsystem
The host shared memory subsystem, shown in Appendix L, maps memory into the address

spaces of independent host threads that share a memory region and removes the need to use operating
system calls to exchange data and additional information required by RPIOSS. The subsystem also
controls shared structures that store synchronization and application variables that enable the virtual
layers and the manager to handle application parameters and data with little overhead. This subsystem
uses synchronization and mutual exclusion for robust management of shared application data. In
addition, this subsystem controls set variables and functions that keep track of dependences among
parameters of multi-kernel applications.
4.2.3

Dependence Subsystem
The dependence subsystem, shown in Appendix M, automatically checks kernel parameters for

dependences when an application invokes two or more kernels. Using set functions for dynamic
analysis, this subsystem removes redundant data transfers between the host and the device in either
direction, thus reducing the cost of allocating and deallocating storage, transferring data, and rewriting
data in device memory. For instance, Figure 4.7 shows invocation of three kernels – k1, k2, and k3. Since
k1 and k2 share a common parameter, b, the dependence subsystem creates Setabcxy to handle these two
kernels and avoid allocating and copying twice parameter b. Kernel k3 does not have any common
parameter with k1 and k2. Thus, the dependence subsystem creates Setmnt to handle k3. The overlapping
of communications and computations and the elimination of redundant data transfers offset overhead
introduced by RPIOSS functionality as gird size and data size grow. Another advantage of using
RPIOSS to automatically remove redundant data transfers is the elimination of the need to develop
redundant, more complex, or tightly coupled kernels aimed to manually remove redundant data transfers
to improve application performance.

23

k1(a, b, c);
k2(b, x, y);
k3(m, n, t);
Figure 4.7: Vector addition example using RPIOSS.
Once the dependence subsystem associates parameter subsets with kernels, RPIOSS maps
application processing to a set of discrete tasks implemented by a workflow through its pipeline stages
resulting on a host-device execution timeline as shown in Figure 4.2. Overlapping of work within stages,
on the processor and the device aims to both improve performance and utilization of the host and the
device.
4.2.4

Pipeline Subsystem
The pipeline of RPIOSS consists of four stages as shown in Figure 4.6 and their programs are

shown in Appendix N through R. While the configuration analysis stage determines the sizes of blocks
of data to transfer, the allocation stage defines blocks of data to transfer and allocates storage in the
device global memory. Once the first stage allocates storage for at least one data block, the second stage
copies data from host to device. As soon as input data blocks are transferred, the third stage invokes the
associated kernel subgrid. After kernel execution, the fourth stage copies the result to the host and
deallocates device storage if not needed by further processing on the device. Each stage receives inputs
from the previous stage and produces outputs for the next stage. The first stage takes host inputs and the
last stage produces the final outputs for the application running on the host. For this, RPIOSS uses light
host threads that enable concurrent program execution. Light host threads share their address space and
other resources. Light threads are intended to run in user space rather than operating system kernel
space, but this is thread-library implementation dependent. The disadvantage of using the operating
system kernel space is that it requires additional copy operations, which can generate overhead that may
affect application performance.

24

Figure 4.8 depicts the dynamic structure of RPIOSS. Gray squares represent resource sharing,
lines connecting boxes represent actions, the synchronization primitives are mi, i = 0..3, and the host
threads are Tj, j = 0..4. Figure 4.9 shows the algorithm for each stage. Contiguous pipeline stages behave
as producer/consumer pairs. For instance, T1 in the Allocation stage and T2 in the Transfer To Device
stage are one of such pairs. Once the former allocates storage on device global memory, the latter
transfers data blocks to the allocated space. The figure illustrates RPIOSS use of semaphores for
synchronization, represented by wait and post, to determine when a stage may consume and produce
elements and avoid race conditions.

Host#shared#memory#

Pipeline#ﬂow#

T0#:#Conﬁgura-on#
#######analysis#
m0#
T1#:Alloca-on#

m0#
m1#

T2#:Transfer#to#device#
m1#
m2#
T3#:Kernel#launcher#
m2#
m3#
T4#:Transfer#to#host#
m3#

Synchroniza-on#
m0#:#Synchroniza-on#
wait#
data#
post#
m1#:#Synchroniza-on#
#wait#
data#
post#
m2#:#Synchroniza-on#
#wait#
data#
post#

data:#App.#data#struct.#
read#
write#

m3#:#Synchroniza-on#
wait#
data#
post#

Figure 4.8: Data and synchronization structure of RPIOSS.

25

for i = 0, partitions − 1
wait (put/get)
for j = 0, elements − 1
lock buffer
produce/consume elements post (put/get)
end
...
// additional lines to produce elements
...
wait
lock buffer
produce/consume elements
unlock buffer
post
end
Figure 4.9: Stage Algorithm.
All the stages use an algorithm similar to the one in Figure 4.9. The algorithm is divided into two
main sections before and after the line following the nested for. Called for short the reference line, it
represents the consumer and the producer respectively for each stage, except the last stage that does not
have the producer part. For instance, before the reference line, the first stage consumes the elements by
wrapping the low level CUDA device driver API for allocating memory (e.g. cuMemAlloc) on the GPU,
based on size of blocks provided by the configuration analysis stage. After the reference line, this stage
produces the elements for the next stage by writing into a buffer. Similarly, the second stage consumes
the element by wrapping the low level CUDA device driver API for copying data asynchronously (e.g.,
cuMemcpyAsync) from the main memory of the host to the global memory of the device. After the
26

reference line, this stage produces the elements for the next stage by writing into a buffer. The third
stage executes some additional tasks before the reference line. It identifies whether the input is a scalar
or a vector since the allocation and transfer operations are for vectors. After that, this stage configures
and launches the kernel asynchronously by invoking the primitive provided by device driver (e.g.,
cuLaunchKernel), based on the input parameters given by the user such as grid dimension and name of
the kernel, which is stored in host shared memory. After the reference line, the third stage produces the
elements for the last stage by writing them into a buffer. Finally, the last stage just consumes the
elements by transferring data back asynchronously from the global memory of the device to the main
memory of the host through a low level CUDA API function (e.g. cuMemcpyAsync).
4.3

CUDA Context
A CUDA context contains information about the driver that the applications need. There is one

context for each GPU application. Therefore, a single host thread creates the CUDA context and
accesses it in the context stack through the low level driver API. It is possible that a single host thread
can create more than one context by pushing and popping the context in turn through
cuCTxPushCurrent and cuCtxPopCurrent respectively into the context stack. At any time, only one of
the contexts is active.
Using a similar approach, RPIOSS takes advantage of the context stack that enables the host
thread of each pipeline stage to share a CUDA context and have access to one device. To do this, the
functions located in each stage of the pipeline use the low-level device driver API to change a CUDA
context to other host threads by popping the context and then pushing it from another host thread. The
synchronization primitives become very important to avoid race conditions or system crashes [13, 54].

27

Chapter 5. Results using RPIOSS
This chapter describes benchmarks, methods, metrics, and quantitative and comparative analyses
of using baseline non-streamed and RPIOSS-based execution for GPU computing, which were presented
in [49]. Discussion focuses on application and system behavior. However, application development is
simplified when using RPIOSS as discussed and illustrated in Chapter 4.
5.1

Experimental Setup
A number of benchmarks were run to determine RPIOSS impact on performance of GPU

computing applications with the following experimental setup:
•

Hardware. Experiments were conducted on the Stampede computer of the Texas Advanced
Computing Center. The machine is a Dell PowerEdge C8220 cluster with 2.6 GHz Intel
Xeon Phi coprocessors. The machine has 16 nodes, each with 1 TB of RAM and 32 cores,
and 128 standard compute nodes, each with an NVIDIA Tesla K20 GPU [54].

•

Software. RPIOSS was implemented in C++ using the CUDA v5.5 driver API.

•

Benchmarks. Applications, which are described in detail below and included as source code
in the appendices, written by third parties in C++ using CUDA v5.5 and the NVIDIA SDK
without using streams for computation and communication include the following: Rodinia [8,
48], an application taken from [25], vector addition (VA), Black-Scholes (BS), Nearest
Neighbors (NN), 1D Convolution (C1D), Histogram (HIS), and seismic tomography (ST).
VA, BS, and HIS represent entire benchmark suites, based on the research presented in [58],
in which authors used a clustering technique to select a subset of benchmarks and avoid
running groups of benchmarks that have similar workload behavior.

5.2

Metrics
RPIOSS attempts to improve the performance of the applications by overlapping allocation,

communication, and computation operations. To evaluate the impact of RPIOSS with respect to the
baseline non-streamed execution, this work employs the following metrics based on performance
optimization guidelines [40]:
28

•

Maximize utilization by maximizing parallel execution.

•

Maximize memory throughput by optimizing memory usage.

•

Maximize instruction throughput by optimizing instruction usage.

Based on the guidelines above, the following metrics are used to analyze RPIOSS impact on
performance:
•

Speedup is a metric that provides the ratio between a baseline execution time of the
application and the execution time of the application after a modification under
evaluation. This metric is important for two reasons. First, this metric allows
determination of the overall impact of RPIOSS with respect to the baseline. Second,
speedup can be used to determine the data-set size threshold that RPIOSS needs to
compensate for the overhead it introduces.

•

Global throughput is the sum of the requested global memory load throughput and
requested global memory store throughput. This metric indicates global memory
demanded by a kernel and is related to the effective bandwidth calculation [41]. The
importance of this metric is that it serves as a reference point to determine data transfer
rates.

•

Utilization of issue slots provides the percentage of issue slots that issued at least one
instruction, averaged across all cycles. The significance of this metric is that it provides a
measurement for ILP. Issue slots, which may be a limiting factor for ILP, are resources
for processing instructions.

•

Occupancy is the number of warps concurrently running on a SM divided by the
maximum number of warps that can reside on a SM. This metric is important since it is
related to TLP. Running many threads per SM hides memory latency and helps to
achieve the highest possible device utilization.

•

Number of instructions executed indicates improvements in the instruction throughput,
which can be achieved by minimizing the number of arithmetic instructions, divergence,
and synchronization operations [40].
29

•

Percentage of time of GPU operations denotes the impact of the communication and
computation operations, which is helpful to determine when these operations may
provide a performance improvement.

These metrics are used to analyze the results of running benchmarks except for ST since the
profiler did not report percentage of issue slots, percentage of occupancy, and number of
instructions executed.
5.3

Methodology
GPU computing applications selected as benchmarks were used to evaluate the performance

impact of overlapping storage allocation/deallocation, data transfers, and kernel computations using
RPIOSS. The performance of each application was measured both running with RPIOSS and running
the original, non-streamed base software. Workload sizes, number of threads, and number of streaming
multiprocessors vary according to the benchmark characteristics. For instance, vector addition, BlackScholes, Nearest Neighbors, 1D convolution, and histogram test runs used workload storage ranging
from 64 to 65536 Kbytes; thread numbers ranging from 32 to 128; and number of streaming
multiprocessor ranging from 2 to 8. The seismic tomography workload is based on shot points obtained
from data sets created with the Potrillo Volcanic Field experiment [5] with a range from 7 to 56 shots,
the number of threads used is 32, and the number of streams is 5. Elapsed time measurement started at
storage allocation and finished after receiving results at the host. Reported time is the average of five
runs of each test as shown in Appendix A through E. RPIOSS dynamically determines the number of
blocks to use for each test based on the workload size and the number of threads used.
The validation of results depends on the benchmarks. For instance, the validation of results of the
CUDA SDK benchmarks are automatic, since these benchmarks have a fragment of code that tests the
output without human intervention, by comparing the values of the serial version and the parallel
version. If the output values are the same, the test indicates pass. Otherwise, it indicates fail. The
validation of the result for other benchmarks needs human intervention. It was done by comparing the
output values between the non-streamed execution and the RPIOSS version.

30

5.4

Benchmark Descriptions and Analyses of Results

5.4.1

Vector Addition
Vector addition is defined as
,

where
, and

and

are vectors in

=

, i.e.,
.

This operation, like many other inherently parallel operations, is implemented as some loop in a
sequential program. However, CUDA provides functionality to define a kernel where vector element
additions can be expressed in terms of a primitive task that implements
concurrent instances of thread ,

to be computed by

.

The Vector Addition (VA) program performs a parallel addition of size-n vectors by adding
vector A to vector B and placing the result in vector C. Figure 5.1 shows pseudocode for storage
management, data transfer, and grid launch operations of the base program, which allocates space for
vectors A, B, and C in device global memory, copies to the vectors A and B from host to device,
computes the vector operations to find the value of C, transfers the result back from the device, and
deallocates device storage. Similarly, Figure 5.2 shows VA pseudocode using RPIOSS, which internally
performs the same operations as the base program for device I/O and grid launching. However, RPIOSS
overlaps operations, which require partitioning vectors A, B, and C into subvectors A0, A1,…, Am; B0, B1,…,
Bm; and C0, C1,…, Cm for distribution into a number t of streams. For instance, RPIOSS allocates storage,
copies data to and from the device, and computes using vectors A0, B0, and C0 with stream 0. Then, it
performs the same operations for vectors Am, Bm, and Cm with stream m modulo t.
The VA base program manages the mapping of threads to the elements of vectors A, B, and C.
Vectors are one-dimensional (1D), threads are in a 1D grid, and each thread computes one value of the C
vector. Similarly, RPIOSS implements mapping of threads to values of the A0…m, B0…m, and C0…m
subvectors. Since subvectors are 1D also, threads are organized in a 1D subgrid, and each thread
computes one value of the C0…m subvector. This approach enables the VA base program as well as the
RPIOSS VA to perform the same instruction stream using as many concurrent threads as supported by
the device. In addition, threads coalesce memory requests for read and write operations within each
warp.

31

Allocate device storage for vector A
Allocate device storage for vector B
Allocate device storage for vector C
Copy host A data to device
Copy host B data to device
Specify grid dimensions Bx, By, and Bz
Specify block dimensions Tx, Ty, and Tz
Launch VA grid
Copy device C data to host
Figure 5.1: Pseudocode for storage management, data transfer,
and grid launch operations of the VA base program.

Initialize RPIOSS
Specify input vector host A
Specify input vector host B
Specify output vector host C
Specify VA grid
Run RPIOSS
Figure 5.2: Pseudocode for VA using RPIOSS for GPU I/O and execution.
Results in Figure 5.3 show that for data set sizes below 4 MB speedup obtained ranges from 1x
and 1.06x. Above the 4 MB threshold, however, performance improvements with RPIOSS are
noticeable; this benchmark achieved speedups of 1.44x when the data size was 64 MB.

32

1.60#
1.40#

Speedup&

1.20#
1.00#
0.80#
0.60#
0.40#
0.20#
51
2#
10
24
#
20
48
#
40
96
#
81
92
#
16
38
0#
32
76
8#
65
53
6#

25
6#

12
8#

64
#

0.00#

Data&size&(Kbytes)&

Figure 5.3: RPIOSS Speedup for Vector Addition.
Additional metrics obtained running the benchmarks include occupancy, issue slot utilization,
global throughput, and number of instruction executed. A clear advantage of using RPIOSS is the
improvement of throughput, shown in Table 5.1, which is 63% better than throughput of the base
benchmark. In addition, utilization of issue slots improved 125%. Results show also that RPIOSS-based
VA executed 45% fewer instructions and it had a 342% better occupancy using RPIOSS.
Table 5.1: Metrics for Vector Addition.
Metrics

Base

RPIOSS

Global throughput (GB/s)

79

129

Issue slot utilization (%)

12

27

Occupancy (%)

19

84

Instructions executed

1,573,944 862,413

Results also show the percentage of time used to copy to and from the device and for kernel
execution. RPIOSS balanced the work of data transfer and execution operations as shown in Figure 5.4.
Therefore, as device operations tended to be balanced, thread level parallelism (TLP) improved in the
system. Furthermore, the fraction of time of execution of the GPU kernel tends to be higher using
33

RPIOSS, as the device has a smaller latency and more warps to process according to the higher
occupancy, which introduces abundant opportunities for TLP. The figure also indicates that the system
is hiding communication overhead, as subgrids are launched after partial data transfers and results start
to be transferred back to main memory as soon as they become available, instead of waiting until the

Percentage)(%))

whole result is ready.
100%#
90%#
80%#
70%#
60%#
50%#
40%#
30%#
20%#
10%#
0%#
Base#VA#

RPIOSS#VA#
Benchmarks)

Execu<on#kernel#

Copy#deviceHhost#

Copy#hostHdevice#

Figure 5.4: Percentage of time of each GPU operation for Vector Addition.
5.4.2

Black-Scholes
Black-Scholes (BS) option pricing is a mathematical model for financial markets that was

developed by Fischer Black and Myron Scholes [7]. It is used for computing the price of put and call
options granted by the holder. There are several kinds of options such as the European-style and the
American-style. The former can operate on the expiration date, and the latter may operate on the
expiration date or any time thereafter.
The option seller grants the option buyer the right to exercise the option at a predetermined strike
price on the expiration date, , (European style) or at a time thereafter (American style). While a call
option gives the right to buy the asset, a put option gives the right to sell it [48]. The profit for a call at
the exercise date is the price of the asset minus the strike price minus the paid option price. The profit
for a put is the strike price minus the asset price minus the paid option price. Considering the riskless
34

rate of return, , as the annual interest rate of bonds or other riskless investments, an investment of
dollars is guaranteed a worth of
currently worth

after

years. Conversely, an asset worth

dollars in

years is

. The Black-Scholes model of an option price is a partial differential equation

that has a closed-form solution for European-style options as follows:

  

where

is the price of an option call,

distribution function,

is the price of an option put,

is the current option price,

is the strike price,

the continuously compounded risk free interest rate, and

is the cumulative normal
is the time to expiration,

is

is the implied volatility of the asset .

Figure 5.5 shows BS pseudocode for storage management, data transfer, and grid launch
operations of the base program, which performs device operations for allocation, copy, and computation
on vectors option strike (OS), option price (OP), option year (OY), call result (OR), and put result (PR).
The BS kernel uses also two scalar values. RPIOSS breaks the vectors into chunks according to a
number of SMs. Figure 5.6 shows BS pseudocode using RPIOSS, which internally allocates storage and
copies from host to device a chunk of the OS0, OP0, and OY0 with stream 0. Once computations are
done and the results are in vectors OR0, and PR0, the system copies the results from device to host.
Similarly, RPIOSS allocates storage and copies from host to device a chunk of the OSm, OPm, and OYm
with stream m. Once computations are done and the results are in vectors ORm and OPm, the system
copies the results from device to host.

35

Allocate device storage for vector OS
Allocate device storage for vector OP
Allocate device storage for vector OY
Allocate device storage for vector OR
Allocate device storage for vector PR
Copy host OS data to device
Copy host OP data to device
Copy host OY data to device
Specify grid dimensions Bx, By, and Bz
Specify block dimensions Tx, Ty, and Tz
Launch BS grid
Copy device OR data to host
Copy device PR data to host
Figure 5.5: Pseudocode for storage management, data transfer, and
grid launch operations of the of Black-Scholes base program.
Initialize RPIOSS
Specify input vector host OS
Specify input vector host OP
Specify input vector host OY
Specify output vector host OR
Specify output vector host PR
Specify scalar1 and scalar2
Specify BS grid
Run RPIOSS

Figure 5.6: Pseudocode for Black-Scholes using RPIOSS for GPU I/O and execution.

The BS base program manages the mapping of threads to elements of the OS, OP, and OY
vectors. These are 1D vectors. BS arranges threads on a 1D grid and has each thread in the grid compute
one value for the OR vector and one value for the PR vector. Similarly, RPIOSS assigns threads to
process the values of the OS0..m, OP0..m, and OY0..m 1D subvectors. Threads are assigned using a 1D
subgrid where each thread computes one value of the OR0..m and one value of the PR0..m subvectors. This
approach allows both the BS base and RPIOSS BS to concurrently execute the same instruction with

36

multiple threads. In addition, both the base code and the RPIOSS code use device shared memory for
explicit data caching to reduce memory access latency and they coalesce global memory requests.

1.40#
1.20#
Speedup&

1.00#
0.80#
0.60#
0.40#
0.20#

51
2#
10
24
#
20
48
#
40
96
#
81
92
#
16
38
4#
32
76
8#
65
53
6#

25
6#

12
8#

64
#

0.00#

Data&size&(Kbytes)&

Figure 5.7: Speedup for Black-Scholes.
The result in Figure 5.7 shows that for data set sizes below 4 MB speedup obtained ranges from
1x to 1.05x. Above the 4 MB threshold, performance improvements with RPIOSS were better.
Benchmarks achieved speedups of 1.31x when the data size was 64 MB.

Table 5.2: Metrics for Black-Scholes.
Metrics

Base

RPIOSS

Global throughput (GB/s)

40

63

Issue slot utilization (%)

24

23

Occupancy (%)

23

22

Instructions executed

11,996,672 2,064,235

Execution metrics obtained running the benchmarks include occupancy, issue slot utilization,
global throughput, and number of instruction executed. A clear advantage of using RPIOSS is shown in
37

the improvement of throughput in Table 5.2, which is 58% better than the value obtained with the base
benchmark. Utilization of issue slots and occupancy were approximately the same for both programs.
However, the RPIOSS-based benchmark executed 83% fewer instructions. Results show also the
percentage of work used to copy to and from the device and for kernel execution. BS is an intense
computation benchmark. The percentage of time of the kernel execution was about the same for both
programs as shown in Figure 5.8.

Figure 5.8: Percentage of time of each GPU operation for Black-Scholes.
5.4.3

1D Convolution
Convolution is used in a number of applications including signal processing, digital recording,

image processing, video processing, and computer vision. For instance, to convolve in the space domain
an image and a convolution kernel, each result image value is calculated as the weighted sum of
neighboring elements of the corresponding input value multiplied by the weights in the kernel.
Convolution is defined as
function,

is a convolution kernel,

where

is an input

is the output function, indicates convolution, and

indicates multiplication.
In the case of 1D convolution, the input is a 1D array of image values and the convolution
kernel, also referred to as the mask, is a 1D array of weights that transform the input image in some way.
38

For the input function

and the convolution kernel

the transformed output

,

. In the case of values near the beginning

or the end of the input function, missing function or kernel elements can be taken as a constant for
simplicity. Since this computation is embarrassingly parallel, a convolution primitive task can be
implemented as a CUDA kernel that computes the internal product between a subset of the input
function, i.e., the neighborhood of the

input value,

convolution kernel

after either one of them horizontally is reversed and

translated. For instance, the value of the

and the

output element would be calculated as

where

,

and

.
One-dimensional convolution (C1D) is a filter that executes a series of operations for each output
element based on the sum of values times some weights of the closest elements, which are referred to as
the mask or the convolution kernel. When there are a few values in a mask, device constant memory or
shared memory can be used to store them. However, if the mask is large, it can be stored in device
global memory instead. Figure 5.9 and figure 5.10 show pseudocode for storage management, data
transfer, and grid launch operations of the C1D base and its RPIOSS C1D counterpart, respectively.
Both the base and RPIOSS versions manage the mapping of threads to the computation of elements of
the output vector and subvectors, respectively. Both the output vector and subvectors are 1D. Threads
are arranged on a 1D grid or subgrid and each thread computes a value of the output vector or
subvectors. Each output value is computed by multiplying the input values times the values in
corresponding positions of the mask, adding the results, and dividing the sum by the number of elements
in the mask. Note that there are no values in the beginning and the end of an input array matching mask
positions to compute the first or last values of the output vector. To handle these boundary conditions,
the kernel defines the missing values as zero. These are called ghost elements.

39

Allocate device storage for Input
Allocate device storage for Mask
Allocate device storage for Output
Copy host Input data to device
Copy host Mask data to device
Specify grid dimensions Bx, By, and Bz
Specify block dimensions Tx, Ty, and Tz
Launch C1D grid
Copy device Output data to host

Figure 5.9: Pseudocode for storage management, data transfer,
and grid launch operations of the C1D base program.

Initialize RPIOSS
Specify input vector host Input
Specify input vector host Mask
Specify output vector host Output
Specify C1D grid
Run RPIOSS

Figure 5.10: Pseudocode for C1D using RPIOSS for GPU I/O and execution.
The C1D base program allocates an input and an output vector, copies the input vector to the
device and launches the C1D kernel. After the kernel finishes computations, the program transfers back
to main memory the results in the output vector stored in device global memory. RPIOSS operates in the
same way as the base C1D program. However, as in previous cases, RPIOSS overlaps data transfer and
computation operations. It allocates data blocks of the input and output vectors to a number of SMs and
copies a block of the input vector to each SM. Similarly to the base program, RPIOSS transfers the
results from device to host, but it transfers data blocks of the output vector into the host output vector as
they become available while other computations take place.

40

The results in Figure 5.11 show that for data set sizes below 4 MB, speedup obtained ranges
from 1x to 1.09x. Above the 4 MB threshold, RPIOSS obtained a speedup of 1.55x when the data size
was 64 MB.

1.40#
1.20#
Speedup&

1.00#
0.80#
0.60#
0.40#
0.20#

51
2#
10
24
#
20
48
#
40
96
#
81
92
#
16
38
4#
32
76
8#
65
53
6#

25
6#

12
8#

64
#

0.00#

Data&size&(Kbytes)&

Figure 5.11: Speedup for 1D convolution.
Metrics obtained running the benchmark includes occupancy, issue slot utilization, global
throughput, and number of instructions executed. An advantage of using RPIOSS is the improvement of
global throughput shown in Table 5.3, which is 26% better than the base benchmark. Utilization of issue
slots goes down by 16%. Results also show that RPIOSS-based benchmarks had 27% fewer instructions
executed and the same occupancy than the base program.

Table 5.3: Metrics for 1D convolution.
Metrics

Base

RPIOSS

Global throughput (GB/s)

95

120

Issue slot utilization (%)

19

16

Occupancy (%)

24

24

Instructions executed

44,670,221 32,545,155
41

C1D is the most compute-intensive benchmark used in this work. The percent of time of kernel
execution was virtually the same for the base program and for C1D using RPIOSS as shown in Figure
5.12.

Figure 5.12: Percentage of time of each GPU operation for 1D convolution.
5.4.4

Nearest Neighbors
Nearest Neighbors (NN) is one of the essential data-mining algorithms used to find the closest

objects based on a training set. Input data is a list of latitude and longitude coordinates of hurricanes.
The algorithm computes the Euclidean distance between each pair of objects to determine which objects
are the Nearest Neighbors of each object in the training set.
Figure 5.13 shows pseudocode for storage management, data transfer, and grid launch operations
of the NN base program, which allocates device global memory and copies the data into latitude and
longitude vectors. Then, the kernel computes pairwise distances in parallel and determines the closest
neighbors. Figure 5.14 shows NN pseudocode using RPIOSS for device I/O and grid launching, which
internally checks for data dependences to remove redundant operations if necessary, allocates device
global storage, copies data blocks of the latitude and the longitude vectors to the device, and launches

42

the kernel to perform the computations. Then, RPIOSS copies back the result from device to host with
the Nearest Neighbors for each object.
Allocate device storage for longitude
Allocate device storage for latitude
Allocate device storage for distance
Copy host longitude data to device
Copy host latitude data to device
Specify grid dimensions Bx, By, and Bz
Specify block dimensions Tx, Ty, and Tz
Launch NN grid
Copy device distance data to host
Figure 5.13: Pseudocode for storage management, data transfer,
and grid launch operations of the NN base program.
Initialize RPIOSS
Specify input vector host longitude
Specify input vector host latitude
Specify output vector host distance
Specify scalar
Specify NN grid
Run RPIOSS
Figure 5.14: NN pseudocode using RPIOSS for GPU I/O and execution.
The NN base manages the mapping of threads to the elements of the 1D latitude and longitude
vectors. Threads are arranged in a 1D grid and each thread computes a value of the distance vector.
Similarly, RPIOSS maps threads to the values of latitude and longitude subvectors. Subvectors and
subgrids are 1D. Each thread in a subgrid calculates a value of the distance subvector.
Results in Figure 5.15 show that for data set sizes below 4 MB, speedup obtained is less than 1x.
Above the 4 MB threshold, performance improvements with RPIOSS are noticeable; benchmarks
achieved speedups of 2.37x when the data size was 64 MB.

43

2.50#

Speedup&

2.00#
1.50#
1.00#
0.50#

51
2#
10
24
#
20
48
#
40
96
#
81
92
#
32
76
8#
65
53
6#

25
6#

12
8#

64
#

0.00#

Data&size&(Kbytes)&

Figure 5.15: Speedup for Nearest Neighbors.
Metrics obtained running the benchmarks include occupancy, issue slot utilization, global memory
throughput, and number of instructions executed. RPIOSS improved throughput, as illustrated in Table
5.4, more than 114% with respect to the base benchmark. Utilization of issue slots improved 150%.
Results also show that RPIOSS-based benchmarks executed 77% fewer instructions with a 844% better
occupancy.
Table 5.4: Metrics for Nearest Neighbors
Metrics

Base

RPIOSS

Global throughput (GB/s)

7

15

Issue slot utilization (%)

12

30

Occupancy (%)

9

85

Instructions executed

5,347,221 1,220,986

Regarding percentage of time used to copy to and from the device and for kernel execution,
RPIOSS obtained a better balance of operations as shown in Figure 5.16. The fraction of time for kernel
execution is higher using RPIOSS. This shows that communication overhead is hidden by overlapping
computations and indicates a source of the reduction of total execution time obtained with RPIOSS.
44

Percentage (%)

100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Base NN

RPIOSS NN
Benchmarks

Kernel execution

Copy host-device

Copy device-host

Figure 5.16: Percentage of time of each GPU operation for Nearest Neighbors benchmark.
5.4.5

Histogram
A histogram, a non-parametric density estimator, is intuitively appealing because it provides a

visual representation of both the frequency and the relative frequencies of a set of values. Depending on
the application domain, values can be observations, results of simulations, results of calculations, and so
on. The procedure for constructing a common histogram for the set of values

in

, consists in defining a set of contiguous nonoverlapping intervals containing all the values in

and

counting the number of values in each of these intervals, which are usually referred to as bins. For
instance, to construct a histogram with
the

It

number

can

of

be

values

written

in

bins, assuming the bin origin is at

bin

is

, or

is 0 if

as

and the bin width is

where

is

1

,
if
.
for

.
The Histogram (HIS) benchmark determines the frequency of observations within a group of
ranges of values, which are referred to as bins, for information visualization in many applications
including data mining and image processing. The implementation uses a random stream of bytes that are
45

stored in an input vector. Once the histogram vector is allocated and all bins are set to zero, the program
calculates the frequency of each bin by counting the number of items in the input vector that fall within
the bin.
Figure 5.17 shows pseudocode for storage management, data transfer, and grid launch operations
of the HIS base program, which allocates device storage, copies the input and histogram vectors,
launches the kernel that counts the number of bin items in parallel, and accumulates the respective value
in the histogram vector. Finally, the base program copies the result to the host. Figure 5.18 shows
pseudocode for HIS using RPIOSS for device I/O and grid launching, which keeps the histogram vector
in device global memory where it needs to store the values computed by each SM. However, each SM
temporarily stores the result in shared memory while its computations take place. As in previous cases,
RPIOSS allocates device global storage and copies a block of the input vector to each SM, it launches
kernels to perform computations in parallel, and obtains the results stored in the histogram vector.
Allocate device storage for histogram
Allocate device storage for input
Copy host histogram to device
Copy host input to device
Specify grid dimensions Bx, By, and Bz
Specify block dimensions Tx, Ty, and Tz
Launch HIS grid
Copy device histogram to host

Figure 5.17: Pseudocode for storage management, data transfer,
and grid launch operations of the histogram base program.

Initialize RPIOSS
Specify input/output vector histogram
Specify input vector input
Specify HIS grid
Run RPIOSS

Figure 5.18: Histogram pseudocode using RPIOSS for GPU I/O and execution.
46

The application uses shared memory to store temporally the values of the histogram vector, to
reduce the latency of accessing memory. To allow multiple device threads to increase the same bin,
atomic operations are used to avoid data corruption. In addition, both the HIS base and the RPIOSS HIS
coalesce memory requests.
The HIS base manages the mapping of threads to the elements of the 1D input vector. Threads
are arranged on a 1D grid and each thread checks a value of the input vector. Similarly, RPIOSS
accomplishes the mapping of threads to the values of the 1D input subvectors. Threads are organized in
a 1D subgrid and each thread processes a value of the input subvector.
The results in Figure 5.19 show that for data set sizes below 4 MB, speedup obtained ranges
between 1x and 1.08. Above the 4 MB threshold, however, performance improvements with RPIOSS
reached speedups of 1.23x when the data size was 64 MB.

1.40#
1.20#
Speedup&

1.00#
0.80#
0.60#
0.40#
0.20#

51
2#
10
24
#
20
48
#
40
96
#
81
92
#
16
38
0#
32
76
8#
65
53
6#

25
6#

12
8#

64
#

0.00#

Data&size&(Kbytes)&

Figure 5.19: Speedup for histogram.
RPIOSS achieves an improvement of throughput, as illustrated in Table 5.5, 167% better than the base
benchmark, while utilization of issue slots and occupancy were the same for RPIOSS and the base
benchmarks. RPIOSS had 75% fewer instructions executed.

47

Table 5.5: Metrics for histogram benchmark
Metrics

Base

RPIOSS

Global throughput (GB/s)

3

8

Issues slot (%)

16

16

Occupancy (%)

22

22

Instruction executed

44670221 11242791

Results also show the percentage of time used to copy to and from the device and for kernel
execution. RPIOSS obtained a better balance in data transfers and computations as shown in Figure 5.20
indicating a TLP improvement. As in other benchmarks, the fraction of time of kernel execution tends to
be higher using RPIOSS, showing that it is hiding the communication overhead and reducing execution

Percentage (%)

time by overlapping operations.

100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Base HIS

RPIOSS HIS
Benchmarks

Kernel execution

Copy host-device

Copy device-host

Figure 5.20: Percentage of time of each GPU operation for histogram.
5.4.6 Seismic Tomography
Seismic tomography (ST) creates a velocity model of the internal structure of the Earth [19, 52].
The main steps of the ST algorithm used in this benchmark are the computation of discrete first arrivals
48

throughout a velocity model and the inversion of traveltime residuals. A major ST performance
limitation is the generation of an updated velocity model in each iteration, as the algorithm smooths
velocity perturbations for better stabilization, convergence speed, and resolution of the new model. A
mean filter is used for smoothing the model while removing image noise with a progressively reduced
window size.
Input data used was obtained with the Potrillo Volcanic Field experiment [5], which was
designed to examine the crust of the Earth across southern New Mexico and Far West Texas. This
experiment, which was carried out in 2003, used 8 explosions, or shot points, and 793 seismometers, or
geophones, spaced at spans of 100 m, 200 m, and 600 m over a region with a length of 205 km.
The smoothing algorithm computes an interval along each axis that contains a 3D window to
smooth, accumulates the velocity perturbations of all cells within the window, and assigns the average
of the sum to the model vertex associated with the 3D window.
Figure 5.21 shows pseudocode for storage management, data transfer, and grid launch operations
of the ST base program, which allocates storage for input and the output vectors and copies the input
vectors to the device. This program has two kernels. The first kernel, kernelT, accumulates velocity
perturbations of each cell of the dus vector into the du vector. It also adds the ray hit count of each cell
of the nrays vector into the nrtot vector. The second kernel, kernelM, divides the values of the du vector
by nonzero values of the nrtot vector. Finally, the base program transfers the du and nrtot vectors from
device global memory to the host. Similarly, Figure 5.22 shows pseudocode of RPIOSS ST for device
I/O and grid launching, which first checks for kernel parameter dependences, since there are two
kernels, to avoid redundant operations. Next, RPIOSS divides vectors into blocks for storage allocation
and copies a block of the vectors into each SM. Then, the kernels are launched to compute the
operations in parallel. Ultimately, RPIOSS transfers the result to the host.

49

Allocate device storage for dus
Allocate device storage for du
Allocate device storage for nrays
Allocate device storage for nrtot
Copy host dus to device
Copy host nrays to device
Specify grid dimensions Bx, By, and Bz
Specify block dimensions Tx, Ty, and Tz
Launch ST gridT
Launch ST gridM
Copy device du to host
Copy device nrtot to host

Figure 5.21: Percentage of time of each GPU operation for histogram.
Initialize RPIOSS
Specify input vector host dus
Specify output vector host du
Specify input vector nrays
Specify output vector nrtot
Specify ST gridT
Specify input/output vector du
Specify input vector nrtot
Specify ST gridM
Run RPIOSS
Figure 5.22: Pseudocode for storage management, data transfer,
and grid launch operations of the VA base program.
Figure 5.23 shows the speedup of ST using different numbers of shot points. The speedup is
negligible for this benchmark for any number of shot points. As shown in Figure 5.24, the same happens
to the percentage of time for each device operation. This figure shows that both the ST base and the
RPIOSS ST implementations are highly compute-intensive with a 98% of the time spent on kernel
execution. While providing no speedup, RPIOSS reduced programming time and complexity of the
implementation.

50

1.01$
1.01$

Speedup&

1.01$
1.01$
1.00$
1.00$
1.00$
1.00$
1.00$
0.99$
7$

14$

21$

28$

35$

42$

49$

56$

Number&of&shot&points&

Figure 5.23: Speedup for seismic tomography.
1.000%
Percentage)(%))

0.995%
0.990%
0.985%
0.980%
0.975%
0.970%
Base%

RPIOSS%
Benchmarks)

Kernel%execu9on%

Copy%deviceAhost%

Copy%hostAdevice%

Figure 5.24: Percentage of time for each GPU operation for seismic tomography.
5.5

Data Transfer Bandwidth
RPIOSS leverages stream concurrency, as discussed in section 2.3, and DMA, as discussed in

section 2.4, to overlap data transfers between host main memory and device global memory with kernel
computations. Data transfers in this case are asynchronous and require using page-locked host memory
to support DMA operations. Figures 5.25 through 5.29 show comparisons of main memory access
throughput for each of the benchmarks except for ST, as performance of this benchmark was not
51

affected by RPIOSS in any of the tests. In all cases, benchmark programs use paged memory while their
RPIOSS implementations use pinned memory. While benchmarks using pinned memory show a more
scalable memory throughput, base programs outperformed their RPIOSS-based counterparts for all data
set sizes below 4 MB. This result has a strong correlation with speedups obtained with RPIOSS that are
shown combined in Figure 5.30.

Figure 5.25: Host memory throughput for Vector Addition.

52

Figure 5.26 Host memory throughput for Black-Scholes.

Figure 5.27: Host memory throughput for 1D Convolution.

53

Figure 5.28: Host memory throughput for Nearest Neighbors.

Figure 5.29: Host memory throughput for Histogram.
54

3.00

Speedup

2.50
2.00
1.50
1.00
0.50

36
65
5

68
32
7

80
16
3

2
81
9

6
40
9

8
20
4

4
10
2

2
51

6
25

8
12

64

0.00

Data size (Kbytes)
VA

BS

NN

C1D

HIS

Figure 5.30: Speedups obtained with RPIOSS.
5.6

Summary of Results
Results show an impact on speedup for VA, BS, NN, C1D, and HIS benchmarks when using

RPIOSS and varying the data set size. For data set sizes below 4 MB, obtained speedup ranges from
1.0x to 1.09x except for the NN benchmark. This region of the results indicates data set sizes for which
RPIOSS overhead either offsets its performance benefit or is excessive, as indicated by the fractional
speedup of NN. Above the 4 MB threshold, performance improvements with RPIOSS are noticeable.
For a 64 MB data set size, speedups are 1.44x for VA, 1.31x for the BS, 2.37x for NN, 1.55x for C1D,
and 1.23x for HIS. Obtained speedup in these cases compensates for overhead introduced by RPIOSS
when creating and destroying host threads, setting up and releasing synchronization resources, running
its pipeline, and accessing host memory. The speedup tends to increase as a function of the data set size,
which is known as the Amdahl effect [17]. Increasing the data set size improves computation time faster
than the increase in data transfer time and other overhead.

55

Advantages of using RPIOSS include the improvement of global memory throughput, which
ranges from 26% to 167% better than the base benchmarks. Utilization of issue slots slightly goes down
for BS and C1D, but improves from 125% to 150% for VA and NN, respectively. This result indicates
that RPIOSS does not improve the percentage of issue slots utilized by applications with intensive
computation such as BS and C1D. However, RPIOSS provides better utilization of issue slots for
applications with less intensive computation such as VA and NN.
Results also show that RPIOSS-based benchmarks executed 45% fewer instructions for VA,
83% fewer for BS, 27% fewer for C1D, 77% fewer for NN, and 75% fewer for HIS than the base
benchmarks. The number of instructions is reduced because each SM can execute instructions for a
number of warps depending on available resources [25]. For instance, VA and NN have 242% and 844%
better occupancy using RPIOSS, respectively; indicating that RPIOSS improves TLP as SMs had more
warps to schedule. However, the BS and the C1D benchmarks obtained a similar occupancy to their base
implementations. Results also indicate that RPIOSS-based benchmarks obtained better instruction level
parallelism (ILP), as RPIOSS reduces thread block size for scheduling. Because of this, device threads
could use more registers and avoid register spilling, which may affect overall performance [15].
Occupancy and utilization of issue slots were better for VA and NN when using RPIOSS. For the
largest data set sizes, the fraction of time used executing kernels tends to be higher using RPIOSS, as
kernel launch latency is reduced and the device has more warps to process thus increasing opportunities
for exploiting both TLP and ILP. For BS and C1D, percentages of kernel execution time remain very
high using RPIOSS because these benchmarks are very compute-intensive as shown in Figures 5.8 and
5.12.

56

Chapter 6: Conclusions and Future Work
This research presents a runtime pipelined I/O scheduling system that improves utilization of
host and device resources by abstracting and overlapping host and processor computations with explicit,
concurrent device I/O operations. RPIOSS uses host threads to manage the computation-communication
overlap through asynchronous streams that share device access and use DMA with page-locked host
memory. Results obtained running a set of benchmarks with RPIOSS show that speedup improves when
the size of data sets is greater than 4 MB – a threshold that depends on the experimental setup – and
reaches 2.37x when the data set size is 64 MB. Obtained speedup tends to improve as data set size
increases, thus indicating a better host and device utilization and a lessened impact of overhead. RPIOSS
reduced and hid the impact of overhead of data transfers between host and device memory, reduced the
latency of kernel-launching functions, and improved utilization of the host and the device through
improved thread level and instruction level parallelism.
Future research would extend RPIOSS to support multiple devices, which would require
functionality for device-driven data partitioning, load balancing, memory management. A second
extension would include extending RPIOSS to similar platforms such as OpenCL which require explicit
management of communication and computation for improved application performance. A third
extension to make RPIOSS more portable would focus on bindings to other programming languages
such as Java, Python, and FORTRAN. A fourth extension would address GPGPU support for hybrid
applications that leverage cluster and shared memory multiprocessor platforms.
Future work could include using a deep pipeline with parallel stages to lessen the effect of
dependences, leverage multi-kernel support in new device architectures, and explore speculation based
on availability of idle resources.
Another opportunity for future work would focus on scheduling integration and cooperative
coordination. Coarse to fine grain scheduling in a GPU computing platform includes scheduling of jobs,
processes, light-weight threads, streams, blocks, and warps. Except for the top level, RPIOSS could
interact with and adapt to other schedulers based on workload behavior and availability of resources.
57

References
[1] N. AlSaber and M. Kulkarni. SemCache: semantics-aware caching for efficient GPU offloading. in
Proceedings of International Conference on Supercomputing (ICS), 2013, pp.421-432.
[2] AMD. Introduction to OpenCL programming. http://http://developer.amd.com/tools-and-sdks/opencl-zone/openclresources/opencl-course-introduction-to-opencl-programming/, 2010 (accessed June 12, 2014).
[3] AMD. Accelerate Processing Units (APUs). http://www.amd.com/en-us/innovations/softwaretechnologies/apu, 2014 (accessed July 10, 2014).
[4] C. Augonnet, J. Clet-Ortega, S. Thibault, and R. Namys. Data-aware task scheduling on multiaccelerator based platforms. In Proceedings of the 16th International Conference on Parallel and
Distributed Systems (ICPADS), IEEE, 2010, pages. 291-298.
[5] M.G. Averill. A lithospheric investigation of the Southern Rio Grande. PhD thesis, University of
Texas at El Paso, 2007.
[6] M. Bauer, H. Cook, and B. Khailany. CudaDMA: Optimizing GPU memory bandwidth via warp
specialization. In Proceedings of the International Conference for High Performance Computing,
Networking, Storage and Analysis (SC2012), Seattle, WA, Mar. 2011.
[7] F. Black and M. Scholes. The pricing of options and corporate liabilities. The Journal of Political
Economy, 81, pages 637-654. 1973.
[8] T. Bradley, Hyper-Q example (white paper), NVIDIA, Aug. 2012.
[9] I. Buck. GPU computing: Programming a massively parallel processor. In Proceedings of the
International Symposium on Code Generation and Optimization. IEEE, 2007, pages 17-17.
[10] S. Che, M. Boyer, J. Meng, D. Tarjan, J.W. Sheaffer, S.H Lee, and K. Skadron. Rodinia: A
Benchmark Suite for Heterogeneous Computing. In Proceeding of the International Symposium on
Workload Characterization, IEEE, pages 44-53. 2009.
[11] S. Che, M. Boyer, J. Meng, D. Tarjan, J.W. Sheaffer, and K. Skadron. A performance study of
general-purpose applications on graphics processors using CUDA. Journal of Parallel and
Distributed Computing, 68(10), 2008, 1370-1380.
[12] S. Cook. CUDA Programming: A Developer's Guide to Parallel Computing with GPUs. Newnes,
2012.
[13] Cray, NVIDIA, CAPS, and PGI. The OpenACC Application Programming Interface, 2011
(accessed June 12, 2014).
[14] Z. Fan, F. Qiu, A. Kaufman, and S. Yoakum-Stover. GPU cluster for high performance computing.
In Proceedings of the 2004 ACM/IEEE conference on Supercomputing. IEEE computer society,
page 47, 2004.
[15] R. Farber. CUDA Application Design and Development. Elsevier, 2011.
[16] A. Gharaibeh, S. Al-Kiswany, and M. Ripeanu. CrystalGPU: Transparent and efficient utilization
of GPU power. Technical report, 2010.
[17] S.E. Goodman and S.T. Hedetniemi. Introduction to the Design and Analysis of Algorithms,
McGraw-Hill, 1977.
[18] A. Gramma, A. Gupta, G. Karypis, and V. Kumar. Introduction to Parallel Computing. AddisonWesley, 2003.
58

[19] J. Gummaraju, L. Morichetti, M. Houston, B. Sander, B.R. Gaster, and B. Zheng. Twin peaks: a
software platform for heterogeneous computing on general-purpose and graphics processors. In
Proceedings of the 19th international conference on Parallel Architectures and Compilation
Techniques, pages 205-216. ACM, 2010.
[20] V. Gupta, A. Gavrilovska, K. Schwan, H. Kharche, N. Tolia, V. Talwar, and P. Ranganathan.
GViM: GPU-accelerated virtual machines. In Proceeding of the 3rd ACM Workshop on Systemlevel Virtualization for High Performance Computing, pages 17–24. ACM, 2009.
[21] J.A. Hole. Nonlinear high-resolution three-dimensional seismic travel time tomography. Journal of
Geophysical Research, 97, 1992, 6553-6562..
[22] S. Hong, S. Kim, T. Oguntebi, and K. Olukotun. Accelerating CUDA graph algorithms at
maximum warp. ACM SIGPLAN Notices 46, no. 8 (2011): 267-276.
[23] T. Jablin, P. Prabhu, J.A. Jablin, N.P. Johnson, S.R. Beard, and D.I. August. Automatic CPU-GPU
communication management and optimization. ACM SIGPLAN Notices 46, no. 6 (2011): 142-151.
[24] S. Kato, K. Lakshmanan, A. Kumar, M. Kelkar, Y. Ishikawa, and R. Rajkumar. RGEM: A
responsive GPGPU execution model for runtime engines. In Proceedings of the Real-Time Systems
Symposium (RTSS), pages 57-66. IEEE, 2011.
[25] D.B. Kirk, and W. M. Hwu, Programming massively parallel processors: a hands-on approach,
Morgan Kaufmann, 2010.
[26] C. Lee, W.W Ro, and J.L. Gaudiot. Cooperative heterogeneous computing for parallel processing
on CPU/GPU hybrids. In Proceedings of the 16th Workshop on Interaction between Compilers
and Computer Architectures (INTERACT). IEEE, 2012.
[27] A. Leist, D.P. Playne, and K.A. Hawick. Exploiting graphical processing units for data-parallel
scientific applications. Concurrency and Computation: Practice and Experience, 21(18), 2009,
2400-2437.
[28] D. Lin, X. Huang, Q. Nguyen, J. Blackburn, C. Rodrigues, T. Huang, M.N. Do, S.J. Patel, and WMW Hwu. The parallelization of video processing. Signal Processing Magazine, IEEE, 26(6),
2009, 103-112.
[29] B.C. Lovell, A. Bigdeli, and S. Mau Invited paper: embedded face and biometric technologies for
national and border security. In Proceedings of the Computer Vision and Pattern Recognition
Workshops (CVPRW), 2011, pages 117-122.
[30] D. Luebke, D., M. Harris, N. Govindaraju, A. Lefohn, M. Houston, J. Owens, M. Segal, M.
Papakipos, and I. Buck. GPGPU: general-purpose computation on graphics hardware. In
Proceedings of the 2006 ACM/IEEE Conference on Supercomputing. ACM, 2006, page 208.
[31] C.K. Luk, S. Hong, and H. Kim. "Qilin: exploiting parallelism on heterogeneous multiprocessors
with adaptive mapping." In Proceedings if the 42nd Annual IEEE/ACM International Symposium
on Microarchitecture, pages 45-55. IEEE, 2009.
[32] D. Lustig, and M. Martonosi. Reducing GPU offload latency via fine-grained CPU-GPU
synchronization. In Proceeding of High Performance Computer Architecture, 2013, pages 354365.
[33] S.A. Manavski. CUDA compatible GPU as an efficient hardware accelerator for AES
cryptography. In Proceedings of the International Conference on Signal Processing and
Communications. IEEE, 2007, pages 65-68.
59

[34] C. McClanahan. History and Evolution of GPU Architecture. A Paper Survey
http://mcclanahoochie.com/blog/wp-content/uploads/2011/03/gpu-hist-paper. Pdf, 2010 (accessed
May 9, 2014).
[35] Microsoft. DirectX. http://windows.microsoft.com/en-us/windows7/products/features/directx-11,
2014 (accessed July 10, 2014).
[36] NVIDIA.
NVIDIA’s
next
generation
CUDA
TM
compute
architecture,
Fermi.http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf
(accessed May 9, 2014).
[37] NVIDIA. NVIDIA’s next generation CUDA compute architecture, Kepler GK 110.
http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf
(accessed May 9, 2014).
[38] NVIDIA.

Cuda

toolkit

4.0

4.0/CUDA_4.0_Readiness_Tech_Brief.pdf.

readiness for cuda applications.
(accessed May 9, 2014), 2011.

http://dirac.ruc.dk/manuals/cuda-

[39] NVIDIA. http://docs.nvidia.com/cuda/cuda-samples/index.html, 2013 (accessed January 20, 2014).
[40] NVIDIA. CUDA C Programming Guide.
2013 (accessed March 20, 2014).

http://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf,

[41] NVIDIA. CUDA C Best Practice Guide.
(accessed June 27, 2014).

http://docs.nvidia.com/cuda/cuda-c-best-practices-guide,

2014

[42] NVIDIA. http://www.nvidia.com, 2014 (accessed March 20, 2014).
[43] OpenGL. The Industry Foundation for High Performance Graphics. http://www.opengl.org, 2014,
(accessed July 10, 2014).
[44] S. Pai, J.M. Thazhuthaveetil, and R. Govindarajan. Improving GPGPU concurrency with elastic
kernels. In Proceedings of the 18th International Conference on Architectural Support for
Programming Languages and Operating Systems. ACM, 2013.
[45] D.A. Patterson and J.L. Hennessy. Computer organization and design: the hardware/software
interface. Newnes, 2013.
[46] M. Peercy, M. Segal, and D. Gerstmann. A performance-oriented data parallel virtual machine for
GPUs. ACM SIGGRAPH 2006 Sketches. ACM, 2006.
[47] H. Peters, M. Koper, and N. Luttenberger. Efficiently using a CUDA-enabled GPU as shared
resource. In Proceedings of the 10th International Conference on Computer and Information
Technology (CIT). IEEE, 2010.
[48] V. Podlozhnyuk. Black-Scholes Option Pricing.
May 20, 2014).

http://docs.nvidia.com/cuda/cuda-samples,

2012 (accessed

[49] J.C. Olaya and R. Romero. Runtime pipeline I/O scheduling system for heterogeneous
architectures. In Proceedings of the Extreme Science and Engineering Discovery Environment
(XSEDE). ACM, 2014 pages 45-53.
[50] M.J. Quinn. Parallel Programming in C with MPI and OpenMP, McGrawHill, 2004.
[51] Rodinia.
Rodinia:
accelerating
compute-intensive
applications
with
accelerators.
https://www.cs.virginia.edu/skadron/wiki/rodinia/index.php/Main Page, 2013 (accessed October
16, 2013).
60

[52] L. Shi, H. Chen, and J. Sun, J., vCUDA: GPU Accelerated High Performance Computing for
virtual Machines. IEEE International Symposium on Parallel & Distributed Processing
(IPDPS'09), 2009, pages 1-11.
[53] W. Sun, and R. Ricci, Augmenting Operating Systems with the GPU. Technical Report, University
of Utah, 2010.
[54] Texas Advanced Computing Center. The University of Texas at Austin. Available:
http://www.tacc.utexas.edu. 2013
[55] J.E. Vidale, Finite-difference calculation of travel times in three dimensions. Geophysics, 55(5),
pages 521-526. 1990.
[56] L. Wang, M. Huang, and T. El-Ghazawi. Exploiting concurrent kernel execution on graphic
processing units. In Proceedings of the 2011 International Conference on High Performance
Computing & Simulation, 2011, pages 24-32.
[57] N. Wilt. The CUDA handbook: A comprehensive guide to GPU programming. Pearson Education,
2013.
[58] Y. Zhang, P. Lu, L. Bin, P. Jih-Kwon, and C. Jianmin. Architecture comparisons between Nvidia
and ATI GPUs: Computation parallelism and data communications. In Proceedings of the
International Symposium on Workload Characterization (IISWC), IEEE, 2011, pages 205-215.

61

Appendix A: Elapsed Time for Vector Addition
Table A.1: Result of Vector Addition using 32 threads.
Data size
64
128
256
512
1024
2048
4096
8192
16380
32768
65536

Base
1.5759
1.5796
1.5778
1.5766
1.6360
1.5829
1.6648
1.7029
1.7764
2.1536
2.7085

2 Streams
1.7008
1.5766
1.5739
1.5885
1.6344
1.5841
1.6223
1.6587
1.7383
1.8857
1.8857

RPIOSS
4 Streams
1.5754
1.5805
1.5796
1.5706
1.5922
1.5943
1.6303
1.6573
1.7431
1.9038
2.1870

8 Streams
1.5825
1.5726
1.5652
1.5836
1.5874
1.5961
1.6093
1.6594
1.7344
1.8728
2.2398

Table A.2: Result of Vector Addition using 64 threads.
Data size
64
128
256
512
1024
2048
4096
8192
16380
32768
65536

Base
0.0710
0.0725
0.0731
0.0735
0.0748
0.0720
0.0779
0.0809
0.0878
0.1000
0.1111

2 Streams
0.0671
0.0695
0.0682
0.0692
0.0710
0.0725
0.0725
0.0766
0.0816
0.0935
0.1213

62

RPIOSS
4 Streams
0.0658
0.0689
0.0689
0.0690
0.0692
0.0685
0.0714
0.0737
0.0736
0.0834
0.0940

8 Streams
0.0667
0.0692
0.0687
0.0689
0.0698
0.0695
0.0702
0.0755
0.0827
0.0817
0.0939

Table A.3: Result of Vector Addition using 128 threads.
Data size
64
128
256
512
1024
2048
4096
8192
16380
32768
65536

Base
0.0651
0.0697
0.0695
0.0701
0.0701
0.0708
0.0715
0.0751
0.0772
0.0833
0.0969

2 Streams
0.0683
0.0717
0.0688
0.0685
0.0694
0.0723
0.0711
0.0734
0.0762
0.0829
0.0939

63

RPIOSS
4 Streams
0.0659
0.0697
0.0697
0.0712
0.0709
0.0702
0.0702
0.7200
0.0785
0.0842
0.0952

8 Streams
0.0661
0.0699
0.0700
0.0702
0.0685
0.0713
0.0710
0.0719
0.0750
0.0835
0.0964

Appendix B: Elapsed Time for Black-Scholes
Table B.1: Result of Black-Scholes using 32 threads.
Data size
64
128
256
512
1024
2048
4096
8192
16380
32768
65536

Base
1.5367
1.5974
1.5995
1.6242
1.6839
1.8101
2.0920
2.5057
3.5049
5.7665
9.3438

2 Streams
1.5888
1.6531
1.6502
1.6230
1.6729
1.9553
2.0194
2.4036
3.1241
5.1399
9.8503

RPIOSS
4 Streams
1.5882
1.5919
1.6147
1.6287
1.6647
1.6647
2.0236
2.3603
3.1229
4.6684
9.8503

8 Streams
1.6079
1.6051
1.6748
1.6893
1.7445
1.7774
2.0214
2.3721
3.1260
4.5980
7.2000

Table B.2: Result of Black-Scholes using 64 threads.
Data size
64
128
256
512
1024
2048
4096
8192
16380
32768
65536

Base
1.5785
1.5862
1.5972
1.6195
1.6718
1.7761
2.0302
2.3808
3.2030
4.8187
7.4510

2 Streams
1.5817
1.6049
1.6049
1.6210
1.6580
1.7990
1.9380
2.4036
3.0186
4.5412
7.5861

64

RPIOSS
4 Streams
1.6406
1.5920
1.6071
1.6278
1.6649
1.7721
1.9455
2.3603
3.0781
4.5086
7.5861

8 Streams
1.6662
1.6198
1.6202
1.6857
1.6810
1.7733
1.9328
2.3721
3.0421
4.4600
7.5861

Table B.3: Result of Black-Scholes using 128 threads.
Data size
64
128
256
512
1024
2048
4096
8192
16380
32768
65536

Base
1.5639
1.5813
1.5950
1.6210
1.7267
1.7766
1.9650
2.3505
3.1486
4.7114
7.8626

2 Streams
1.6494
1.5902
1.6554
1.6296
1.6578
1.7483
1.9803
2.2714
3.0168
4.5141
7.8456

65

RPIOSS
4 Streams
1.6466
1.5939
1.6062
1.6253
1.6651
1.7558
1.9144
2.9688
2.9688
4.3411
7.8589

8 Streams
1.6034
1.6148
1.6177
1.6923
1.6774
1.7623
1.9270
2.9685
2.9685
4.3262
7.1601

Appendix C: Elapsed Time for Nearest Neighbors
Table C.1: Result of Nearest Neighbors using 32 threads.
Data size
64
128
256
512
1024
2048
4096
8192
16380
32768
65536

Base
0.0035
0.0070
0.0070
0.0272
0.0529
0.1052
0.2051
0.4399
0.8682
1.7505
3.5003

2 Streams
0.1869
0.1841
0.1859
0.1889
0.2062
0.2179
0.2555
0.3262
0.5114
0.7839
1.4029

RPIOSS
4 Streams
0.1815
0.1830
0.1874
0.1895
0.1988
0.2207
0.2422
0.3473
0.4724
0.8390
1.4742

8 Streams
0.1824
0.1840
0.1867
0.1919
0.2049
0.2168
0.2658
0.3329
0.4823
0.8367
1.4715

Table C.2: Result of Nearest Neighbors using 64 threads.
Data size
64
128
256
512
1024
2048
4096
8192
16380
32768
65536

Base
0.0035
0.0070
0.0143
0.0273
0.1054
0.2056
0.2343
0.4354
0.8696
1.7320
3.4383

2 Streams
0.1860
0.1836
0.1850
0.1913
0.2024
0.2197
0.2425
0.3580
0.5090
0.8330
1.4894

66

RPIOSS
4 Streams
0.1825
0.1821
0.1887
0.1915
0.2039
0.2196
0.2549
0.3327
0.5100
0.8347
1.4874

8 Streams
0.1824
0.1830
0.1742
0.1931
0.2009
0.2175
0.2490
0.3475
0.4809
0.8265
1.4616

Table C.3: Result of Nearest Neighbors using 128 threads.
Data size
64
128
256
512
1024
2048
4096
8192
16380
32768
65536

Base
0.0070
0.0141
0.0277
0.0549
0.1052
0.2054
0.2399
0.4423
0.8679
1.7336
3.4274

2 Streams
0.1820
0.1840
0.1859
0.1920
0.1989
0.2173
0.2547
0.3317
0.5079
0.7777
1.3938

67

RPIOSS
4 Streams
0.1851
0.1847
0.1873
0.1912
0.2062
0.2165
0.2522
0.3217
0.4809
0.8128
1.4286

8 Streams
0.1827
0.1841
0.1920
0.1935
0.2017
0.2298
0.2501
0.3436
0.5083
0.7777
1.3917

Appendix D: Elapsed Time for 1D Convolution
Table D.1: Result of 1D Convolution using 32 threads.
Data size
64
128
256
512
1024
2048
4096
8192
16380
32768
65536

Base
1.6859
1.6994
1.7180
1.7503
1.8017
1.9141
2.1458
2.6033
3.5154
5.3506
9.0132

2 Streams
1.6787
1.6970
1.7053
1.7756
1.8267
1.8345
1.9759
2.3174
2.8453
4.0142
6.3223

RPIOSS
4 Streams
1.6979
1.7020
1.7547
1.7772
1.7619
1.8281
2.0279
2.6167
2.8395
3.9946
6.3223

8 Streams
1.7499
1.7553
1.7036
1.7230
1.8035
1.8304
1.9759
2.5496
2.8911
3.9918
6.3458

Table D.2: Result of 1D Convolution using 64 threads.
Data size
64
128
256
512
1024
2048
4096
8192
16380
32768
65536

Base
1.6971
1.7083
1.7264
1.7555
1.8173
1.9424
2.1738
2.6484
3.6109
5.4903
9.2976

2 Streams
1.6887
1.7271
1.7551
1.7550
1.7595
1.8321
1.9829
2.2621
2.8290
3.9774
6.2358

68

RPIOSS
4 Streams
1.6963
1.7023
1.7107
1.7227
1.7589
1.8284
1.9767
2.2550
2.8336
3.9709
6.2527

8 Streams
1.7054
1.7038
1.7076
1.7243
1.7607
1.8314
2.0255
2.3106
2.8343
3.9809
6.2750

Table D.3: Result of 1D Convolution using 128 threads.
Data size
64
128
256
512
1024
2048
4096
8192
16380
32768
65536

Base
1.7056
1.7249
1.7522
1.8508
1.8740
2.0233
2.3313
2.9528
4.2431
6.6565
11.5985

2 Streams
1.6951
1.7005
1.7579
1.7770
1.7642
1.8483
2.0149
2.3465
2.9890
4.3049
6.8929

69

RPIOSS
4 Streams
1.7006
1.7037
1.7110
1.7258
1.7668
1.8531
2.0668
2.4069
2.9980
4.3031
6.9274

8 Streams
1.7684
1.7166
1.7720
1.7316
1.7779
1.8646
2.0161
2.3478
2.9892
4.2964
6.8560

Appendix E: Elapsed Time for Histogram
Table E.1: Result of Histogram using 32 threads.
Data size
64
128
256
512
1024
2048
4096
8192
16380
32768
65536

Base
0.0652
0.0698
0.0700
0.0707
0.0761
0.0729
0.0742
0.0788
0.0915
0.106
0.1452

2 Streams
0.0679
0.0691
0.0668
0.0688
0.0688
0.0723
0.0698
0.0739
0.0828
0.0927
0.1181

RPIOSS
4 Streams
0.0670
0.0688
0.0688
0.0607
0.0685
0.0701
0.0682
0.0705
0.0756
0.0800
0.1224

8 Streams
0.0686
0.0698
0.0705
0.0725
0.0734
0.0734
0.0745
0.0788
0.0915
0.1060
0.1452

Table E.2: Result of Histogram using 64 threads.
Data size
64
128
256
512
1024
2048
4096
8192
16380
32768
65536

Base
0.0710
0.0725
0.0731
0.0735
0.0748
0.0720
0.0779
0.0809
0.0878
0.1000
0.1111

2 Streams
0.0671
0.0695
0.0682
0.0692
0.0710
0.0725
0.0725
0.0766
0.0816
0.0935
0.1213

70

RPIOSS
4 Streams
0.0658
0.0689
0.0689
0.0690
0.0692
0.0685
0.0714
0.0737
0.0736
0.0834
0.0940

8 Streams
0.0667
0.0692
0.0687
0.0689
0.0698
0.0695
0.0702
0.0755
0.0827
0.0817
0.0939

Table E.3: Result of Histogram using 64 threads.
Data size
64
128
256
512
1024
2048
4096
8192
16380
32768
65536

Base
0.0651
0.0697
0.0695
0.0701
0.0701
0.0708
0.0715
0.0751
0.0772
0.0833
0.0969

2 Streams
0.0683
0.0717
0.0688
0.0685
0.0694
0.0723
0.0711
0.0734
0.0762
0.0829
0.0939

71

RPIOSS
4 Streams
0.0659
0.0697
0.0697
0.0712
0.0709
0.0702
0.0702
0.7200
0.0785
0.0842
0.0952

8 Streams
0.0661
0.0699
0.0700
0.0702
0.0685
0.0713
0.0710
0.0719
0.0750
0.0835
0.0964

Appendix F: Vector Addition Benchmark

/*------------------------------------------------------------------------The vector addition program performs a parallel addition of size-n
vectors by adding vector A to vector B and placing the result in vector
C. VA assigns threads to process the elements of vectors A, B, and C.
Since vectors are one-dimensional, threads are in a one-dimension grid
and each thread computes one value of the C vector.
-------------------------------------------------------------------------*/
#include <stdio.h>
#include <cuda_runtime.h>
#include "timers.h"
/**
* CUDA Kernel Device code
*/
__global__ void
vectorAdd(const float *A, const float *B, float *C, int numElements)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < numElements)
{
C[i] = A[i] + B[i];
}
}
/**
* Host main routine
*/
int main(int argc, char **argv)
{
// Error code to check return values for CUDA calls
cudaError_t err = cudaSuccess;
int tx = atoi(argv[1]);
int numElements = atoi(argv[2]);
// Print the vector length to be used, and compute its size
size_t size = numElements * sizeof(float);
printf("[Vector addition of %d elements]\n", numElements);
// Allocate the host input vector A
float *h_A = (float *)malloc(size);
// Allocate the host input vector B
float *h_B = (float *)malloc(size);
// Allocate the host output vector C

72

float *h_C = (float *)malloc(size);
// Verify that allocations succeeded
if (h_A == NULL || h_B == NULL || h_C == NULL)
{
fprintf(stderr, "Failed to allocate host vectors!\n");
exit(EXIT_FAILURE);
}
// Initialize the host input vectors
for (int i = 0; i < numElements; ++i)
{
h_A[i] = rand()/(float)RAND_MAX;
h_B[i] = rand()/(float)RAND_MAX;
}
// Allocate the device input vector A
float *d_A = NULL;
err = cudaMalloc((void **)&d_A, size);
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to allocate device vector A (error code %s)!\n",
cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
// Allocate the device input vector B
float *d_B = NULL;
err = cudaMalloc((void **)&d_B, size);
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to allocate device vector B (error code %s)!\n",
cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
// Allocate the device output vector C
float *d_C = NULL;
err = cudaMalloc((void **)&d_C, size);
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to allocate device vector C (error code %s)!\n",
cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
// Copy the host input vectors A and B in host memory to the device input
// vectors in device memory
printf("Copy input data from the host memory to the CUDA device\n");
err = cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to copy vector A from host to device (error code
%s)!\n", cudaGetErrorString(err));
exit(EXIT_FAILURE);

73

}
err = cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to copy vector B from host to device (error code
%s)!\n", cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
// Launch the Vector Add CUDA Kernel
int threadsPerBlock = tx;
int blocksPerGrid =(numElements + threadsPerBlock - 1) / threadsPerBlock;
printf("CUDA kernel launch with %d blocks of %d threads\n", blocksPerGrid,
threadsPerBlock);
vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, numElements);
err = cudaGetLastError();
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to launch vectorAdd kernel (error code %s)!\n",
cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
// Copy the device result vector in device memory to the host result vector
// in host memory.
printf("Copy output data from the CUDA device to the host memory\n");
err = cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to copy vector C from device to host (error code
%s)!\n", cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
// Verify that the result vector is correct
for (int i = 0; i < numElements; ++i)
{
if (fabs(h_A[i] + h_B[i] - h_C[i]) > 1e-5)
{
fprintf(stderr, "Result verification failed at element %d!\n", i);
exit(EXIT_FAILURE);
}
}
printf("Test PASSED\n");
// Free device global memory
err = cudaFree(d_A);
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to free device vector A (error code %s)!\n",
cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
err = cudaFree(d_B);

74

if (err != cudaSuccess)
{
fprintf(stderr, "Failed to free device vector B (error code %s)!\n",
cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
err = cudaFree(d_C);
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to free device vector C (error code %s)!\n",
cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
// Free host memory
free(h_A);
free(h_B);
free(h_C);
// Reset the device and exit
err = cudaDeviceReset();
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to deinitialize the device! error=%s\n",
cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
printf("Done\n");
return 0;

75

Appendix G: Black-Scholes Benchmark

/*------------------------------------------------------------------------Black-Scholes is a mathematical modeling instrument for financial
markets that was developed by Fischer Black and Myron Scholes. It is
used for computing the price of put and call options granted by a
holder. There are several kinds of options such as the European-style
and the American-style. The former can operate on the expiration date,
and the latter can operate on the expiration date or any time
thereafter.
References:
V. Podlozhnyuk, Black-Scholes Option Pricing,
http://docs.nvidia.com/cuda/samples/4_Finance/BlackScholes/doc/BlackSch
oles.pdf, 2007, (accessed on August 4, 2007)
F. Black and M. Scholes. The pricing of options and corporate
liabilities. The Journal of Political Economy, 81, pages 637-654. 1973.
-------------------------------------------------------------------------*/
extern "C" void BlackScholesCPU(
float *h_CallResult,
float *h_PutResult,
float *h_StockPrice,
float *h_OptionStrike,
float *h_OptionYears,
float Riskfree,
float Volatility,
int optN
);
#include "BlackScholes_kernel.cuh"
float RandFloat(float low, float high)
{
float t = (float)rand() / (float)RAND_MAX;
return (1.0f - t) * low + t * high;
}
const int OPT_N = 1048576;
const int NUM_ITERATIONS = 512;
//const int
const float
const float

OPT_SZ = OPT_N * sizeof(float);
RISKFREE = 0.02f;
VOLATILITY = 0.30f;

76

////////////////////////////////////////////////////////////////////////////
// Main program
////////////////////////////////////////////////////////////////////////////
int main(int argc, char **argv)
{
// Start logs
printf("[%s] - Starting...\n", argv[0]);
//'h_' prefix - CPU (host) memory space
float
//Results calculated by CPU for reference
*h_CallResultCPU,
*h_PutResultCPU,
//CPU copy of GPU results
*h_CallResultGPU,
*h_PutResultGPU,
//CPU instance of input data
*h_StockPrice,
*h_OptionStrike,
*h_OptionYears;
//'d_' prefix - GPU (device) memory space
float
//Results calculated by GPU
*d_CallResult,
*d_PutResult,
//GPU instance of input data
*d_StockPrice,
*d_OptionStrike,
*d_OptionYears;
double
delta, ref, sum_delta, sum_ref, max_delta, L1norm, gpuTime;
StopWatchInterface *hTimer = NULL;
int i;
int tx = atoi(argv[1]);
int OPT_N = atoi(argv[2]);
int OPT_SZ = OPT_N * sizeof(float);
findCudaDevice(argc, (const char **)argv);
sdkCreateTimer(&hTimer);
printf("Initializing data...\n");
printf("...allocating CPU memory for options.\n");
h_CallResultCPU = (float *)malloc(OPT_SZ);
h_PutResultCPU = (float *)malloc(OPT_SZ);
h_CallResultGPU = (float *)malloc(OPT_SZ);
h_PutResultGPU = (float *)malloc(OPT_SZ);
h_StockPrice
= (float *)malloc(OPT_SZ);
h_OptionStrike = (float *)malloc(OPT_SZ);

77

h_OptionYears

= (float *)malloc(OPT_SZ);

printf("...generating input data in CPU mem.\n");
srand(5347);

//Generate options set
for (i = 0; i < OPT_N;
{
h_CallResultCPU[i]
h_PutResultCPU[i]
h_StockPrice[i]
h_OptionStrike[i]
h_OptionYears[i]
}

i++)
=
=
=
=
=

0.0f;
-1.0f;
RandFloat(5.0f, 30.0f);
RandFloat(1.0f, 100.0f);
RandFloat(0.25f, 10.0f);

printf("...allocating GPU memory
checkCudaErrors(cudaMalloc((void
checkCudaErrors(cudaMalloc((void
checkCudaErrors(cudaMalloc((void
checkCudaErrors(cudaMalloc((void
checkCudaErrors(cudaMalloc((void

for options.\n");
**)&d_CallResult,
**)&d_PutResult,
**)&d_StockPrice,
**)&d_OptionStrike,
**)&d_OptionYears,

OPT_SZ));
OPT_SZ));
OPT_SZ));
OPT_SZ));
OPT_SZ));

printf("...copying input data to GPU mem.\n");
//Copy options data to GPU memory for further processing
checkCudaErrors(cudaMemcpy(d_StockPrice, h_StockPrice,
OPT_SZ,
cudaMemcpyHostToDevice));
checkCudaErrors(cudaMemcpy(d_OptionStrike, h_OptionStrike, OPT_SZ,
cudaMemcpyHostToDevice));
checkCudaErrors(cudaMemcpy(d_OptionYears, h_OptionYears,
OPT_SZ,
cudaMemcpyHostToDevice));
printf("Data init done.\n\n");
printf("Executing Black-Scholes GPU kernel (%i iterations)...\n",
NUM_ITERATIONS);
checkCudaErrors(cudaDeviceSynchronize());
sdkResetTimer(&hTimer);
sdkStartTimer(&hTimer);
dim3 block(tx, 1, 1);
dim3 grid((OPT_N + tx - 1)/tx, 1, 1);
for (i = 0; i < NUM_ITERATIONS; i++)
{
BlackScholesGPU<<<grid, block>>>(
d_CallResult,
d_PutResult,
d_StockPrice,
d_OptionStrike,
d_OptionYears,
RISKFREE,
VOLATILITY,
OPT_N
);
getLastCudaError("BlackScholesGPU() execution failed\n");
}

78

checkCudaErrors(cudaMemcpy(h_CallResultGPU, d_CallResult, OPT_SZ,
cudaMemcpyDeviceToHost));
checkCudaErrors(cudaMemcpy(h_PutResultGPU, d_PutResult, OPT_SZ,
cudaMemcpyDeviceToHost));

printf("Checking the results...\n");
printf("...running CPU calculations.\n\n");
//Calculate options values on CPU
BlackScholesCPU(
h_CallResultCPU,
h_PutResultCPU,
h_StockPrice,
h_OptionStrike,
h_OptionYears,
RISKFREE,
VOLATILITY,
OPT_N
);
printf("Comparing the results...\n");
//Calculate max absolute difference and L1 distance
//between CPU and GPU results
sum_delta = 0;
sum_ref
= 0;
max_delta = 0;
for (i = 0; i < OPT_N; i++)
{
ref
= h_CallResultCPU[i];
delta = fabs(h_CallResultCPU[i] - h_CallResultGPU[i]);
if (delta > max_delta)
{
max_delta = delta;
}
sum_delta += delta;
sum_ref
+= fabs(ref);
}
L1norm = sum_delta / sum_ref;
printf("L1 norm: %E\n", L1norm);
printf("Max absolute error: %E\n\n", max_delta);
printf("Shutting down...\n");
printf("...releasing GPU memory.\n");
checkCudaErrors(cudaFree(d_OptionYears));
checkCudaErrors(cudaFree(d_OptionStrike));
checkCudaErrors(cudaFree(d_StockPrice));
checkCudaErrors(cudaFree(d_PutResult));
checkCudaErrors(cudaFree(d_CallResult));
printf("...releasing CPU memory.\n");
free(h_OptionYears);
free(h_OptionStrike);
free(h_StockPrice);
free(h_PutResultGPU);

79

free(h_CallResultGPU);
free(h_PutResultCPU);
free(h_CallResultCPU);
sdkDeleteTimer(&hTimer);
printf("Shutdown done.\n");
printf("\n[BlackScholes] - Test Summary\n");
cudaDeviceReset();
if (L1norm > 1e-6)
{
printf("Test failed!\n");
exit(EXIT_FAILURE);
}
printf("Test passed\n");
exit(EXIT_SUCCESS);
}

80

Appendix H: 1D Convolution Benchmark

/*------------------------------------------------------------------------Convolution is used for signal, image, and video processing. Onedimensional convolution is a filter that executes a series of
operations for each output element based on the sum of weights of the
closest elements, which are referred to as the mask or the convolution
kernel. Since there are a few values in a mask, device constant memory
can be used to store them. However, the mask can be set in device
global memory instead if it is large.
-------------------------------------------------------------------------*/
extern "C" __global__ void kernel(float *input, float *output, float *mask,
int mask_width, int width)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
float pvalue = 0;
int start_point = i - (mask_width / 2);
for(int j = 0; j < mask_width; j++)
{
if(start_point + j >= 0 && start_point + j < width)
{
pvalue += input[start_point + j] * mask[j];
}
}
output[i] = pvalue;
}
int main(int argc, char **argv)
{
// Host variables
int i;
float *h_input;
float *h_output;
float *h_mask;
// Device variables
float *d_input;
float *d_output;
float *d_mask;
// Error code to check return values for CUDA calls
cudaError_t err = cudaSuccess;
int numThreads = atoi(argv[1]);
int BLOCK_SIZE = atoi(argv[2]);
int MASK_SIZE = numThreads+1;

81

// Allocate the host vector for h_input
h_input = (float *) malloc (BLOCK_SIZE * sizeof(float));
// Allocate the host vector for h_ouput
h_output = (float *) malloc (BLOCK_SIZE *

sizeof(float));

// Allocate the host vector for h_mask
//h_mask = (float *) malloc (MASK_SIZE * sizeof(float));
cudaMallocHost(&h_mask, MASK_SIZE * sizeof(float));
// Verify that allocations succeeded
if (h_input == NULL || h_output == NULL || h_mask == NULL)
{
fprintf(stderr, "Failed to allocate host vectors!\n");
exit(EXIT_FAILURE);
}
// Initializing the h_input vector
srand(19);
for(i = 0; i < BLOCK_SIZE; i++)
{
h_input[i] = (float) rand()/(float)RAND_MAX;
}
// Initializing the h_mask vector
srand(20);
for(i = 0; i < MASK_SIZE; i++)
{
h_mask[i] = (float) rand()/(float)RAND_MAX;
}
// Allocating device variables
err = cudaMalloc((void **) &d_input, BLOCK_SIZE*sizeof(float));
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to allocate device vector d_input (error code
%s)!\n", cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
err = cudaMalloc((void **) &d_output, BLOCK_SIZE*sizeof(float));
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to allocate device vector d_input (error code
%s)!\n", cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
err = cudaMalloc((void **) &d_mask, MASK_SIZE*sizeof(float));
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to allocate device vector d_input (error code
%s)!\n", cudaGetErrorString(err));
exit(EXIT_FAILURE);
}

82

// Copy data from host to device
err = cudaMemcpy(d_input, h_input, BLOCK_SIZE*sizeof(float),
cudaMemcpyHostToDevice);
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to copy to device vector d_input (error code
%s)!\n", cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
err = cudaMemcpy(d_mask, h_mask, MASK_SIZE*sizeof(float),
cudaMemcpyHostToDevice);
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to copy to device vector d_input (error code
%s)!\n", cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
int grid = (BLOCK_SIZE + numThreads - 1) / numThreads;
for(int n = 0; n < 200; n++)
kernel<<<grid, numThreads>>>(d_input, d_output, d_mask, MASK_SIZE, BLOCK_SIZE);
// Copy data from device to host
err = cudaMemcpy(h_output, d_output, BLOCK_SIZE*sizeof(float),
cudaMemcpyDeviceToHost);
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to copy from device vector d_ouput(error code
%s)!\n", cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
/*for(i = 0; i < BLOCK_SIZE; i++)
{
printf("%d %f\n", i, h_output[i]);
}*/
// Free device memory
err = cudaFree(d_input);
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to free memory device vector d_input(error code
%s)!\n", cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
err = cudaFree(d_output);
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to free memory device vector d_output(error code
%s)!\n", cudaGetErrorString(err));
exit(EXIT_FAILURE);
}

83

err = cudaFree(d_mask);
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to free memory device vector d_output(error code
%s)!\n", cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
err = cudaDeviceReset();
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to deinitialize the device! error=%s\n",
cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
// Free host memory
free(h_input);
free(h_output);
cudaFreeHost(h_mask);
printf("Done !\n");
return 0;
}
}

84

Appendix I: Nearest Neighbors Benchmark

/*------------------------------------------------------------------------Nearest Neighbors (NN) is one of the essential data-mining algorithms
used to find the closest objects based on a training set. Input data
is a list of latitude and longitude coordinates of hurricanes. The
algorithm computes the Euclidean distance between each pair of objects
to determine which objects are the nearest neighbors of each object in
the training set.
-------------------------------------------------------------------------*/
#define
#define
#define
#define

min( a, b )(a) > (b) ? (b) : (a)
ceilDiv( a, b )( (a) + (b) - 1 ) / (b)
print( x ) printf( #x ": %lu\n", (unsigned long) (x) )
DEBUG
false

//#define DEFAULT_THREADS_PER_BLOCK 32
#define
#define
#define
#define

MAX_ARGS 10
REC_LENGTH 53
// size of a record in db
LATITUDE_POS 28 // character position of latitude in each record
OPEN 10000
// initial value of Nearest Neighbors

typedef struct latLong
{
float lat;
float lng;
} LatLong;
typedef struct record
{
char recString[REC_LENGTH];
float distance;
} Record;
int loadData(char *filename,std::vector<Record> &records,
std::vector<LatLong> &locations);
void findLowest(std::vector<Record> &records,float *distances,int numRecords,
int topN);
void printUsage();
int parseCommandline(int argc, char *argv[], char* filename,int *r,float *lat,
float *lng, int *q, int *t, int *p, int *d);

85

/**
* Kernel
* Executed on GPU
* Calculates the Euclidean distance from each record in the database to the target
position
*/
extern "C" __global__ void euclid(float *d_lan, float *d_lon, float *d_distances,
float lat, float lng, int numRecords)
{
//int globalId = gridDim.x * blockDim.x * blockIdx.y + blockDim.x *
blockIdx.x + threadIdx.x;
int globalId = blockDim.x * ( gridDim.x * blockIdx.y + blockIdx.x ) +
threadIdx.x; // more efficient
//LatLong *latLong = d_locations+globalId;
if (globalId < numRecords) {
//float *dist=d_distances+globalId;
d_distances[globalId] = (float)sqrt((lat-d_lan[globalId])*(latd_lan[globalId])+(lng-d_lon[globalId])*(lng-d_lon[globalId]));
}
}
/**
* This program finds the k-Nearest Neighbors
**/
int main(int argc, char* argv[])
{
int
i=0;
float lat, lng;
int quiet=0,timing=0,platform=0,device=0;
std::vector<Record> records;
std::vector<LatLong> locations;
char filename[100];
int resultsCount=10;
// Error code to check return values for CUDA calls
cudaError_t err = cudaSuccess;
// parse command line
if (parseCommandline(argc, argv, filename,&resultsCount,&lat,&lng,
&quiet, &timing, &platform, &device)) {
printUsage();
return 0;
}
int DEFAULT_THREADS_PER_BLOCK = atoi(argv[8]);
int numRecords = loadData(filename,records,locations);
if (resultsCount > numRecords) resultsCount = numRecords;
//for(i=0;i<numRecords;i++)
// printf("%s, %f, %f\n",(records[i].recString),
//
locations[i].lat,locations[i].lng);

86

//Pointers to host memory
float *distances;
//Pointers to device memory
//LatLong *d_locations;
float *d_distances;
// Scaling calculations - added by Sam Kauffman
cudaDeviceProp deviceProp;
cudaGetDeviceProperties( &deviceProp, 0 );
cudaThreadSynchronize();
unsigned long maxGridX = deviceProp.maxGridSize[0];
unsigned long threadsPerBlock = min( deviceProp.maxThreadsPerBlock,
DEFAULT_THREADS_PER_BLOCK );
size_t totalDeviceMemory;
size_t freeDeviceMemory;
cudaMemGetInfo( &freeDeviceMemory, &totalDeviceMemory );
cudaThreadSynchronize();
// 85% arbitrary throttle to compensate for known CUDA bug
unsigned long usableDeviceMemory = freeDeviceMemory * 85 / 100;
// 4 bytes in 3 vectors per thread
unsigned long maxThreads = usableDeviceMemory / 12;
if ( numRecords > maxThreads )
{
fprintf( stderr, "Error: Input too large.\n" );
exit( 1 );
}
// extra
unsigned
unsigned
unsigned

threads will do nothing
long blocks = ceilDiv( numRecords, threadsPerBlock );
long gridY = ceilDiv( blocks, maxGridX );
long gridX = ceilDiv( blocks, gridY );

// There will be no more than (gridY - 1) extra blocks
dim3 gridDim( gridX, gridY );
if ( DEBUG )
{
print( totalDeviceMemory ); // 804454400
print( freeDeviceMemory );
print( usableDeviceMemory );
print( maxGridX ); // 65535
print( deviceProp.maxThreadsPerBlock ); // 1024
print( threadsPerBlock );
print( maxThreads );
print( blocks ); // 130933
print( gridY );
print( gridX );
}

87

/**
* Allocate memory on host and device
*/
float *d_lat, *d_lon;
distances = (float *) malloc (sizeof(float) * numRecords);
// Verify that allocations succeeded
if (distances == NULL)
{
fprintf(stderr, "Failed to allocate host vectors!\n");
exit(EXIT_FAILURE);
}
err = cudaMalloc((void **) &d_lat,sizeof(float) * numRecords);
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to allocate device vector d_lat (error code
%s)!\n", cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
err = cudaMalloc((void **) &d_lon,sizeof(float) * numRecords);
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to allocate device vector d_lon (error code
%s)!\n", cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
err = cudaMalloc((void **) &d_distances,sizeof(float) * numRecords);
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to allocate device vector d_distance (error code
%s)!\n", cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
/**
* Transfer data from host to device
*/
float *h1;
float *h2;
h1 = (float *) malloc (numRecords*sizeof(float));
h2 = (float *) malloc (numRecords*sizeof(float));
for(int i = 0; i < numRecords; i++) {
h1[i] = locations.at(i).lat;
h2[i] = locations.at(i).lng;
}
err = cudaMemcpy( d_lat, h1, sizeof(float) * numRecords,
cudaMemcpyHostToDevice);
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to copy to device vector d_lat (error code
%s)!\n", cudaGetErrorString(err));
exit(EXIT_FAILURE);

88

}
err = cudaMemcpy( d_lon, h2, sizeof(float) * numRecords,
cudaMemcpyHostToDevice);
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to copy to device vector d_lon (error code
%s)!\n", cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
/**
* Execute kernel
*/
euclid<<< gridDim, threadsPerBlock>>>(d_lat,d_lon,
d_distances,numRecords,lat,lng);
err = cudaGetLastError();
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to launch kernel (error code %s)!\n",
cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
//Copy data from device memory to host memory
cudaMemcpy( distances, d_distances, sizeof(float)*numRecords,
cudaMemcpyDeviceToHost);
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to copy from device vector distances (error code
%s)!\n", cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
// find the resultsCount least distances
findLowest(records,distances,numRecords,resultsCount);
// print out results
if (!quiet)
for(i=0;i<resultsCount;i++) {
printf("%s --> Distance=%f\n",records[i].recString,records[i].distance);
}
free(distances);
//Free memory
err = cudaFree(d_lat);
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to free vector d_lat (error code %s)!\n",
cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
err = cudaFree(d_lon);
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to free vector d_lon(error code %s)!\n",
cudaGetErrorString(err));
exit(EXIT_FAILURE);
}

89

err = cudaFree(d_distances);
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to free vector d_distances(error code %s)!\n",
cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
free(h1);
free(h2);
endTime1 = gettime();
printf("Total execution of operation %f\n", endTime2 - startTime2);
printf("Total execution time %f\n", endTime1 - startTime1);
}
int loadData(char *filename,std::vector<Record> &records,std::vector<LatLong>
&locations){
FILE
*flist,*fp;
int
i=0;
char dbname[64];
int recNum=0;
/**Main processing **/
flist = fopen(filename, "r");
while(!feof(flist)) {
/**
* Read in all records of length REC_LENGTH
* If this is the last file in the filelist, then done
* else open next file to be read next iteration
*/
if(fscanf(flist, "%s\n", dbname) != 1) {
fprintf(stderr, "error reading filelist\n");
exit(0);
}
fp = fopen(dbname, "r");
if(!fp) {
printf("error opening a db\n");
exit(1);
}
// read each record
while(!feof(fp)){
Record record;
LatLong latLong;
fgets(record.recString,49,fp);
fgetc(fp); // newline
if (feof(fp)) break;
// parse for lat and long
char substr[6];
for(i=0;i<5;i++) substr[i] = *(record.recString+i+28);
substr[5] = '\0';
latLong.lat = atof(substr);

90

for(i=0;i<5;i++) substr[i] = *(record.recString+i+33);
substr[5] = '\0';
latLong.lng = atof(substr);
locations.push_back(latLong);
records.push_back(record);
recNum++;
}
fclose(fp);
}
fclose(flist);
for(i=0;i<rec_count*REC_LENGTH;i++) printf("%c",sandbox[i]);
return recNum;

//
}

void findLowest(std::vector<Record> &records,float *distances,int numRecords,int
topN){
int i,j;
float val;
int minLoc;
Record *tempRec;
float tempDist;
for(i=0;i<topN;i++) {
minLoc = i;
for(j=i;j<numRecords;j++) {
val = distances[j];
if (val < distances[minLoc]) minLoc = j;
}
// swap locations and distances
tempRec = &records[i];
records[i] = records[minLoc];
records[minLoc] = *tempRec;
tempDist = distances[i];
distances[i] = distances[minLoc];
distances[minLoc] = tempDist;
// add distance to the min we just found
records[i].distance = distances[i];
}
}
int parseCommandline(int argc, char *argv[], char* filename,int *r,float *lat,float
*lng,int *q, int *t, int *p, int *d){
int i;
if (argc < 2) return 1; // error
strncpy(filename,argv[1],100);
char flag;
for(i=1;i<argc;i++) {
if (argv[i][0]=='-') {// flag
flag = argv[i][1];
switch (flag) {
case 'r': // number of results
i++;
*r = atoi(argv[i]);

91

break;
case 'l': // lat or lng
if (argv[i][2]=='a') {//lat
*lat = atof(argv[i+1]);
}
else {//lng
*lng = atof(argv[i+1]);
}
i++;
break;
case 'h': // help
return 1;
case 'q': // quiet
*q = 1;
break;
case 't': // timing
*t = 1;
break;
case 'p': // platform
i++;
*p = atoi(argv[i]);
break;
case 'd': // device
i++;
*d = atoi(argv[i]);
break;
}
}
}
// both p and d must be specified if either are specified
if ((*d >= 0 && *p<0) || (*p>=0 && *d<0))
return 1;
return 0;
}
void printUsage(){
printf("Nearest Neighbor Usage\n");
printf("\n");
printf("nearestNeighbor [filename] -r [int] -lat [float] -lng [float] [-hqt] [-p
[int] -d [int]]\n");
printf("\n");
printf("example:\n");
printf("$ ./nearestNeighbor filelist.txt -r 5 -lat 30 -lng 90\n");
printf("\n");
printf("filename
the filename that lists the data input files\n");
printf("-r [int]
the number of records to return (default: 10)\n");
printf("-lat [float] the latitude for Nearest Neighbors (default: 0)\n");
printf("-lng [float] the longitude for Nearest Neighbors (default: 0)\n");
printf("\n");
printf("-h, --help
Display the help file\n");
printf("-q
Quiet mode. Suppress all text output.\n");
printf("-t
Print timing information.\n");
printf("\n");
printf("-p [int]
Choose the platform (must choose both platform and
device)\n");
printf("-d [int]
Choose the device (must choose both platform and
device)\n");

92

printf("\n");
printf("\n");
printf("Notes: 1. The filename is required as the first parameter.\n");
printf("
2. If you declare either the device or the platform,\n");
printf("
you must declare both.\n\n");
}

93

Appendix J: Histogram Benchmark
/*------------------------------------------------------------------------Histogram determines the frequency of observations within a group of ranges of values, which are referred
to as bins, for information visualization in many applications including data mining and image processing.
-------------------------------------------------------------------------*/
#include <stdio.h>
#include <cuda_runtime.h>
#include "timers.h"
#include "histogram_common.h"
#define BIN_COUNT_HISTOGRAM 256
/**
* Kernel function.
*/
__global__ void histogram_kernel(unsigned char *buffer, int sizeB,
unsigned int *histo)
{
shared__ unsigned int tmp[BIN_COUNT_HISTOGRAM];
tmp[threadIdx.x] = 0;
syncthreads();
int i = blockIdx.x * blockDim.x + threadIdx.x;
int offset = blockDim.x * gridDim.x;
while(i < sizeB)
{
atomicAdd( &tmp[buffer[i]], 1);
i += offset;
}
syncthreads();
atomicAdd( &(histo[threadIdx.x]), tmp[threadIdx.x]);
}
/**
* Main function.
*/
int main(int argc, char **argv)
{
int PassFailFlag = 1;
unsigned char *buffer;
unsigned int *histoGPU;
unsigned int *histoCPU;
unsigned char *dev_buffer;
unsigned int *dev_histo;
int size = atoi(argv[1]) * 1024 * 1024;
// Error code to check return values for CUDA calls
cudaError_t err = cudaSuccess;
// Allocate host memory

94

histoGPU = (unsigned int *) malloc (BIN_COUNT_HISTOGRAM * sizeof(int));
buffer = (unsigned char *) malloc (size);
histoCPU = (unsigned int *) malloc (BIN_COUNT_HISTOGRAM * sizeof(int));
if (buffer == NULL || histoCPU == NULL || histoGPU == NULL)
{
fprintf(stderr, "Failed to allocate host vectors!\n");
exit(EXIT_FAILURE);
}
// Initialize buffer with randon values
for(int i = 0; i < size; i++)
buffer[i] = rand() % BIN_COUNT_HISTOGRAM;
// Allocate memory in GPU
err = cudaMalloc((void **) &dev_buffer, size);
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to allocate dev_buffer (error code %s)!\n",
cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
err = cudaMalloc((void **) &dev_histo, BIN_COUNT_HISTOGRAM * sizeof(int));
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to allocate device dev_history (error code %s)!\n",
cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
// Copy data to GPU
err = cudaMemcpy(dev_buffer, buffer, size, cudaMemcpyHostToDevice);
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to copy dev_buffer to device (error code %s)!\n",
cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
// Lauch kernel
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
int block = prop.multiProcessorCount;
histogram_kernel<<<block * 2, BIN_COUNT_HISTOGRAM>>>(dev_buffer, size,
dev_histo);
err = cudaGetLastError();
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to launch kernel (error code %s)!\n",
cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
// Copy data from GPU
err = cudaMemcpy(histoGPU, dev_histo, BIN_COUNT_HISTOGRAM * sizeof(int),
cudaMemcpyDeviceToHost);
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to copy dev_histo to host (error code %s)!\n",

95

cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
histogram256CPU(histoCPU, buffer, size);
//free GPU memory
err = cudaFree(dev_histo);
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to free dev_histo (error code %s)!\n",
cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
err = cudaFree(dev_buffer);
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to free dev_buffer (error code %s)!\n",
cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
// Free CPU memory
free(buffer);
free(histoCPU);
free(histoGPU);
// Reset the device and exit
err = cudaDeviceReset();
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to deinitialize the device! error=%s\n",
cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
return 0;
}

96

Appendix K: RPIOSS Manager Program

/*------------------------------------------------------------------------System: RPIOSS
Module: Manager
Developer: Julio C. Olaya
Date: June 30, 2013
Description:
RPIOSS consists of a manager, a dependence subsystem, a host-shared
memory subsystem, and a pipeline subsystem. The manager initializes the
dependence subsystem, which removes redundant data transfers when the
host process invokes more than one kernel; invokes the pipeline
subsystem to redirect input and output of host shared memory; destroys
host threads; and releases synchronization resources.
-------------------------------------------------------------------------*/
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include

<iostream>
"ErrorDevice.h"
"Pipe.h"
"Device.h"
"Alloc.h"
"CopyHD.h"
"Function.h"
"CopyDH.h"
"Depen.h"
"timers.h"

using namespace std;
// Declare global variables for memory mapping, device, and streams
SHARED *
Pipe::ptrShared;
CUdevice
Pipe::hDevice;
CUcontext
Pipe::hContext;
CUmodule
Pipe::hModule;
CUfunction * Pipe::hFunction;
CUresult
Pipe::err;
CUstream *
Pipe::stream;
/*
* Initialize and destroy hostreads, and create and release synchronization
primitives.
*/
void Pipe::setProducerConsumer() {
pthread_t tid_allocProducer;
pthread_t tid_allocConsumer;
pthread_t tid_copyhdConsumer;
pthread_t tid_funcConsumer;
pthread_t tid_copydhConsumer;
// Initialize semaphores to synchronize the configuration and the

97

// allocations stages, the allocation and the copy host-device stages, the
// copy host-device and kernel function stages, and the kernel function
// and the copy device-host stages.
sem_init(&Pipe::ptrShared->allocMutex, 0, 1);
sem_init(&Pipe::ptrShared->allocNempty, 0, Pipe::ptrShared->numChunk);
sem_init(&Pipe::ptrShared->allocNstored, 0, 0);
sem_init(&Pipe::ptrShared->copyhdMutex, 0, 1);
sem_init(&Pipe::ptrShared->copyhdNempty, 0, Pipe::ptrShared->numChunk);
sem_init(&Pipe::ptrShared->copyhdNstored, 0, 0);
sem_init(&Pipe::ptrShared->funcMutex, 0, 1);
sem_init(&Pipe::ptrShared->funcNempty, 0, Pipe::ptrShared->numChunk);
sem_init(&Pipe::ptrShared->funcNstored, 0, 0);
sem_init(&Pipe::ptrShared->copydhMutex, 0, 1);
sem_init(&Pipe::ptrShared->copydhNempty, 0, Pipe::ptrShared->numChunk);
sem_init(&Pipe::ptrShared->copydhNstored, 0, 0);
//Create producer and consumer host threads for the allocation, copy host//device, kernel, and copy device-host stages.
pthread_create(&tid_allocProducer, 0, &Alloc::produce, 0);
pthread_create(&tid_allocConsumer, 0, &Alloc::consume, 0);
pthread_create(&tid_copyhdConsumer, 0, &CopyHD::consume, 0);
pthread_create(&tid_funcConsumer, 0, &Function::consume, 0);
pthread_create(&tid_copydhConsumer, 0, &CopyDH::consume, 0);
//Wait for produder and consumer host threads to terminate
pthread_join(tid_allocProducer, 0);
pthread_join(tid_allocConsumer, 0);
pthread_join(tid_copyhdConsumer, 0);
pthread_join(tid_funcConsumer, 0);
pthread_join(tid_copydhConsumer, 0);
//Destroy semaphores
sem_destroy(&Pipe::ptrShared->allocMutex);
sem_destroy(&Pipe::ptrShared->allocNempty);
sem_destroy(&Pipe::ptrShared->allocNstored);
sem_destroy(&Pipe::ptrShared->copyhdMutex);
sem_destroy(&Pipe::ptrShared->copyhdNempty);
sem_destroy(&Pipe::ptrShared->copyhdNstored);
sem_destroy(&Pipe::ptrShared->funcMutex);
sem_destroy(&Pipe::ptrShared->funcNempty);
sem_destroy(&Pipe::ptrShared->funcNstored);
sem_destroy(&Pipe::ptrShared->copydhMutex);
sem_destroy(&Pipe::ptrShared->copydhNempty);
sem_destroy(&Pipe::ptrShared->copydhNstored);
}
/**
* Runs the pipeline I/O scheduling system.
*/
void Pipe::runPipe() {
Device device;
// Set memory map object to map the host memory into the address space of
// multiple host threads.
ptrShared = (SHARED *) mmap(NULL, sizeof(SHARED), PROT_READ | PROT_WRITE,
MAP_SHARED | MAP_ANON, -1, 0);
ptrShared = &shared;

98

// Enable cuda device
hFunction = new CUfunction[1];
// Un/enable dependence subsystem
if (DEPENDENCE == ON)
device.setDevice();
// Create the number of streams according to the number of patitions that
// is specified by the users.
Pipe::stream = new CUstream[Pipe::ptrShared->numChunk];
// Create a temporal vector according to the size of data blocks.
Pipe::ptrShared->devPtr = new CUdeviceptr[Pipe::ptrShared->
numChunk*Pipe::ptrShared->vectorHost.size()];
// Initialize the streams according to the number of patitions.
for(int i = 0; i < Pipe::ptrShared->numChunk; i++)
cuStreamCreate(&Pipe::stream[i], 0);
// Invoke the runtime I/O pipeline scheduling system.
setProducerConsumer();
// Synchronize streams.
for(int i = 0; i < Pipe::ptrShared->numChunk; i++)
cuStreamSynchronize(Pipe::stream[i]);
// Free device memory.
for(int j = 0; j < Pipe::ptrShared->numChunk*Pipe::ptrShared->
vectorHost.size(); j++) {
cuMemFree(Pipe::ptrShared->devPtr[j]);
}
// Destroy streams.
for(int i = 0; i < Pipe::ptrShared->numChunk; i++)
cuStreamDestroy(Pipe::stream[i]) ;
// Destroy the context.
cuCtxDetach(hContext);
}

99

Appendix L: RPIOSS Shared Memory Subsystem Program

/*------------------------------------------------------------------------System: RPIOSS
Module: Shared memory subsystem
Developer: Julio Olaya
Date: June 30, 2014
Description:
The host shared memory subsystem, which runs in user space, handles
RPIOSS
data
structures
used
to
access
application
data
and
synchronization resources. The host shared memory subsystem maps memory
into the address spaces of independent host threads that share a memory
region and removes the need to use operating system calls to exchange
data and additional information required by RPIOSS. This subsystem also
controls shared structures that store synchronization and application
variables that enable the virtual layers and the manager to handle
application parameters and data with little overhead. This subsystem
uses synchronization and mutual exclusion for robust management of
shared application data. In addition, this subsystem controls set
variables and functions that keep track of dependences among parameters
of multi-kernel applications.
--------------------------------------------------------------------------*/
#ifndef SHARED_H_
#define SHARED_H_
#include
#include
#include
#include
#include

<semaphore.h>
<vector>
<string>
<map>
<set>

using namespace std;
// Flags for specifying if the variable is: input, output, input and output.
#define IN 0
#define OUT 1
#define INOUT 2
#define DEP 0
// Flags for specifying the semaphores.
#define SEM_ALLOCMUTEX "allocMutex"
#define SEM_ALLOCNEMPTY "allocNempty"
#define SEM_ALLOCNSTORED "allocNstored"
#define SEM_COPYHDCMUTEX "copyhdMutex"
#define SEM_COPYHDNEMPTY "copyhdNempty"
#define SEM_COPYHDNSTORED "copyhdNstored"
#define SEM_FUNCMUTEX "funcMutex"
#define SEM_FUNCNEMPTY "funcNempty"
#define SEM_FUNCNSTORED "funcNstored"
#define SEM_COPYDHMUTEX "copydhMutex"
#define SEM_COPYDHNEMPTY "copydhNempty"

100

#define SEM_COPYDHNSTORED "copydhNstored"
// Structure that contains a Cuda device pointer.
typedef struct {
CUdeviceptr devPtr;
} DEVICE;
// Defining name for the map type used
typedef map< int *, int, less< int * > > MAP_DEV_INT;
typedef map< float *, int, less< float * > > MAP_DEV_FLOAT;
typedef map< double *, int, less< double * > > MAP_DEV_DOUBLE;
typedef map< int , int, less< int > > MAP_SCA_INT;
typedef map< float , int, less< float > > MAP_SCA_FLOAT;
typedef map< double , int, less< double > > MAP_SCA_DOUBLE;
// Defining name
typedef set<int,
typedef set<int,
typedef set<int,

for the set type used
less< int > > SET_GPU_FUN;
less < int > > SET_DEV_VAR;
less < int > > SET_SCA_VAR;

// Contain the pointer for device memory for int, float, and double type.
typedef struct {
int type;
int flow;
int * ptrIntData;
float * ptrFloatData;
double * ptrDoubleData;
} HOST;
// Contain the scalar variables for int, float, and double type.
typedef struct {
int type;
int scalarIntData;
float scalarFloatData;
double scalarDoubleData;
} SCALAR;
// Contain the variables used to configure the grid
typedef struct {
string nameKernel;
// Name of the kernel function
int bx;
// block x-dimension
int by;
// block y-dimension
int bz;
// block z-dimension
int tx;
// thread x-dimension
int ty;
// thread y-dimension
int tz;
// thread z-dimension
int shMem;
// size of share memory in bytes
string namePTX;
// name of the ptx file
int sizeVector;
// size of the vector
int numChunk;
// number of chunks
vector<int> vectorDevId; // contain vector device ids
vector<int> vectorDevId2; // vector ids for dependences
vector<int> vectorScaId;
// scalar ids for dependences
} GRID;

101

// Contain the host memory variables
typedef struct {
int dep;
int indexPackage;
//
int numVariables;
//
HOST host;
//
SCALAR scalar;
//
GRID grid;
//
CUdeviceptr * devPtr;
//
DEVICE * vectorDevice;
//
vector<int> vtrChunkSize;
//
vector<int> vtrStartIndex;
//
vector<HOST> vectorHost;
//
vector<SCALAR> vectorScalar;
//
vector<GRID> vectorGrid;
//
MAP_DEV_INT mapDevIntId;
//
MAP_DEV_FLOAT mapDevFloatId;
//
MAP_DEV_DOUBLE mapDevDoubleId;
//
MAP_SCA_INT mapScaIntId;
//
MAP_SCA_FLOAT mapScaFloatId;
//
MAP_SCA_DOUBLE mapScaDoubleId;
//
SET_GPU_FUN *set_gpu_fun;
//
SET_DEV_VAR *set_dev_var;
//
SET_SCA_VAR *set_sca_var;
//
sem_t allocMutex;
//
sem_t allocNempty;
//
sem_t allocNstored;
//
sem_t copyhdMutex;
//
sem_t copyhdNempty;
//
sem_t copyhdNstored;
//
sem_t funcMutex;
//
sem_t funcNempty;
//
sem_t funcNstored;
//
sem_t copydhMutex;
//
sem_t copydhNempty;
//
sem_t copydhNstored;
//
} SHARED;

index for packages (sets)
number of variables
host variables structure
device variables structure
grid variable strucuture
device pointer
device container
size of the chunks
start indexes
host vector
scalar vector containter
grid vecor container
map variable for integer vector
map variable for float vector
map variable for double vector
map variable for scalar integer
map variable for scalar float
map variable for scalar double
set variable for kernel functions
set variable for device vectors
set variable for scalar variables
binary semaphore for allocation
empty counter semaphore for alloc
stored counter semaphore for alloc
binary semaphore for copy h-d
empty counter semaphore for coph h-d
stored counter semaphore for alloc
binary semaphore for kernel
empty counter semaphore for alloc
stored counter semaphore for kernel
binary semaphore for copy d-h
empty counter semaphore for copy d-h
stored counter semap for copy d-h

extern "C" SHARED shared;
#endif /* SHARED_H_ */

102

Appendix M: RPIOSS Dependence Subsystem Program

/*------------------------------------------------------------------------System: RPIOSS
Module: Dependence Subsystem
Developer: Julio C. Olaya
Date: August 22, 2013
Description:
The dependence subsystem automatically checks kernel parameters for
dependences when an application invokes two or more kernels. Using set
functions for dynamic analysis, this subsystem removes redundant data
transfers between the host and the device in either direction, thus
reducing the cost of allocating and deallocating storage, transferring
data, and rewriting data in device memory.
--------------------------------------------------------------------------*/
#include <iostream>
#include "Depen.h"
#include "Shared.h"
using namespace std;
/*
* return the number of set.
*/
int Depen::getCounterPackage() { return counterPackage; }
/*
* Create the set for the device vector variable.
*/
void Depen::setDepenDev() {
int tmp;
int flag = 0;
counterPackage = 0;
shared.set_gpu_fun = new SET_GPU_FUN[shared.vectorGrid.size()];
shared.set_dev_var = new SET_DEV_VAR[shared.vectorGrid.size()];
pair<SET_GPU_FUN::const_iterator, bool> pFun;
pair<SET_DEV_VAR::const_iterator, bool> pVar;
pFun = shared.set_gpu_fun[counterPackage].insert(0);
for(int i = 0; i < shared.vectorGrid.at(0).vectorDevId.size(); i++) {
pVar =
shared.set_dev_var[counterPackage].insert(shared.vectorGrid.at(0).vectorDevId.at(i)
);
}

103

set<int>::iterator it;
for(int i = 1; i < shared.vectorGrid.size(); i++) {
for(int j = 0; j < shared.vectorGrid.at(i).vectorDevId.size(); j++) {
tmp = shared.vectorGrid.at(i).vectorDevId.at(j);
const bool isElem = shared.set_dev_var[counterPackage].find(tmp) !=
shared.set_dev_var[counterPackage].end();
if(isElem == 1) {
flag = 1;
break;
}
}
if(flag == 1) {
pFun = shared.set_gpu_fun[counterPackage].insert(i);
for(int j = 0; j < shared.vectorGrid.at(i).vectorDevId.size(); j++) {
pVar =
shared.set_dev_var[counterPackage].insert(shared.vectorGrid.at(i).vectorDevId.at(j)
);
}
flag = 0;
}
else {
counterPackage++;
pFun = shared.set_gpu_fun[counterPackage].insert(i);
for(int j = 0; j < shared.vectorGrid.at(i).vectorDevId.size(); j++) {
pVar =
shared.set_dev_var[counterPackage].insert(shared.vectorGrid.at(i).vectorDevId.at(j)
);
}
}
}
}
/*
* Create set for the scalar variables
*/
void Depen::setDepenSca() {
int tmp;
int flag = 0;
counterPackage2 = 0;
shared.set_sca_var = new SET_SCA_VAR[shared.vectorGrid.size()];
pair<SET_SCA_VAR::const_iterator, bool> pVar;
for(int i = 0; i < shared.vectorGrid.at(0).vectorScaId.size(); i++) {
pVar =
shared.set_sca_var[counterPackage2].insert(shared.vectorGrid.at(0).vectorScaId.at(i
));
}

104

set<int>::iterator it;
for(int i = 1; i < shared.vectorGrid.size(); i++) {
for(int j = 0; j < shared.vectorGrid.at(i).vectorScaId.size(); j++) {
tmp = shared.vectorGrid.at(i).vectorScaId.at(j);
const bool isElem = shared.set_sca_var[counterPackage2].find(tmp) !=
shared.set_sca_var[counterPackage2].end();
if(isElem == 1) {
flag = 1;
break;
}
}
if(flag == 1) {
for(int j = 0; j < shared.vectorGrid.at(i).vectorScaId.size(); j++) {
pVar =
shared.set_sca_var[counterPackage2].insert(shared.vectorGrid.at(i).vectorScaId.at(j
));
}
flag = 0;
}
else {
counterPackage2++;
for(int j = 0; j < shared.vectorGrid.at(i).vectorScaId.size(); j++) {
pVar =
shared.set_sca_var[counterPackage2].insert(shared.vectorGrid.at(i).vectorScaId.at(j
));
}
}
}
}

105

Appendix N: RPIOSS Storage Allocation Program

/*------------------------------------------------------------------------System: RPIOSS
Module: Pipeline Subsystem
Developer: Julio Olaya
Date: July 15, 2013
Description:
This program implements the pipeline allocation stage, which defines
blocks of data to transfer and allocates storage in the device global
memory.
--------------------------------------------------------------------------*/
#include
#include
#include
#include
#include
#include

<iostream>
<vector>
"ErrorDevice.h"
"Pipe.h"
"Alloc.h"
"Shared.h"

using namespace std;
// variable for tracking the blocks of data for the allocation stage
int Alloc::k=0;
/**
* Produce the items (the size of data blocks, the start and end index for
* of the blocks.
*/
void * Alloc::produce(void *) {
int endIndex;
int startIndex;
for(int i = 0; i < Pipe::ptrShared->numChunk; i++) {
sem_wait(&Pipe::ptrShared->allocNempty);
sem_wait(&Pipe::ptrShared->allocMutex);
startIndex = i * Pipe::ptrShared->sizeVector / Pipe::ptrShared->numChunk;
endIndex = (i+1) * Pipe::ptrShared->sizeVector / Pipe::ptrShared->numChunk;
Pipe::ptrShared->vtrStartIndex.push_back(startIndex);
Pipe::ptrShared->vtrChunkSize.push_back(endIndex-startIndex);
sem_post(&Pipe::ptrShared->allocMutex);
sem_post(&Pipe::ptrShared->allocNstored);
}
return 0;
}

106

/**
* Consume the items to allocate memory space on the GPU for the blocks of
* data.
*/
void * Alloc::consume(void *) {
int size;
for(int i = 0; i < Pipe::ptrShared->numChunk; i++) {
sem_wait(&Pipe::ptrShared->copyhdNempty);
sem_wait(&Pipe::ptrShared->copyhdMutex);
sem_wait(&Pipe::ptrShared->allocNstored);
sem_wait(&Pipe::ptrShared->allocMutex);
// Push the context into the CUDA context stack
Pipe::err = cuCtxPushCurrent(Pipe::hContext);
if (Pipe::err != CUDA_SUCCESS) {
fprintf(stderr, "* Error initializing the CUDA create (allocation).\n");
cuCtxDetach(Pipe::hContext);
exit(-1);
};
for(int j = 0; j < Pipe::ptrShared->vectorHost.size(); j++) {
switch (Pipe::ptrShared->vectorHost.at(j).type) {
case 0:
size = Pipe::ptrShared->vtrChunkSize.at(i) * sizeof(int);
k = i * Pipe::ptrShared->vectorHost.size() + j;
checkCudaErrors(cuMemAlloc(&Pipe::ptrShared->devPtr[k], size));
break;
case 1:
size = Pipe::ptrShared->vtrChunkSize.at(i) * sizeof(float);
k = i * Pipe::ptrShared->vectorHost.size() + j;
checkCudaErrors(cuMemAlloc(&Pipe::ptrShared->devPtr[k], size));
break;
case 2:
size = Pipe::ptrShared->vtrChunkSize.at(i) * sizeof(double);
k = i * Pipe::ptrShared->vectorHost.size() + j;
checkCudaErrors(cuMemAlloc(&Pipe::ptrShared->devPtr[k], size));
break;
}
}
// Pop the context from the CUDA context stack
Pipe::err = cuCtxPopCurrent(&Pipe::hContext);
if (Pipe::err != CUDA_SUCCESS) {
fprintf(stderr, "* Error initializing the CUDA end (allocation).\n");
cuCtxDetach(Pipe::hContext);
exit(-1);
};
sem_post(&Pipe::ptrShared->copyhdMutex);
sem_post(&Pipe::ptrShared->copyhdNstored);
sem_post(&Pipe::ptrShared->allocMutex);
sem_post(&Pipe::ptrShared->allocNempty);
}
return 0;
}

107

Appendix O: RPIOSS Copy Host-Device Program

/*------------------------------------------------------------------------System: RPIOSS
Module: Pipeline Subsystem
Developer: Julio Olaya
Date: July 29, 2013
Description:
Once the allocation pipeline stage allocates storage for at least one
data block, this stage copies data from the host to the device.
--------------------------------------------------------------------------*/
#include
#include
#include
#include
#include
#include

<iostream>
<vector>
"ErrorDevice.h"
"Pipe.h"
"CopyHD.h"
"Shared.h"

using namespace std;
// Variable for tracking the number of block of data.
int CopyHD::k=0;
/**
* Consume the items (the size of data blocks) and produce the item for
* next stage.
*/
void * CopyHD::consume(void *) {
int size;
int index;
set<int>::iterator j;
for(int i = 0; i < Pipe::ptrShared->numChunk; i++) {
sem_wait(&Pipe::ptrShared->funcNempty);
sem_wait(&Pipe::ptrShared->funcMutex);
sem_wait(&Pipe::ptrShared->copyhdNstored);
sem_wait(&Pipe::ptrShared->copyhdMutex);
// Push the context into the CUDA context stack
Pipe::err = cuCtxPushCurrent(Pipe::hContext);
if (Pipe::err != CUDA_SUCCESS) {
fprintf(stderr, "* Error initializing the CUDA create (copy h-d).\n");
cuCtxDetach(Pipe::hContext);
exit(-1);
};

108

for (j=Pipe::ptrShared->set_dev_var[Pipe::ptrShared->
indexPackage].begin(); j != Pipe::ptrShared->
set_dev_var[Pipe::ptrShared->indexPackage].end(); ++j) {
index = *j - Pipe::ptrShared->firstVar;
CopyHD::k = i * Pipe::ptrShared->numVariables + index;
if(Pipe::ptrShared->vectorHost.at(*j).flow == 0 || Pipe::ptrShared->
vectorHost.at(*j).flow == 2) {
switch (Pipe::ptrShared->vectorHost.at(*j).type) {
case 0:
size = Pipe::ptrShared->vtrChunkSize.at(i) * sizeof(int);
checkCudaErrors(cuMemcpyAsync(Pipe::ptrShared->devPtr[CopyHD::k],
( CUdeviceptr) Pipe::ptrShared->vectorHost.at(*j).ptrIntData+i*size,
size, Pipe::stream[i]));
break;
case 1:
size = Pipe::ptrShared->vtrChunkSize.at(i) * sizeof(float);
checkCudaErrors(
cuMemcpyAsync(Pipe::ptrShared->devPtr[CopyHD::k],
( CUdeviceptr)Pipe::ptrShared>vectorHost.at(*j).ptrFloatData+i*size,
size, Pipe::stream[i]));
break;
case 2:
size = Pipe::ptrShared->vtrChunkSize.at(i) * sizeof(double);
checkCudaErrors(
cuMemcpyAsync(Pipe::ptrShared->devPtr[CopyHD::k],
(CUdeviceptr) Pipe::ptrShared->
vectorHost.at(*j).ptrDoubleData+i*size,
size, Pipe::stream[i]));
break;
}
}
}
// Pop the context from the CUDA context stack
Pipe::err = cuCtxPopCurrent(&Pipe::hContext);
if (Pipe::err != CUDA_SUCCESS) {
fprintf(stderr, "* Error initializing the CUDA end (copy h-d).\n");
cuCtxDetach(Pipe::hContext);
exit(-1);
};
sem_post(&Pipe::ptrShared->funcMutex);
sem_post(&Pipe::ptrShared->funcNstored);
sem_post(&Pipe::ptrShared->copyhdMutex);
sem_post(&Pipe::ptrShared->copyhdNempty);
}
return 0;
}

109

Appendix P: RPIOSS Kernel Launcher Program

/*------------------------------------------------------------------------System: RPIOSS
Module: Pipeline Subsystem
Developer: Julio Olaya
Date: July 30, 2013
Description:
This pipeline stage invokes a kernel subgrid as soon as data blocks
have been transferred to the device.
--------------------------------------------------------------------------*/
#include
#include
#include
#include
#include
#include

<iostream>
<vector>
"ErrorDevice.h"
"Pipe.h"
"Function.h"
"Shared.h"

using namespace std;
// Variable for tracking the block of data.
int Function::k=0;
/**
* Consume the items (the size of data blocks) and produce the items for
* the next stage.
*/
void * Function::consume(void *) {
int size;
int j, m,index, scalarSize, vecSize;
set<int>::iterator it;
for(int i = 0; i < Pipe::ptrShared->numChunk; i++) {
sem_wait(&Pipe::ptrShared->copydhNempty);
sem_wait(&Pipe::ptrShared->copydhMutex);
sem_wait(&Pipe::ptrShared->funcNstored);
sem_wait(&Pipe::ptrShared->funcMutex);
// Push the context into the CUDA context stack
Pipe::err = cuCtxPushCurrent(Pipe::hContext);
if (Pipe::err != CUDA_SUCCESS) {
fprintf(stderr, "* Error initializing the CUDA create (Function).\n");
cuCtxDetach(Pipe::hContext);
exit(-1);
};

110

// Associate the subset to the kernel functions.
for (it=Pipe::ptrShared->set_gpu_fun[Pipe::ptrShared->indexPackage].begin();
it != Pipe::ptrShared->set_gpu_fun[Pipe::ptrShared->indexPackage].end();
++it) {
// Load device variables
scalarSize = Pipe::ptrShared->vectorGrid.at(*it).vectorScaId.size();
vecSize = Pipe::ptrShared->vectorGrid.at(*it).vectorDevId.size();
void * args[scalarSize+vecSize+1];
args[scalarSize+vecSize] = &Pipe::ptrShared->vtrChunkSize.at(i);
for(j = 0; j < vecSize; j++) {
index = Pipe::ptrShared->vectorGrid.at(*it).vectorDevId.at(j) –
Pipe::ptrShared->firstVar;
Function::k = i * Pipe::ptrShared->numVariables + index;
args[j] = &Pipe::ptrShared->devPtr[Function::k];
}
// load scalar values
if(scalarSize > 0) {
for(m = 0; m < scalarSize; m++) {
index = Pipe::ptrShared->vectorGrid.at(*it).vectorScaId.at(m);
switch(Pipe::ptrShared->vectorScalar.at(index).type) {
case 0: args[j++] =
&Pipe::ptrShared->vectorScalar.at(index).scalarIntData;
break;
case 1: args[j++] =
&Pipe::ptrShared->vectorScalar.at(index).scalarFloatData;
break;
case 2: args[j++] =
&Pipe::ptrShared->vectorScalar.at(index).scalarDoubleData;
break;
}
}
}
// Launch the kernel functions
checkCudaErrors(cuLaunchKernel(Pipe::hFunction[*it],
Pipe::ptrShared->vectorGrid.at(*it).bx,
Pipe::ptrShared->vectorGrid.at(*it).by,
Pipe::ptrShared->vectorGrid.at(*it).bz,
Pipe::ptrShared->vectorGrid.at(*it).tx,
Pipe::ptrShared->vectorGrid.at(*it).ty,
Pipe::ptrShared->vectorGrid.at(*it).tz,
Pipe::ptrShared->vectorGrid.at(*it).shMem,
Pipe::stream[i],
args,
0));
if ( Pipe::err != CUDA_SUCCESS) {
fprintf(stderr, "* Error initializing the CUDA create Kernel
1(function) .\n");
cuCtxDetach( Pipe::hContext);
exit(-1);
};
}

111

// Pop the context from the context stack
Pipe::err = cuCtxPopCurrent(&Pipe::hContext);
if (Pipe::err != CUDA_SUCCESS) {
fprintf(stderr, "* Error initializing the CUDA end (function).\n");
cuCtxDetach(Pipe::hContext);
exit(-1);
};
sem_post(&Pipe::ptrShared->copydhMutex);
sem_post(&Pipe::ptrShared->copydhNstored);
sem_post(&Pipe::ptrShared->funcMutex);
sem_post(&Pipe::ptrShared->funcNempty);
}
return 0;
}

112

Appendix Q: RPIOSS Copy Device-Host Program

/*------------------------------------------------------------------------System: RPIOSS
Module: Pipeline Subsystem
Developer: Julio Olaya
Date:
Description:
This pipeline stage copies results from the device to the host and
deallocates device storage if not needed for further processing on the
device.
--------------------------------------------------------------------------*/
#include
#include
#include
#include
#include
#include

<iostream>
<vector>
"ErrorDevice.h"
"Pipe.h"
"CopyDH.h"
"Shared.h"

using namespace std;
// Variable for tracking block of data
int CopyDH::k=0;
/**
* Consume the items (the size of data blocks).
* of the blocks.
*/
void * CopyDH::consume(void *) {
int size;
int index;
set<int>::iterator j;
for(int i = 0; i < Pipe::ptrShared->numChunk; i++) {
sem_wait(&Pipe::ptrShared->copydhNstored);
sem_wait(&Pipe::ptrShared->copydhMutex);
// Push context into CUDA context stack
Pipe::err = cuCtxPushCurrent(Pipe::hContext);
if (Pipe::err != CUDA_SUCCESS) {
fprintf(stderr, "* Error initializing the CUDA create (copy d-h).\n");
cuCtxDetach(Pipe::hContext);
exit(-1);
};

113

for (j=Pipe::ptrShared->set_dev_var[Pipe::ptrShared->indexPackage].begin();
j != Pipe::ptrShared->set_dev_var[Pipe::ptrShared->indexPackage].end(); ++j){
index = *j - Pipe::ptrShared->firstVar;
CopyDH::k = i * Pipe::ptrShared->numVariables + index;
if(Pipe::ptrShared->vectorHost.at(*j).flow == 1 ||
Pipe::ptrShared->vectorHost.at(*j).flow == 2) {
switch (Pipe::ptrShared->vectorHost.at(*j).type) {
case 0:
size = Pipe::ptrShared->vtrChunkSize.at(i) * sizeof(int);
checkCudaErrors(cuMemcpyAsync(( CUdeviceptr)
Pipe::ptrShared->vectorHost.at(*j).ptrIntData+i*size,
Pipe::ptrShared->devPtr[CopyDH::k],size, Pipe::stream[i]));
break;
case 1:
size = Pipe::ptrShared->vtrChunkSize.at(i) * sizeof(float);
checkCudaErrors(cuMemcpyAsync(( CUdeviceptr)
Pipe::ptrShared->vectorHost.at(*j).ptrFloatData+i*size,
Pipe::ptrShared->devPtr[CopyDH::k],size, Pipe::stream[i]));
break;
case 2:
size = Pipe::ptrShared->vtrChunkSize.at(i) * sizeof(double);
checkCudaErrors(cuMemcpyAsync(( CUdeviceptr)
Pipe::ptrShared->vectorHost.at(*j).ptrDoubleData+i*size,
Pipe::ptrShared->devPtr[CopyDH::k],size, Pipe::stream[i]));
break;
}
}
}
// Pop context from CUDA context stack
Pipe::err = cuCtxPopCurrent(&Pipe::hContext);
if (Pipe::err != CUDA_SUCCESS) {
fprintf(stderr, "* Error initializing the CUDA end (copy d-h).\n");
cuCtxDetach(Pipe::hContext);
exit(-1);
};
sem_post(&Pipe::ptrShared->copydhMutex);
sem_post(&Pipe::ptrShared->copydhNempty);
}
return 0;
}

114

Appendix R: Device program
/*------------------------------------------------------------------------System: RPIOSS
Module: Pipeline Subsystem
Developer: Julio Olaya
Date:
Description:
This program creates and initializes the device, module and context.
--------------------------------------------------------------------------*/
#include
#include
#include
#include
#include

<iostream>
<string>
"Pipe.h"
"ErrorDevice.h"
"Device.h"

using namespace::std;
void Device::setDeviceDep()
{
const char * tmp;
// create CUDA device & context
cuInit(0);
int deviceCount = 0;
cuDeviceGetCount(&deviceCount);
if (deviceCount == 0) {
printf("There is no device supporting CUDA.\n");
exit (0);
} else {
printf("number of devices = %d\n", deviceCount);
}
checkCudaErrors(cuDeviceGet(&Pipe::hDevice, 0)); // pick first device
// Creeate the cuda context
Pipe::err = cuCtxCreate(&Pipe::hContext, 0, Pipe::hDevice);
if (Pipe::err != CUDA_SUCCESS) {
fprintf(stderr,
"* Error initializing the CUDA create 0 (device).\n");
cuCtxDetach(Pipe::hContext);
exit(-1);
};
// Create module from PTX file
tmp = new char[Pipe::ptrShared->vectorGrid.at(0).namePTX.size()+1];
tmp = const_cast<char *>
(Pipe::ptrShared->vectorGrid.at(0).namePTX.c_str());

115

// Load module
Pipe::err = cuModuleLoad(&Pipe::hModule, tmp);
if (Pipe::err != CUDA_SUCCESS) {
fprintf(stderr,
"* Error initializing the CUDA module 1 (device).\n");
cuModuleUnload(Pipe::hModule);
cuCtxDetach(Pipe::hContext);
exit(-1);
};
// Get function handle from module
for(int i = 0; i < Pipe::ptrShared->vectorGrid.size(); i++) {
tmp = new char[Pipe::ptrShared->vectorGrid.at(i).nameKernel.size()+1];
tmp = const_cast<char *>
(Pipe::ptrShared->vectorGrid.at(i).nameKernel.c_str());
Pipe::err= cuModuleGetFunction(&Pipe::hFunction[i], Pipe::hModule, tmp);
}
if (Pipe::err != CUDA_SUCCESS) {
fprintf(stderr,
"* Error initializing the CUDA function 2 (device).\n");
cuModuleUnload(Pipe::hModule);
cuCtxDetach(Pipe::hContext);
exit(-1);
};
}

116

Appendix S: Reference for RPIOSS API Functions
rpiossInit(string ptxName, int vectorSize, int numPartitions, dependenceFlag, subgridFlag)
• ptxName – name of PTX file.
• vectorSize – size of the host vector.
• numPartitions – number of chunks.
• dependenceFlag – turn on/off dependence checking in kernel parameters.
• subgridFlag – turn on/off partitioning of grid into subgrids.
rpiossData(int *data, IOFlag, zeroCopyFlag)
• data – name of a vector or scalar of type integer.
• IOFLAG – input/output parameter declaration.
• zeroCopyFlag – turn on/off zero-copy memory access.
rpiossData(float *data, IOFlag, zeroCopyFlag)
• data – name of a vector or scalar of type float.
• IOFLAG – input/output parameter declaration.
• zeroCopyFlag – turn on/off zero-copy memory access.
rpiossData(double *data, IOFlag, zeroCopyFlag)
• data – name of a vector or scalar of type double.
• IOFLAG – input/output parameter declaration.
• zeroCopyFlag – turn on/off zero-copy memory access.
rpiossGrid(string kernelName, Bx, By, Bz, Tx, Ty, Tz, Ms)
• kernelName – name of the kernel function
• Bx – dimension of grid in x-axis.
• By - dimension of grid in y-axis.
• Bz - dimension of grid in z-axis.
• Tx – dimension of block in x-axis.
• Ty - dimension of block in y-axis.
• Tz - dimension of block in z-axis.
• Ms – Shared memory size

117

Appendix T: How to use RPIOSS

The following are sample steps to use RPIOSS:
•

Include in the application source file the RPIOSS header file (Pipeline.h).

•

Use the RPIOSS APIs shown in Figure 4.4, instead of using CUDA APIs shown in
Figure 4.5.

•

Generate a PTX file with the kernel function using a command like the following:
nvcc –ptx filename.cu

•

Compile RPIOSS using the following makefile template to create the executable:

CUDA_PIPE=(location of RPIOSS source)
CUDA_PATH=(location of the CUDA binaries)
all: build
build: filename
filename.o: filename.cu
filename.o=filename.cu
nvcc –I(CUDA_PATH) –o $@ -c $<
$(CUDA_PIPE)/device.o = $(CUDA_PIPE)/device.cpp
gcc –I(CUDA_PATH) –o $@ -c $<
$(CUDA_PIPE)/allocation.o = $(CUDA_PIPE)/allocation.cpp
gcc –I(CUDA_PATH) –o $@ -c $<
$(CUDA_PIPE)/copyHD.o = $(CUDA_PIPE)/copyHD.cpp
gcc –I(CUDA_PATH) –o $@ -c $<
$(CUDA_PIPE)/function.o = $(CUDA_PIPE)/function.cpp
gcc –I(CUDA_PATH) –o $@ -c $<
$(CUDA_PIPE)/copyDH.o = $(CUDA_PIPE)/copyDH.cpp
gcc –I(CUDA_PATH) –o $@ -c $<
filename.o: $(CUDA_PIPE)/device.o $(CUDA_PIPE)/allocation.o
$(CUDA_PIPE)/function.o $(CUDA_PIPE)/copyDH.o

118

$(CUDA_PIPE)/copyHD.o

Vita
Julio Cesar Olaya was born in Ibague, Colombia. He is the first son of Julio Olaya Hernandez
and Luz Stella Builes Poveda. He received his Bachelor of Science degree in Computer Science in the
spring of 2006 at the University of Texas at El Paso and his Master of Science degree in Computer
Science in the summer of 2008 at the Polytechnic University of Puerto Rico. In the fall of 2008, he
started to pursue a doctorate in Computer Science at The University of Texas at El Paso. While pursuing
this doctorate, he worked as a Research Associate at the Cyber-ShARE Center of Excellence under the
supervision of Dr. Rodrigo Romero and Dr. Aaron Velasco. Julio started by designing and implementing
algorithms to accelerate seismic traveltime tomography. After that, he worked on visualization of
uncertainty in 3D seismic tomography; the results of this work were published in the proceedings of the
Computing Alliance of Hispanic-Serving Institutions (CAHSI) 2010 Annual Conference. In the summer
of 2013, he participated in an internship at the Texas Advanced Computing Center (TACC) in UT
Austin where he designed and developed an interactive parallelization tool. Results of this project were
published in the proceedings of Extreme Science and Engineering Discovery Environment’14
(XSEDE’14). At the same conference, Julio published as the principal author work on a runtime
middleware system for GPGPU applications.

Permanent address: Altos de Tivoli
Calle 18 No. 42-00
Casa 23
Neiva, Huila, Colombia

119

