Filesystem integration for a GPU based Java virtual machine by Thornton, Noah Jacob
Copyright
by
Noah Jacob Thornton 
2020
The Thesis Committee for Noah Jacob Thornton
Certifies that this is the approved version of the following thesis:
Filesystem Integration for a
GPU Based Java Virtual Machine
SUPERVISING COMMITTEE:
Christopher J. Rossbach, Supervisor
Milos Gligoric









for the Degree of
MASTER OF SCIENCE IN COMPUTER SCIENCE
The University of Texas at Austin
August 2020
Dedicated to my parents, Duane and Jami.
Acknowledgments
Many people have helped me during my education, I wish to thank them all
but some deserve special recognition.
I would like to thank Dr. Alison Norman, who has helped me on more
occasions than I can count. Her Operating Systems course inspired my interest
in systems. She gave me the privilege of being a teaching assistant for this
same course, allowing me to inspire students in the same way she inspired
me. But most importantly Alison has a role model, a mentor, and a friend
throughout my time at UT.
I also want to thank Dr. Chris Rossbach, who has generously helped
me complete my thesis. Additionally, his Advanced Operating Systems and
Concurrency courses were my favorite courses. Without this exposure I don’t
think I would have had enough passion to make it this far.
Lastly, I want to thank Dr. Milos Gligoric who has dedicated much
time and effort to help me with my thesis. He has provided immense amounts
of valuable feedback, which has helped me form a better thesis than I could’ve
ever written alone.
v
Filesystem Integration for a
GPU Based Java Virtual Machine
Noah Jacob Thornton, M.S.Comp.Sci.
The University of Texas at Austin, 2020
Supervisor: Christopher J. Rossbach
Modern computers utilize many accelerator devices alongside tradi-
tional CPUs. These devices provide additional performance or functionality 
to the system for specialized workloads. We specifically investigate graphics 
processing units (GPUs), which have become very prolific in enterprise and 
consumer computers. GPUs can enable demanding graphics-based workloads, 
but they can also be used to perform more flexible general purpose compu-
tation (GPGPU). Currently, the most popular GPGPU workloads are related 
to machine learning such as training a neural network. While this may be the 
case, there are many other highly parallel workloads that can take advantage 
of the GPU architecture. This project focuses on extending GVM, a Java 






List of Tables x
List of Figures xi
1 Introduction 1
1.1 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Background 4
2.1 Accelerators . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Nvidia Hardware . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.1 Thread Divergence . . . . . . . . . . . . . . . . . . . . 8
2.2.2 GPU Memory . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.3 Unified Virtual Addressing . . . . . . . . . . . . . . . . 10
2.2.4 Fully Unified Memory . . . . . . . . . . . . . . . . . . 10
2.3 GPGPU Programming Model . . . . . . . . . . . . . . . . . . 11
vii
2.4 GPGPU Execution Model . . . . . . . . . . . . . . . . . . . . 12
3 GVM: A GPU-based Java Bytecode Interpreter 14
3.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.1 Packaging . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.2 Transferring . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.3 Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.4 Interpreting . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 GVM Limitations . . . . . . . . . . . . . . . . . . . . . . . . . 17
4 GPU File System 19
4.1 GPU Communication . . . . . . . . . . . . . . . . . . . . . . . 20
4.2 File System Semantics . . . . . . . . . . . . . . . . . . . . . . 21
4.2.1 Read And Write . . . . . . . . . . . . . . . . . . . . . 21
4.2.2 Open and Close . . . . . . . . . . . . . . . . . . . . . . 22
4.2.3 The Remaining Functions . . . . . . . . . . . . . . . . 23
4.3 File Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5 GVM File System 24
5.1 Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.2 RPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.2.1 Synchronization . . . . . . . . . . . . . . . . . . . . . . 26
5.3 Memory Management . . . . . . . . . . . . . . . . . . . . . . . 27
5.4 Filesystem General Operations . . . . . . . . . . . . . . . . . . 28
5.5 GVM File System Integration . . . . . . . . . . . . . . . . . . 30
5.6 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
viii
6 Evaluation 33
6.1 Memory Stress Test . . . . . . . . . . . . . . . . . . . . . . . . 34
6.1.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.2 RPC Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.2.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.3 GVM Performance . . . . . . . . . . . . . . . . . . . . . . . . 41
6.3.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
7 Related Work 44
7.1 Networking for GPUs . . . . . . . . . . . . . . . . . . . . . . . 44
7.2 FPGA Accelerator Filesystems . . . . . . . . . . . . . . . . . . 45
7.3 Generic System Calls For GPUs . . . . . . . . . . . . . . . . . 45
7.4 High-level GPGPU Programming . . . . . . . . . . . . . . . . 46
7.5 GPU Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . 46
8 Future Work 48
8.1 Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48





5.1 GVMfs CUDA API . . . . . . . . . . . . . . . . . . . . . . . . 30
6.1 System Specifications . . . . . . . . . . . . . . . . . . . . . . . 33
x
List of Figures
2.1 Microprocessor Trends (data collection by M.Horowitz, F.Labonte,
O. Shacham, K. Olukotun, L. Hammond, C. Batten, and K.
Rupp [38]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Example Divergence . . . . . . . . . . . . . . . . . . . . . . . 8
4.1 GPUfs Overview Diagram (original figure from [41]) . . . . . . 20
5.1 RPC Request Struct . . . . . . . . . . . . . . . . . . . . . . . 26
5.2 GVMfs Organization . . . . . . . . . . . . . . . . . . . . . . . 27
5.3 GVMfs Logical Flow Chart . . . . . . . . . . . . . . . . . . . . 29
6.1 GPU File Backed Memory Test . . . . . . . . . . . . . . . . . 34
6.2 Random R/W Total Total Time (Transfer + Kernel) . . . . . 36
6.3 Random R/W Total Kernel Time . . . . . . . . . . . . . . . . 37
6.4 Sequential Read Total Time (Transfer + Kernel) . . . . . . . . 38
6.5 RPC Performance . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.6 Checksum Benchmark Algorithm . . . . . . . . . . . . . . . . 42




Since the turn of the century, the rate at which single-threaded CPU per-
formance increases each generation has been slowing. This has driven the
need for architects to find other ways to increase performance for computa-
tionally demanding workloads. One approach is to redesign algorithms such
that they take advantage of parallelism. Parallel algorithms can experience
massive speedups on modern parallel architectures, such as multi-core CPUs
and GPUs.
GPUs are a form of extremely parallel processors and excel at workloads
that exhibit abundant parallelism, but provide mediocre performance in single-
threaded applications and non vectorizable workloads. In order to enable
programmers to get the most performance out of GPU hardware, hardware
vendors provide low level GPGPU programming interfaces. One such interface
is CUDA: CUDA extends systems programming languages such as C/C++.
Programmers must use these provided interfaces to manage various aspects
of GPU program execution. Proper management for a GPGPU program is
more nuanced than an average CPU program, due to the amount of hardware
1
architecture that the runtime exposes. This confounds with GPU’s sensitivity
to poor memory management and access patterns. However, CUDA does not
provide support for some CPU programming features. Particularly, GPGPUs
programs lack access to a system call interface.
To fill this void, we explore the feasibility of providing a filesystem
interface to GPGPU programs. We integrate this new GPU filesystem with
GVM, a Java interpreter for GPUs [4]. Enabling Java programs to execute
on a GPU allows preexisting Java programs that utilize the filesystem to run
on device, unmodified. This provides a means to write high-level code that
targets GPUs via the Java language, ultimately simplifying the programming
experience. Before GVMfs, GVM, like other GPGPU code, had no access to
the filesystem. GVMfs fills this void, providing access to persistent storage via
a filesystem API.
1.1 Goals
The goals of this thesis are as follows:
• Provide an overview of parallel GPU architecture, specifically Nvidia
GPUs in respect to GPGPU programming.
• Provide an overview of GVM and existing GPU file system research.
• Detail a design of a GPU file system interface that utilizes unified mem-
ory to provide a GPU ↔ CPU RPC.
• Enumerate modifications necessary to GVM in order to enable file sup-
port to Java programs.
2
1.2 Motivation
GPU programming has limitations that will not be encountered in similar
CPU programming. This is a barrier to entry that is typically mediated a few
ways:
• frameworks that provide a higher level interface to the GPU [3]
• libraries to aid low level programming [13,41]
• tools to abstract away the underlying hardware completely [4]
These approaches have different drawbacks, benefits, and require different de-
sign approaches. If we combine aspects of these different disjoint approaches
we can minimize the drawbacks of any of the given techniques, while providing
more functionality. Thus, research into tooling and frameworks in this area
will further reduce the difficulty of programming GPUs.
GVM in particular falls into the abstraction category, by providing a
way to run unmodified Java programs. But GVM doesn’t support all Java
programs, and requires specific engineering for many of the programs it does
support. While adequate for showing that such a design is feasible, if GVM
supported more Java features, specifically file IO, a greater breadth of Java
programs can be executed. GVMfs seeks to provide this missing functionality,




The computing industry is always chasing the maximum performance attain-
able, and this can be done in software, hardware, or both. In respect to
hardware, CPU performance has historically increased approximately 100%
every two years until 2005 [2].
Moore’s law and Dennard scaling suggest that transistor density will
double every 18 months [33]. Intuitively, doubling the transistors per chip
should double the performance, but this is not the case with modern chip
design and architectures. For example, the latest AMD architecture transition
only yielded a 13% increase in single-threaded CPU performance [45]. These
gains are a long way from doubling performance, and the cause is the inability
for CPU frequency to scale.
Previous CPU generations had the ability to double CPU frequency.
CPU frequency dictates how many clock cycles a CPU can generate per sec-
ond. A higher CPU frequency generally implies better performance, as the
CPU can do more operations per second. The problem is that, CPUs per-
formance has diminishing returns once a certain clock speed is achieved due
4
to voltage leak and thermal limitations. This can be observed in Figure 2.1,
where frequency has trended around the 3− 5GHz range for the past decade.
Although clock speed is important, it is not the only determining factor in
performance. Every design has a metric called IPC or instructions per cycle.
Overall performance can be thought of as IPC ∗ CPU FREQUENCY, because
both IPC and the clock speed increases can provide net performance gain [27].
Like frequency, IPC can be a difficult metric to increase because it requires
very large architectural engineering efforts.
Figure 2.1: Microprocessor Trends (data collection by M.Horowitz, F.Labonte,
O. Shacham, K. Olukotun, L. Hammond, C. Batten, and K. Rupp [38])
A different approach to increasing performance is to add more CPU
cores. This is possible due to the increased transistor density gained from
advancements in processor manufacturing. Multiple CPU cores can reside on
a single chip allowing multiple programs or multiple threads from a single
program to run simultaneously in parallel. One might think that providing x
5
more cores would increase the overall program performance x times. This is
not necessarily the case. Serial single-threaded code cannot take advantage of
these extra cores, and can at most be run on one core at a time. But if a user
has multiple of these program instances to run simultaneously, then multicore
processors can provide a speedup. This is because each instance can occupy a
single core. What if a user has one program that needs high performance; such
a program will have to be written specifically to utilize these multiple cores
cooperatively. This is what is called parallel programming. The parallel pro-
gramming model introduces many intricacies not exposed when working with
a single-threaded application. For example two CPU cores accessing the same
data could overwrite each others results. Thus, using this parallel hardware
as efficiently as possible, can be a complicated task for the programmer.
2.1 Accelerators
CPUs are designed to handle very general purpose workloads. This enables
them to do many things from routing network packets to playing video games.
Performance while executing these tasks is generally good, but not adequate
scenarios. Because of this, computers have evolved to have additional applica-
tion specific processors for some select tasks. Many modern CPUs even have
task-specific silicon, or hardware accelerators, within the CPU chip, such as
cryptographic generators and vector processing units [7]. Hardware accelera-
tors can be necessary to enable better performance, reduce power consumption,
lower computational latency, etc.
This thesis focuses on graphics processing units (GPUs). When using
a computer monitor, it is necessary to perform some computation in order to
6
draw the image on the screen. This was originally done on the CPU, but more
demanding workloads such as video games need more throughput than a CPU
can provide. Some of these original GPUs date all the way back to the 1970s
for home game consoles [18]. But, these processors evolved greatly since these
basic implementations. Modern GPU design has evolved to take advantage of
the parallel structure of graphics work, which is a good fit for SIMD (single
instruction multiple data) execution. SIMD is a parallel execution model in
which one set of instructions is executed on multiple execution units in lock-
step [20]. These units all perform the same operations, but on separate data
values. Such an execution model is flexible enough to be applied to non-
graphics problems such as neural networks and cryptographic computation,
which is referred to as general purpose computing on GPUs (GPGPU).
2.2 Nvidia Hardware
We target Nvidia GPUs, so we will define some vendor specific terminology
and architecture. Nvidia GPUs have many execution units called CUDA cores.
CUDA cores are not all independent from one another. CUDA cores are
scheduled in multiples of 32 threads (a warp), but the smallest number of
threads allocatable for any task is 1. Since CUDA cores are only scheduled
in warps, this single thread will execute alongside 31 CUDA cores, which will
perform wasted work. The reason for this is that the GPU is divided into
Streaming Multiprocessors (SMs), each containing a multiple 32 CUDA cores.
An SM is responsible for scheduling tasks onto the CUDA cores that reside
within. Each SM has multiple hardware warp schedulers to ensure that if a
given warp stalls on a memory access, a different warp can then execute while
7
waiting for the access. Multiplexing the CUDA cores like this can help hide
the latency incurred on memory operations.
2.2.1 Thread Divergence
This section may raise the question, why use CPUs when modern GPUs have
thousands of cores? Singular GPU cores are much less powerful and less flex-
ible compared to a CPU core. Typically, GPU cores provide performance
advantages when working in parallel.
Thread divergence is a key performance determinant. Thread diver-
gence is a consequence of warp scheduling. It happens when two or more
threads in a warp take different branches in a given control statement. Since
all threads in a warp must execute the same instructions, the entire warp is
forced to execute both logic branches [34]. While executing the branch that
does apply to a given thread the results are discarded, only saving results for
branches that evaluate to true on that particular thread.
1 void divergent(int size , int a[], int b[]) {
2 for (int i = 0; i < size; i++) {
3 if (a[i] > 0) {
4 b[i] += 1;
5 } else {




10 void nondivergent(int size , int a[], int b[]) {
11 for (int i = 0; i < size; i++) {
12 b[i] += (a[i] > 0);
13 }
14 }
Figure 2.2: Example Divergence
8
In the worst case, it is possible to have a branch that is taken by one
single thread in a warp. This would cause 31 of the 32 threads in the warp to
waste cycles executing code that will not be used. Therefore, it is important
to avoid branching statements that cause control flow divergence. GPU pro-
grammers will often replace branching logic with careful data layout and/or
computation. For instance we can often represent a branch with a data flag
that will change the result of the computation, as shown in (Figure 2.2). In
fact, the Nvidia CUDA compiler will attempt to do this automatically [8].
2.2.2 GPU Memory
Nvidia GPUs have multiple types of memory on the physical device, which are
all distinctly separate from CPU main memory. The access pattern to these
memories contribute heavily to how well the program will run. There are
two considerations when accessing GPU memory, alignment and stride. An
aligned memory access is when the data being accessed resides on a multiple
of the alignment boundary. For example, an 8-byte aligned access would be
any address that is divisible by 8. Aligned accesses can be coalesced into a
single access to memory, but unaligned accesses could result in, worst case,
an access to memory per thread in the warp. The stride is the space between
data accesses. A common scenario is accessing a single member of a struct,
which is contained in an array of structs. Each read would need to be offset
x bytes from the previous read, where x is the size of the struct. The x value
is the stride. If the stride of the memory accesses becomes large, it make it
impossible to coalesce the access regardless of alignment [25].
9
2.2.3 Unified Virtual Addressing
To simplify CUDA memory management, Nvidia introduced Unified Virtual
Addressing (UVA). Traditionally, a CUDA program has distinct dynamic al-
locators for main memory and GPU memory, but with UVA the programmer
can utilize the CUDA allocator for both device and main memory. The al-
locator always provide unique memory addresses, opposed to using separate
allocators that could provide overlapping addresses [14]. UVA also provides a
feature to allocate portable memory [14]. Portable memory is memory that
is always resident in host main memory. The downside is that this memory
must be read to the device via DMA before it can be accessed, incurring PCI
Express bus transfer overhead [26]. This process incurs much more latency
and provides less bandwidth than a normal read to GPU memory.
2.2.4 Fully Unified Memory
Unified memory addresses some limitations of UVA. Unified memory provides
similar access semantics to UVA’s portable memory, but the management is
completely different. Host memory acceible with UVA is never migrated to
device memory, it is only communicated via DMA; but with unified memory,
pages can be migrated to and from GPU memory directly [14]. Pages that
are not resident but accessed by the GPU will be un-mapped on the CPU and
mapped into the devices page tables. Because resident unified memory gets
directly mapped on device, it will provide the same performance as manually
managed GPU memory.
This is significant because unified memory provides a simplified means
of memory management and access. It enables a programmer to avoid man-
10
ual copying to and from device, thus helping avoid errors and reduce code
complexity. In addition, programs can oversubscribe unified memory. The
runtime will manage paging this memory in and out when needed without
programmer intervention. The prefetcher tries to predict which memory will
be needed next, and ensures that memory is available. When the runtime can
prefetch memory efficiently, performance is good and comparable to efficient
manual memory management. But if the GPU faults because the prefetcher
fails to anticipate the access pattern, performance takes approximately a 50%
hit [39].
To help the prefetcher make accurate decisions, there is an advise func-
tion similar to madvise for the CPU [31], called cudaMemAdvise. This function
allows the programmer to provide hints about how the allocated memory will
be used, like specifying which device will be accessing the memory most fre-
quently. Proper hint values have a big effect on performance as shown in
Section 6.1.
2.3 GPGPU Programming Model
To GPGPU programmability, vendors and third parties created domain spe-
cific languages (DSLs) and runtimes. These fall into three new categories
described in this Section 1) low level APIs that provide direct access to the
hardware, 2) libraries built for these low level APIs, and 3) frameworks built
to obscure the underlying hardware to the user.
First, we discuss low level hardware APIs. The most notable are CUDA
for Nvidia hardware and ROCm for AMD [6]. These APIs extend C/C++
programming languages. The programmer is given full control of the hardware
11
such that they can utilize it to its maximum potential. The downside is,
having direct access to the complex micro-architecture makes it easy to make
avoidable mistakes.
To avoid this, libraries such as Thrust, provide templated implemen-
tations for common patterns [13]. These algorithms can be used instead of
forcing programmers to code them from scratch, and they provide a correct,
performant solution without requiring deep understanding.
Last is domain-specific frameworks. Domain-specific frameworks ab-
stract the underlying hardware. A prime example is machine learning, Ten-
sorFlow is a framework that works on many compute devices [3]. A machine
learning expert can write TensorFlow code, which can then be run on a CPU,
GPU, or even TensorFlow specific hardware. No matter which device the code
runs on, the interface is the same for the programmer, as such they may not
even be aware that their code has run on a GPU.
2.4 GPGPU Execution Model
The unique hardware causes the programming model to be different and it also
enforces a unique execution model. A traditional program can be much more
dynamic, it can access the filesystem, the network, system calls, etc. When a
program is executing on a GPU it only has access to memory and there is no
mechanism to utilize the aforementioned resources.
Due to this, GPU programs require all memory mappings, allocations
and transfers to occur before the GPU code (the kernel) begins execution. In
this phase the CPU side of program can read in any necessary IO and then
structure the data efficiently for on device consumption.
12
Once the copying and allocation is complete, the GPU kernel can be
launched. When a GPU kernel is running, it can only access GPU memory
mapped before kernel launch. This is a hardware limitation, thus all GPU
programs run within these constraints.
After the kernel finishes execution, the results must be copied from GPU
memory to host memory, manually or automatically if using UVA/UVM. This
is because there is no concept of a return value from a kernel launch, and direct
memory transfers are the only means of communications.
13
Chapter 3
GVM: A GPU-based Java
Bytecode Interpreter
Our project builds upon GVM, which is a Java bytecode interpreter built on
top of CUDA for Nvidia GPUs [4]. The goal of GVM is to run unmodified
Java code on a GPU. This is not normally possible, as hardware vendors
don’t provide runtimes for high level languages. Even for languages hardware
vendors do support, C/C++, programmers are required to specifically target
the vendors special API. If Java gained such an API, existing programs would
need modification.
Enabling Java code to run on a GPU opens the door for programmers
completely unfamiliar with GPUs to write and execute code on device. Like-
wise, it means that the large amounts of preexisting Java code can be run
unmodified, providing flexibility to adapt older codebases to work in mod-
ern heterogeneous systems. GVM takes advantage of this to accelerate the
process of testing existing Java codebases. GVM specifically targets sequence-
14
based test generation. Sequence-based testing [44] [40] is a systematic testing
methodology that generates all sequences of method calls (for the system un-
der test) up to a given bound. Each sequence generated will test a different
combination of statements, eventually producing every possible combination
of method calls. The sequence bound is provided to constrain on the amount
of code generated and provide a termination condition.
If we inspect the code generated, it can be observed to have a tree-like
structure. The root of this tree is a sequence of length 0. From the root, there
are N branches at each node, each representing a possible method call. The
example provided in GVM is a tree class with two methods add and remove
with one possible parameter x. In this scenario, at a given node there would
be 2 new branches generated. Each of these branches corresponds to a possible
method invocation.
The code generated for a given test suite can be very large and is long-
running. In practice, execution time is costly and can delay releasing new
features or issuing bug fixes. To speed this process up, we can dedicate more
CPU cores or faster CPU cores to the problem; but there is a limit on how
much CPU can be put in one machine. GVM, running these generated tests
on a GPU in parallel, can achieve more throughput than traditional CPU
testing. This may seem counter intuitive, as single-threaded GPU tasks are
very slow. But, the structure of the tests ensures that only portions of the
programs are strictly single-threaded. Much of the execution will be shared
code, thus the GPU can execute them together. Overall GVM outperforms
JPF (Java Pathfinder), a bytecode testing framework, as well as Oracle’s JVM
in the interpreter mode (-Xint) [4]. Additionally, the performance of GVM
is comparable with Oracle’s JVM with enabled JIT (just-in-time) compiler,
15
which has been optimized over twenty years.
3.1 Implementation
GVM can be logically divided into 4 phases of execution (similar to the model
described in Section 2.4):
3.1.1 Packaging
Since Java programs target CPU execution, there are some aspects that are
unfriendly to GPUs. For instance, some data structures use multiple levels of
pointer indirection. On a GPU this can cause irregular memory access pat-
terns leading to poor performance. GVM transforms any fields that would
traditionally be pointer-based into array-based representations. Furthermore,
other information crucial to execution such as bytecode, classfiles, and excep-
tion tables are transformed into arrays. Overall, these transformations simplify
the memory accesses the GPU Java interpreters will need to make.
3.1.2 Transferring
Once values are prepared for the device, memory to store these arrays can be
allocated and the data can be copied to the device. Only one copy of data
is needed, as the data is constant and shared by all interpreters [4]. Each
interpreter instance needs its own heap and static area.
16
3.1.3 Scheduling
TinyBees are GVM’s Java interpreter instances. Each tinyBee instance is an
entire independent Java interpreter and runs on a single GPU thread. GVM
provides some flexibility in how tinyBee instances are launched. A decision is
made whether to run one tinyBee instance per warp, or to completely fill the
warp based upon how much thread divergence there is in the program under
test. If the code has a lot of divergence, it is beneficial to only schedule one per
thread per warp, because the other threads in the warp would be performing
wasted work (Section 2.2.1). GVMfs is run with one tinyBee per warp, as
filesystem operations are often highly divergent.
3.1.4 Interpreting
After the CPU side code is complete, the tinyBee instances will initialize local
data structures necessary for execution. tinyBees can then begin interpreting
the Java bytecode. The interpreter will read a bytecode and apply a corre-
sponding change to the tinyBee state machine. Bytecodes will continue to be
interpreted until a terminal state is reached.
3.2 GVM Limitations
GVM, has some limitations. In general, when running Java code it is inter-
preted from a bytecode format. Therefore, having a Java virtual machine
implementation that can interpret all of the bytecodes for a given program
should enable it to execute said program. This is what GVM does, but just
providing these bytecodes is not enough for most programs.
17
Java programs typically utilize libraries to provide frequently used func-
tionality. Some of these libraries provide functionality that is not possible
to implement completely in Java code. For these special ”native” functions,
Java provides the Java Native Interface (JNI), which allows for calling code
written in a different language [17]. Many low level operations require in-
teraction with system calls, for example file operations require calls to open,
close, read, write, etc. Because of this, Java provided file IO classes such
as RandomAccessFile implement much of their logic in native functions that
invoke C code. This C code can handle the low-level interactions with the
operating system API.
GVM does not implement JNI, but does capture the idea of the JNI
by providing mechanisms to override Java functions with native CUDA code.
This is particularly important as many of the things that are trivial on a CPU
based JVM implementations require special handling on GPUs. Additionally,
GVM has to provide custom implementation for many of the Java standard
library class functions. An example of such a situation is the StringBuilder
class. Strings and objects are laid out in memory in a special way in GVM,
consequently the standard Java StringBuilder code has to be modified. Since
a typical Java program will utilize many of these standard library classes GVM
includes many custom function implementations. The downside is that, if a
new program that utilizes a function or class not yet implemented in native





One thing that is common among GPGPU APIs and frameworks is the lack
of support for filesystem access directly from a GPU. This is due to the fun-
damental design of a GPU, it is logically disconnected from the system call
interface. Consequently, any type of data that is needed from the filesystem
must instead be read and copied by the CPU before kernel execution. Explicit
data migration encourages the programmer to structure the data efficiently in
GPU memory. Additionally, utilizing this workflow ensures that data needed
for a given execution will be mapped into GPU memory, reducing latency that
would be incurred by any file IO.
Despite this API design choice, there is nothing barring the creation of a
GPU filesystem interface. In fact, this has been done. GPUfs [41] is a research
project to explore the viability of creating a GPU based filesystem. GPUs
present different challenges than a CPU, thus the design of GPUfs differs from
a classic filesystem interface to address these challenges.
19
4.1 GPU Communication
During a kernel’s execution it is ordinarily impossible to communicate between
the GPU and CPU, all communication will happen before or after in the form
of data transfers. A filesystem that requires all file operations to be known
before program execution doesn’t provide much value. In order to provide
an interface that can provide access during a kernel execution, there must be
some sort of Remote Procedure Call (RPC) implementation. RPCs are often
implemented by means of shared memory, for example Linux’s D-BUS which
provides RPCs between processes on the same machine [21].
Figure 4.1: GPUfs Overview Diagram (original figure from [41])
GPUfs takes this idea and evolves it to use CUDA unified virtual ad-
dressing. UVA makes it possible to create a shared memory region, resident
on the CPU, accessible both on CPU and GPU. GPUfs can then dispatch
filesystem requests to this writable CPU memory region. On the CPU side,
20
GPUfs has a thread polling this memory region in order to respond to requests
(Figure 4.1). The GPU will mirror this polling pattern to receive the response.
4.2 File System Semantics
The most interesting bit of GPUfs is the unique semantics provided by the file
interface. The POSIX file interface was inherently designed for CPU based
access, one thread performing an action at a time sequentially. When a GPU
program accesses a file it will do so with multiple threads simultaneously and
can even do so from multiple warps simultaneously. For this reason, GPUfs
operations take a different approach, described below.
4.2.1 Read And Write
GPUfs file operations do not keep track of a file offset like standard file de-
scriptors do. Instead GPUfs requires that every read or write (gread/gwrite)
explicitly specify the file offset for the operation. This is not unlike POSIX
pread, which offers the ability to specify offset when reading from a file de-
scriptor [24]. While this is not a drastic overhaul of POSIX read/write, it
important as it doesn’t make sense for GPUfs to maintain a file offset. For
example, if two gread requests are made for the same file, what would the
resulting offset need to be? If this were to happen with the CPU one request
would be given exclusive access. Providing exclusive access to a file descriptor
for one warp would require costly synchronization, so it is best to just avoid
this scenario.
21
4.2.2 Open and Close
The POSIX open function will open a file and return a unique file descriptor
for each call. File descriptors are managed on a per process basis, so two
processes can have unique handles to a singular file. In contrast, gopen will
return the same file descriptor every time gopen is called on a given file. Thus,
all of GPU threads will share the same handle to each open file. Additionally,
after the first gopen call to a file the file data and metadata will have already
been cached since it is shared globally. Caching the metadata will speed up
subsequent calls to gopen.
As for close, in the POSIX world close will release a file descriptor
marking it suitable for reuse. A more nuanced behavior of close is that
the corresponding file will immediately flush any changes residing in the file’s
buffer cache to stable storage. gclose instead takes a lazy approach. When
the open count of a GPUfs file reaches zero nothing occurs and the file data
and metadata remain untouched. In order for the file buffer to be flushed,
the programmer must explicitly sync data when desired with gfsync. gfsync
forces all dirty file buffers for the associated file to be written back to the CPU
file cache, unless there are other threads accessing it at the time of invocation.
gclose is designed to be lazy due to the chaotic behavior of GPU scheduling.
GPU scheduling may cause a file to be open and closed many times in a short
time span. Flushing the data every close would be extremely costly in terms
of data transfer [41].
22
4.2.3 The Remaining Functions
GPUfs provides additional functions, but they provide semantics in line with
a CPU based filsystem. Because of this, we will only mention them from a
high level.
• gmmap/gmunmap: Behaves like mmap/munmap, providing a way to
memory map a file.
• gunlink: Removes a file and reclaims memory.
• gfstat: Returns the file size when gopen was called.
• gftruncate: Truncates a file to a given size and reclaims memory now
unused.
• gmsync: Flushes a file immediately to the CPU.
4.3 File Usage
You may recall the discussion about data needing to be structured in a certain
way in order for the best performance for GPU usage (Section 2.2.2). This
remains a concern, especially when using files. An average file is laid out in a
format designed to be read or written to sequentially. The reason for such a
design is that disks must physically perform these actions. Files that are to
be used with a system such as GPUfs will need special consideration in order
to be efficiently usable in GPU programs. For example, it would be ideal to





We know from the GPUfs project that it is possible to create a filesystem
that targets a GPU. We also know that GVM enables Java code to run on a
GPU and has the ability to have additional functionality added with native
CUDA code. Thus, our contribution is to combine the two concepts, creating
a filesystem extension for GVM called GVMfs. Such an extension allows for a
greater breadth of Java programs to be executable, as many of these programs
rely on the filesystem.
5.1 Interface
Although GPUfs provides an interface for a filesystem, it is complex and is
not fully utilized by many programs. Furthermore, GPUfs’s unique semantics
require more background knowledge to use. For this reason, we decided to look




Before any of these filesystem functions can be implemented on the GPU side,
a means of communication between CPU and GPU is necessary. GVMfs does
this in the form of an RPC similar in concept to the RPC outlined in GPUfs.
The RPC is built upon a region of shared memory allocated with CUDA
unified memory. We can recall, CUDA unified memory is accessible from both
GPU and CPU address spaces, therefore it can be used to provide the means
for bidirectional communication between CPU and GPU.
To ensure reliable communication, it was necessary to construct a syn-
chronization mechanism on top of the shared memory. It would be beneficial
to utilize a blocking synchronization primitive to avoid the need to poll for
RPC requests, unfortunately this is not possible. On the GPU there is no way
to manipulate the state of a kernel, they are always running once launched.
The only way to ensure a GPU kernel waits for an event is to create a spin
lock. Conversely, the CPU side process can be blocked easily with primitives
like the futex [22]. However, waking this process from the GPU proves im-
possible, due to the lack of access to system calls. With these considerations,
the only option for synchronization utilizable by both the CPU and GPU is a
polling based implementation. Unfortunately, this design means we must have
at least one CPU thread constantly polling for requests during GPU kernels.
The size of this shared memory region is large enough to accommodate
multiple requests, specifically one request slot per tinyBee instance. Subse-
quently, each tinyBee is free to make an RPC request regardless if another
tinyBee is making a request simultaneously. The format of a request is shown
in Figure 5.1 and has various data fields to accommodate different RPC re-
25
quest types. As a minor space optimization, we utilize unions to minimize the
memory needed to represent the RPC region.
1 typedef struct request_t {




6 struct { /* Open variables */
7 char file_name[MAX_PATH_SIZE ];
8 permissions_t permissions; /* Unix permissions (rwx) */
9 };
10 struct { /*** Grow/Close variables ***/
11 int host_fd; /* fd on host */
12 char * file_mem; /* File backed memory */
13 union {
14 struct { /*** Grow ***/
15 size_t new_size; /* Size desired for grow */
16 size_t current_size; /* Size of current gpu file */
17 };
18 struct { /*** Close ***/
19 size_t actual_size; /* Total new size for close */






Figure 5.1: RPC Request Struct
5.2.1 Synchronization
While we can poll for updates to the RPC queue, this is not enough to ensure
correctness. Without memory fences, there is no guarantee that the data for an
RPC request is written before the request is serviced. Such a race condition can
cause a host of issues. To ensure a given set of shared memory operations per-
formed on the GPU are observed by the CPU, we use threadfence system
26
provided by CUDA [14]. And on the CPU we use sync synchronize, which
is a full memory barrier provided by GCC [35]. These memory barriers ensure
that our ready to read flag is only true when all previous updates to the
struct are visible.
Figure 5.2: GVMfs Organization
5.3 Memory Management
Because memory allocation and de-allocation can only occur before or after
a CUDA kernel launch, we must provide our own means of management to
enable large dynamic allocations during a kernel. To do this, we must allocate
27
a large pool of unified memory before the kernel. Allocations are made using
a simple bump pointer. This entire memory pool must be unified memory
because the allocations are used to provide backing for GVMfs file buffers
(Figure 5.2).
5.4 Filesystem General Operations
Accessing a file in GVMfs first requires opening the file. When opening a file,
the CPU will first check the file size. File size information is used to allocate a
region of memory from the GVMfs unified memory allocator. After allocation,
the entire file is read into the shared memory region. Once the read completes,
a response can be sent to the GPU via the RPC buffer. This response contains:
a pointer to the shared region (where the file data resides), the file size, the
corresponding file descriptor for the host, and file permissions. We chose to
send more data than necessary at this stage, to enable state management to
be handled on the GPU side. Since all state is maintained on the GPU side,
we can keep the CPU side code very minimal. Furthermore, this state is
maintained on a per warp basis, so each warp will have a different set of open
files.
Once a file is open, GPU programs can read and write to the file without
involving the CPU. GVMfs read will return a pointer representing an offset into
the shared memory, similar to how a traditional file descriptor will maintain
an offset. This buffer is used to not only read data, but also is able to be
written to. Writing to the buffer that read returns is equivalent to writing to
the file. As such, we don’t need to provide an explicit write function.
In order to write any file changes back to disk, the file must be closed.
28
Figure 5.3: GVMfs Logical Flow Chart
The close operation communicates to the CPU the file descriptor, a pointer
to the file data, and the size. This is the same information that gets stored
on the GPU when a file is opened. With this metadata the CPU is able to
flush the entire buffer to the original file. If the file has been grown since it
has been opened, the CPU will allocate more space on disk for the new data.
We also provide seek and grow, which are trivial. seek will change the
file offset in the GPU state as a traditional seek call would. grow involves a bit
more complexity, as more memory is needed to back the extra file data. When
a file is grown a new memory buffer is allocated. The CPU must copy the
contents of the original file buffer into the new file buffer. Then the new buffer
29
can be returned to the GPU. A side effect of this is, that any references to the
old buffer will remain writable and readable, but will not be representative of
the file.
It is worth noting that the GPU only needs to make RPC requests when
opening, closing, or growing a file (shown in Figure 5.3). All other filesystem
operations can occur solely on the GPU. Structuring the code logic this way
ensures that frequent reads and writes will not incur a costly RPC penalty.
int open(char * path) Opens a file, returning a file descriptor.
int read(int fd) Returns a pointer the file-backed buffer,
offset into the buffer corresponding with
file offset.
seek(int fd, size t pos) Moves the file offset to be equal
to the position parameter.
int tell(int fd) Returns the current file offset.
grow(int fd, size t size) Grows the file to be at least
as large as the provided size.
close(fd) Closes a file descriptor, flushing it to
the host filesystem.
Table 5.1: GVMfs CUDA API
5.5 GVM File System Integration
With the filesystem interface complete we needed to create an interface for our
Java programs that run on GVM. Java has a standard interface for low level
file operations and we implement a subset of these. This is a proof of concept
to show that it is possible to implement this functionality within GVM, such
that a user provided Java program can access the filesystem.
This required a few modifications to GVM. First, GVM must han-
dle filesystem initialization and cleanup. These are done with code added
30
to the initialization and cleanup of GVM itself since the filesystem lifetime
needs to be the same as GVM. As for the runtime interface we implement
a class called NewFile. NewFile implements functions open, close, write,
writeByte, read, and readByte. These functions invoke the corresponding
CUDA function, for instance NewFile.open() will invoke the open CUDA
function. The exceptions are the three new functions not present in the CUDA
code: readByte, write, and writeByte. ReadByte will call read with a size
of 1 byte and return the byte read. Write and writeByte will write to a file
buffer produced by read.
As previously discussed, the native filesystem functions operate directly
on a buffer that represents a file. This buffer is not used directly used from code
GVM is interpreting. This is because GVM arranges array data differently
than native CUDA code. GVM arrays are designed to be generic enough to
handle many data types and it does so by allocating 4 bytes for each slot.
The native filesystem returns a character array, so this needs to be copied into
GVMs array layout. On the opposite end, write will need to copy changes
from the GVM array to the device file buffer.
5.6 Limitations
As stated, GVM filesystem integration requires a double copy in order to
perform operations. This incurs a performance penalty. It should be possible
to refactor GVM in such a way that the heap uses unified memory. By doing
this, it would be possible for the CPU to access the heap, and read the file
directly into this memory in the correct format. As it stands, the project
doesn’t do this because the GPU filesystem implementation was designed to
31
be flexible enough to be used for programs other than GVM.
The other main limitation is that not the entire DataInput and DataOutput
interface is implemented, if it was then the file access classes in the Java stan-
dard library could utilize it transparently [15, 16]. This would allow for com-
pletely unmodified code to utilize filesystem operations within GVM instead




To evaluate the GVMfs we crafted two microbenchmarks to characterize the
performance of various components along with a macrobenchmark to under-
stand how they perform together. Measurements were performed on a modern
mid-range Nvidia GPU with 1920 CUDA cores. The CUDA cores are divided
among 15 SMs, each containing 128 CUDA cores and 4 warp schedulers [12].
Other system specif cations are enumerated in Table 6.1.
CPU AMD Ryzen 2700x
RAM 16Gb DDR4 3200Mhz
Storage Adata SU800 512Gb
GPU Nvidia GTX 1070
CUDA Version 11.0
GCC Version 9.3
Table 6.1: System Specifications
33
Figure 6.1: GPU File Backed Memory Test
6.1 Memory Stress Test
The first microbenchmark characterizes the performance of unified memory
on the GPU. At a high level, the benchmark allocates memory for file data,
launches a GPU kernel to read/modify the data, then writes the modified file
data back to the file.
We compare the performance file-backed memory allocated via three
different methods:
• cudaMalloc returns explicitly managed GPU memory. We must then
copy file data to/from device with cudaMemCpy.
• cudaHostRegister(UVA) mmaps a region of host memory to be acces-
34
sible to the GPU. The region already contains the file data, thus no
memory copy is necessary.
• mallocManaged(UVM) returns a shared region of memory that we can
read the file data directly into (the runtime manages data migration).
In addition to different memory backings, we experiment with different
hinting values for cudaMemAdvise. Note that cudaMemAdvise is only usable
with UVM, so it is not applied to the cudaMalloc or cudaHostRegister. We
now briefly discuss what each cudaMemAdvise option does:
• CudaMemAdviseSetReadMostly tells the runtime that the memory region
will be mostly read thus creating a read only copy on device.
• CudaMemAdviseSetPreferredLocation specifies the preferred location
for the given memory region to reside. This mainly effects data migra-
tion, if a pagefault occurs on an unpreferred device the runtime will try
to establish a direct mapping to the region instead of migrating the page.
• CudaMemAdviseSetAccessedBy is similar to set preferred location with
the idea being that you would specify which device is likely to access the
given region. It differs in the effect, setAccessedBy will guarantee that
a given memory region is always mapped in the specified device’s page
table.
• No advise.
Once the allocation is done and file data is read into the memory, the
file backed memory is split into page sized chunks. These pages are divided
evenly among the GPU threads running the benchmark. At this step there is a
35
configuration option to randomize the order in which these pages are assigned
to GPU threads, in order to force a random access pattern.
The kernel is launched with a total of 2048 threads in 64 warps, ap-
proximately the number of warps this particular GPU can run simultaneously.
Each thread in the kernel will perform an access on every byte of each page
assigned to it (shown in Figure 6.1). An access will read and/or write depend-
ing on the parameters specified at launch. This portion of the benchmark is
intended to test the harware prefetcher, by stressing the memory subsystem
similar to how filesystem operations would. When all of the pages have been
fully accessed, the kernel exits. Then, if using explicit memory management,
the data is copied back from the GPU to the host.
We time the transfer for data to/from the GPU and the length of the
kernel execution across range of parameter combinations. With this data we
compile a few graphs to show interesting performance patterns that arose.
Figure 6.2: Random R/W Total Total Time (Transfer + Kernel)
36
Figure 6.3: Random R/W Total Kernel Time
6.1.1 Results
The benchmark showed that different memory allocation and advise combi-
nations yielded different performance in different aspects of execution. When
looking at a random read and write workload, setPreferredLocation pro-
vided the fastest performance at high mapping sizes (Figure 6.2 & 6.3). This
result is not surprising, because it would seem that telling the runtime to pre-
fer having direct mappings on the GPU would decrease the amount of page
faults.
37
Figure 6.4: Sequential Read Total Time (Transfer + Kernel)
But when looking at the sequential read-only workload, different pat-
terns arise (Figure 6.4). The performance delta between allocation methods
is much smaller. This is likely because the prefetcher is able to make ac-
curate predictions. Interestingly, we do not see a benefit using ReadMostly
here. We think that this is because there is no contention between CPU and
GPU for this benchmark, thus making a read only copy of the data provides




The next major component we investigated was the custom RPC interface.
Since an RPC is incurred on every open, close, and grow operation, the
latency of an RPC request is very important. In order to test this latency,
we isolated the RPC mechanism. This allows us to time the RPC round-trip-
time, without any processing overhead that is normally required to read and
respond to a filesystem RPC request. Each request has the entire RPC struct
(Figure 5.1) zeroed out to simulate writing real data to the request, then is
sent to either the GPU or CPU. At which point, the receiver will immediately
send a similar response.
One minor point is that it is not easy to generate a wall clock time
for these events on the GPU. CUDA events, which are often used for timing
kernels, will not be precise enough as they include overhead from the actual
kernel launch [9]. Instead we choose to measure the latency on the GPU side
in clock cycles with clock64, which provides the current cycle count of a given




Figure 6.5: RPC Performance
Both GPU and CPU side latency benchmarks were performed across 1000
runs in order to provide a reliable measurement. First we will discuss the
CPU side measurements. A single RPC to the GPU round trip took 8,550ns.
To provide context, we can compare this to a standard CPU system call and
function call. The time for a simple system call, getpid, took significantly
less time at 2,518ns and a function call took only 979ns (shown in Figure 6.5).
In contrast, on the GPU we can only measure an RPC call against a function
call due to the lack of system call support. A GPU side RPC call takes 28,074
40
GPU cycles which is a massive 412x slower than a function call.
The large variation in cost paid for an RPC versus a function or system
call is due to memory pressure. When both the CPU and GPU are access-
ing the same region, the pages thrash between devices. Each time a page is
migrated to or from GPU there is a performance penalty. The sum of these
penalties is the cost for providing this RPC interface. And while this price is
expensive, CPU based system calls are also expensive but it a sacrifice made
because they provide value.
6.3 GVM Performance
With an understanding of how the individual building blocks perform, we
can now look at how GVMfs performs overall. To test GVMfs we chose to
create a benchmark around checksumming files. We use a basic longitudinal
redundancy [28] checksum algorithm (Figure 6.6), which accesses each byte of
a given file. This algorithm was used in two implementations: native Java and
Java targeted to GVMfs. As a baseline, we use the native Java implementation
running on a standard JVM. This is anticipated to be the fastest, as it has
native access to the file system.
In the comparison we checksum a 12Kb compiled classfile from GVM.
We benchmark GVM against the native JVM differing the number of times the
file is checksummed. For the native JVM, we invoke the checksum benchmark
x times. But for GVM we launch x threads, each occupying a warp. We do
this to highlight the ability for GVM to scale with more threads. A small note,
we did experiment with a C implementation, but it was too fast to provide a
meaningful comparison.
41
1 public static int checksum(char[] buf) {
2 int size = buf.length;
3 int tmp = 0;
4 int i = 0;
5
6 while (size -- != 0) {





Figure 6.6: Checksum Benchmark Algorithm
6.3.1 Results
Figure 6.7: Checksum Benchmark Results
42
The results of this benchmark highlights both the strengths and limitations
of GVM/GVMfs. The initial startup time for GVM is high because of the
data transformation and transferring must occur (discussed in Section 3.1).
But, we can see that kernel execution time on the GPU climbs slowly. This
is unexpected, we anticipated kernel execution time would only increase once
all of the GPU warps were occupied. This increase in execution time must
come from contention in the RPC queue, both in terms of memory thrashing
and CPU time. Strangely, the Java implementation has a large spike in exe-
cution time. This is the point at which we thought the GVM execution would
slowdown, but we are unsure what is causing this.
GVMfs checksum is outperformed by a single-threaded CPU imple-
mentation. This implementation on GVM receives little benefit from GPU
architecture other than raw cores to execute on. In order to achieve higher
performance it is necessary to have some coordination between the tinyBee
instances in order to avoid RPC contention. But, this would mean exposing
much of the intricacies of GPGPU. A more logical approach would be to ex-
tend GVM to support Java threading, enabling more concurrency control for




GVMfs is a combination and evolution of ideas presented in previous research,
mainly GVM [4] and GPUfs [41]. Since we have already discussed these
projects in depth, we take the opportunity to discuss other related research.
7.1 Networking for GPUs
Networking is not natively supported by GPUs, as networking requires a sys-
tem call interface. RDMA (Remote Direct Access Memory) network adapters,
unlike traditional network adapters, can directly access memory.
GPUrdma provides GPUs a direct interface to RDMA network adapters
[19]. GPUrdma network requests are issued directly to the RDMA network
adapter through shared memory; any responses will also be written to this
shared memory. This completely eliminates the need for any CPU intervention.
GPUnet uses the same RDMA network adapters to provide a Unix
Sockets programming interface to GPU programs [30]. These sockets also can
be used to facilitate communication with CPU programs. They demonstrate
44
that the GPUnet interface is flexible enough to build complex networked GPU
applications, such as MapReduce, without a CPU daemon handling network
operations.
7.2 FPGA Accelerator Filesystems
FPGA accelerators lack a standard filesystem interface. BORPH, an FPGA
OS, provides a filesystem interface to FPGA programs [42]. The filesystem
supports standard files, inter-process communication, and hardware IO. The
BORPH kernel maintains an RPC based system call interface to communicate
between the OS and running FPGA processes.
A different FPGA approach is a filesystem implemented directly in
FPGA logic [32]. Because FPGA logic gets programmed directly into hard-
ware, the FPGA implementation can outperform a software CPU implemen-
tation. The project showed that a modern FPGA can completely saturate the
disk interface and while providing a user friendly filesystem API.
7.3 Generic System Calls For GPUs
Genesys is a library that provides a generic system call interface to GPGPU
programs [43]. This system does so asynchronously. When multiple system
calls occur in a time window, Genesys will coalesce system calls requests such
that the CPU can service the requests all at once. Genesys is shown to handle
most of Linux’s 300+ system calls including performing filesystem operations,
interacting with the network, and device control system calls.
45
7.4 High-level GPGPU Programming
PTask [37] and Dandelion [36] provide programmers with a high level API to
describe dataflow operations.
PTask exposes a collection of operations that can be used to connect
nodes (PTasks) of a dataflow graph. Each PTask represents a transformation
to be applied when data passes through it. These PTasks can be arbitrarily
combined to create graphs that model complex computation.
Dandelion uses a language-integrated query (LINQ) approach. LINQ
provides the ability to put data queries directly into code. These queries
operate on entire data collections. Dandelion will automatically transform
these operations into native CUDA code to provide more performance.
7.5 GPU Scheduling
GPUs often need to run multiple tasks simultaneously. Devices have hardware
scheduling, but these schedulers do not enforce fairness or account for latency
sensitivity.
TimeGraph provides a real-time GPU scheduler [29]. A programmer
will specify the scheduling constraints for a given task. These constraints
enumerate the priority of a task, the duration of any given time slice, and how
often it should be scheduled. Tasks are divided up into command groups to
enable smaller units of execution to be issued to the GPU.
GPUpIO focuses on providing I/O driven scheduling to GPGPU pro-
grams [46]. GPUpIO builds upon GPUfs’s filesystem interface [41]. When
performing filesystem operations, threadblocks will save their state and relin-
46
quish control to a different threadblock until the I/O request has been serviced.
This asynchronous behavior hides GPUfs filesystem latency, while freeing GPU




While GVMfs provides a usable file system interface for Java programs, there
are usability and performance improvements that can be made. There are two
main limitations for GVMfs discussed in Section 5.6. In order to address these
limitations, we propose engineering avenues to explore.
8.1 Interface
As mentioned, the native Java file classes utilize common interfaces, mainly
DataOutput and DataInput [15, 16]. These interfaces expose much more sur-
face area than the GVMfs implementation does, therefore there would need to
be more CUDA code added to handle the new functions. On top of this, the
accompanying classes such as File would need to be modified to work on GVM.
Additionally, if any classes that GVM relies on to launch the GPU kernel get
modified, there will be errors because it will invoke the modified code instead
of the default implementation. To avoid these collisions there would need to
be significant code modifications to GVM.
48
8.2 Performance
The most obvious performance issue is the double copy, which occurs when
transforming the file buffer on the GPU into the array format GVM expects.
This is not incredibly hard to fix with some logic on the GPU side, but doing so
will increase response time for RPC operations. A different option is to transfer
the data onto the device the same way, but parallelize the array transformation
to gain some speedup.
Neither of these solutions address the most major contributor to the
performance of GVMfs, the CPU. Since the CPU resides on the datapath
for opening, closing, and growing files there will always be a latency penalty
for these operations. Nvidia has provided an interesting way to potentially
mitigate this, GPUDirect. GPUDirect allows for GPUs, network adapters,
SSDs and NVMe drives to access GPU memory directly [10]. GVMfs could





GMVfs shows that it is possible to provide first class filesystem support to
GPGPU programs. But the main contribution is integrating filesystem into
GVM, enabling Java programs that run on GPUs to access the filesystem
transparently.
This integration has performance limitations mainly due to the lack of
native RPC and memory pressure. For such an implementation to become
competitive with CPU based filesystem access, accelerators need to become
a focus in system design. Current systems treat accelerators as add-ons, but
these compute devices are gaining adoption. Hopefully these devices will see
steady improvements towards equality in the computer architecture hierarchy.
50
Bibliography
[1] IEEE Standards Association. IEEE standard portable operating system
interface for computer environments. IEEE STD, (1003.1), 1988.
[2] Shekhar Borkar and Andrew A. Chien. The future of microprocessors.
Communications of the ACM, 54, May 2011.
[3] Google Brain. Tensorflow overview. Available at https://www.tensor
flow.org/overview (2020/8/9).
[4] Ahmet Celik, Pangyu Nie, Christopher J. Rossbach, and Milos Gligoric.
Design, implementation, and application of GPU-based Java bytecode
interpreters. Proceedings of the ACM on Programming Languages, 3(177),
2019.
[5] CLOCK GETRES(2) Linux Programmer’s Manual, April 2020.
[6] AMD Corporation. ROCm documentation. Available at https://rocm
docs.amd.com/en/latest/ (2020/8/9).





[8] Nvidia Corporation. CUDA C++ best practices guide. Available
at https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/i
ndex.html (2020/8/9).
[9] Nvidia Corporation. Event management. Available at https://docs.nvi
dia.com/cuda/cuda-runtime-api/group\_\_CUDART\_\_EVENT.html
(2020/8/11).
[10] Nvidia Corporation. GPUDirect. Available at https://developer.nvid
ia.com/gpudirect (2020/8/9).
[11] Nvidia corporation. Memory management. Available at
https://docs.nvidia.com/cuda/cuda-runtime-api/group\_\_CU
DART\_\_MEMORY.html (2020/8/9).




[13] Nvidia Corporation. Thrust: A GPU accelerated library. Available at
https://developer.nvidia.com/thrust (2020/8/9).




[15] Oracle Corporation. Interface DataInput. Available at https://docs.o
racle.com/javase/8/docs/api/java/io/DataInput.html (2020/8/9).
[16] Oracle Corporation. Interface DataInput. Available at https:
//docs.oracle.com/javase/8/docs/api/java/io/DataOutput.html
(2020/8/9).
[17] Oracle Corporation. JNI specification. Available at https://docs.oracl
e.com/javase/8/docs/technotes/guides/jni/spec/intro.htm, Au-
gust 2020.
[18] Cris Crawford. What are the 6502, antic, ctia/gtia, pokey, and freddie
chips? Available at https://web.archive.org/web/20160305010645/
http://www.atari8.com/node/31 (2020/8/9).
[19] Feras Daoud, Amir Watad, and Mark Silberstein. GPUrdma: GPU-side
library for high performance networking from GPU kernels. Technical
report, June 2016.
[20] Micheal J. Flynn. Some computer organizations and their effectiveness.
IEEE Transactions on Computers, C-21, September 1972.
[21] freedesktop.org. D-bus. Available at https://www.freedesktop.org/wi
ki/Software/dbus/ (2020/8/9).
[22] Futex(2) Linux Programmer’s Manual, June 2020.
[23] GETPID(3P) POSIX Programmer’s Manual, 2013.
[24] IEEE/The Open Group. PREAD(3P) POSIX Programmer’s Manual,
June 2013.
53
[25] Mark Harris. How to access global memory efficiently in CUDA C/C++
kernels. Available at https://developer.nvidia.com/blog/how-acc
ess-global-memory-efficiently-cuda-c-kernels/ (2020/8/9), Jan-
uary 2013.
[26] Mark Harris. Unified memory in CUDA 6. Available at https://de
veloper.nvidia.com/blog/unified-memory-in-cuda-6/ (2020/8/9),
November 2013.
[27] JohnL. Hennessy and David A. Patterson. Computer Architecture: A
Quantitative Approach. The Morgan Kaufmann Series in Computer Ar-
chitecture and Design. Elsevier, 4 edition, 2006.
[28] ISO. Information processing — use of longitudinal parity to detect errors
in information messages. Novemeber 1978.
[29] Shinpei Kato†, Karthik Lakshmanan, Ragunathan (Raj) Rajkumar, and
Yutaka Ishikawa. TimeGraph: GPU scheduling for real-time multi-
tasking environments. Technical report, June 2011.
[30] Sangman Kim, Seonggu Huh, Xinya Zhang, Yige Hu, Amir Wated, Em-
mett Witchel, and Mark Silberstein. GPUnet: Networking abstractions
for GPU programs. In 11th USENIX Symposium on Operating Systems
Design and Implementation (OSDI 14), pages 201–216, Broomfield, CO,
October 2014. USENIX Association.
[31] MADVISE(2) Linux Programmer’s Manual, April 2020.
[32] Ashwin A. Mendon and Ron Sass. A hardware filesystem implementation
54
for high-speed secondary storage. International Conference on Reconfig-
urable Computing and FPGAs, December 2008.
[33] Gordon E. Moore. Cramming more components onto integrated circuits.
Intel Electronics, 38(8), April 1965.
[34] Philip Nee. Introduction to GPGPU and CUDA programming: Thread
divergence. Available at https://cvw.cac.cornell.edu/gpu/thread\_
div (2020/8/9), July 2013.
[35] GNU Project. Built-in functions for atomic memory access. Avail-
able at https://gcc.gnu.org/onlinedocs/gcc-4.6.2/gcc/Atomic-B
uiltins.html (2020/8/10).
[36] Chris Rossbach, Yuan Yu, Jon Currey, and Jean-Philippe Martin. Dande-
lion: a compiler and runtime for heterogeneous systems. Technical Report
MSR-TR-2013-44, April 2013.
[37] Christopher J. Rossbach, Jon Currey, Mark Silberstein, Baishaki Ray,
and Emmett Witcher. PTask: Operating system abstractions to manage
GPUs as compute devices. SOSP, 2011.
[38] Karl Rupp. 42 years of microprocessor trend data. Available
at https://www.karlrupp.net/2018/02/42-years-of-microprocesso
r-trend-data/ (2020/8/9).




[40] Rohan Sharma, Milos Gligoric, Andrea Arcuri, Gordon Fraser, and Darko
Marinov. Testing container classes: Random or systematic? In Funda-
mental Approaches to Software Engineering, pages 262–277, 2011.
[41] Mark Silberstein, Bryan Ford, Idit Keidar, and Emmet Witchel. GPUfs:
Integrating a file system with GPUs. ASPLOS, 2013.
[42] Hayden Kwok-Hay So and Robert Brodersen. File system access from
reconfigurable FPGA hardware processes in borph. International Con-
ference on Field Programmable Logic and Applications, September 2008.
[43] Jan Vesely, Arkaprava Basu Abhishek Bhattacharjee, Gabriel H. Loh,
Mark Oskin, and Steven K. Reinhardt. Generic system calls for GPUs.
ISCA, June 2018.
[44] Willem Visser, Corina S. Pǎsǎreanu, and Radek Pelánek. Test input
generation for Java containers using state matching. In International
Symposium on Software Testing and Analysis, pages 37–48, 2006.
[45] Steve Walton. 4GHz CPU battle: Ryzen 3900x vs. 3700x vs. core i9-
9900k. Available at https://www.techspot.com/article/1876-4ghz
-ryzen-3rd-gen-vs-core-i9/ (2020/8/9), December 2019.
[46] Lior Zeno, Avi Mendelson, and Mark Silberstein. GPUpIO: the case for
i/o-driven preemption on GPUs. pages 63–71, 03 2016.
56
