Tsinghua Science and Technology
Volume 18

Issue 1

Article 3

2013

GPGPU Cloud: A Paradigm for General Purpose Computing
Liang Hu
College of Computer Science and Technology, Jilin University, Changchun 130012, China

Xilong Che
College of Computer Science and Technology, Jilin University, Changchun 130012, China

Zhenzhen Xie
College of Computer Science and Technology, Jilin University, Changchun 130012, China

Follow this and additional works at: https://tsinghuauniversitypress.researchcommons.org/tsinghuascience-and-technology
Part of the Computer Sciences Commons, and the Electrical and Computer Engineering Commons

Recommended Citation
Liang Hu, Xilong Che, Zhenzhen Xie. GPGPU Cloud: A Paradigm for General Purpose Computing. Tsinghua
Science and Technology 2013, 18(1): 22-33.

This Research Article is brought to you for free and open access by Tsinghua University Press: Journals Publishing.
It has been accepted for inclusion in Tsinghua Science and Technology by an authorized editor of Tsinghua
University Press: Journals Publishing.

TSINGHUA SCIENCE AND TECHNOLOGY
I S S N l l 1 0 07 - 0 2 14 l l 0 3 / 1 2 l l p p 2 2-3 3
Volume 18, Number 1, February 2013

GPGPU Cloud: A Paradigm for General Purpose Computing
Liang Hu, Xilong Che , and Zhenzhen Xie
Abstract: The Kepler General Purpose GPU (GPGPU) architecture was developed to directly support GPU
virtualization and make GPGPU cloud computing more broadly applicable by providing general purpose computing
capability in the form of on-demand virtual resources. This paper describes a baseline GPGPU cloud system
built on Kepler GPUs, for the purpose of exploring hardware potential while improving task performance. This
paper elaborates a general scheme which defines the whole cloud system into a cloud layer, a server layer, and
a GPGPU layer. This paper also illustrates the hardware features, task features, scheduling mechanism, and
execution mechanism of each layer. Thus, this paper provides a better understanding of general-purpose computing
on a GPGPU cloud.
Key words: Kepler; GK110; GPGPU cloud; virtualization; SMX

1

Introduction

Cloud computing employs Internet-based services
to move computations from self-managed resources
within endpoints to on-demand resources in centralized
infrastructures. Cloud computing removes the
limitations that once existed in traditional computing
paradigms, such as platform, hardware, software,
architecture, and geographical location, thus becoming
a leading trend in high performance computing. In the
cloud computing paradigm, the end users contract with
cloud vendors for customized Virtual Machines (VM),
and interact with the VMs using only a console/browser
through the Internet, with all the data and applications
maintained on the remote servers accessible to end
users from any device, anywhere, at any time.
In state-of-the-art cloud computing, most of the
computations are ultimately undertaken by multicore CPUs and many-core GPUs, regardless of how
they are allocated. Although contemporary CPUs
and GPUs are manufactured by the same procedure
of semiconductor technology, the computational
 Liang Hu, Xilong Che, and Zhenzhen Xie are with College
of Computer Science and Technology, Jilin University,
Changchun 130012, China.
E-mail: fhul,chexilongg@jlu.edu.cn; kappaxie@sina.com.
 To whom correspondence should be addressed.
Manuscript received: 2012-12-10; accepted: 2012-12-28

performance of GPUs has increased more rapidly than
that of CPUs. Divergence in the design purpose have
resulted in devices with very different abilities with the
same order of transistor counts. CPUs are optimized
for high performance sequential code execution with
more transistors dedicated to the control logics such
as branch prediction and out-of-order execution, while
GPUs are optimized for high performance parallel
code execution with more transistors dedicated to
arithmetic logics such as float point calculations
and transcendental functions. Furthermore, GPUs
have also evolved rapidly from graphics-specific to
general-purpose devices to open a new era of GPU
computing[1] , denoted as GPGPU.
There has been a flood of research activities assessing
GPGPU performance in cloud environments, leading
to a new paradigm for general purpose computing, the
GPGPU cloud. The power of GPGPU cloud has been
validated by successful applications in a great many
contexts[2-4] . Several middleware systems[5-8] have
been developed to address the GPGPU virtualization,
with programming models for this paradigm[9, 10] . As
the GPGPU cloud evolving, the code “Kepler” was
introduced by the NVIDIA Corporation to name the
fourth-generation GPU computing architecture[11] in
2012, under serial number “GK110”. Kepler is the most
significant leap forward in the NVIDIA’s architecture,

Liang Hu et al.: GPGPU Cloud: A Paradigm for General Purpose Computing

which directly supports GPU virtualization without the
need of invoking a third party middleware; thus making
GPGPU cloud computing more broadly applicable,
providing general purpose computing capability in the
form of on-demand virtual resources.
With the development of the GPGPU cloud, many
researchers attempted to explain how the GPU works.
Several studies[12, 13] tried to illustrate the operating
mechanism of GPU computing, however, these
illustrations were graphic specific, rather than general
purpose. Wang and Shen[4] presented a GPU-based
cloud computing system for transportation management
with the system structure clearly illustrated and
generalized, but most of the operating details
were omitted. Kindratenko et al.[14] discussed such
issues as cluster architecture, resource management,
programming models, and application experience.
They illustrated a GPGPU cluster instead of a GPGPU
cloud, and state-of-art technologies such as in Kepler
are not enclosed.
To our knowledge, this is the first paper to describe
a generalized GPGPU cloud system built on Kepler
GPGPUs, for the purpose of exploring hardware
potential while improving task performance. This paper

Fig. 1

23

does not discuss basic concepts, such as core, SMX,
thread, warp, block, atomic, and texture, to give a clear
and concise logic. The readers are presumed to have
some preliminary knowledge. Also, this paper is not a
technical report describing the hardware logic circuits,
but an overview of the structural components, with only
the necessary components illustrated.

2

A General Scheme of GPGPU Cloud
Computing

Figure 1 shows a general scheme of GPGPU cloud
computing. The figure includes the multiple layers in
the hardware hierarchy and in the task hierarchy, as
well as a mapping relation between the two hierarchies.
The hardware and task hierarchies can be either
homogenous or heterogenous, with them treated here
as homogenous ones for simplicity. Heterogenous
hierarchies make no remarkable difference to our
illustration. There are approximately three layers for
mapping tasks to hardware components, from coarse
to fine granularity, as listed in the following. Note
that only key concepts are given here with more details
given later.

A general scheme of GPGPU cloud computing.

24

(1) Cloud layer A baseline GPGPU cloud system
contains a virtualization management server
maintaining a pool of VMs, a cluster of GPGPU
cloud servers holding the resources for the VM
instances, and a network storage system managing
a pool of Virtual Hard Disks (VHDs). All the cloud
servers are interconnected through a low-latency
interconnection network, such as InfiniBand[15] . In
this layer, a cloud job contains a set of parallel VM
jobs, with each VM job treated as an atomic task
unit. A VM job is executed on a VM allocated for
the user, with its job code and job data stored on the
network storage system before execution.
(2) Server layer A baseline GPGPU cloud server
contains one hypervisor managing all the physical
devices, an array of Kepler GPUs, and a
host memory shared by them. All devices are
interconnected through an interconnection network,
such as a main-board system bus. A VM job invokes
a CUDA procedure, which contains a sequence of
kernel grids, with each kernel grid assigned a stream
ID. In this layer, each kernel grid is treated as
an atomic task unit. A kernel grid is scheduled
by a control step (which is executed by a CPU)
and executed by a GPGPU, whose kernel code and
kernel data are loaded from the network storage
system to the host memory before execution.
(3) GPU layer A Kepler GPU contains a Gega Thread
Engine (GTE), two Data Transfer Engines (DTEs),
an array of SMXs, and a device memory shared by
them. All components are interconnected through
an interconnection network, such as an on-chip
GPU bus. In this layer, a kernel grid contains a set
of parallel kernel blocks, with each block treated as
an atomic task unit. A kernel block is scheduled by
a GTE and executed by an SMX, with the kernel
code and block data loaded by DTEs from the host
memory to the device memory before execution.
In a GPGPU cloud, the cloud user customizes the
computing ability of the VM(s), and organizes the VM
jobs in each cloud job for execution. If the hypervisor
permits, the cloud user may choose the VMs parameters
for a fixed computation ability, such as the number
and type of Virtual CPUs (VCPUs), the number and
capabilities of Virtual GPUs (VGPUs), the Virtual
Memory (VMEM) capacity, the VHD capacity, the
Virtual Network (VN) bandwidth. This paper simply
assumes that there is one VCPU and one VGPU in each
VM. A cloud user can set up one or more VMs. In
the context of this paper, the source file of a VM job

Tsinghua Science and Technology, February 2013, 18(1): 22-33

can be coded using a standard programming language
such as C/C++, with an extension for Message Passing
Interface (MPI) directives[16] and CUDA directives[17] .
Such source files can be compiled by the NVCC
compiler[18] . The MPI processes provide the inter-VM
parallelism and the CUDA procedures provide the intraVM parallelism.
The task granularity in the different layers is scalable
and flexible; thus, the entire computation can be
distributed in a flat manner or a hierarchical manner. A
GPGPU cloud job may contain one or more VM jobs,
each VM job may create one or more CUDA streams.
From the GPGPU computing perspective, the codes
executed by the VCPUs are for scheduling rather than
computing, although they could also undertake CPU
computing tasks while the VGPUs run, as in CPU-GPU
mixed contexts.

3
3.1

Cloud Layer
Task features

A GPGPU cloud job is generated by a cloud user as
the top granularity task; thus, it is independent and does
not need to communicate with other cloud jobs. Each
cloud job consists of one or more VM job executables,
a set of data files, and a Job Description Document
(JDD). A cloud job uses the JDD to describe the VM job
organization, executable paths, data file paths, resource
requirements, execution parameters, and environment
variables. Each VM job runs on one VM and accesses
a VM job data set which is a portion or all of the data
files for the entire cloud job. The number of VM jobs
can exceed the number of VMs, in which case each VM
has multiple VM jobs residing on it. VM jobs in one
cloud job interact with each other through shared VHD
or message passing for data sharing or synchronization.
Since each VM job has a respective program code, the
synchronization may involve either some or all of the
VM jobs within a cloud job.
3.2

Hardware features

A baseline GPGPU cloud system consists of a
virtualization management server, a cluster of
GPGPU cloud servers, a network storage system,
and an interconnection network. The virtualization
management server interacts with the cloud users and
provides virtual resources to them through web-based
services. The server maintains three types of virtual
resources, including a set of VMs, a set of VHDs, and
a VN. Such virtual resources are mapped to physical

Liang Hu et al.: GPGPU Cloud: A Paradigm for General Purpose Computing

resources, with VMs to the cloud servers, VHDs to the
network storage system, and VN to the interconnection
network. In GPGPU cloud computing, the computing
power of physical resources is partitioned by virtual
resources and invoked indirectly, rather than directly as
in traditional cluster computing.
(1) The cloud servers organize the computing resources
of the cloud. These can be physically implemented
in a computing cluster and logically partitioned
into several VMs allocated to cloud users. The
cloud user employs one or more VMs to provide
customized computing environment for running
applications and programs.
(2) The network storage system maintains the cloud
storage resources. It can be physically implemented
by one or more disk arrays under the management
of a network file system, and can be logically
partitioned into several VHDs allocated to the cloud
users. A cloud user uses the VHD to store the guest
operating system images for the VMs, the driver
and configuration files, and the cloud job files.
(3) The interconnection network maintains the cloud
networking resources. It can physically connect the
cloud servers in any topology and operate under any
routing rules. The VN runs on top of the physical
connections and transfers virtual traffic among the
VMs into physical traffic within or among the cloud
servers.
3.3

Scheduling mechanism

A cloud user can generate multiple cloud jobs, here
simply assumed to be executed sequentially, one job at
a time. The scheduling of each cloud job is achieved
by the scheduling of its VM jobs, and is managed
by the cloud user. More specifically, if the resource
requirements are satisfied, the cloud job is ready for
assignment; otherwise, adjustments are made on either
the VM parameters or the organization of the VM jobs
until they are compatible with each other. After that,
the cloud user assigns each VM job to a target VM and
launches it for execution. The resource requirement for
a VM job is that it can not exceed the capacity of the
virtual resources in the VM. The resource requirement
for a cloud job is that there are sufficient VMs to run all
its VM jobs at the same time, since the VM jobs might
interact with each other and will get stall if some are left
unassigned.

3.4

25

Execution mechanism

A cloud job is a logical task concept executed in the
form of all its VM jobs. A VM can run multiple VM
jobs at the same time, as long as the overall resources
consumed by them do not exceed the VM resource
capacity. In this layer, we treat each VM job as an
atomic task unit, with its execution workflow containing
five pipelined stages.
(1) Create stage The cloud user contracts with the
cloud system, sets up one or more VMs on demand,
and uploads the executables and data set files to the
VHD allocated by the cloud system.
(2) Check stage The cloud user checks the
compatibility of the VMs and the VM jobs,
and adjusts the VMs or the VM jobs as needed.
(3) Assign stage Once a cloud job is assigned, all its
VM jobs get assigned. The assignment is performed
by the cloud user, either by hand or by executable
scripts, guided by the associated JDD.
(4) Execute stage The VM jobs run in parallel on
one or more VMs. Each VM job corresponds to
one MPI process during execution with all of
code and data accesses served in the form of
virtual traffic between the VMs and the VHD. The
synchronization among VM jobs is served as virtual
traffic among the VMs.
(5) Clear stage When all the VM jobs in a cloud job
are finished, the cloud user collects the job results,
deletes the cloud job, and moves to the next cloud
job or even terminates the VM instances.

4
4.1

Server Layer
Task features

A VM job executes within a VM and operates on
the VM job data set. Its program code contains
the MPI steps for context maintenance and inter-job
communication, and the CUDA codes for managing
and executing the kernel grids. The MPI steps
include code/data access calls to the VHD, message
send/receive calls to other VM jobs, etc. The CUDA
codes include kernel launch steps, kernel data transfer
steps, and kernel grids. The kernel launch steps and
kernel data transfer steps belong to control steps for
scheduling the kernel grids, where each kernel grid
contains a GPGPU computing task.

Tsinghua Science and Technology, February 2013, 18(1): 22-33

26

4.2

Hardware features

A baseline GPGPU cloud server contains one or more
CPUs, one or more Kepler GPUs, and host memory,
all these physical devices linked by an interconnection
network, such as a main-board system bus. Each cloud
server has a hypervisor running on top of it, which is
like a low level operating system that runs between the
physical machine layer and the VM layer. With the
help of hypervisor, a cloud server could expose as a
set of VMs. Each VM has one VCPU, one VMEM,
and one VGPU. These virtual devices are mapped to
physical devices, with VCPUs to CPUs, VGPUs to
Kepler GPUs, and VMEMs to host memory.
During the execution of VM jobs, VCPUs execute
all the MPI steps and the control steps of the
CUDA procedures, VGPUs execute all the kernel grids
assigned by the CUDA procedures. The management

Fig. 2

of kernel grids is achieved by the channel mechanism.
Each Kepler GPU has multiple physical channels,
denoted as Hyper-Q links[11] , while each VGPU has
one virtual channel. These virtual channels are mapped
to physical channels in a one to one manner. All the
VM jobs residing on a VM share a same virtual channel
and assign their kernel grids through it to a same Kepler
GPU for execution. The channel mechanism enhances
the occupancy of the physical devices, as well as the
scalability of the VM jobs.
4.3

Scheduling mechanism

The cloud server layer breaks up the atomicity of a
VM job and focuses on the scheduling of the respective
parts. Figure 2 illustrates the distribution of these parts
during the running of VM jobs. The coordination
between a VCPU and a VGPU for executing a
VM job is achieved by the delivery of Push Buffer

Scheduling mechanism of server layer.

Liang Hu et al.: GPGPU Cloud: A Paradigm for General Purpose Computing

Streams (PBSs)[19] . Each PBS is a sequence of Push
Buffer Commands (PBCs) generated by a VCPU and
consumed by a VGPU, it contains all the scheduling
information for executing a kernel grid.
When a VCPU executes a kernel data transfer step,
the generated PBS contains several PBCs that describe
the source/target memory addresses, data size, memory
types, etc., denoted as a Transfer PBS (TPBS). When
a VCPU executes a kernel launch step, the generated
PBS contains several PBCs that describe the execution
parameters of the kernel grid, the starting address of the
kernel code, the resource requirements, etc., denoted
as a Launch PBS (LPBS). Therefore, when a CUDA
procedure manages the data transfer and execution for
its kernel grids, the VCPU of a VM sends a sequence of
PBSs to the VGPU of the same VM in response to the
control steps.
Each VGPU has a virtual channel which exclusively
maps to a physical channel of the Kepler GPU. Each
virtual/physical channel is a single FIFO queue, it
buffers all the PBSs delivered to it, therefore, the PBSs
are stored in the order of their arrival with each PBS
inside the channel processed by the Kepler GPU in the
same order.
4.4

Execution mechanism

In GPGPU computing, the MPI steps and the control
steps are assumed to contribute to the kernel grids
management only, rather than to the computations.
Therefore, the computation for a VM job involves
execution of all its kernel grids, where each kernel grid
is the minimal scheduling unit in the server layer. The
kernel grids delivered to the same channel are executed
in sequence, while the kernel grids delivered to different
channels are executed out-of-order. The life time of a
kernel grid contains four pipelined stages.
(1) Assign stage The VCPU/CPU executes a kernel
launch step, assigns a semaphore variable in the
VMEM/host memory to the kernel grid, and then
generates an LPBS, as the assignment of the kernel
grid. The address of semaphore variable is also
enclosed in the LPBS. There is usually a control
step of data transfer (from VMEM/host memory to
device memory) before this stage to set the input
data ready.
(2) Queue stage The VGPU/GPU accepts the LPBS
of the kernel grid and stores it inside the
virtual/physical channel. When the LPBS reaches
the end of the channel, the GPU judges whether it

27

is to be processed or stalled.
(3) Execution stage The VGPU/GPU processes the
LPBS, one PBC at a time until the whole PBS
is consumed. Accordingly, the execution of the
associated kernel grid is driven by the scheduling
information contained in the PBCs.
(4) Clear stage When a kernel grid finishes
execution, the VGPU/GPU signals the VCPU/CPU
to release the occupied semaphore value, so that
dependent control steps stalled by the semaphore
can be released for execution. Note that there is
usually a control step of data transfer (from device
memory to VMEM/host memory) after this stage
to return the output data back.

5
5.1

GPGPU Layer
Task features

A kernel grid contains a large number of kernel threads
that share the same kernel code, but operate on different
thread index (TID) based data addresses of the kernel
grid data set. A kernel grid has to be partitioned into
one or more kernel blocks for scheduling and execution.
The partitioning is derived from two concerns. First,
the thread count of the kernel grid may be too large
for the SMX capacity, so the partitioning guarantees
that each kernel block is small enough to reside on
an SMX. Second, the number of kernel grids ready
for execution might be less than the number of SMXs
within a Kepler GPU; thus, partitioning each kernel grid
into kernel blocks remarkably increases the possibility
that each SMX is busy serving one or more kernel
blocks, which enhances the Kepler GPU occupancy and
thus, the execution performance. The whole kernel grid
data set is also partitioned into a group of kernel block
data sets, with each data set processed exclusively by
one kernel block.
The kernel grid partitioning, however, creates a
limitation as feedback to the kernel code. The kernel
blocks from the same kernel grid may run on different
SMXs, or a portion of blocks may be running while
the rest are waiting unassigned. The blocks that get
resources for execution are called active blocks, while
the unassigned blocks are called pending blocks. For
scalability concerns, kernel blocks are not allowed to
interact with each other during execution. For instance,
each synchronization operation involves all threads
within a kernel block. Thus, kernel threads from the
same kernel block can interact with each other, but

Tsinghua Science and Technology, February 2013, 18(1): 22-33

28

kernel threads from different blocks can not interact
with each other, even if they belong to the same kernel
grid. Therefore, when all the threads within a kernel
grid must interact with each other during execution, the
kernel grid must contain only one kernel block. In this
case, the kernel grid size can not exceed the capacity of
one SMX and its execution is served by only one SMX,
even if all the other SMXs are idle.
In the pre-Kepler architectures[20] , a kernel grid could
only be generated by a control step and one kernel grid
could not invoke another kernel grid. However, in the
Kepler architecture, kernel grids can be executed in a
nested arrangement without the assistance of control
steps. Therefore, the kernel code of the main kernel
grid can invoke the kernel code of one or more sub
kernel grids. This feature enhances the applicability of
the CUDA programming model, for example, by adding
support for recursive algorithms.
5.2

Hardware features

Figure 3 shows the micro-architecture of a Kepler
GPGPU. A Kepler GPU has one GTE and two DTEs
that collaborate as the GPGPU front end and operate
asynchronously on multiple physical channels. All
the LPBSs are served by the GTE and all the TPBSs
are served by the DTEs. The GTE partitions each
kernel grid into kernel blocks driven by an LPBS
and assigns them to SMXs for execution. The DTE
provides streaming bidirectional data transfer between
the VMEM/host memory and the device memory driven
by a TPBS.
The GTE contains a Grid Management Unit (GMU)
and a CUDA Work Distributor (CWD), they work
together to manage the assignments of the kernel
grids. The GMU is a newly introduced component
in the Kepler architecture which can serve not only
the kernel grids created by the control steps of the

Fig. 3

CUDA procedures, but also the kernel grids nested
inside other kernel codes that are dynamically generated
as the SMXs execute the kernel codes. The CWD
buffers active kernel grids scheduled by the GMU and
dispatches them to available SMXs for execution. The
GMU and the CWD communicate with each other
through a bidirectional path to enable advanced dispatch
operations, such as conditional hold/suspend/resume of
kernel grid assignment and dependency check.
Each SMX could execute multiple kernel blocks at
the same time, as long as the blocks do not exceed
its resource capacity. The reference counters[21] record
the execution states of the assigned kernel grids. Each
reference counter is composed of one launch counter
and one complete counter that are coupled to serve one
kernel grid during execution. When one of the kernel
blocks is assigned to an SMX, the launch counter is
increased by one, while when one of the kernel blocks
finishes execution, the complete counter is increased by
one.
The entire address space of the device memory is
divided into push buffer, code buffer, and data buffer.
The push buffer takes a data structure of multiple FIFO
queues, where each queue associates to one physical
channel and enqueues all the PBSs delivered through
the channel. The code buffer stores the kernel codes
for all kernel grids residing on the GPGPU. A kernel
code might be too long to fit into the SMXs; therefore,
each kernel code is partitioned into segments called
kernel phrases, with one phrase loaded into an SMX
at a time[22] . The data buffer stores the data set for
each kernel grid. Since a kernel grid may process
different data types, the data buffer can be dynamically
allocated as several memory types, including local,
global, constant, texture, and surface memories. Each
memory type is accessed in a unique manner.
(1) Local memory This stores private data for each

Kepler GPGPU micro-architecture.

Liang Hu et al.: GPGPU Cloud: A Paradigm for General Purpose Computing

kernel grid residing on the GPGPU. Private means
that each address in this memory can only be
accessed by one thread of one kernel grid during
the entire life time of the kernel grid. When a
kernel block needs more registers than an SMX
can provide, the overflow is stored by cells in the
local memory. This is a one-dimensional (1-D)
linear memory which can be read/written by both
the VCPU/CPU and the SMXs.
(2) Global memory This stores public data for each
kernel grid residing on the GPGPU. Public means
that each address in this memory can be accessed
by any thread in any kernel grid within the GPGPU.
Therefore, global memory can be used to transfer
intermediate results from an earlier kernel grid to a
kernel grid, as long as the earlier one finishes before
the one starts. This is a 1-D linear memory which
can be read/written by both the VCPU/CPU and the
SMXs.
(3) Constant memory This stores constant data for
each kernel grid residing on the GPGPU. Constant
means that each address in this memory is read only
and can be read by any thread of any kernel grid
within the GPGPU. Therefore, constant memory
can be used to transfer parameters into kernel grids
to guide execution. This is a 1-D linear memory
which can be written only by the VCPU/CPU and
read only by the SMXs.
(4) Texture memory This stores texture data for each
kernel grid residing on the GPGPU. Each data
element stored here is not indexed by an address,
but by a special object called texture reference. A
data access request for texture memory is served
by texture fetching rather than by data addressing.
A texture reference can only be declared as a
static global variable and can not be passed as an
argument to a function. This is a read only 1-D/
2-D/3-D CUDA array which is in fact implemented
using 1-D linear memory. Texture memory can be
written only by the VCPU/CPU and read only by
the SMXs.
(5) Surface memory This stores surface data for each
kernel grid residing on the GPGPU. Each data
element stored here is not indexed by an address,
but by a special object called surface reference. A
data access request for surface memory is served
by surface fetching rather than by data addressing.
A surface reference can only be declared as a
static global variable and can not be passed as an

29

argument to a function. This is a read only 1-D/
2-D/3-D CUDA array which is in fact implemented
using 1-D linear memory. Surface memory can be
written only by the VCPU/CPU and read only by
the SMXs.
A 1-D CUDA array is a data structure whose data
elements are arranged in consecutive logical addresses
with one dimension, like a row. A 2-D CUDA array
is a data structure whose data elements are arranged in
consecutive logical addresses with two dimensions, like
a matrix; while a 3-D CUDA array is a data structure
whose data elements are arranged in consecutive logical
addresses with three dimensions, like a cube. In
reality, all kinds of CUDA arrays are implemented
by an address mapping mechanism onto a 1-D linear
memory whose data elements have consecutive physical
addresses. These CUDA arrays are hardware optimized
for texture/surface fetching.
5.3

Scheduling mechanism

The scheduling of each kernel grid is in fact a
consumption of its corresponding PBS. At each fetch
cycle, push buffer fetches one PBS and routes it to
a GTE or a DTE according to its type. The PBSs
from different physical channels are processed out-oforder, while PBSs from the same physical channel
are processed in sequence. However, there are still
a remarkable probability that the execution of two
adjacent kernel grids from the same channel are
overlapped. For example, the PBS of a kernel grid
gets processed before the last kernel block of an earlier
kernel grid finishes. In order to maintain the correctness
of all GPGPU tasks, the stream mechanism is utilized
to manage the overlapped execution of kernel grids, as
shown in Fig. 4. Each kernel grid has a stream ID which
is binded to all the PBSs that are associated to the kernel
grid. Logically, each CUDA procedure can create up to
16 streams, where each stream is a logical FIFO queue
with a unique stream ID. In reality, these streams are
utilizing one physical channel in a multiplex manner.
When a PBS reaches the end of the physical channel,
the push buffer checks whether the stream required
by the PBS is occupied by a running kernel grid. If
the stream is vacant, the PBS is free to be processed;
otherwise, the PBS is stalled for next round check. The
stream mechanism guarantees that each data transfer
or kernel execution exclusively occupies one stream.
Thus, kernel grids with a same stream ID run in a
strict sequential order, while kernel grids with different

Tsinghua Science and Technology, February 2013, 18(1): 22-33

30

Fig. 4

Stream mechanism.

stream IDs run asynchronously. The stream mechanism
also provides intrinsic synchronization of kernel grids.
Consider three adjacent kernel grids (from a same
physical channel) sharing the same stream ID, the first
block of the second grid can not be executed until all
the blocks of the first grid finish execution, and the first
block of the third grid can not be executed until all the
blocks of the second grid finish execution. Therefore,
although the kernel blocks in the same kernel grid runs
asynchronously, all the kernel threads of each kernel
grid are intrinsically synchronized at the start and end
of the kernel execution.
When an LPBS is routed to a GTE, its PBCs
are consumed by the GTE in sequence, one PBC
at a time. When a TPBS is routed to a DTE, its
PBCs are consumed by the DTE in sequence, one
PBC at a time. Each PBC contains a portion of the
scheduling information for a kernel grid, with Table
1 listing exemplary PBCs along with their function
descriptions. The scheduling of each kernel block is
in fact a consumption of its corresponding PBC. More
specifically, each time a “launch” PBC is processed,
one kernel block is assigned to an available SMX for
execution.

Table 1
Command name
DefineSemaphore
AcquireSemaphore
SetLanchID
SetRefCntValue
ResetRefCnt
WaitRefCnt
Launch
SetParameterSize
Parameter

5.4

Execution mechanism

From the perspective of the hardware device, a kernel
grid is just a logical task concept whose execution is
achieved by the execution of all its kernel blocks. In the
GPGPU layer, a kernel grid is broken into kernel blocks,
with each kernel block treated as an atomic task unit.
The execution pipeline of each kernel block contains
four pipelined stages.
(1) Allocate stage When an LPBS reaches the GTE,
the GMU acquires the scheduling information for
a kernel grid by consuming several PBCs in the
beginning of the arrival LPBS. The scheduling
information, denoted as state information[23] ,
contains the starting address and the size of the
associated kernel code in the VMEM/host memory,
the thread count per block, the block count per grid,
the resource requirements for each kernel block,
etc. The GMU allocates a code space with the same
size as the kernel code in the code buffer, loads
the associated kernel code from the VMEM/host
memory into the allocated code space, and records
the starting address of the code space. The GMU
also allocates an unoccupied reference counter to
the kernel grid for tracing its execution state. The

Push buffer commands.

Description
Define a semaphore at a given memory location.
Wait for a previous kernel to release the semaphore.
Associate the blocks of the current kernel to a reference counter.
Set a reference counter used by ResetRefCnt and WaitRefCnt.
Wait for blocks of the previous kernel to complete.
Wait for blocks of the current kernel to complete.
Launch a block in the current kernel for execution.
Allocate space for blocks to accept kernel parameters.
Transfer the kernel parameters to blocks when launched.

Liang Hu et al.: GPGPU Cloud: A Paradigm for General Purpose Computing

GMU then signals the CWD to add the kernel grid
into working list.
(2) Dispatch stage The CWD has a working list
recording kernel grids that are ready for dispatch,
it can also conditionally hold/suspend/resume one
or more of these kernel grids according to the
direction of the GMU. The CWD periodically sends
inquiries to the SMXs searching for available ones
that meet the resource requirements for executing
active kernel grids in the working list. Each time
one SMX replies with an SMX ID, CWD consumes
one “launch” PBC in response to the dispatch of
one kernel block, signals thread controller of the
target SMX to download state information of the
kernel block into state registers, and hands over the
execution handle of the dispatched kernel block to
the target SMX. Upon the dispatch of each kernel
block, the CWD increases the associated launch
counter (contained in the reference counter) by one.
(3) Execute stage The dispatched kernel blocks
execute on the SMXs asynchronously. If a kernel
block dynamically invokes sub kernel grids, the
associated SMX sends the newly generated work
load back to the GMU, where it will be prioritized
and sent to the CWD. If an active kernel need to
pause due to invoking a sub kernel, the GMU will
interact with the CWD and hold the kernel inactive
until the dependent work load is completed.
Each time an SMX finishes one kernel block, it
releases the portion of resources occupied by the
kernel block and signals the CWD to increase the
associated complete counter (also contained in the
reference counter) by one. When the complete
counter is equal to the kernel grid block count, the
execution of the entire kernel grid is finished.
(4) Clear stage The CWD releases the reference
counter assigned to the kernel grid that finishes
execution, deletes the associated kernel code in
the code buffer, and frees the corresponding code
space.

6

Final Remarks

The purpose of this paper is to describe the operating
mechanism of the GPGPU cloud, so that its potential
can be further explored, and the tasks performance can
be further improved. The whole cloud system can be
divided into three layers with this analysis focusing
on the task features, hardware features, scheduling

31

mechanism, and execution mechanism of each layer.
There are four key features to understand current and
future GPGPU computing paradigms.
 Virtualized Virtualization management server and
hypervisors incorporate to realize the mapping from
virtual resources to physical resources. In the cloud
system environment, VMs are mapped to cloud
servers, VNs are mapped to interconnection network,
and VHDs are mapped to network storage system.
In the VM environment, VCPUs are mapped to
one or more CPUs, VGPUs are mapped to one or
more Kepler GPUs, and VMEMs are mapped to host
memory.
 Hierarchical The GPGPU task is arranged in
a hierarchical manner, the hardware structure of
GPGPU is also built in a hierarchical manner.
The scheduling mechanism of GPGPU computing
can be considered a mapping from the task
organization to the hardware structure. These
hierarchical organizations achieve a balance of
resource capacity and task flexibility.
 Scalable A task with a fixed computing complexity
can be arranged as a group of VM jobs and executed
on many VMs, or arranged as one VM job and
executed on one VM with multiple VGPUs, or
arranged as one VM job and executed on one VM
with only one VGPU. The difference among these
arrangements is that less time is needed while more
physical resources get involved, but the task itself is
scalable with its arrangement decided by the available
physical resources.
 Diverse All the tasks with the same granularity can
be executed in parallel, while each task execution is
pipelined. All the tasks with the same granularity can
be executed out-of-order, while each task is executed
in-order. Diverse execution patterns seamlessly fuse
together in GPGPU computing to pursue the extreme
of execution performance.
Acknowledgements
This work was funded by the European Framework
Programme (FP7) (No. FP7-PEOPLE-2011-IRSES),
the National Natural Science Foundation of China
(Nos. 61073009 and 60873235), the Science-Technology
Development Key Project of Jilin Province of China
(No. 20080318), and the National High-Tech Research
and Development Program (863) of China (No.
2011AA010101).

32

References
[1]

J. Nickolls and W. J. Dally, The GPU computing era, IEEE
Micro., vol. 30, no. 2, pp. 56-69, March-April 2010.
[2] M. Leinweber, L. Baumgartner, M. Mernberger, T.
Fober, E. Hullermeier, G. Klebe, and B. Freisleben,
GPU-based cloud computing for comparing the
structure of protein binding sites, in Proc. of the 6th
IEEE International Conference on Digital Ecosystems
Technologies (DEST’12), Campione d’Italia, Italy, 18-20
June, 2012, pp. 1-6.
[3] T. Nishiyama, S. Yamagiwa, and T. Hisamitsu, Prototyping
GPU-based cloud system for IODP core image database,
in Proc. of the 2011 Second International Conference on
Networking and Computing (ICNC ’11), Osaka, Japan,
November 30-December 02, 2011, pp. 327-331.
[4] K. Wang and Z. Shen, Artificial societies and GPUbased cloud computing for intelligent transportation
management, IEEE Intelligent Systems, vol. 26, no. 4, pp.
22-28, July 2011.
[5] G. Giunta, R. Montella, G. Laccetti, F. Isaila, and J.
G. Blas, A GPU accelerated high performance cloud
computing infrastructure for grid computing based virtual
environmental laboratory, in Advances in Grid Computing,
Z Constantinescu, Ed. InTech, Feb. 28, 2011, Chapter 7,
pp. 121-146.
[6] J. Duato, A. J. Pena, F. Silla, J. C. Fernandez, R. Mayo,
and E. S. Quintana-Orti, Enabling CUDA acceleration
within virtual machines using rCUDA, in Proc. of the 18th
International Conference on High Performance Computing
(HiPC 2011), Bengaluru, India, December 18-21, 2011,
pp. 1-10.
[7] L. Shi, H. Chen, J. Sun, and K. Li, vCUDA:
GPU-accelerated high-performance computing in virtual
machines, IEEE Transactions on Computers, vol. 61, no.
6, pp. 804-816, June 2012.
[8] V. Gupta, A. Gavrilovska, K. Schwan, H. Kharche,
N. Tolia, V. Talwar, and P. Ranganathan, GViM:
GPU-accelerated virtual machines, in Proc. of the
3rd ACM Workshop on System-Level Virtualization for
High Performance Computing (HPCVirt 09), Nuremberg,
Germany, March 31, 2009, pp. 17-24.
[9] C. Yang, C. Huang, and C. Lin, Hybrid CUDA, OpenMP,
and MPI parallel programming on multicore GPU clusters,
Computer Physics Communications, vol. 182, no. 1, pp.
266-269, January 2011.
[10] C. Yang, C. Huang, C. Lin, and T. Chang, Hybrid
parallel programming on GPU clusters, in Proc. of the
IEEE International Symposium on Parallel and Distributed

Tsinghua Science and Technology, February 2013, 18(1): 22-33
Processing with Applications (ISPA 2010), Taipei, China,
September 6-9, 2010, pp. 142-147.
[11] NVIDIA Corporation, Whitepaper: NVIDIA’s next
generation CUDA compute architecture: Kepler GK110,
http://www.nvidia.com/content/PDF/kepler/NVIDIAKepler-GK110-Architecture-Whitepaper.pdf, 2012.
[12] K. Fatahalian and M. Houston, A closer look at GPUs,
Communications of the ACM, vol. 51, no. 10, pp. 50-57,
October 2008.
[13] D. Luebke and G. Humphreys, How GPUs work,
Computer, vol. 40, no. 2, pp. 96-100, February 2007.
[14] V. V. Kindratenko, J. J. Enos, G. Shi, M. T. Showerman,
G. W. Arnold, J. E. Stone, J. C. Phillips, and W. Hwu,
GPU clusters for high-performance computing, in Proc.
of the 2009 IEEE International Conference on Cluster
Computing (CLUSTER 2009), New Orleans, Louisiana,
USA, August 31-September 4, 2009, pp. 1-8.
[15] Mellanox
Technologies
Inc.
Introduction
InfiniBand,

to

http://www.mellanox.com/pdf/whitepapers/

IB Intro WP 190.pdf, 2012.
[16] Message Passing Interface Project, http://www.mcs.anl.
gov/research/projects/mpi/, 2012.
[17] NVIDIA Corporation, NVIDIA CUDA C Programming
Guide,

Version 5.0,

http://docs.nvidia.com/cuda/pdf/

CUDA C Programming Guide.pdf, 2012.
[18] NVIDIA Corporation, The CUDA compiler driver
NVCC, Version 5.0,

http://docs.nvidia.com/cuda/pdf/

CUDA Compiler Driver NVCC.pdf, 2012.
[19] F. R. Diard, Programming multiple chips from a
command buffer (Assignee NVIDIA Corp.), US Patent
US7528836B2, May 5, 2009.
[20] NVIDIA Corporation. Whitepaper:

NVIDIA’s next

generation CUDA compute architecture: Fermi, http://
www.nvidia.com/content/PDF/fermi white papers/NVIDIA
Fermi Compute Architecture Whitepaper.pdf, 2012.
[21] J. F. Duluk Jr., S. D. Lew, and J. R. Nickolls, Counterbased delay of dependent thread group execution (Assignee
NVIDIA Corp.), US Patent US7526634B1, April 28, 2009.
[22] P. C. Mills, J. E. Lindholm, B. W. Coon, G. M.
Tarolli, and J. M. Burgess, Scheduler in multi-threaded
processor prioritizing instructions passing qualification
rule (Assignee NVIDIA Corp.), US Patent US7949855B1,
May 24, 2011.
[23] J. F. Duluk Jr., Predicated launching of compute
thread arrays (Assignee NVIDIA Corp.), US Patent
US7697007B1, April 13, 2010.

Liang Hu et al.: GPGPU Cloud: A Paradigm for General Purpose Computing
Liang Hu received his MEng and PhD
degrees in Computer Science from Jilin
University in 1993 and 1999. Currently,
he is a professor and doctoral supervisor
at the College of Computer Science and
Technology, Jilin University, China. His
research areas are network security and
distributed computing, including related
theories, models, and algorithms of PKI/IBE, IDS/IPS, and Grid
Computing. He is a member of the China Computer Federation.

Xilong Che received his MEng and PhD
degrees in Computer Science from Jilin
University in 2006 and 2009. Currently,
he is a lecturer at the College of Computer
Science and Technology, Jilin University,
China. His current research areas are
parallel and distributed computing,
machine learning, and related applications.
He is a member of the IEEE.

33
Zhenzhen Xie is a master degree
candidate of College of Computer Science
and Technology, Jilin University, China.
Her current research areas are digital
forensic technology for cloud computing
environments and behavior modeling for
digital forensics.

