The case for reconfigurable general purpose GPU computing by Dhar, Ashutosh
c© 2014 Ashutosh Dhar
THE CASE FOR RECONFIGURABLE GENERAL PURPOSE GPU
COMPUTING
BY
ASHUTOSH DHAR
THESIS
Submitted in partial fulfillment of the requirements
for the degree of Master of Science in Electrical and Computer Engineering
in the Graduate College of the
University of Illinois at Urbana-Champaign, 2014
Urbana, Illinois
Adviser:
Associate Professor Deming Chen
ABSTRACT
General purpose graphics processing unit (GPU) computing (GPGPU) has
emerged as a new paradigm for programmers to exploit massive amounts
of parallelism for relatively low costs. The abundance of GPUs in desktop
and mobile computing platforms makes them ideal for accelerating tasks on
multiple devices. However, despite the massively parallel architecture, GPUs
are limited by the applications that run on them. The generic architecture of
a GPU allows it to accelerate a large variety of applications, but none of these
applications are capable of exploiting the complete performance capabilities
of the GPU. This underutilization results in wastage of resources and power.
In this work, we propose to introduce reconfiguration to the GPU ar-
chitecture in the hopes of being able to tune its architecture to maximize
performance for a given application or redistribute resources so as to reduce
power consumption.
ii
To my parents and my sister for their love and support.
To my friends for their companionship and support.
To my teachers for their guidance and wisdom.
iii
ACKNOWLEDGMENTS
There are a lot of people I would like to acknowledge. This thesis would not
have been possible without their unconditional support. Each of them has
my deepest respect and gratitude.
First and foremost, I would like to thank my adviser, Dr. Deming Chen,
for his continued guidance and support. Prof. Chen has been a source of in-
spiration and motivation for the last two years. His creativity and experience
has provided me with invaluable insight on multiple occasions. Without his
support, this thesis would not have been possible.
My family has been an integral part of my support structure. Without
their love and encouragement, I would not have made it this far.
My friends, old and new, have constantly provided me with encouragement
and have marched along with me as I faced my challenges.
Last but definitely not the least, I would like to thank all the members
of my research group for providing me with guidance and counsel. Their
support helped me pave the road.
iv
TABLE OF CONTENTS
LIST OF ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . vi
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . 1
CHAPTER 2 BACKGROUND . . . . . . . . . . . . . . . . . . . . . 4
2.1 GPU Computing . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 CUDA and GPU Architecture . . . . . . . . . . . . . . . . . . 6
2.3 Reconfigurable Computing . . . . . . . . . . . . . . . . . . . . 11
CHAPTER 3 RECONFIGURABLE GPU COMPUTING . . . . . . . 14
3.1 Understanding the Inefficiencies . . . . . . . . . . . . . . . . . 15
3.2 Reconfiguration as the Solution . . . . . . . . . . . . . . . . . 18
3.3 Modifying the GPU Datapath . . . . . . . . . . . . . . . . . . 23
CHAPTER 4 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . 25
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
v
LIST OF ABBREVIATIONS
ALU Arithmetic and logic unit
API Application programming interface
ASIC Application specific integrated circuit
CAD Computer aided design
CGRA Coarse grain reconfigurable architecture
CMOS Complimentary metal-oxide on semiconductor
CMP Chip multiprocessor
CNT Carbon nano-tube
CTA Cooperative thread array
CUDA Compute unified device architecture
DSP Digital signal processor
EPIC Explicitly parallel instruction computing
FPGA Filed programmable gate array
GPU Graphics processing unit
GPGPU General purpose GPU computing
ILP Instruction level parallelism
OpenCL Open computing language
SIMD Single instruction multiple data
SIMT Single instruction multiple thread
SM Streaming multiprocessor
vi
SOC System on chip
SP Streaming processor
SPMD Single program multiple data
SMT Simultaneous multi-threading
TFET Tunnel field effect transistor
TLP Thread level parallelism
VLIW Very long instruction word
vii
CHAPTER 1
INTRODUCTION
The advent of the nanoscale complementary metal-oxide semiconductor era
has presented new challenges to the design and development of processors,
ranging from process variation and reduced reliability to increased difficulty
in fabrication, verification, and testing [1]. More importantly, the increased
transistor density no longer guarantees improved performance because of lack
of memory bandwidth and increased difficulty in extracting and exploiting
instruction level parallelism (ILP). However, the most profound effect has
been the advent of the power wall, making it prohibitively expensive to cool
processors; the high power densities result in higher operating temperatures
that degrade performance and reliability. Combined with the mobile revo-
lution, the need for reigning in power consumption has become a primary
concern for architects.
In the face of these challenges, multi-core processors emerged as a tem-
porary solution, sacrificing area for improved performance while trying to
maintain the proportional performance scaling promised by Moore’s law.
However, one of the major challenges since their advent has been the lack
of applications that have sufficient parallelism to exploit and run on multi-
core systems as well as lack of skilled programmers to program such paral-
lel systems. Amidst this, the use of graphics processing units (GPUs) for
general purpose computing (GPGPU) was introduced by Nvidia and AMD,
each presenting their own take on GPGPU that conformed to the underlying
architecture of the GPU. GPGPU presented a new paradigm that allowed
programmers to exploit the massively parallel architecture of GPUs while
maintaining a relatively simpler programming model than that of traditional
multiprocessor systems. GPUs offer performance in the order of Teraflops
(floating point operations per second) and are coupled with memories that
have large bandwidths and thus have made their way into the world of high-
performance computing. At present, GPUs are coupled with general purpose
1
processors or central processing units (CPUs) as accelerators, which allow
CPUs to oﬄoad heavy throughput-oriented computation onto the GPUs.
However, future high performance GPUs are expected to be integrated on
the same dies as their CPU counterparts [2], much like some of the solutions
proposed by AMD’s Fusion architecture and Nvidia’s Tegra architecture.
Having said that, GPUs still face similar issues as faced by their CPU coun-
terparts: diminished voltage scaling and the increasing gap between memory
performance and computational performance.
Despite the hardware challenges faced, GPGPU has been experiencing
a growing amount of interest. Each new generation of GPUs provides a
massive improvement in computational performance and memory bandwidth
along with additional architectural features designed to improve the parallel
programming experience. However, there is still one fundamental challenge:
applications. GPUs have massively parallel architectures which allows them
to execute several types of applications, but the architecture is not tuned to
suit the needs of each application. This one size fits all policy allows for a
wide range of executable applications, but limits the performance of certain
applications, which in turn wastes resources on the GPU. More critically, it
wastes power and energy.
As CMOS (complementary metal-oxide semiconductor) scaling comes close
to an end due to the physical constraints of silicon at sub 10 nm nodes, the
question that must be asked is, What’s next? Several technologies arrive in
the face of this question, including spin devices, TFETs (tunnel field effect
transistors), carbon nano-tubes (CNTs) and graphene. CNT and graphene
have gained a lot attention in recent times, but they are yet to prove them-
selves as viable replacements for CMOS, and several challenges must be faced
before we can successfully commercialize these technologies. Irrespective of
the what we choose as the successor for CMOS technology, it will be some
time before it is completely replaced. Over forty years of silicon-based design
experience, architectures and CAD (computer aided design) tools will have
to be ported over to the new technology; an effort which is not trivial. More
importantly, after investing billions of dollars in CMOS-based foundries and
technology, it is likely that manufacturers will be unwilling to move away
from silicon unless presented with a strong case to do.
Keeping this scenario in mind, perhaps it is time to reconsider the way we
utilize silicon area. As the cost-per-unit will continue to fall, irrespective of
2
the technology node, perhaps it is time to embrace dark silicon [3].
In this work we present reconfigurable computing as an inspiration to
change the way we think and utilize silicon. As compared to fixed-functioned
units, reconfigurable computing paradigms can offer high parallelism and
power efficiency on demand to suit an application while promising the ar-
chitect efficient utilization of area. This work aims to make a case for intro-
ducing the reconfigurable computing paradigm to GPGPU in the hopes of
improving power and energy efficiency. We focus our reconfiguration on the
datapath so as to minimize the area overheads and to exploit the relatively
simple control mechanisms of the GPU.
3
CHAPTER 2
BACKGROUND
2.1 GPU Computing
General purpose GPU computing (GPGPU) is predicated on the idea that
certain application workloads are better suited to high throughput archi-
tectures, operate on large amount data such that they demonstrate a large
amount of data-level parallelism, and have a limited amount of control de-
pendencies. Such applications, while perfectly capable of running on modern
CPUs, have demonstrated significant speedup when run on GPUs because of
the massively parallel architecture of GPUs and their large memory band-
width. As an example, consider vector addition.
Figure 2.1: Data parallelism in vector addition.
In a conventional CPU, each element of the array is operated on individu-
ally, iterating through the entire array and calculating the sum of individual
array pairs. However, each data element has no dependence on one another,
and the end resultant array is made of data elements that are completely in-
dependent. Thus, it is possible to compute the sum of each pair of elements
simultaneously, as illustrated in Figure 2.1 . This data parallelism allows for
much faster implementation of vector addition, and the massively parallel
4
architecture of a GPU is very well suited to exploit this data parallelism.
However, the number of applications that can be mapped to run efficiently
on a GPU are limited in nature, and often only a small segment of the code
can be sufficiently parallelized. Thus, GPUs are used in conjunction with
CPUs as accelerators to speed up either segments of code or entire appli-
cations, allowing the CPU to focus on handling sequential tasks. This has
led to the development of a hierarchy in GPGPU, where the CPU is referred
to as the host and the GPU as the device, in a server-client fashion where
parallel tasks are launched from the CPU and oﬄoaded to the GPU where
they execute and return back to the CPU. While the GPU is completing
its assigned task, the CPU is free to continue execution of other sequential
tasks. This host-device setup has been the standard model from the advent
of GPGPU and added the constraint that the GPU itself cannot generate its
own workload and is completely dependent on the host. This problem has
been addressed to an extent with the advent of dynamic parallelism [4].
Unlike CPUs, the architecture of a modern GPU differs radically across
vendors. Intel, Nvidia, and AMD, the three largest market share holders,
have very different GPU architectures, with only Nvidia and AMD offering
GPGPU support. As a result of varying architectures, the programming
models supported by the GPUs are not uniform. Irrespective of the pro-
gramming model, the idea of launching massively parallel workloads and ex-
ploiting data-parallelism remains consistent. However, the difference lies in
the manner in which memory, synchronization, and workload scheduling are
handled. The two prominent frameworks for GPGPU are CUDA (compute
unified device architecture) and OpenCL (open compute language). CUDA
is NVidia’s proprietary programming model and architecture for GPGPU,
while OpenCL is a framework developed by the Khronos Group and is a
cross-platform application programming interface (API), that targets hetero-
geneous programming across a variety of platforms including GPUs, CPUs,
digital signal processors (DSP), and field programmable gate arrays (FPGAs)
[5], [6]. OpenCL is supported by a variety of vendors, and is capable of run-
ning on both AMD and Nvidia GPUs, and is the preferred API for GPGPU
on AMD GPUs. OpenCL implements a parallel programming model very
similar to that of CUDA; and for the purpose of this work, CUDA will be
the primary focus.
The emergence of GPGPU has spawned the emergence of a number of
5
research opportunities. The majority of work has focused on the accelera-
tion of applications by mapping them to GPUs. The range of application
include scientific computing, bioinformatics, data mining, computer vision,
and linear algebra. In order to improve the efficiency of applications that
have been mapped to GPUs, a lot of work has focused on demystifying the
back-end execution and operation. [7] presented a tool to understand mem-
ory access patterns of applications, while [8], [9], [10] provided insights into
the micro-architecture of GPUs. In a similar spirit, efforts towards improving
performance via compiler optimization [11], [12] have focused on improving
performance by modeling resource utilization and allocation.
2.2 CUDA and GPU Architecture
In order to understand the source of underutilization of resources, it is impor-
tant to understand the architecture and its programming model. In this sec-
tion the discussion is limited to Nvidia’s CUDA architecture. While CUDA,
running on a Nvidia GPU, and equivalent implementations in OpenCL, run-
ning on AMD hardware, follow an almost identical programming model and
approach, the difference in the underlying architectures has an impact on
performance [13].
2.2.1 CUDA
CUDA is Nvidia’s proprietary architecture and programming model for GPGPU.
It allows programmers to develop programs in a C/C++ styled language,
while treating the GPU as a general processor. The CUDA execution model
is heterogeneous in nature, allowing serial parts of the code to run on the
host (CPU), while the parallel portions of the code are completed on the de-
vice (GPU), where they execute in a SPMD (single program multiple data)
fashion. CUDA divides the total work space into a grid and assigns a thread
to each portion of data. Each and every thread is a copy of each other, i.e.,
they execute the same set of instructions. Thus, parallel implementations in
CUDA can be considered as heavily threaded (multi-threaded) and ideal for
parallel architectures. The segment of code that runs on the GPU is referred
to as the kernel [14]. This kernel is the blueprint for each thread and de-
6
fines the behavior of each thread, as well as how they cooperate. Figure 2.2
illustrates the execution model and flow.
Figure 2.2: CUDA execution model.
CUDA defines the term cooperative thread array (CTA) as a group of
threads that execute a kernel concurrently. So, CUDA launches, executes,
and schedules work in the form of CTAs to the GPU. The total work grid
is divided into blocks or CTAs, and threads within the CTA execute con-
currently and can communicate with each other. Also, synchronization con-
structs are supported on a per-CTA basis, i.e., the programmer can force
synchronizations via barriers and atomics to threads within the same block
only. Thus, in essence, each block can be treated as an independent entity,
while threads within the CTA or block are not isolated from each other.
This model, along with basic barrier synchronizations, makes CUDA exhibit
a relaxed consistency model, i.e., it is not sequentially consistent.
2.2.2 Architecture
Unlike CPUs, the amount information available about the architecture and
micro-architecture of GPUs is relatively limited. To a large degree, this
is due to the constant evolution of GPU architecture as well as the fierce
competition between vendors. In this regard, work done by [8], [9], [10] is
of great use; technical briefs [15], [16] provided by the manufacturers are of
great value. While Nvidia has released several GPU architectures in the last
few years, including the G80, Fermi, Kepler, and the latest, Maxwell, we
shall limit our discussion to a Fermi-styled GPU.
7
Figure 2.3: Overview of a GPU architecture.
A high-level overview of the architecture of a Nvidia GPU is described
in Figure 2.3. A GPU consists of several streaming multiprocessors or SMs,
along with a unified L2 cache that is coupled with memory controllers via
an on-chip interconnect. The global memory of the system is off-chip, as in
a traditional CPU. Each SM is analogous to a multi-threaded CPU, with its
fetch, decode, execution, and memory/writeback logic forming the pipeline.
When the host launches a kernel on the GPU, each CTA in the grid is
allocated to an SM. Once allocated to an SM, the CTAs reside, execute and
complete on the SM. The GPU can then be thought of as a shared-memory
multiprocessor system, where each processor is multi-threaded, and all the
processors execute the same kernel at a time.
Figure 2.4: The streaming multiprocessor.
Figure 2.4 illustrates a high level overview of an SM. The SM consists of
8
several scalar processors (SP) cores, a large register file, and on-chip mem-
ory. The SM’s on-chip memory consists of caches, shared memory, texture
caches, and memory control and dispatch logic. Each SP is a large, heav-
ily pipelined execution unit capable of executing both integer and floating
point operations. Unlike the ALU (arithmetic and logic unit) of a CPU, it is
not optimized for low latency operation, rather it is designed to have a deep
pipeline with optimal power utilization. Each thread executing on the SM is
assigned an SP, its own instruction address and register state. Thus, the total
number of concurrently executing threads is limited by the number of SPs.
In the case of Fermi, this is 32. At any given instant up to 32 threads can be
executing on the SM, subject to available register and memory resources. A
block assigned to an SM can consists of a large number of threads, up to 1024
in Fermi, and multiple blocks can be assigned to an SM, 8 in Fermi; thus a
very large number of threads can be executing concurrently on a SM. How-
ever, since the pipeline supports the execution of only 32 threads at a time,
the remaining threads need to be scheduled in and out of the core. The SM
manages the threads in groups called warps. These warps are created, man-
aged, and scheduled by the SM alone [10], [14]. The large number of resident
threads is key to the operation of the SM. By having a large number of warps
in flight, the SM is able to mask the latency of memory access by swapping
warps in and out of the pipeline, overlapping execution of threads with mem-
ory accesses. The scheduling of warps is done with zero overhead. Nvidia
refers to this architecture as SIMT (single instruction multiple thread), and
while it appears to be SIMD, it differs in one key aspect: SIMD exposes the
width of the hardware to the software, which is not the case here. Rather,
programmers write code on a per-thread level, and each thread executes on
the SP.
The heavily multi-threaded nature of execution mandates that the SM core
have a very large register file. However, a large multi-ported register file is
not very feasible. Thus, Nvidia uses several banks of single-ported register
files. The register files are coupled with specialized units called operand
collectors, whose purpose is to simulate a multi-ported register file by actively
buffering operands in advance with the assistance of multiple RAM banks
and arbitration logic.
The SM provides a separate pipeline for memory operations, allowing mem-
ory accesses to be dispatched in tandem with ALU executions. These opera-
9
tions are performed by dedicated load-store units that interact with the local
memory of the SM as well as global memory. Each SM is provided with three
types of caches. The texture cache and constant cache, are read-only caches
that provide fast access to the constant memory and texture memory on the
GPU. Apart from this, the SM has a set of configurable memory banks that
can be configured to be used as both L1 cache and shared memory. The size
of the L1 cache and shared memory is decided by the programmer from two
options: 48 KB or 16 KB. The memory resources within the SM are avail-
able to threads resident on the SM alone. The global memory and unified
L2 cache are available to all threads executing on all SMs.
10
2.3 Reconfigurable Computing
Reconfigurable computing is almost synonymous with FPGAs (field pro-
grammable gate arrays). FPGAs provide logic elements on a reconfigurable
fabric in way that allows designers to connect logic elements to form a digital
circuit and be programmed several times.Thus the same hardware is capa-
ble of implementing a variety of circuits. While these implementations are
not as efficient as a custom ASIC (application specific integrated circuits),
it involves lesser design effort and is much faster and more energy-efficient
when compared to software implementations of the same application, when
run on a general purpose processor. However, the fine-grained customization
offered for FPGAs is not well suited for general computing and is not easy to
program. In this section, two newer paradigms of reconfigurable computing
are discussed.
2.3.1 Coarse Grain Reconfigurable Architectures
Coarse grain reconfigurable architectures (CGRAs) offer a smaller design
space with an easier programming model than FPGAs. CGRAs consists of
an array of processing elements (PE) coupled together on a mesh fabric that
can be configured to form custom datapaths. These designs are especially
effective for application domains that involve a large amount of computation
such as DSP and multimedia [17]. At first glance, such architectures have a
great deal of similarity to GPU SMs, with a large number of PEs and limited
control logic.
The design of the PEs varies by design and application. MorphoSys [18]
coupled a simple host processor along with an array of PEs, each of which con-
sisted of an ALU, a multiplier, and a register file. Operations were launched
and executed in a SIMD fashion, the width of which was configurable. On
the other hand, the MIT RAW processor [19] was a tiled architecture, where
each tile was a single-issue RISC processor. The tiles were placed on a 2D
mesh. A more interesting approach was that adopted by RICA [20]. RICA
had a heterogeneous setup with multiple instruction cells placed on an inter-
connect. Each cell could serve a different purpose ranging from branching,
to ALU to memory. RICA used dynamic reconfiguration to eliminate the
need for elaborate controllers, and was capable of reconfiguring the datapath
11
within a single cycle to perform any number of instructions. CGRAs were
never a commercial success and lost momentum in academia due to the lim-
ited application set and need for custom compilers. However the ADRES
[21],[22] architecture was very well received and was capable of exploiting
ILP [21] and even thread-level parallelism [22].
2.3.2 Reconfiguration and General Purpose Processors
Reconfigurable computing as applied to CMPs (chip multiprocessors) has
been utilized in a variety of fashions. However, the focus has predominantly
been on improving single-thread performance. An important step towards
this was the TRIPS architecture [23]. TRIPS used coarse-grain reconfigura-
tion to target three levels of parallelism: instruction, data, and thread level.
Starting with four 16-wide cores and memory tiles, TRIPS reconfigures the
micro-architecture to be utilized differently depending on the targeted paral-
lelism level. In contrast, Core fusion [24] is a different approach. Core fusion
is a trade-off between the level of parallelism available. Starting with a chip
multiprocessor (CMP) composed of lean cores, Core fusion can merge several
cores by fusing them such that they form a large and more powerful wide-
issue core. This then is a trade-off between ILP and thread-level parallelism
(TLP).
Another more recent effort is Composite cores [25], where the authors pro-
pose a heterogeneous core that integrates a large out-of-order core with a
smaller in-order core with reduced issue-width. Both cores are tightly cou-
pled together and allow for fine-grained switching between cores to maximize
energy efficiency with limited performance degradation. The work done in
Illusionist [26] also embraces heterogeneity, but in a different fashion. Illu-
sionist use a large aggressive core to generate hints for several smaller cores.
These hints guide the smaller cores and accelerate them, giving the illusion
of several larger cores.
A key drawback to all of the above-mentioned approaches is the need
to modify either the control flow, the schedulers, the ISA, or the compil-
ers. Combined with the area overheads, exploiting reconfiguration on a large
scale, such as a GPU with thousands of computational units, may seem to
be undesirable. However, the focus in this work will be to limit the amount
12
of area, control, and software overheads required to exploit the resources of
the GPU.
13
CHAPTER 3
RECONFIGURABLE GPU COMPUTING
In this chapter, the idea of reconfigurable GPU computing is introduced.
Previous attempts at reconfigurable computing, as discussed in Chapter 2,
have focused on improving the performance of specific applications or a spe-
cific application set, improving energy efficiency or improving single-thread
performance. In this work, the focus is broader: overall energy, power, and
performance gains. We shall strive towards these goals, with the aim of
limited area overheads and acknowledging that there are power-performance
trade-offs. The results presented in [13] suggest that there are several ap-
plications that have sufficient ILP (instruction level parallelism) for AMD’s
VLIW-based architecture to exploit. This VLIW implementation allows it
to operate at lower frequencies and achieve better energy efficiencies. On the
other hand, the work done in [27] demonstrates that GPGPU applications
exhibit unbalanced resource utilization. Thus, there is a need to look at how
the vast resources of the GPU can be utilized on a case-by-case basis.
While the inherent limitations of the GPU architecture do not allow it to
embrace the diversity in the variety of applications that run on it, another
factor must be considered: programming limitations. Programming GPUs is
not a trivial task. In most cases, careful planning and design are required on
the programmer’s part in order to get the most out of the GPU. Thus, the
onus is on the programmer to ensure that he/she understands the architecture
and implements a suitable algorithm. The end result can be an algorithm
for an application that is not necessarily the best solution. Embracing code
diversity is another goal that reconfigurable GPU computing aspires to.
14
3.1 Understanding the Inefficiencies
In order to better understand the diversity of applications and performance
of these applications, we shall now examine a simple GPU. In this section
we will study the performance characteristics obtained from the simulation
a GPU with six SMs and three memory controllers, based on the Fermi
architecture. This setup provides resources similar to a single Kepler-styled
SMX, which can be found on the Tegra K1 mobile SOC (system on chip).
For the purpose of this study, a wide range of applications have been selected
from the Rodinia [28] and Parboil [29] benchmark sets. Each application
consists of one more kernels that have unique characteristics and resource
requirements. Thus, from this point onwards, we shall examine each kernel
individually since each kernel executes and operates independently.
Typically, kernels can be classified as compute-bound, memory-bound, or
problem size-bound [27]. In addition to this, kernel performance can be
limited by the amount of branching, i.e., control divergence, as well as by
interconnect limitations. Typically, performance of a processing core is mea-
sured in IPC (instructions per cycle). The throughput of the processing core
can vary from application to application, depending on the type of instruc-
tions issued and how many times the pipeline is forced to stall. Consider
the simulation results shown in Figure 3.1 and Figure 3.2 for two different
benchmark sets.
As we can see from the results, a wide range of performance spectrums
are present within the same application. Consider the application lower-
upper decompostion (LUD) from the Rodinia benchmark set. It consists
of three distinct kernels each of which demonstrates significantly different
performance metrics. We can see that kernel lud diagonal has an IPC value
of less than 1, while an IPC value of over 150 is observed for the lud internal
kernel.
Upon examination of each of these kernels, we see that lud diagonal,
which had the worst IPC, has 25% of its instructions issuing memory opera-
tions, while only one control-flow instruction is present. On the other hand,
lud internal has over 40% of its instructions issuing memory operations. We
also see that they have similar cache miss rates. Thus, the lud diagonal
kernel is not constrained by memory or interconnect limitations/contention.
Neither is it plagued by branch divergence.
15
(a) Instructions per cycle (IPC)
(b) Average cache miss rate
Figure 3.1: Performance of a six-SM GPU running kernels from the Rodinia
benchmark set.
16
(a) Instructions per cycle (IPC)
(b) Average cache miss rate
Figure 3.2: Performance of a six-SM GPU running kernels from the Parboil
benchmark set.
17
Further examination of the kernel shows us that the pipeline is stalled
for several thousand cycles at the memory stage. These stalls are caused
either by shared memory bank conflicts or non-coalesced memory accesses
or serialized memory accesses. A detailed study done in [30] shows that the
LUD benchmark utilizes a great deal of shared memory resources. Thus,
the kernel is not limited by compute or global memory resources. Rather, it
wastes time and energy waiting on memory resources.
A similar analysis of the other benchmarks will show that power and time
are wasted waiting on resources. Thus, the goal of this work is to improve
performance by either redistributing resources, or limiting wastage of power
by actively redistributing resources and shutting down control and compute
units.
3.2 Reconfiguration as the Solution
While GPUs are designed as massively parallel machines, with thousands
of threads executing, the programming model and micro-architecture is rel-
atively simple. Compared to an out-of-order (OoO) wide-issue superscalar
CPU with SMT (simultaneous multi-threading), a GPU has a very simple
in-order pipeline. The distinguishing factor between the two is a significantly
smaller control logic in the GPU whose only job is to fetch-decode instructions
and schedule warps. This level of simplicity makes GPU pipelines an excellent
target for reconfiguration. The addition and removal of resources is less chal-
lenging than that required for a CPU. Also, given that GPU architectures are
very good at hiding memory latencies, by overlapping the execution of thou-
sands of threads, the overheads involved with reconfiguration can be easily
masked. From an implementation perspective, micro-architectural units that
are to be reconfigured need to be redesigned to take into account the latency
overheads of reconfiguration. The end result of these micro-architecture mod-
ification could be larger and more power-hungry units. However, in the case
of the GPU, which is capable of embracing this latency, the design overhead
can be mitigated all together. Thus, reconfiguration on GPUs is much more
scalable and practical as compared to multi-core processors. Keeping these
limitations in mind, we shall focus on modifying the data path alone. In do-
ing so, excessive area overheads and the complexities of changing the control
18
logic can be avoided. However, limiting the modifications to the datapath
does not limit the scope of our reconfiguration.
We model the GPU pipeline in accordance to what is modeled in our
simulator provided by [9]. The micro-architecture of a single SM is detailed
in Figure 3.3.
Figure 3.3: Micro-architecture of the GPU’s SM.
The described architecture is simple and is sufficiently detailed for our
purposes. Note, that the memory pipeline is not shown in the figure. In this
work, we assume that the memory pipeline is independent from the execution
pipeline.
Clearly, the resources available are not being utilized in the best possible
way. In theory, we have two options. Maximize the utilization of the com-
putation throughput or the memory bandwidth. It is very rare that both
compute and memory will be stressed equally. However, the goal of our work
19
is to reduce resource under-utilization such as to reduce power and/or en-
ergy consumption. Looking to reconfiguration as our solution, we can opt
for either fine-grained or coarse-grained reconfiguration. It is important to
remember that dynamic reconfiguration is expensive from an area and power
perspective. However, as we have seen, individual kernels exhibit regular
behavior. Thus, we propose a one-time reconfiguration on a kernel-by-kernel
basis.
We shall now explore these options and their application on a kernel-by-
kernel basis.
3.2.1 Fine-Grained Resource Sharing
As our first approach, we propose a low-cost system that fuses resources
from neighboring SMs. This approach targets lowering overall system power.
While, common approaches would involve shutting down SMs altogether or
reducing the frequency, we propose that instead of shutting down an SM alto-
gether, its storage resources can be shared with neighboring SMs, while power
consuming resources such as ALUs, dispatch units, and control units can be
powered down altogether. In our proposed architecture, we propose shutting
down one-third of the SMs and sharing its register, operand dispatch, cache,
and shared memory resources amongst its two neighboring SMs.
As shown in Figure 3.4, fifty percent of the register banks and operand col-
lector units are shared between the two neighboring SMs, while the remaining
units are shut down. In essence, we now have a GPU with two-thirds the
number of SMs, with each SM having fifty percent additional memory re-
sources. It could be argued that shutting down SMs would reduce power
anyway. However, the goal is not just to reduce power. Rather, in scenarios
where SMs compute resources are being under-utilized, shutting down com-
pute units would not affect the throughput. Instead the goal is to provide
these SMs with additional resources such that when neighboring SMs are
powered down, they can try and pick up the slack or at the least deliver the
same performance without wasting power on control and compute logic that
is not needed.
In order to quantify this option, an experiment has been performed. The
configuration is modeled based on what is shown in Figure 3.4. In our first
20
Figure 3.4: Overview of fine-grained resource sharing: darker shaded units
are turned off; colored blocks indicate fused groups.
experiment, we experiment with a GPU with six SMs and three memory
controllers, as we did in the previous section. The results are shown in
Figure 3.5.
As we can see from the results, along with the reduction of power from
shutting down units, certain kernels demonstrate improved performance as
well. For example, kernel2 from the bfs benchmark of the Rodinia set. This
kernel not only showed 27% improvement in total execution time, but also
nearly 20% reduction of power. All in all, the reduction in execution time and
power has resulted in a 40% improvement of the power-delay product. Simi-
larly, several other kernels have shown small but significant improvements in
performance along with power cutbacks.
An interesting case is the invert mapping kernel of the kmeans benchmark
which has demonstrated a 37% increase in IPC and a 25% fall in execution
time. However, this has come at a 26% increase in power. So, despite
shutting a third of the execution units, we still have an increase in power.
If we look back at Figure 3.1a, we see that this kernel was particularly
under-performing. The increased resources have allowed it to increase its
throughput and thus the increase in power.
21
Figure 3.5: Performance and power of fine-grained resource sharing.
3.2.2 Coarse-Grained Fusion of Resources
As a second study, we explore the idea of a more coarse-grained approach
along the lines of Core fusion [24], two or more SMs can be fused together.
This would give blocks executing on the new larger SMs more resources
to work with, allowing for more warps in flight, which could be used in
scenarios where memory latencies are extremely high and the number of
resident threads needs to be higher. It would also for wider warps (greater
than 32 threads per warp). Thus, this configuration would appear to be
useful for compute-intensive but memory bound applications.
As a part of this work, we performed a preliminary study that fuses the
storage and compute resources of two SMs to form a larger SM in whole. An
overview is shown in Figure 3.6.
Once again, we limit the reconfiguration to the datapath alone. And the
reconfiguration needed for our fine-grained sharing can be easily expanded
to support this setup.
22
Figure 3.6: Overview of coarse-grained fusion: darker shaded units are
turned off; colored blocks indicate fused groups.
Once again, we quantify the effect of this via simulation of the six SM
GPU, the results are shown in Figure 3.7.
The results shown in Figure 3.7 are presented in order to demonstrate
that the reconfiguration scheme does indeed have potential applications set.
There were a large number of kernels that did not benefit and had decreased
performance vectors; however, that is expected. The goal is to find a config-
uration that works best for each kernel.
Having said that, clearly this scheme is very promising, with kernels show-
ing up to 40% improvements in IPC and a 30% reduction in execution cycles.
While this may not be ideal for low-power scenarios, for larger GPU config-
urations, this may indeed prove to be worth exploring.
3.3 Modifying the GPU Datapath
We will now briefly discuss the overheads involved in modifying the datapath
in order to support the reconfiguration previously discussed.
There are three main units that need to be modified: the arbitration logic,
the crossbar, and the dispatch unit. In particular, the crossbar and dispatch
units need to be able to route twice as many inputs into their respective
paths when in fused mode. Our methodology involves over-designing each
23
Figure 3.7: Performance of coarse-grain sharing.
unit in all SMs so as to enable them to perform in the fused scenarios.
In order to do this, a datapath model must be built. For this work, we
have implemented our baseline crossbar and dispatch units in Verilog and
have synthesized them using a 45 nm IBM library.
The results of our initial investigation suggest that the crossbar size in-
creases by 93% while the dispatch unit increases by 18%. Note that these
are not area overheads to the entire SM but just to a single unit. While the
overhead of the dispatch unit is understandable, the crossbar unit suffers due
to the increased area effort in order to compensate for the timing constraints
involved. In future work, we will improve on this by reducing the constraints
with additional pipeline stages in the crossbar.
24
CHAPTER 4
CONCLUSION
While GPGPU may still be in its nascent stages, it is a very promising
paradigm for parallel computing and is gaining momentum rapidly. In the
face of scaling and power challenges, reconfigurable computing offers a scal-
able solution for GPUs. Preliminary results presented in this thesis suggest
that fine-grained reconfiguration as well as coarse-grained reconfiguration can
be used to improve the power and energy efficiency of GPUs with limited im-
pact on their performance or simply to improve performance alone. Coarse
grained and even fine-grained reconfiguration can be implemented with ease
as compared to multi-core setups where these techniques have been focused
on until now. Our study has shown that the reconfiguration can be limited
to the datapath alone.
25
REFERENCES
[1] “21st Century Computer Architecture,” White Paper, Computing Com-
munity Consortium, 2012.
[2] S. Keckler, W. Dally, B. Khailany, M. Garland, and D. Glasco, “GPUs
and the Future of Parallel Computing,” Micro, IEEE, vol. 31, no. 5, pp.
7–17, Sept 2011.
[3] M. Taylor, “Is Dark Silicon Useful? Harnessing the Four Horsemen of the
Coming Dark Silicon Apocalypse,” in Design Automation Conference
(DAC), 2012 49th ACM/EDAC/IEEE, June 2012, pp. 1131–1136.
[4] Nvidia, “Dynamic Parallelism in CUDA,” 2012, Technical Draft. [On-
line]. Available: http://developer.download.nvidia.com/assets/cuda/
files/CUDADownloads/TechBrief Dynamic Parallelism in CUDA.pdf
[5] D. B. Kirk and W. mei W. Hwu, Programming Massively Parallel Pro-
cessors: A Hands-on Approach . Second Edition. Morgan Kaufmann,
2012.
[6] “Implementing FPGA Design with the OpenCL Standard,” White
Paper-1173, Altera, 2013. [Online]. Available: http://www.altera.com/
literature/wp/wp-01173-opencl.pdf
[7] Y. Kim and A. Shrivastava, “CuMAPz: A Tool to Analyze Memory
Access Patterns in CUDA,” in Design Automation Conference (DAC),
2011 48th ACM/EDAC/IEEE, June 2011, pp. 128–133.
[8] H. Wong, M.-M. Papadopoulou, M. Sadooghi-Alvandi, and
A. Moshovos, “Demystifying GPU Microarchitecture through Mi-
crobenchmarking,” in Performance Analysis of Systems Software
(ISPASS), 2010 IEEE International Symposium on, March 2010, pp.
235–246.
[9] A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt, “Analyzing
CUDA Workloads Using a Detailed GPU Simulator,” in Performance
Analysis of Systems and Software, 2009. ISPASS 2009. IEEE Interna-
tional Symposium on, April 2009, pp. 163–174.
26
[10] W. Fung, I. Sham, G. Yuan, and T. Aamodt, “Dynamic Warp Formation
and Scheduling for Efficient GPU Control Flow,” in Microarchitecture,
2007. MICRO 2007. 40th Annual IEEE/ACM International Symposium
on, Dec 2007, pp. 407–420.
[11] Y. Liang, Z. Cui, K. Rupnow, and D. Chen, “Register and Thread
Structure Optimization for GPUs,” in Design Automation Conference
(ASP-DAC), 2013 18th Asia and South Pacific, Jan 2013, pp. 461–466.
[12] Z. Cui, Y. Liang, K. Rupnow, and D. Chen, “An Accurate GPU Perfor-
mance Model for Effective Control Flow Divergence Optimization,” in
Parallel Distributed Processing Symposium (IPDPS), 2012 IEEE 26th
International, May 2012, pp. 83–94.
[13] Y. Zhang, L. Peng, B. Li, J.-K. Peir, and J. Chen, “Architecture Com-
parisons Between Nvidia and ATI GPUs: Computation Parallelism and
Data Communications,” in Workload Characterization (IISWC), 2011
IEEE International Symposium on, Nov 2011, pp. 205–215.
[14] Parallel Thread Execution ISA Version 3.1., Nvidia, 2012.
[15] C. Wittenbrink, E. Kilgariff, and A. Prabhu, “Fermi GF100 GPU Ar-
chitecture,” Micro, IEEE, vol. 31, no. 2, pp. 50–59, March 2011.
[16] “Nvidia’s Next Generation CUDA Compute Architecture:
Kepler GK110,” White Paper, Nvidia, 2014. [On-
line]. Available: http://international.download.nvidia.com/pdf/kepler/
NVIDIA-Kepler-GK110-GK210-Architecture-Whitepaper.pdf
[17] Z. ul Abdin and B. Svensson, “Evolution in Architectures and Program-
ming Methodologies of Coarse-grained Reconfigurable Computing,” Mi-
croprocess. Microsyst., vol. 33, no. 3, pp. 161–178, May 2009.
[18] H. Singh, M.-H. Lee, G. Lu, F. Kurdahi, N. Bagherzadeh, and
E. Chaves Filho, “MorphoSys: An Integrated Reconfigurable System
for Data-Parallel and Computation-Intensive Applications,” Computers,
IEEE Transactions on, vol. 49, no. 5, pp. 465–481, May 2000.
[19] M. Taylor, J. Psota, A. Saraf, N. Shnidman, V. Strumpen, M. Frank,
S. Amarasinghe, A. Agarwal, W. Lee, J. Miller, D. Wentzlaff, I. Bratt,
B. Greenwald, H. Hoffmann, P. Johnson, and J. Kim, “Evaluation of
the Raw Microprocessor: An Exposed-Wire-Delay Architecture for ILP
and Streams,” in Computer Architecture, 2004. Proceedings. 31st Annual
International Symposium on, June 2004, pp. 2–13.
[20] S. Khawam, I. Nousias, M. Milward, Y. Yi, M. Muir, and T. Arslan,
“The Reconfigurable Instruction Cell Array,” Very Large Scale Integra-
tion (VLSI) Systems, IEEE Transactions on, vol. 16, no. 1, pp. 75–85,
Jan 2008.
27
[21] B. Mei, S. Vernalde, D. Verkest, and R. Lauwereins, “Design Methodol-
ogy for a Tightly Coupled VLIW/Reconfigurable Matrix Architecture:
A Case Study,” in Design, Automation and Test in Europe Conference
and Exhibition, 2004. Proceedings, vol. 2, Feb 2004, pp. 1224–1229 Vol.2.
[22] K. Wu, A. Kanstein, J. Madsen, and M. Berekovic, “Mt-ADRES: Mul-
tithreading on Coarse-grained Reconfigurable Architecture,” in Proceed-
ings of the 3rd International Conference on Reconfigurable Computing:
Architectures, Tools and Applications, ser. ARC’07, 2007, pp. 26–38.
[23] K. Sankaralingam, R. Nagarajan, H. Liu, C. Kim, J. Huh, D. Burger,
S. Keckler, and C. Moore, “Exploiting ILP, TLP, and DLP with the
Polymorphous TRIPS Architecture,” in Computer Architecture, 2003.
Proceedings. 30th Annual International Symposium on, June 2003, pp.
422–433.
[24] E. Ipek, M. Kirman, N. Kirman, and J. F. Martinez, “Core Fusion:
Accommodating Software Diversity in Chip Multiprocessors,” in Pro-
ceedings of the 34th Annual International Symposium on Computer Ar-
chitecture, ser. ISCA ’07, 2007, pp. 186–197.
[25] A. Lukefahr, S. Padmanabha, R. Das, F. Sleiman, R. Dreslinski,
T. Wenisch, and S. Mahlke, “Composite Cores: Pushing Heterogene-
ity Into a Core,” in Microarchitecture (MICRO), 2012 45th Annual
IEEE/ACM International Symposium on, Dec 2012, pp. 317–328.
[26] A. Ansari, S. Feng, S. Gupta, J. Torrellas, and S. Mahlke, “Illusionist:
Transforming Lightweight Cores into Aggressive Cores on Demand,”
in High Performance Computer Architecture (HPCA2013), 2013 IEEE
19th International Symposium on, Feb 2013, pp. 436–447.
[27] J. Adriaens, K. Compton, N. S. Kim, and M. Schulte, “The Case for
GPGPU Spatial Multitasking,” in High Performance Computer Archi-
tecture (HPCA), 2012 IEEE 18th International Symposium on, Feb
2012, pp. 1–12.
[28] S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee, and
K. Skadron, “Rodinia: A Benchmark Suite for Heterogeneous Comput-
ing,” in Workload Characterization, 2009. IISWC 2009. IEEE Interna-
tional Symposium on, Oct 2009, pp. 44–54.
[29] J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang,
N. Anssari, G. D. Liu, and W. mei W. Hwu, “The Parboil Technical
Report,” University of Illinois, at Urbana-Champaign, IMPACT Tech-
nical Report 12-01.
28
[30] S. Che, J. Sheaffer, M. Boyer, L. Szafaryn, L. Wang, and K. Skadron,
“A Characterization of the Rodinia Benchmark Suite with Compari-
son to Contemporary CMP workloads,” in Workload Characterization
(IISWC), 2010 IEEE International Symposium on, Dec 2010, pp. 1–11.
29
