AMOEBA: A Coarse Grained Reconfigurable Architecture for Dynamic GPU
  Scaling by Cheng, Xianwei et al.
AMOEBA: A Coarse Grained Reconfigurable Architecture
for Dynamic GPU Scaling
Xianwei Cheng
Computer Science and Engineering Department
University of North Texas
xianweicheng@my.unt.edu
Hui Zhao
Computer Science and Engineering Department
University of North Texas
hui.zhao@unt.edu
Mahmut Kandemir
Computer Science and Engineering Department
Pennsylvania State University
mtk2@psu.edu
Beilei Jiang
Computer Science and Engineering Department
University of North Texas
beileijiang@my.unt.edu
Gayatri Mehta
Electrical Engineering Department
University of North Texas
gayatri.mehta@unt.edu
ABSTRACT
Different GPU applications exhibit varying scalability
patterns with network-on-chip (NoC), coalescing, mem-
ory and control divergence, and L1 cache behavior. A
GPU consists of several Streaming Multi-processors (SMs)
that collectively determine how shared resources are
partitioned and accessed. Recent years have seen diver-
gent paths in SM scaling towards scale-up (fewer, larger
SMs) vs. scale-out (more, smaller SMs). However, nei-
ther scaling up nor scaling out can meet the scalability
requirement of all applications running on a given GPU
system, which inevitably results in performance degra-
dation and resource under-utilization for some applica-
tions. In this work, we investigate major design pa-
rameters that influence GPU scaling. We then propose
AMOEBA, a solution to GPU scaling through reconfig-
urable SM cores. AMOEBA monitors and predicts ap-
plication scalability at run-time and adjusts the SM con-
figuration to meet program requirements. AMOEBA
also enables dynamic creation of heterogeneous SMs
through independent fusing or splitting. AMOEBA is
a microarchitecture-based solution and requires no ad-
ditional programming effort or custom compiler sup-
port. Our experimental evaluations with application
programs from various benchmark suites indicate that
AMOEBA is able to achieve a maximum performance
gain of 4.3x, and generates an average performance im-
provement of 47% when considering all benchmarks tested.
1 Introduction
GPUs have emerged as performance accelerators for
general purpose applications and take advantage of the
single instruction multiple threads (SIMT) program-
ming model to improve the performance of data-parallel
computations. Supercomputers [1, 2], cloud servers [3],
desktops [4] and even mobile devices [5] already benefit
significantly from GPUs to achieve better performance
and higher power efficiency. A GPU typically consists of
many compute units (CUs), also called streaming multi-
processors (SMs), and each SM contains a large number
of simple compute cores [6]. GPUs leverage the massive
number of computing cores in SMs to exploit thread-
level parallelism (TLP) in an attempt to hide memory
access latency [7, 8].
The multiprocessor industry has been fueled by Moore’s
Law for many years and processor performance is im-
proved through increasing transistor counts. However,
Moore’s Law is slowing down because we are reaching
the technological limits of how small transistors can be
made. Increasing the chip size can allow us to add more
transistors to a chip. However, this is not a sustainable
solution due to several reasons: (1) there is not enough
power budget to allow all transistors to be powered
on simultaneously. This is because transistor thresh-
old voltage does not scale with technology nodes and
per-transistor switching energy almost keeps constant
[9], (2) cost becomes prohibitive in manufacturing chips
with ultra-high transistor counts [10], and (3) data com-
munication becomes a bottleneck as chip size increases
[11].
TLP has been considered as a promising solution to
tackle the slowdown of Moore’s law, and GPU archi-
tecture is based on the idea of exploiting TLP. The
computing power of GPU arises from its SIMT archi-
tecture: many threads are executed concurrently in an
SIMD fashion. However, an average programmer may
not be aware of the details of the underlying hardware
to write high-quality code to fully utilize available GPU
resources. As a result, many general purpose applica-
tions are not fully optimized for running on a specific
GPU architecture, and this causes under-utilization of
hardware resources [12, 13, 14, 15, 16, 17, 18]. For
example, it has been observed that cores are idle for
52% to 98% of the execution time for some GPU bench-
marks [12]. Therefore, instead of adding more resources
to GPUs, exploration of optimized resource utilization
techniques can be a more viable option to enhance the
GPU performance and power efficiency.
There have been earlier efforts targeting to maximize
GPU resource utilization. For example, several prior
works have proposed to share a GPU among multi-
ple applications and used software-level techniques to
manage resource sharing [19, 20, 21]. On the hardware
ar
X
iv
:1
91
1.
03
36
4v
1 
 [c
s.A
R]
  8
 N
ov
 20
19
side, spatial multitasking has been proposed as a tech-
nique that partitions a GPU among multiple kernels at
an SM granularity [22]. Several techniques have been
also proposed to share resources among kernels inside
each SM, such as simultaneous multikernel (SMK) [12],
warp-slicer [23], and GPU Maestro [16]. However, these
techniques need to run multi-programmed workloads to
fully utilize GPU resources (i.e., finding more tasks to
avoid GPU inactive cycles). Also, they do not consider
an application’s scalability and do not configure hard-
ware to meet the software’s resource demands.
An alternative approach is to dynamically reconfigure
hardware to avoid resource under-utilization. Reconfig-
urable cores have been proposed in the past for CPUs
to facilitate parallel execution [24, 9, 25, 26, 27]. How-
ever, the overhead of reconfiguring CPU cores is high
due to the complexities associated with CPU architec-
tures. There have been very few reconfigurable GPU
architectures proposed. For example, R-GPU is pro-
posed to interconnect compute cores inside an SM to
reduce data movement and remove decoding overhead
by assigning each core a fixed operation [14]. In compar-
ison, SGMF is proposed as a dataflow architecture using
coarse-grain reconfigurable fabric [13]. However, SGMF
needs help from compiler to convert the kernels into
dataflow graphs. Neither work considers the scalability
of applications regarding system bottlenecks, such as
interconnection network, memory access patterns, and
control divergence. In addition, these prior works only
explore intra-SM resource utilization and assume that
the number and size of SMs are fixed. However, sharing
resources among SMs is also important because appli-
cations have varying scalability patterns depending on
SM settings, but exploration of this design space has
largely been ignored in prior work.
In this work, we present AMOEBA, a reconfigurable
architecture to improve GPU resource utilization, per-
formance, and energy efficiency. AMOEBA takes into
account several important application resources require-
ments such as interconnect throughput, memory access
patterns, and control divergence, before selecting an
optimal GPU scaling option. AMOEBA is a coarse-
grained reconfigurable architecture to enable flexible SM
scaling at a low cost. It also explores heterogeneity of
SMs through dynamic fusing and splitting in order to
accommodate program divergence.
We make the following contributions in this paper:
• We investigate the GPU scaling problem under re-
source bound. We identify the important factors de-
termining whether an SM should be designed in a
scale-up or scale-out fashion. Building upon the re-
sults from this investigation, we propose a coarse-
grained reconfigurable architecture that fuses the base-
line scale-out SMs into larger scale-up SMs. This de-
sign enables optimized resource utilization across SM
boundaries.
• We design an online controller that takes into ac-
count an application’s dynamic behavior and makes
reconfiguration decisions accordingly. The controller
employs a binary logistic regression model to predict
Figure 1: GPU architecture overview.
application scalability with low cost.
• We provide design details of the proposed reconfig-
urable architecture. Our proposed architecture en-
ables coarse-grained SM fusion, and it can provide
support for both scale-up and scale-out GPU config-
urations.
• We propose a scheme to split individual scale-up SM
cores dynamically when program divergence causes
pipeline stalls.
To the best of our knowledge, this is the first paper
to propose a reconfigurable GPU architecture that can
dynamically toggle between scale-up and scale-out op-
tions, with the goal of maximizing resource utilization.
2 Background
2.1 GPU Execution Model and Architecture
GPU execution model divides the total work space into
a grid and assigns a work item, also called a thread, to
work on each portion of data. Each thread executes the
same set of instructions, and this enables parallel multi-
threaded execution in an SIMT fashion. Each segment
of code loaded to a GPU is called a kernel. A group
of threads that execute a kernel concurrently is referred
to as a workgroup, also called a thread-block or a co-
operative thread-array (CTA). The total work space is
divided into blocks or CTAs, and threads within a given
CTA can communicate with each other.
A high-level view of a GPU architecture is shown in
Figure 1. A GPU consists of multiple compute units
(CUs) or streaming multiprocessors (SMs), which are
analogous to CPU cores. Each SM contains fetch, de-
code, execution and memory access logic, and these
units collectively form a pipeline. There are several
compute cores residing within each SM, and each com-
pute core is a large, heavily-pipelined execution unit
capable of executing both integer and floating point op-
erations. When a kernel is launched, each CTA is dis-
patched to an SM and executes there until its comple-
tion. A CTA is further divided into units called warps,
also known as wavefronts. Typically, there are a large
number of warps in flight inside an SM so that memory
access latency can be masked by concurrent execution.
There is a unified L2 cache that is coupled with memory
controllers, while the global memory is off-chip. On-
chip data communication is implemented through an
on-chip network.
Figure 2: SM scaling trends (number of SMs vs
numbers of cores per SM) for NVIDIA’s GTX
GPU family[28].
.2.2 Scaling of SMs
The primary execution unit in a GPU is a compute core,
and they are grouped into SMs that share resources such
as register file, local memory, and warp scheduler. Due
to the resource limitation, once the chip size is decided,
the total number of compute cores is fixed. Then, there
arises an important scaling problem: how should com-
pute cores and other resources be partitioned into SMs?
That is, should we opt for scale-up SMs (by including
more compute cores and resources into a smaller num-
ber of SMs) or scale-out SMs (by having more SMs with
fewer cores and less resource inside)? The scaling of
configuration of SMs is critical since it directly deter-
mines the maximum parallelism among GPU threads
and impacts resource sharing and utilization.
Figure 2 shows the scaling trends of NVIDIA’s GTX
GPU family during the last 11 years. The number of
computing cores per SM can be used to represent the
scaling in SM size. We observe that both size and num-
ber of SMs are increased from 2008 to 2011. However,
after 2011, the trends of SM size and SM count start to
part their ways in opposite directions. This is because
we are reaching the limit in terms of the total number
of computing cores that can be integrated into a chip
due to power and area constraints. Therefore, it is not
possible to scale up both the size and number of SMs;
so, we either scale out or scale up, but not both, as
shown in the figure. And, the most recent trend has
been scaling out since 2017. However, the question is
whether this trend of scaling out is sustainable for the
future. And, if not, what is the optimal configuration
for the best performance and resource utilization?
3 Motivation
3.1 Trade-offs in SM Scaling
As discussed above, warps execute in SMs and all threads
in one SM share GPU resources such as shared mem-
ory, L1 cache, register file, warp scheduler, and inter-
connect interface. Scaling of SM greatly influences the
resource utilization and power-performance efficiency.
Due to their different characteristics and resource re-
quirements, different applications exhibit varying SM
scaling patterns. We start by investigating the scaling
of multiple benchmarks, and the results are plotted in
Figure 3: Performance with SM scaling (a) with
a mesh NoC (b) with a perfect NoC. (x-axis is
the number of SMs and y-axis is IPC normalized
to 16 SMs.)
Figure 4: Memory access coalescing results with
different GPU scaling options. Actual memory
access rate represents the memory accesses after
coalescing. Here, we experiment with different
SM scaling options with 16, 25, 36, and 64 SMs.
Figure 3(a). In this experiment, we fit the total amount
of chip resources but vary the size and the number of
SMs. As can be observed, some applications benefit
from scaling out with smaller SMs (CP, SC ), while
other applications benefit from scale-up SMs (MUM,
RAY ). This result indicates that there is not a scaling
setting that benefits all applications. Motivated by this
observation, we next examine in detail the major fac-
tors that determine an application’s performance with
SM scaling.
(1) NoC Effect on SM Scaling. GPU SMs are
connected to L2 cache and memory controllers through
a network-on-chip (NoC). It has been shown that NoC
is a bottleneck in GPU performance as the chip size
grows [11]. This is due to the particular traffic pattern
exhibited by GPUs. Specifically, all SMs communicate
with the limited number of memory controllers on chip.
As the total on-chip bandwidth is fixed and is shared
by all SMs, more SMs means that each SM receives a
smaller share of the network bandwidth. In addition, a
larger network incurs longer delays due to increased hop
count and contention. As a result, there will be more
negative impact on the performance. We experimented
with different SM scaling options using a perfect NoC
(with zero delay), and the results are plotted in Fig-
ure 3(b). We can observe that when the NoC impact is
removed, more applications (e.g., LPS, AES, CP, and
SC ) achieve better performance with scale out settings.
This means that, for applications that are sensitive to
the on-chip network performance, performance will ul-
timately degrade when we keep scaling out the SMs.
(2) Memory Locality and SM Scaling. It has
been observed that memory resources inside an SM af-
Figure 5: Rate of shared data in L1 caches of
neighboring SMs.
Figure 6: Control divergence caused stalls with
different GPU scaling options.
fect the performance of some applications. Some appli-
cations may share data a lot among warps in one SM or
among L1 caches in different SMs. In such cases, scaling
up will improve the utilization of shared memory and
L1 cache, then reduce accesses to memory outside of
an SM. GPUs employ a mechanism called memory coa-
lescing to reduce data movements. The idea is to com-
bine multiple memory accesses from a warp to the same
cache line into a single transaction. Larger SMs can ex-
ecute larger warps, and provide more opportunities for
memory coalescing. We quantitatively characterize the
coalescing effects in SMs with different scaling settings,
as shown in Figure 4. In this figure, the y-axis shows
actual memory access percentage of all load and store
instructions after coalescing. As shown in Figure 4, a
scale up design with 16 SMs has much lower memory
accesses compared to a scale out design with 64 SMs.
That is, as far as coalescing is concerned, scale up SMs
bring more benefits than scale out SMs.
In addition, recent GPU architectures combine data
cache and shared memory functionality into a single
memory block to provide the best overall performance.
This makes the actual L1 cache capacity several times
larger when needed. For example, NVIDIA’s Volta [29]
architecture has a combined capacity of 128 KB/SM,
more than seven times larger than the GP100 data cache,
and all of it are usable as a cache by programs that do
not use shared memory. Considering this trend, we also
investigated L1 data sharing among neighboring SMs
with increased L1 capacity, and the results are plot-
ted in Figure 5. As can be observed, some benchmarks
(such as HW and 3DCV ) exhibit around 10% sharing
rate in the baseline configuration. When L1 capacity
is increased by two or four times, higher sharing rate
is observed in most benchmarks that exhibit data shar-
ing. This means that scaling up SMs by increasing the
L1 capacity can effectively reduce duplicated data and
leads to more efficient utilization of the L1 caches.
(3) Control Divergence and SM Scaling. Re-
cent GPU architectures allow individual threads to fol-
low distinct program paths with control flow on the
SIMD pipeline. Control divergence occurs when threads
in the same warp take different paths upon a condi-
tional branch, which can lead to significant performance
degradation because it increases pipeline stalls [30, 31].
Even though various software techniques have been pro-
posed to better schedule branch instructions [19, 20, 21],
control divergence cannot be totally removed. We have
observed the core inactivity caused by the control di-
vergence, as shown in Figure 6. It can be seen from
this plot that, for scale up SMs, pipeline stalls caused
by branch instructions are much larger than scaling out
SMs. In fact, for many benchmarks, the cores are stalled
for more than half of the time waiting for branch in-
structions to be resolved. This is because, in larger
SMs, the pipeline is wider than smaller SMs; as a re-
sult, a pipeline stall causes more reduction in computa-
tion parallelism. In this sense, applications with many
control instructions need to employ scale out SMs for
better performance.
3.2 Accommodating Control and Memory Di-
vergence through Dynamic SM Scaling
Ideally, threads running on GPUs are able to execute
in a lock step fashion, and consume continuous com-
putation enabled by warp scheduling to avoid pipeline
stalls. However, it has been shown that, for some appli-
cations, control flow divergence and memory divergence
inside warps can significantly degrade performance by
causing stalls in SM pipelines [30, 31, 32]. Memory
divergence occurs when threads from a single warp ex-
perience different memory-reference latencies caused by
cache misses or accessing different DRAM banks. In
current organizations, the entire warp must wait until
the last thread to have its reference satisfied. To solve
this problem, several techniques have been proposed to
divide a warp into smaller slices and regroup them to
create a new warp so that divergent threads do not pre-
vent other threads from proceeding in execution [23, 16,
33]. However, to our knowledge, all existing work sub-
divides a warp and reorganizes the threads to build a
new warp to run on the ”same sized” SM.
There is a significant drawback of the above men-
tioned techniques when implementing variable warp sizes:
the SM needs to be subdivided to support the execution
of a gang of split warps. For example, the gang-warp
[16] needs to divide an SM into four slices and each
slice works, after splitting, as a small SM. There are
prohibitive hardware overhead and design complexity
issues in this type of approach. In addition, prior work
only considers resource utilization inside an SM, but
not across SMs. In contrast, we consider sharing among
SMs at a larger granularity. We also consider resources
such as NoC, L1 sharing, and coalescing among SMs,
which have never been explored by prior work.
In our proposed approach, we first observe the appli-
cation’s scalability with SM resources such as network
and memory. If we detect that the application works
better with scale up cores, we fuse two small SMs into
one big SM. However, such a scheme fuses all SMs stati-
cally and may not flexibly adapt to a program’s dynamic
divergence. For example, some control and memory di-
vergence between the threads inside a warp may cause
long stalls in the fused SM since the pipeline is much
wider now. Based on this observation, we propose to dy-
namically split the scaled-up SM into two smaller SMs
to handle the control and memory divergence within a
warp. Once we detect that the divergence no longer
exists, we fuse the two SMs back to one.
Since the fused SM already consists of two sets of
execution paths, there is no extra hardware needed to
support slicing, as opposed to the prior work [23]. It
needs to be noted that, we dynamically split and fuse
SMs independently in this scheme. Fusing and splitting
decisions are made based on the current warp’s running
status, locally on each SM. As a result, when using our
approach, at any given time during execution, the GPU
architecture can have two types of SMs: some (fused)
big SMs and some (split) SMs. Through this type of dy-
namic heterogeneity, we are able to further improve re-
source utilization and achieve better performance, over
state of the art.
3.3 Reconfiguration Overhead
Usually, reconfigurable architectures involve redesign-
ing micro-architecture units, and this may lead to sig-
nificant overhead if not handled carefully. Due to this
reason, there have not been many reconfigurable CPU
architectures proposed in the past. However, in the case
of GPUs, the reconfiguration overhead can be much
lower. This is because GPU SMs have much simpler
structure and control logic, compared to general out-
of-order CPU cores. Specifically, a GPU has a very
simple in-order pipeline which reduces the reconfigu-
ration complexity. In addition, GPUs are designed to
hide memory latency by overlapping the execution of
a large number of threads. As a result, delays caused
by reconfiguration can be conveniently masked. This
makes GPUs excellent candidates for reconfigurable ar-
chitectures. Reconfiguration overhead also heavily de-
pends on the granularity at which reconfiguration takes
place. In this work, we propose a coarse-grained recon-
figurable architecture based on SMs, which can further
reduce design complexity and overhead. Specifically, we
only reconfigure GPUs at an SM level without modify-
ing pipeline structures. We only modify a few managed
resources such as warp queues, L1 cache, and register
files. Therefore, the proposed GPU architecture is very
amenable to reconfiguration.
4 Dynamic SM Scaling through Reconfigura-
tion
The goal of our design is to reduce resource under-
utilization and also improve performance. To reduce
the design complexity, we opt for coarse-grain reconfig-
uration. Since it has been shown that individual kernels
exhibit regular behavior, we propose a one-time recon-
figuration scheme on a kernel-by-kernel basis. Once a
kernel is determined to benefit from scale up SMs, we
fuse every two neighboring SMs to create scale up SMs.
Otherwise, we continue executing the kernel using scale
out SMs. Our method is basically a top-down approach:
Figure 7: Reconfiguration controller overview.
we first characterize the kernel’s overall scaling behav-
ior regarding overall GPU resource utilization and then
make a decision regarding whether to fuse or not. Based
on this static fusion scheme, we also propose to refine
the mechanism by allowing individual fused SMs to split
dynamically if warps exhibit significant divergence in
the fused SM.
4.1 Online Reconfiguration Controller
A high level view of our reconfiguration controller is
shown in Figure 7. Profiling has been employed by
many resource utilization techniques to determine an
application’s characteristics [16, 24, 9]. In this work,
we propose to combine online profiling with an offline
trained model to predict scalability. When a new ker-
nel starts, we first evaluate various metrics regarding
its execution. Then, these metrics are fed into a scal-
ability predictor which is already trained offline. The
scalability predictor gives a result indicating whether
the kernel should be executed on scale up or scale out
SMs. Next, we reconfigure the SMs according to this
result and start executing the kernel. After the kernel
finishes, we start the loop again for the next kernel.
4.1.1 Online Scalability Sampling
It has been shown that kernels exhibit disparate behav-
ior with SM scalability and resource utilization [12, 34].
Therefore, we cannot profile kernels to predict the be-
havior of an entire application. Recall however that,
each kernel is split into smaller blocks, called CTAs,
that execute similar portions of the code. We found
that the CTAs exhibit very consistent behavior, which
closely follows the scalability trend at the kernel granu-
larity. Figure 8 shows how CTAs follow the same scal-
ing trend with their kernel using applications LIB and
RAY. As can be observed, both the kernel and CTAs of
RAY show a scale up trend, whereas LIB kernel and its
CTAs exhibit a scale out trend. Therefore, we propose
to use a CTA to predict the scaling behavior of a kernel.
4.1.2 Scalability Metrics
To profile an application’s scalability respect to the SM
size and number, we need to identify metrics that can
influence the scalability. Following are the major met-
rics we considered in this work:
1© NoC throughput: This metric reflects the applica-
tion’s ”communication intensity”. If the NoC is a bottle-
neck, choosing scaled up cores will improve performance
because the SM count would be smaller and the network
size would accordingly be smaller, resulting in each core
having a higher network throughput.
2© Average NoC latency: This is the average latency
of the packets. It can also be used to evaluate the com-
munication intensity. 3© Coalescing rate: The coalesc-
ing rate is calculated as the number of actual mem-
ory accesses sent out from each SM divided by the
Figure 8: Kernel and CTA scalability consis-
tency.
total number of memory accesses in the instructions.
This metric reflects how much shared data are requested
across warps in an SM. 4© L1 cache miss rate: This re-
flects the demand for an application on local memory. If
the miss rate is high and the data is not streaming, allo-
cating a larger L1 will improve the performance, which
means scale-up SMs are expected to have better perfor-
mance. 5© MSHR rate: This metric is similar to the
coalescing rate, but it is across different instructions.
Scale up SMs will have more instructions running on
the fly, and this will benefit the applications with higher
MSHR rates. 6© Inactive thread rate: This is used to
reflect the warp control divergence. It is calculated as
the number of cycles threads spent idling due to control
instructions, divided by the total execution cycles. Ker-
nels with larger control divergence would favor scale-out
SMs.
4.1.3 Scalability Predictor
In this work, we propose to use binary logistic regres-
sion, which is a machine learning technique borrowed
from the field of statistics, to predict scalability. Our
model accepts several input parameters and generates a
binary output indicating whether an application needs
to be run with scaled up GPUs or scaled out GPUs.
Since we only fuse two neighboring SMs to build a scale
up core, we only need a simple regression based model
to predict scalability. The output of the model is only
Binomial: yes or no to scale up.
Binary logistic regression estimates the probability
that a characteristic is present (e.g., estimating the prob-
ability of ”success”), given the values of explanatory
variables. Unlike the normal distribution, the mean and
variance of the Binomial distribution are not indepen-
dent. Specifically, the mean is denoted by P and the
variance is denoted by P ∗ (1 − P )/n, where n is the
number of observations, and P is the probability of the
event occurring (i.e. whether we need to reconfigure
smaller SMs into bigger SMs) in any one trial. If we
were considering the data in a list rather than a table
form, we would assume that the variable had a mean P ,
and a variance P ∗(1−P ), and this variable would have
a Bernoulli distribution. When we have a proportion
as a response, we use a logistic or logit transformation
to link the dependent variable to the set of explanatory
variables. The logit link has the form:
Logit(P ) = log[P/(1− P )]. (1)
The term within the square brackets is the odds of an
event occurring. In our case, it indicates whether we
need to configure bigger cores. Using the logit scale
changes the scale of a proportion to plus and minus
infinity, and also because of Logit (P ) = 0, when P =
0.5. When we transform our results back from the logit
(log odds) scale to the original probability scale, our
predicted values will always be at least 0 and at most
1. If there is only one input x, then we can write the
model as:
P =
e(b0+b1x)
1 + e(b0+b1x)
, (2)
where y is the predicted output, b0 is the bias or inter-
cept term, and b1 is the coefficient for the single input
value (x). We can write the model in terms of odds as:
P
1− P = e
b0+b1x. (3)
Conversely, the probability of the outcome not occur-
ring is
1− P = 1
1 + eb0+b1x
. (4)
For an event with multiple input factors, the modeled
logarithm of the chance is given by:
log(
P
1− P ) = b0 + b1x1 + b2x2 + ...+ bnxn + constant,
(5)
where P indicates the probability of an event (e.g.,
chance to scale up by fusing SMs in our case), and Pi
are the regression coefficients associated with the refer-
ence group and the xi explanatory variables. We train
this binary logistic model using a large amount of offline
experimental data to obtain the values of b0 − bn. We
then use this model to directly infer the fusing decision
online. Since the model is in fact linear, its implemen-
tation overhead is quite low. We give more details of
the overheads in later sections.
4.2 Design of the Reconfigurable Architecture
The goal of AMOEBA is to create a GPU architecture
that can dynamically change the number and size of its
SMs, based on run-time workload behavior. We propose
to start with a ”baseline” scale out machine and fuse the
neighboring SMs into a bigger SM, if the application is
found to perform better with scaling out. Note that we
allow fusing only two neighboring SMs. This is due to
the following considerations: (1) Our scale out SM has
32 SIMD units and a scaled up SM will have 64 SIMD
units when two SMs are fused. Fusing more SMs would
significantly increase the pipeline width and the prob-
ability of pipeline stalls. In the future, if the scale out
SM gets even smaller, for example, with 16 SIMD units,
then fusing 4 such SMs together will be a more viable
option. Note that our techniques can be easily extended
to fusing more SMs to scale up. (2) Because fused SMs
share resources such as L1 cache, register files, and warp
schedulers, fusing more SMs means increased commu-
nication latency and implementation complexity. For
example, a larger L1 cache will need a longer access
time which will compromise the potential benefit from
the SM fusion. Due to these reasons, we only consider
fusing two neighboring SMs in this paper.
Figure 9: SM reconfiguration via fusion.
Figure 9 shows how two scaled out SMs are fused to
create a scaled up SM. The dashed lines show the fused
units of the two SMs, placed to ensure that they can
work in a lockstep fashion as one SM. The shaded com-
ponents in SM1 are disabled due to SM fusion. In the
fused SM, instructions are first fetched from the fused
L1 I-cache ( 1 ). Then, the instructions are decoded,
and selected instructions are sent to the per warp I-
buffers ( 2 ). Next, the control logic ( 3 ) decides which
instruction to issue and the decision is sent to the issue
unit ( 4 ). Selected warps are then sent to the datapath
of both SM0 ( 5 ) and SM1 ( 6 ) for execution. Mem-
ory accesses are sent from the executing threads to the
fused memory unit ( 7 ).
In Figure 9, there are two baseline SMs, shown as
SM0 and SM1. AMOEBA does not change the execu-
tion units such as SP or SFU. When fused, the regis-
ter files of the two original SMs and score boards work
independently, as in the baseline. AMOEBA does not
change register files, and since the register files, are allo-
cated with warps, they are not fused but can be accessed
independently. Thus, there is no change in the through-
put of any individual register file. Similarly, the score
board connection with each register file is not modified
either. However, the connection of the score board in
SM1 to the warp scheduler is removed when two SMs
are fused ( 3 ). Instead, this score board is connected
to the warp scheduler of SM0. This is because when we
fuse two SMs, only one warp scheduler is kept, and it
schedules all warps on both the SMs ( 2 ).
The memory components of the two SMs need to
be fused, and this includes the shared memories, L1
I caches, L1 D caches, and L1 context cache. We fuse
L1 caches by increasing the cache associativity. To re-
duce the new L1 cache access latency, the SM layout
needs to be modified as shown in Figure 9, so that the
L1 caches of both SMs are placed next to each other
( 1 , 7 ). Since the GPUs are good at hiding memory
access latencies through overlapped warp execution, the
extra delay caused by accessing a larger L1 D cache can
be hidden by warp computation. In our experiments,
we conservatively added one extra cycle in L1 cache ac-
cess due to the cache fusion. Our results show that
this extra delay is hidden quite well by the overlapped
computation.
Each fused SM has one copy of the coalescing unit in
Figure 10: Mechanism for switching between
fusing and splitting.
Figure 11: Algorithm to dynamically split a
fused SM to accommodate warp heterogeneity.
the fused core by fusing the two coalescing units from
both the SMs. Since the warp size is doubled, this leads
to more chances for coalesced memory accesses. Af-
ter fusing the two SMs, AMOEBA combines the NoC
routers of the two SMs into one by disabling one SM’s
router. This is implemented by adding a bypass path
in one disabled router. As a result, the network size
is reduced, this significantly reduces the network la-
tency, and consequently, each router can enjoy a higher
throughput in the network.
4.3 Enabling Heterogeneous SMs through Dy-
namic SM Fusing and Splitting
We propose to fuse SMs to reconfigure the GPU as a
scaled up architecture when we observe that fusing the
resources of two SMs is beneficial from a performance
angle. It needs to be noted that our approach is dif-
ferent from prior works such as variable warps [30] or
warp subdivision [33]. Those works only consider the
resources inside an SM and try to fully utilize them –
there is no cross-SM resource utilization optimization
in those prior studies. Our proposed architecture, on
the other, hand takes into account cross-SM resource
utilization, such as NoC resource, sharing L1 caches be-
tween SMs, and memory access coalescing across SMs.
As a result, it is fundamentally different from the earlier
works.
However, there are still opportunities to further im-
prove resource utilization in AMOEBA. This is, when
we fuse two SMs, there can be scenarios where warp
heterogeneity can cause inefficient pipeline utilization.
For example, even though fusing two SMs can bring
benefits in cache access or NoC, the resulting larger
warp size creates wider pipelines. In this case, diver-
gence in memory or control behavior in warps could lead
to more pipeline stalls, compared to the unfused SMs.
Therefore, we propose a dynamic SM splitting strategy:
when we observe a significant warp divergence, and wide
pipeline leads to a higher performance degradation com-
pared to the benefits from fusion, we split the fused SM
into two separate SMs. In this way, each split SM has
half the pipeline width and the warps that cause diver-
gence can only cause stalls in one of the smaller SMs.
The other SM can keep the computation without being
delayed by the pipeline stalls.
We can have different policies to decide when to split
a given ”fused” SM into two independent ones. Note
that, by ”independent”, we mean that two SMs are run-
ning different warps independently on their respective
data paths. However, to reduce the cost of hardware
and context switch, we do not split the shared resources,
such as L1 cache, register files, and NoC interface. We
set up a threshold to decide when to split, which is a
fixed ratio of divergent warps to the total warps running
in the large SM. If the current ratio is greater than the
threshold, we decide to split the SM into two. This fig-
ure also shows how NoC interfaces are bypassed when
two SMs are fused together.
After the SM splits, we move all divergent warps from
the bin to a new SM created from the split. Subse-
quently, the two SMs start the independent execution
of their warps. When the second SM finishes all diver-
gent warps, we re-fuse the two SMs into one. Then,
we start the procedure to collect divergent warps again
and split the SMs when necessary. Thus, this proce-
dure of splitting and fusing is dynamically decided by
the divergence of warps. This mechanism is expected
to maximize resource utilization and reduce stalls in the
fused SMs.
The idea behind splitting is to prevent divergent warps
from causing pipeline stalls. So, we need to separate di-
vergent warps and non-divergent warps into two clusters
and run each cluster on a separate smaller SM, so that
the slow warps do not cause the fast warps to delay.
Suppose that we have split a scale up SM into 2 scale
out SMs (SM 0 and SM 1), and then, we want to run
fast warps on SM 0 and slow warps on SM 1. There
can be different mechanisms that can be used to decide
what warps to be moved to the second SM 1. In this
work, we investigated two methods: (1) direct split,
Figure 12: Performance results.
and (2) warp regrouping. The direct split method is
simple as it directly divides a divergent warp in the mid-
dle into 2 smaller warps. Then, both the smaller warps
are moved to SM 1. This method has a low cost but
may not have optimal performance. This is because the
slow threads in a divergent warp may be located in dif-
ferent positions. If we simply cut the warp in half, there
can be varying combinations of resulting warps. For ex-
ample, we can have one warp with all fast threads and
one warp with all slow threads. Or, we can have both
smaller warps with partially slow threads. The ideal
case is the first splitting, since we can better remove
negative effects of the slow threads on the fast ones.
Based on this analysis, we propose a second method
that regroups threads into a fast warp and a slow warp.
We then move the slow warp to SM 1 and keep the fast
warp in SM 0. To accomplish this, we first divide the
threads in the original warp into small groups, and label
them as ”fast” or ”slow” based on divergence. Then, we
regroup them into two warps so that the slowest groups
are all put into a slow warp and moved to SM 1. In our
design, we also periodically check the stalls in the slow
SMs. We periodically move some fast warps to them so
that the resources are not wasted when the slow warps
cause stalls.
The hardware overhead of the splitting is low because
the split SMs were anyway two independent SMs in the
baseline architecture. We added hardware to fuse them
as described earlier, and splitting them does not need
extra hardware, except the management and storage of
the divergent warps. Therefore, we need a new warp
queue and some simple control logic. Compared to the
prior works [30, 33] that proposed splitting resources
inside one SM, our overhead is very low. Figure 10 and
Figure 11 show the timing and algorithm of our dynamic
splitting and fusing.
Table 1: System configuration. See GPGPU-
Sim v3.2.2 [35] for the full list.
Number of Computing Cores 48 cores
Number of Memory Controllers 8
MSHR per Core 64
Warp Size 32
SIMD Pipeline Width 8
Number of Threads per Core 1024
Number of CTAs/Core 8
Constant Cache Size/Core 8KB
Texture Cache Size/Core 8KB
L1 Cache Size/Core 16KB
L2 Cache Size/Core 128KB
Number of Registers/Core 16384
Warp Scheduler Greedy-Then-Oldest
Shared Memory 48 KB
Memory Scheduler FR-FCFS
Memory Model 8 MCs, 924 MHz
NoC Channel Width 128 bit
NoC Topology mesh
NoC Router Pipeline Stage 2
Figure 13: Control divergence caused stalls.
5 Experimental Evaluation
We simulate our baseline architecture using a cycle-level
simulator (GPGPU-Sim [36]) and faithfully model all
key parameters (Table 1). The baseline GPU consists
of 48 scale out SMs with a warp size of 32. There are
8 memory controllers on the chip. The interconnection
network is a mesh-based NoC. There are two subnets
to avoid deadlock between request and reply messages.
The router has a pipeline with 2 stages. When we per-
form reconfiguration, two baseline SMs are fused to cre-
ate one scale up SM. We use a wide range of GPU ap-
plications from Ispass [37], Rodinia [38], Polybench [39]
and Mars [40], to evaluate our design, and execute all
applications to completion. We report performance re-
sults using the geometric mean of IPC speedup (over the
baseline GPU). We also report other evaluation metrics
provided by the simulator such as L1 cache miss rate,
NoC latency, network injection rate, and SM idle rate.
5.1 Performance
5.1.1 Performance Impact
Figure 12 illustrates the performance gains when using
AMOEBA. The baseline is a scale out architecture and
we also experiment with direct scale up. We present the
performance of applying three techniques proposed by
AMOEBA: static fuse configures the SMs only once be-
fore a kernel’s execution. Using the prediction model,
AMOEBA predicts the scalability of application with
SMs, and chooses to fuse two SMs or not. The next two
techniques are based on the dynamic heterogeneous SM
scaling. Direct split simply divides a divergent warp
into smaller ones in the middle, whereas warp regroup-
ing employs more complicated techniques to re-organize
threads into a fast warp and a slow warp. As can be ob-
served, SM achieves the highest improvement in perfor-
mance, by 4.25 times. MUM also achieves a significant
performance improvement of 2.11 times. On average,
all 12 benchmarks have around 47% increase in IPC.
Static fuse achieves almost same performance as di-
rect scale up when larger SMs can bring performance
benefits. For applications that can benefit from larger
SMs, static fuse achieves almost the same performance
gain as direct scale up. However, some benchmarks pre-
fer scale out configurations, such as 3MM and ATAX.
Our fusing techniques all perform better than direct
scale up (about 10%) for these workloads. This shows
that AMOEBA can accurately predict the applications’
scalability and the correct reconfiguration can lead to
performance gains. Some workloads are not sensitive
to scaling such as FWT and KM and all AMOEBA
techniques perform similar to the baseline. In gen-
eral, direct split and static fuse bring similar benefits
(on average) for most workloads, except BFS and SM.
Some workloads such as WP even experience perfor-
Figure 14: L1-I cache miss rate.
Figure 15: L1-D cache miss rate.
mance degradation, which is mainly due to the fusion
overhead. This is because this technique cannot dynam-
ically react to workload behavior changes. On the con-
trary, warp regrouping achieves 16% performance gain
than direct split because it can accurately capture a
workload’s dynamic behavior caused by divergence.
5.1.2 Control Stall Impact
Figure 13 plots the SM inactive rate caused by con-
trol divergence which is defined as the fraction of cycles
that SMs are stalled due to control instructions. We
can observe that only part of the workloads suffer from
stalls caused by control divergence. For workloads that
have control divergence caused stalls, dynamic fusion
perform better than direct scale up and static fusing
because they can dynamically adjust to the changes in
control divergence. Warp regrouping performs better in
more cases than direct split because fast and slow warps
are allocated to different SMs. Among all cases, the
baseline scale out configuration has the least amount of
stalls because its pipeline width is always smaller than
the other configurations.
5.1.3 Memory Access Impact
L1-I cache miss rate is plotted in Figure 14. Some
benchmarks such as FWT, 3MM, and ATAX are not
sensitive to L1-I cache capacity and fusing does not lead
to any change in their behavior. However, most bench-
marks have their miss rates reduced and the average
reduction is 9%, 20% and 30% for the three AMOEBA
schemes. Sharing L1-I cache through SM fusion reduces
the I cache misses and thus leads to improved perfor-
mance. Figure 15 plots the miss rate of L1-D cache. The
most significant reduction is for SM and its miss rate is
reduced by more than 70%. This is because the shar-
ing of L1 cache increases its effective capacity and this
change directly leads to 4.25 times improvement in per-
formance. Some benchmarks, such as BFS and MUM,
experience increased L1-D cache miss rates. This is be-
cause warp regrouping changes data locality by moving
warps between SMs and this leads to higher miss rates.
Impact of AMOEBA on memory accesses is plotted in
Figure 16. As can be observed, all benchmarks achieve
reduced actual memory access rates compared to the
baseline. Actual memory access rate is calculated as
the actual memory access count divided by the total
number of memory accesses in the instructions. Since
AMOEBA allows SMs to share coalescing units, the ac-
tual number of loads and stores is greatly reduced.
Figure 16: Actual memory access.
Figure 17: Normalized rate of stalls when MCs
cannot inject to the NoC.
5.1.4 NoC Impact
Figure 17 plots the normalized ICNT stall rate, which
is defined as the rate of stalls when new reply packets
cannot be generated because an MC’s injection queues
are full. This data can reflect the pressure on both
NoC and memory controllers. As can be observed from
this figure, all AMOEBA schemes are able to reduce
this stall rate. For some benchmarks, such as CORR
and COVR, this stall time is totally removed. Since
AMOEBA can fuse SMs and bypass some routers, the
network size is reduced, and this leads to smaller hop
counts. As a result, NoC bottleneck can be greatly re-
lieved for communication-intensive applications. Fig-
ure 18 shows the average network data injection rates
for the SM configurations evaluated. As can be observed
from this plot, all benchmarks have a higher injection
rate under the AMOEBA than the baseline. This is
because we fuse SMs and use only one NoC network in-
terface to inject packets. Even though the injection rate
is higher in AMOEBA schemes, the network size is re-
duced by half and this leads to shorter communication
delays, paving the way to achieve better performance.
5.2 Dynamics of Core Fusion and Splitting
To observe the dynamics of switching between fusing
and splitting, we studied the status of five SMs in bench-
mark RAY. The results are shown in Figure 19. As
shown in this figure, all 5 SMs start with fused execu-
tion because this benchmark favor scale up SMs. After
a period of time, the SMs start to split because enough
divergent warps have been detected and smaller SMs
brings more benefit. However, the switching between
fusing and splitting of each SM is independent of each
other. As a result, at a certain time, there exist both
scale up and scale out SMs in the architecture. As a re-
sult, better performance results are achieved from this
flexible heterogeneity in SM configurations provided by
AMOEBA.
5.3 Analysis of Scalability Prediction Model
We use several performance counters to generate the
detailed metrics required by our scalability prediction
model. Most of these performance counters are already
included in many of today’s GPU systems, including
cache hits and misses, MSHR, and branch instruction
statistics. For metrics cannot currently be provided by
the performance counters, we propose to add such coun-
Figure 18: NoC injection rate.
Figure 19: Phases of dynamic SM fusion and
splitting.
ters, e.g., concurrent CTA numbers. Table 2 shows the
coefficients in our scalaiblity prediction model.
To analyze the relative contribution of each metric
to overall performance in the prediction model, we plot
the distributed weights of the major metrics. Here, we
consolidate different types of L1 cache miss rate into
one metric called L1 miss rate. The result is shown in
Figure 20. For each metric, its magnitude of impact is
shown as a value between -1 to 1. The magnitude of im-
pact of a metric is calculated as the coefficient of this
metric×measured value. For example, the impact mag-
nitude of Load instruction is calculated as Load insn rate
× its coefficient. All positive impact magnitudes con-
tribute to a scaling up decision, and all negative impact
magnitudes contribute to a scaling out decision. Even-
tually we add all metrics’ impact magnitudes together
and check the sum. If the result is positive, then we
predict to fuse SMs and create a scale up configuration.
Otherwise, we predict that a scale out configuration will
fit better with the application. In this figure, the sum
of the impact magnitudes for BFS and RAY are both
positive. So, these benchmarks favor running on scale
up SMs. On the contrary, CP and PR prefer to run on
scale out SMs. It can also be observed that different ap-
plications’ scalability is influenced by different metrics
with varying extent. For example, MSHR plays a more
significant role in BFS and CP, whereas PR and RAY
are more sensitive to the NoC performance compared
to others.
5.4 Comparing with State of the Art
We now compare the performance of AMOEBA against
Dynamic Warp Subdivision (DWS) [33]. The results
are plotted in Figure 21. DWS was proposed by Meng
et. al. to divide warps into smaller ones in order to
reduce the stalls caused by memory and control diver-
gence. On average, AMOEBA achieves 27% perfor-
mance gain over DWS. Benchmark SM achieves 3.97
times improvement in performance compared to DWS.
This is because DWS can only improve resource utiliza-
tion inside an SM and cannot harness the benefits of
cross-SM resource sharing. In contrast, AMOEBA can
dynamically change the configurations of SMs and thus
flexibly allow resources to be shared among SMs. Thus,
Table 2: Coefficients in scalability prediction
model.
Constant -73.635 Concurrent cta 1.414
Control Diver-
gent
444.628 Coalescing 2057.050
L1D Miss Rate -313.838 L1I Miss Rate 1674.513
L1C Miss Rate -67.277 MSHR -102.971
Load Inst Rate -680.786 Store Inst rate -804.7
NoC -8.301
Figure 20: Magnitude of parameter impact on
determining scalability for some applications us-
ing the proposed predictor.
performance can be further improved through enhanced
resource utilization.
5.5 Area Overheads
There are two types of controllers in the proposed archi-
tecture: online reconfiguration controller for scale up or
scale out, and switch controller for dynamic fusing and
splitting. We propose to implement these controllers in
an IP module in the GPU chip. The major components
in the controllers are a MAC unit, buffers and control
logic. We employ similar methods proposed in [32] to
model the buffers in the controllers by using the area
of a latch cell from the NanGate 45 nm Open Cell li-
brary. The resulting area of each bit of the buffer is 4.2
um2, and the total estimated added buffer area is 0.021
mm2. We use a pipelined Booth Wallas MAC [41] and
it is synthesized by Synopsis Design Compiler using 90
nm technology and scaled to 45 nm. The area of the
MAC is 0.019 mm2. Together with the control logic,
we estimate the two controllers to have area of 1.53
mm2. GeForce 8800GTX which has 128 SM cores, the
overall area overhead of AMOEBA can be calculated
as the total SM area overhead + controller overhead =
0.021 mm2 × 128 + 1.52 mm2 = 4.208 mm2. Com-
pared to the total GeForce 8800GTX area of 480 mm2,
AMOEBA incurs an area overhead of 0.88%.
6 Related Work
There has been plenty of work proposing reconfigurable
architectures for multi-core CPU systems [25, 26, 27,
24, 42]. A multicore architecture is proposed in [25]
that reconfigures cores into a wide VLIW machine to
exploit hybrid forms of parallelism. As a pioneer re-
configurable architecture, TRIPS [26] splits ultra-large
cores to small ones to meet the diverse demand of appli-
cation parallelism. Working in the opposite direction of
reconfiguring cores, Ipek et al. [24] proposed Core Fu-
sion where a large core can be dynamically configured
from a group of independent smaller cores. Core Fusion
is the most closely related work to AMOEBA, but it is
proposed for CPU cores and the core fusing policy and
micro architecture are very different from our work.
Figure 21: Comparison with Dynamic Warp
Subdivision (DWS) [33].
Compared to CPU based multicore systems, there
have been fewer works on reconfigurable GPU architec-
tures. Voitsechov et al. proposed SGMF, a dataflow ar-
chitecture using coarse-grain reconfigurale fabric, com-
posed of a grid of interconnected functional units [13].
However, SGMF needs help from compiler to break the
CUDA/OpenCL kernels into dataflow graphs and in-
tegrates the control flow of the original kernel to pro-
duce a control-data-flow-graph (CDFG). Different from
their work, our proposed scheme does not require com-
piler support. R-GPU is a reconfigurable GPU archi-
tecture that aims to reduce the cycles spent on data
movement and control instructions and focus on data-
computations [14]. It configures GPU cores to create
a spatial computing architecture. R-GPU implements
reconfiguration at a core level within an SM, and does
not consider an application’s scalability while our work
reconfigures at an SM level, and our reconfiguration de-
cision is based on NoC, control instructions and mem-
ory access patterns. Dhar et al. proposed fine grained
and coarse grained reconfigurations of SMs in GPUs in
order to reduce the underutilization of resources and
power consumption [15]. However, their work only re-
configures the datapath inside each SM. Our work also
reconfigures memory and NoC of the system, and we
also propose to use heterogeneous SMs to improve per-
formance and power efficiency.
Heterogeneous multicores have emerged as a promis-
ing approach for CPU-based systems which leverage
cores with different capabilities and complexities to strike
a balance between performance and power [9, 43, 44,
45, 46, 47, 48]. Lukefahr et al. propose composite
cores that consist of big and small compute engines
[9]. Kumar et al. [44] proposed a heterogeneous multi-
core architecture to reduce power dissipation. Hill et
al. showed that there is great potential in performance
improvement of the serial sections of an application us-
ing heterogeneous cores [43]. Our proposed AMOEBA
architecture differs from these heterogeneous architec-
tures in that our heterogeneous cores are dynamically
configurable while these earlier works employ fixed core
configurations. Our design can provide more flexibility
in exploring heterogeneous architectures, and achieve
better resource utilization.
Recently, several approaches have been proposed for
improving GPU resource utilization [12, 16, 17, 18, 49,
50]. Wang et al. propose Simultaneous Multikernel
(SMK) by exploiting heterogeneity of different kernels
[12]. Park et al. proposed GPU Maestro that per-
forms dynamic resource allocation for efficient utiliza-
tion of multitasking GPUs [16]. Wang et al. propose
an application-aware TLP management techniques for
a multi-application execution environment in order to
make judicious use of shared resources [17]. To im-
prove resource utilization in concurrent kernel execu-
tion (CKE), Dai et al. proposed mechanisms to reduce
memory stalls [18]. Our proposed work is different from
these prior techniques because it reconfigures SMs so
that they scale according to the application’s dynamic
behaviour.
7 Conclusion
In this work, we propose a reconfigurable GPU archi-
tecture, called AMOEBA, to explore the design space of
GPU scaling. By predicting a given application’s scal-
ability with SM size, the proposed architecture is able
to dynamically configure scale up or scale out SMs in
order to achieve high performance and resource utiliza-
tion. We also propose an optimization strategy to fur-
ther reconfigure each SM based on the warp divergence
at run- time, resulting in a heterogeneous architecture
in which both scale up and scale out SMs co-exist at run-
time. Our evaluation results using various benchmark
programs demonstrate the effectiveness of AMOEBA in
reducing GPU resource under-utilization and improving
system performance and power efficiency.
8 References
[1] Green500 list.
https://www.top500.org/green500/lists/2016/11, 2016.
[2] Top500 list. https://www.top500.org/lists/2016/11, 2016.
[3] Amazon Web Service. https://aws.amazon.com/ec2.
[4] D. Luebke and G. Humphreys, “How gpus work,” in
Computer ( Volume: 40 , Issue: 2 , Feb. 2007), 2007.
[5] A. Prakash, H. Amrouch, M. Shafique, T. Mitra, and
J. Henkel, “Improving mobile gaming performance through
cooperative cpu-gpu thermal management,” in Proceedings
of 53nd ACM/EDAC/IEEE Design Automation
Conference (DAC), 2016.
[6] Nvidia. Programming Guide, 2014.
[7] V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov,
O. Nutlu, and Y. N. Patt, “Improving gpu performance via
large warps and two-level warp scheduling,” in Proceedings
of 44th Annual International Symposium on
Microarchitecture, 2011.
[8] J. J. K. Park, Y. Park, and S. Mahlke, “Elf: Maximizing
memory level parallelism for gpus with coordinated warp
and fetch scheduling,” in Proceedings of SC15, 2015.
[9] A. Lukefahr, S. Padmanabha, R. Das, F. M. Sleiman,
R. Dreslinkski, T. F. Wenisch, and S. Mahlke, “Composite
cores: Pushing heterogeneity into a core,” in Proceedings of
the 45th Annual International Symposium on
Microarchitecture, 2012.
[10] R. Saleh, S. Wilton, S. Mirabbasi, A. Hu, M. Greenstreet,
G. Lemieux, P. P. Pande, C. Grecu, and A. Ivanov,
“System-on-chip: Reuse and integration,” in Proceedings of
IEEE, Vol. 94, No. 6, June 2006, 2006.
[11] A. Bakhoda, J. Kim, and T. M. Aamodt,
“Throughput-effective on-chip networks for manycore
accelerators,” in Proceedings of 43rd Annual IEEE/ACM
International Symposium on Microarchitecture, 2010.
[12] Z. Wang, J. Yang, R. Melhem, B. Childers, Y. Zhang, and
M. Guo, “Simultaneous multikernel gpu: Multi-tasking
throughtput processors via fine-grained sharing,” in
Proceedings of IEEE International Symposium on High
Performance Computer Architecture (HPCA), 2016.
[13] D. Voitsechov and Y. Etsion, “Single-graph multiple flows:
Energy efficient design alternative for gpgpus,” in
Proceedings of the 41st International Symposium on
Computer Architecture (ISCA), 2014.
[14] G. V. D. Braak and H. Corporaal, “R-gpu: a reconfigurable
gpu architecture,” in ACM Transations on Architecture and
Code Optimization, Vol.0, No. 0, Article 0, 2015.
[15] A. Dhar, “The case for reconfigurable general purpose gpu
computing,” in Master Thesis, University of Illinois at
Urbana-Champaign, 2014.
[16] J. Park, Y. Park, and S. Mahlke, “Dynamic resource
management for efficient utilization of multitasking gpus,”
in Proceedings of ASPLOS, 2017.
[17] H. Wang, F. Luo, M. Ibrahim, O. Kayiran, and A. Jog,
“Efficient and fair multi-programming in gpus via effective
bandwidth management,” in Proceedings of IEEE
International Symposium on High Performance Computer
Architecture (HPCA), 2018.
[18] H. Dai, Z. Lin, C. Li, C. Zhao, F. Wang, N. Zheng, and
H. Zhou, “Accelerate gpu concurrent kernel execution by
mitigating memory pipeline stalls,” in Proceedings of IEEE
International Symposium on High Performance Computer
Architecture (HPCA), 2018.
[19] C. basaran and K. D. Kang, “Supporting preemptive task
executions and memory copies in gpgpus,” in Proceedings of
Euromicro Conference on Real-Time Systems, 2012.
[20] S. Kato, K. Lakshmanan, R. R. Rajkumar, and
Y. Ishikawa, “Timegraph: Gpu scheduling for real-time
multi-tasking environments,” in Proceedings of the 2011
USENIX conference on USENIX annual technical
conference, 2011.
[21] J. T. Adriaens, K. compton, N. S. Kim, and M. J. schutle,
“The case for gpgpu spatial mutlitasking,” in Proceedings of
the 18th HPCA, 2012.
[22] C. J. Rossback, J. currey, M. silberstein, B. Ray, and
E. Witchel, “Ptask: Operating system abstrations to
manage gpus as compute devices,” in Proceedings of the
23rd ACM Symposium on Operating System Principles,
2011.
[23] Q. Xu, H. Jeon, K. Kim, W. W. Ro, and M. Annavaram,
“Warped-slicer: Efficient intra-sm slicing through dynamic
resource partitioning for gpu multiprogramming,” in
Proceedings of the 43rd Annual International Symposium
on Computer Architecture, 2016.
[24] E. Ipek, M. Kirman, N. Kirman, and J. F. Martinez, “Core
fusion: Accommodating software diversity in chip
multiprocessors,” in Proceedings of the International
Symposium on Computer Architecture (ISCA), 2007.
[25] S. A. Lieberman and S. A. Mahlke, “Extending multicore
architectures to exploit hybrid parallelism in single-thread
applications,” in Proceedings of HPCA, 2007.
[26] K. sankaralingam, R. Nagarajan, H. Liu, C. Kim, J. Huh,
D. Burger, S. W. Keckler, and C. R. Moore, “Exploiting ilp,
tlp and dlpp with the polymorphous trips architecture,” in
Proceedings of International Symposium on Computer
Architecture, 2003.
[27] K. Mai, T. Paaske, N. Jayasena, R. Ho, W. J. Dally, and
M. Horowitz, “Smart memories: a modular reconfigurable
architecture,” in Proceedings of International Symposium
on Computer Architecture, 2000.
[28] NVIDIA, . https://www.techpowerup.com/gpu-
specs/geforce-gtx-780.c1701.
[29] volta-architecture-whitepaper.
https://images.nvidia.com/content/volta-
architecture/pdf/volta-architecture-whitepaper.pdf.
[30] W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt,
“Dynamic warp formation and scheduling for efficient gpu
control flow,” in Proceedings of 40th International
symposium on Microarchitecture, 2007.
[31] T. D. Han and T. S. Abdelrahman, “Reducing branch
divergence in gpu programs,” in Proceedings of GPGPU-4
workshop, 2011.
[32] T. Rogers, D. R. Johnson, M. O’Connor, and S. W.
Keckler, “A variable warp size architecture,” in Proceedings
of ISCA, 2015.
[33] J. Meng, D. Tarjan, and K. Skadron, “Dynamic warp
subdivision for integrated branch and memory divergence
tolerance,” in Proceedings of ISCA, 2010.
[34] A. Jadidi, M. Arjomand, M. Kandemir, and C. Das,
“Optimizing energy consumption in gpus through
feedback-driven cta scheduling,” in Proceedings of
SpringSim (HPC) 2017: 12:1-12:12, 2017.
[35] GPGPU-Sim v3.2.2 (2016) GTX 480 Configuration.
https://github.com/chenxuhao/gpgpu-sim-
ndp/tree/master/configs/GTX480.
[36] A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and
T. M. Aamodt, “Analyzing cuda workloads using a detailed
gpu simulator,” in IEEE International Symposium on
Performance Analysis of Systems and Software, 2009.
[37] A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and
T. M. Aamodt, “Analyzing cuda workloads using a detailed
gpu simulator,” in IEEE International Symposium on
Performance Analysis of Systems and Software (ISPASS),
2009.
[38] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer,
S. Lee, and K. Skadron, “Rodinia: A benchmark suite for
heterogeneous computing,” in IEEE International
Symposium on Workload Characterization (IISWC), 2009.
[39] S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and
J. Cavazos, “Auto-tuning a high-level language targeted to
gpu codes,” in Innovative Parallel Computing (InPar),
2012.
[40] B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wang,
“Mars: A mapreduce framework on graphics processors,” in
International Conference on Parallel Architectures and
Compilation Techniques (PACT), 2008.
[41] N. Kumar, M. Bansal, and A. Kaur, “Speed power and area
efficent vlsi architectures of multiplier and accumulator,” in
International Journal of Scientific and Engineering
Research Volume 4, Issue 1, January-2013, 2013.
[42] C. Kim, S. Sethumadhavan, M. S. Govindan,
N. Ranganathan, D. Gulati, D. Burger, and S. W. Keckler,
“Composable lightweight processors,” in Proceedings fo the
International Symposium on Microarchitecture, 2007.
[43] M. Hill and M. Marty, “Amdahl’s law in the multicore era,”
in IEEE Computer, 41(7), 2008.
[44] R. Kumar, K. I. Farkas, N. P. Jouppi, P. Ranganathan, and
D. M. Tullsen, “Single-isa heterogeneous multi-core
architectures: The potential for processor power reduction,”
in Proceedings of the International Symposium on
Microarchitecture, 2003.
[45] P. Greenhalgh, “Big.little processing with arm cortex-a15
i& cortex-a7,” in http://www.arm.com/files/downloads/bit-
LITTLE-Final-Final.pdf,
2011.
[46] M. Annavaram, E. Grochowski, and J. Shen, “Mitigating
amdahl’s law through epi throttling,” in Proceedings of the
32nd International Symposium on Computer Architecture,
2005.
[47] R. Balakrishnan, R. Rajwar, M. Upton, and K. Lai, “The
impact of performance asymmetry in emerging multicore
architectures,” in Proceedings of the 32nd International
Symposium on Computer Architecture, 2005.
[48] M. A. Suleman, O. Mutlu, M. K. Qureshi, and Y. N. Patt,
“Accelerating critical section execution with asymmetric
multi-core architectures,” in Proceedings of ASPLOS, 2009.
[49] Y. Oh, G. Koo, M. Annavaram, and W. W. Ro,
“Linebacker: Preserving victim cache lines in idle register
files of gpus,” in ISCA, 2019.
[50] A. Pattnaik, X. Tang, O. Kayiran, A. Jog, A. Mishra,
M. T. Kandemir, A. Sivasubramaniam, and C. R. Das,
“Opportunistic computing in gpu architectures. in
proceedings of the 46th international symposium on
computer architecture,” in ISCA, 2019.
